Discriminative Neural Embedding Learning for Short-Duration Text-Independent Speaker Verification


Short duration text-independent speaker verification remains a hot research topic in recent years, and deep neural network based embeddings have shown impressive results in such conditions. Good speaker embeddings require the property of both small intra-class variation and large inter-class difference, which is critical for the ability of discrimination and generalization. Current embedding learning strategies can be grouped into two frameworks: “Cascade embedding learning” with multiple stages and “direct embedding learning” from spectral feature directly. We propose new approaches to achieve more discriminant speaker embeddings. Within the cascade framework, a neural network based deep discriminant analysis (DDA) is proposed to project i-vector to more discriminative embeddings. Within the direct embedding framework, a deep model with more advanced center loss and A-softmax loss is used, the focal loss is also investigated in this framework. Moreover, the traditional i-vector and neural embeddings are finally combined with neural network based DDA to achieve further gain. Main experiments are carried out on a short-duration text-independent speaker verification dataset generated from the SRE corpus. The results show that the newly proposed method is promising for short-duration text-independent speaker verification, and it is consistently better than traditional i-vector and neural embedding baselines. The best embeddings achieve roughly 30% relative EER reduction compared to the i-vector baseline, which could be further enhanced when combined with the i-vector system.

IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 11, pp. 1686-1696, Nov. 2019