Joint i-vector with end-to-end system for short-duration text-independent speaker verification


Factor analysis based i-vector has been the state-of-the-art method for speaker verification. Recently, researchers propose to build DNN based end-to-end speaker verification systems and achieve comparable performance with i -vector. Since these two methods possess their own property and differ from each other significantly, we explore a framework to integrate these two paradigms together to utilize their complementarity. More specifically, in this paper we develop and compare four methodologies to integrate traditional i -vector into end-to-end systems, including score fusion, embeddings concatenation, transformed concatenation and joint learning. All these approaches achieve significant gains. Moreover, the hard trial selection is performed on the end-to-end architecture which further improves the performance. Experimental results on a text-independent short-duration dataset generated from SRE 2010 reveal that the newly proposed method reduces the EER by relative 31.0% and 28.2% compared to the i-vector and end-to-end baselines respectively.

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, Alberta, Canada, 2018