What Does the Speaker Embedding Encode?


Developing a good speaker embedding has received tremendous interest in the speech community. Speaker representations such as i-vector, d-vector have shown their superiority in speaker recognition, speaker adaptation and other related tasks. However, not much is known about which properties are exactly encoded in these speaker embeddings. In this work, we make an in-depth investigation on three kinds of speaker embeddings, ie i-vector, d-vector and RNN/LSTM based sequence-vector (s-vector). Classification tasks are carefully designed to facilitate better understanding of these encoded speaker representations. Their abilities of encoding different properties are revealed and compared, such as speaker identity, gender, speaking rate, text content and channel information. Moreover, a new architecture is proposed to integrate different speaker embeddings, so that the advantages can be combined. The new advanced speaker embedding (i-s-vector) outperforms the others, and shows a more than 50% EER reduction compared to the i-vector baseline on the RSR2015 content mismatch trials

In 18th Annual Conference of the International Speech Communication Association (InterSpeech), Stockholm, Sweden, 2017