Focal KL-divergence based dilated convolutional neural networks for co-channel speaker identification


Recognizing the identities of multiple talkers via their overlapped speech is a challenging task, it is also one main difficulty for the “cocktail party problem”. In this paper, a novel dilated convolutional neural network with a focal KL-divergence loss function is proposed to tackle this problem. During training, relative loss for the well-classified samples is automatically reduced and consequently more attention is paid to the hard samples. The use of the focal KL-divergence loss function leads to more stable training and improved testing performance. Furthermore, a post processing of assigning different frames with different weights is also adopted and leads to further improvement. The proposed framework can be easily extended from 2-talker to 3-talker speaker identification scenario. Experiments on the artificially generated RSR2015 multi-talker mixed corpus show that the proposed approach can improve multi-talker speaker identification significantly.

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, Alberta, Canada, 2018