I obtained my Ph.D. degree in Shanghai Jiao Tong University in 2020.09, under the supervision of Kai Yu and Yanmin Qian. During the Ph.D. my research interests include deep learning based approaches for speaker recognition, speaker diarization and voice activity detection. After my graduation, I joined Tencent Games as a senior researcher, where I (informally) led a speech group and extended the research interest to speech synthesis, voice conversion, music generation and audio retrivial. Currently, I am with the SpeechLab in Shenzhen Research Institute of Big Data, Chinese University of Hong Kong (Shenzhen) led by Haizhou Li.
I am the creator of Wespeaker, a research and product oriented speaker representation learning toolkit. You can check my tutorial to see what you can do using speaker modeling and how to easily apply wespeaker to your tasks. Welcome to use and contribute!
Services: I serve as a regular reviewer for speech/deep learning related conferences/journals: Interspeech, ICASSP, ICME, SPL, TASLP, Neural Networks and Pattern Recognition. I will serve as the publication chair for SLT 2024.
Openings: We are actively seeking self-motivated students to join our team as research assistants, visiting students and potential Ph.D. students. Multiple positions are immediately available in Shenzhen, with competitive salary and benefits. If you are interested, please drop me an email with your CV.
PhD in Computer Science and Technology, 2020
Shanghai Jiao Tong University
BSc in Software Engineering, 2014
Northwestern Polytechnical University
Work on several research papers and contribute to
AAAI 2024
We introduce the speaker modeling method (statistics based) into the voice conversion
This paper describes the winning systems developed by the BUT team for the four tracks of the Second DIHARD Speech Diarization Challenge, with source code available
We proposed the text-adptation speaker verification task and an intital solution called Speaker-text factorization network, which could deal with different text-mismatch conditions
This paper describes the winning systems developed by the BUT team for the two tracks of the First VoxSRC Speaker Recognition Challenge, we proposed r-vector in this paper. Update: I lanuched an open-source project wespeaker, where the implementation can be found
We proposed the segment-level representation for phonetic information and the corresponding segment-level multi-task/adversarial training framework, we revisited the usage the phonetic information for the text-independent embedding learning and designed experiments to verify the assumption: For TI-SV, it could be benificial to remove the phonetic variation in the final speaker embeddings