News
News
Category
- News
Doctoral thesis defense
We will be holding a public doctoral dissertation defense. Anyone can participate, but if you plan to attend, please contact us via the contact page so that we can determine the number of attendees and inform you in advance of the admission procedure.
Date and time : Wednesday, January 28, 2026 14:30-15:30
Location : Meeting Room 1810, 18th floor, National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo 101-8430
Presenter: Yun Liu
Paper Title: Training and Data Strategies for Target Speaker Extraction
Abstract Target speaker extraction (TSE) aims to isolate the voice from a specific speaker from complex acoustic mixtures containing multiple speakers and background noise. Despite recent advances in deep neural network approaches, current TSE systems face significant challenges under real-world environments, primarily due to limited speaker diversity in training data and the gap between artificially constructed training conditions and authentic acoustic scenarios. This thesis addresses these fundamental limitations through a comprehensive framework combining curriculum learning strategies with synthetic speaker generation.
The thesis makes four key contributions. First, we identify and evaluate multiple difficulty measures for TSE training data, demonstrating that speaker similarity based on cosine distance between embeddings provides the most reliable indicator of extraction difficulty. Second, we develop a multi-stage curriculum learning framework that progressively trains models from easier to more challenging examples, achieving up to 0.97 dB improvement in signal-to-distortion ratio compared to conventional random sampling. Third, we leverage speech generative models to create valuable synthetic speakers through voice conversion techniques as additional training data, and the optimal performance was achieved using a 50%-50% balance between real and synthetic speakers. Last but not least, we introduce Libri2Vox, a novel dataset that bridges the gap between controlled experimental conditions and real-world complexity by combining clean speech from LibriTTS with recordings from VoxCeleb2, which contains diverse acoustic conditions and encompasses totally over 7,000 speakers.
Experimental evaluations across multiple modern and state-of-the-art TSE architectures (Conformer, BLSTM, SpeakerBeam, and VoiceFilter) demonstrate consistent improvements ranging from 0.5 to 2.2 dB in extraction performance. Notably, models trained with our approach show substantial improvements on real-world recordings, achieving positive quality gains where conventional methods yield degraded performance. Throughout this thesis, the integration of curriculum learning, synthetic speaker augmentation, and realistic acoustic conditions significantly enhances model robustness and generalization capability.
This thesis establishes that strategic training data utilization through curriculum design and synthetic augmentation can substantially improve TSE systems. The proposed methods are architecture-agnostic and can be readily integrated into existing TSE training pipelines. These findings have broader implications for speech processing tasks facing similar challenges of limited training data and domain mismatch, contributing to the development of more robust speech technologies for real-world deployment.