Research

Research Projects, Papers, Books, etc.

Share

  • Technical Reports

[Translation] Objective evaluation of synthetic speech and the VoiceMOS challenge

Author: Erica Cooper, Wenqing Huang, Yu Cao, Xinmin Wang, Tomoki Toda, Junichi Yamagishi

  • #Audio processing
  • #Speech synthesis
  • #Quality Evaluation

Journal of the Acoustical Society of Japan, Vol. 80 (2024) No. 7 Special Feature: Beyond MOS: Future Prospects of Speech Assessment Methods

Computer-generated synthetic speech requires evaluation from multiple perspectives, such as whether it is understandable to the listener, whether it sounds natural, whether it matches the target speaker and speaking style, and whether it fulfills its intended purpose. Such evaluation is also needed to determine whether a new synthesis method is superior to previous ones, or whether a newly proposed change brings about an improvement. Researchers have been developing new speech synthesis methods while also considering how to evaluate them. Traditionally, such evaluation has mainly relied on listening tests conducted with human subjects. Because it is humans who ultimately listen to synthetic speech, human opinion should be considered the gold standard in its evaluation. In these listening tests, subjects are presented with synthetic speech samples one at a time and asked to rate some aspect of the speech, such as how natural it sounds, on a Likert scale (usually 5-point), and a mean opinion score (MOS) is often obtained by averaging the individual ratings of the speech synthesized by each system. However, because such evaluations are very costly and time-consuming, speech synthesis researchers have also been exploring more automated evaluation methods to streamline the iterative experimental process. Starting from correlation analysis between subjects' subjective ratings and acoustic features, to utilizing signal processing-based methods developed for telephone communication, and further to machine learning-based approaches trained on past listening test data, researchers have been exploring and refining automatic evaluation methods for synthetic speech to improve experimental efficiency. In this paper, we provide an overview of recently proposed automatic evaluation methods for synthetic speech and their development. We also discuss our experience running the VoiceMOS Challenge for two years, which provides a common database for training and evaluating machine learning-based synthetic speech quality prediction models and enables cross-comparison of prediction methods. Finally, we discuss ongoing research in this field, as well as open problems and future prospects.