The LibriSpeech corpus is a collection of approximately 1,000 hours of audiobooks that are a part of the LibriVox project. Most of the audiobooks come from the Project Gutenberg. The training data is split into 3 partitions of 100hr, 360hr, and 500hr sets while the dev and test data are split into the ’clean’ and ’other’ categories, respectively, depending upon how well or challenging Automatic Speech Recognition systems would perform against. Each of the dev and test sets is around 5hr in audio length. This corpus also provides the n-gram language models and the corresponding texts excerpted from the Project Gutenberg books, which contain 803M tokens and 977K unique words.
1,978 PAPERS • 8 BENCHMARKS
ESD is an Emotional Speech Database for voice conversion research. The ESD database consists of 350 parallel utterances spoken by 10 native English and 10 native Chinese speakers and covers 5 emotion categories (neutral, happy, angry, sad and surprise). More than 29 hours of speech data were recorded in a controlled acoustic environment. The database is suitable for multi-speaker and cross-lingual emotional voice conversion studies.
38 PAPERS • NO BENCHMARKS YET
The Varied Emotion in Syntactically Uniform Speech (VESUS) repository is a lexically controlled database collected by the NSA lab. Here, actors read a semantically neutral script of words, phrases, and sentences with different emotional inflections. VESUS contains 252 distinct phrases, each read by 10 actors in 5 emotional states (neutral, angry, happy, sad, fearful).
1 PAPER • NO BENCHMARKS YET