VoiceBank+DEMAND is a noisy speech database for training speech enhancement algorithms and TTS models. The database was designed to train and test speech enhancement methods that operate at 48kHz. A more detailed description can be found in the paper associated with the database. Some of the noises were obtained from the Demand database, available here: http://parole.loria.fr/DEMAND/ . The speech database was obtained from the Voice Banking Corpus, available here: http://homepages.inf.ed.ac.uk/jyamagis/release/VCTK-Corpus.tar.gz .
35 PAPERS • 1 BENCHMARK
The TIMIT Acoustic-Phonetic Continuous Speech Corpus is a standard dataset used for evaluation of automatic speech recognition systems. It consists of recordings of 630 speakers of 8 dialects of American English each reading 10 phonetically-rich sentences. It also comes with the word and phone-level transcriptions of the speech.
27 PAPERS • 5 BENCHMARKS
A Brazilian Portuguese TTS dataset featuring a female voice recorded with high quality in a controlled environment, with neutral emotion and more than 20 hours of recordings. with neutral emotion and more than 20 hours of recordings. Our dataset aims to facilitate transfer learning for researchers and developers working on TTS applications: a highly professional neutral female voice can serve as a good warm-up stage for learning language-specific structures, pronunciation and other non-individual characteristics of speech, leaving to further training procedures only to learn the specific adaptations needed (e.g. timbre, emotion and prosody). This can surely help enabling the accommodation of a more diverse range of female voices in Brazilian Portuguese. By doing so, we also hope to contribute to the development of accessible and high-quality TTS systems for several use cases such as virtual assistants, audiobooks, language learning tools and accessibility solutions.
1 PAPER • NO BENCHMARKS YET
A database containing high sampling rate recordings of a single speaker reading sentences in Brazilian Portuguese with neutral voice, along with the corresponding text corpus. Intended for speech synthesis and automatic speech recognition applications, the dataset contains text extracted from a popular Brazilian news TV program, totalling roughly 20 h of audio spoken by a trained individual in a controlled environment. The text was normalized in the recording process and special textual occurrences (e.g. acronyms, numbers, foreign names etc.) were replaced by their phonetic translation to a readable text in Portuguese. There are no noticeable accidental sounds and background noise has been kept to a minimum in all audio samples.