HiFi-GAN

Introduced by Kong et al. in HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

HiFi-GAN is a generative adversarial network for speech synthesis. HiFi-GAN consists of one generator and two discriminators: multi-scale and multi-period discriminators. The generator and discriminators are trained adversarially, along with two additional losses for improving training stability and model performance.

The generator is a fully convolutional neural network. It uses a mel-spectrogram as input and upsamples it through transposed convolutions until the length of the output sequence matches the temporal resolution of raw waveforms. Every transposed convolution is followed by a multi-receptive field fusion (MRF) module.

For the discriminator, a multi-period discriminator (MPD) is used consisting of several sub-discriminators each handling a portion of periodic signals of input audio. Additionally, to capture consecutive patterns and long-term dependencies, the multi-scale discriminator (MSD) proposed in MelGAN is used, which consecutively evaluates audio samples at different levels.

Source: HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Speech Synthesis	16	22.54%
Text-To-Speech Synthesis	7	9.86%
Voice Conversion	5	7.04%
Automatic Speech Recognition (ASR)	2	2.82%
Speech Recognition	2	2.82%
Decoder	2	2.82%
Voice Cloning	2	2.82%
Speech Enhancement	2	2.82%
Speaker Verification	2	2.82%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
🤖 No Components Found	You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories

Add Remove

Generative Audio Models

Generative Adversarial Networks