Amphion

Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development. Amphion offers a unique feature: visualizations of classic models or architectures. We believe that these visualizations are beneficial for junior researchers and engineers who wish to gain a better understanding of the model.

The North-Star objective of Amphion is to offer a platform for studying the conversion of various inputs into audio. Amphion is designed to support individual generation tasks, including but not limited to,

TTS: Text to Speech Synthesis (supported)
SVS: Singing Voice Synthesis (planning)
VC: Voice Conversion (planning)
SVC: Singing Voice Conversion (supported)
TTA: Text to Audio (supported)
TTM: Text to Music (planning)
more…

In addition to the specific generation tasks, Amphion also includes several vocoders and evaluation metrics. A vocoder is an important module for producing high-quality audio signals, while evaluation metrics are critical for ensuring consistent metrics in generation tasks.

Key Features

TTS: Text to speech

Amphion achieves state-of-the-art performance when compared with existing open-source repositories on text-to-speech (TTS) systems.
It supports the following models or architectures,
- FastSpeech2: A non-autoregressive TTS architecture that utilizes feed-forward Transformer blocks.
- VITS: An end-to-end TTS architecture that utilizes conditional variational autoencoder with adversarial learning
- Vall-E: A zero-shot TTS architecture that uses a neural codec language model with discrete codes.
- NaturalSpeech2: An architecture for TTS that utilizes a latent diffusion model to generate natural-sounding voices.

SVC: Singing Voice Conversion

It supports multiple content-based features from various pretrained models, including WeNet, Whisper, and ContentVec.
It implements several state-of-the-art model architectures, including diffusion-based and Transformer-based models. The diffusion-based architecture uses Bidirectoinal dilated CNN and U-Net as a backend and supports DDPM, DDIM, and PNDM. Additionally, it supports single-step inference based on the Consistency Model.

TTA: Text to Audio

Supply TTA with latent diffusion model, including:
- AudioLDM: a two stage model with an autoencoder and a latent diffusion model

Vocoder

Amphion supports both classic and state-of-the-art neural vocoders, including
- GAN-based vocoders: MelGAN, HiFi-GAN, NSF-HiFiGAN, BigVGAN, APNet
- Flow-based vocoders: WaveGlow
- Diffusion-based vocoders: Diffwave
- Auto-regressive based vocoders: WaveNet, WaveRNN

Evaluation

We supply a comprehensive objective evaluation for the generated audios. The evaluation metrics contain:

F0 Modeling
- F0 Pearson Coefficients
- F0 Periodicity Root Mean Square Error
- F0 Root Mean Square Error
- Voiced/Unvoiced F1 Score
Energy Modeling
- Energy Pearson Coefficients
- Energy Root Mean Square Error
Intelligibility
- Character/Word Error Rate based Whisper
Spectrogram Distortion
- Frechet Audio Distance (FAD)
- Mel Cepstral Distortion (MCD)
- Multi-Resolution STFT Distance (MSTFT)
- Perceptual Evaluation of Speech Quality (PESQ)
- Short Time Objective Intelligibility (STOI)
- Signal to Noise Ratio (SNR)
Speaker Similarity
- Cosine similarity based RawNet3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Amphion

Key Features

TTS: Text to speech

SVC: Singing Voice Conversion

TTA: Text to Audio

Vocoder

Evaluation

Files

README.md

Latest commit

History

README.md

File metadata and controls

Amphion

Key Features

TTS: Text to speech

SVC: Singing Voice Conversion

TTA: Text to Audio

Vocoder

Evaluation