Skip to content

Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development.

Notifications You must be signed in to change notification settings

OpenHLT/open-mmlab-Amphion

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 

Repository files navigation

Amphion

Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development. Amphion offers a unique feature: visualizations of classic models or architectures. We believe that these visualizations are beneficial for junior researchers and engineers who wish to gain a better understanding of the model.

The North-Star objective of Amphion is to offer a platform for studying the conversion of various inputs into audio. Amphion is designed to support individual generation tasks, including but not limited to,

  • TTS: Text to Speech Synthesis (supported)
  • SVS: Singing Voice Synthesis (planning)
  • VC: Voice Conversion (planning)
  • SVC: Singing Voice Conversion (supported)
  • TTA: Text to Audio (supported)
  • TTM: Text to Music (planning)
  • more…

In addition to the specific generation tasks, Amphion also includes several vocoders and evaluation metrics. A vocoder is an important module for producing high-quality audio signals, while evaluation metrics are critical for ensuring consistent metrics in generation tasks.

Key Features

TTS: Text to speech

  • Amphion achieves state-of-the-art performance when compared with existing open-source repositories on text-to-speech (TTS) systems.
  • It supports the following models or architectures,
    • FastSpeech2: A non-autoregressive TTS architecture that utilizes feed-forward Transformer blocks.
    • VITS: An end-to-end TTS architecture that utilizes conditional variational autoencoder with adversarial learning
    • Vall-E: A zero-shot TTS architecture that uses a neural codec language model with discrete codes.
    • NaturalSpeech2: An architecture for TTS that utilizes a latent diffusion model to generate natural-sounding voices.

SVC: Singing Voice Conversion

  • It supports multiple content-based features from various pretrained models, including WeNet, Whisper, and ContentVec.
  • It implements several state-of-the-art model architectures, including diffusion-based and Transformer-based models. The diffusion-based architecture uses Bidirectoinal dilated CNN and U-Net as a backend and supports DDPM, DDIM, and PNDM. Additionally, it supports single-step inference based on the Consistency Model.

TTA: Text to Audio

  • Supply TTA with latent diffusion model, including:
    • AudioLDM: a two stage model with an autoencoder and a latent diffusion model

Vocoder

Evaluation

We supply a comprehensive objective evaluation for the generated audios. The evaluation metrics contain:

  • F0 Modeling
    • F0 Pearson Coefficients
    • F0 Periodicity Root Mean Square Error
    • F0 Root Mean Square Error
    • Voiced/Unvoiced F1 Score
  • Energy Modeling
    • Energy Pearson Coefficients
    • Energy Root Mean Square Error
  • Intelligibility
    • Character/Word Error Rate based Whisper
  • Spectrogram Distortion
    • Frechet Audio Distance (FAD)
    • Mel Cepstral Distortion (MCD)
    • Multi-Resolution STFT Distance (MSTFT)
    • Perceptual Evaluation of Speech Quality (PESQ)
    • Short Time Objective Intelligibility (STOI)
    • Signal to Noise Ratio (SNR)
  • Speaker Similarity

About

Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published