Seed-VC

English | 简体中文
Currently released model supports zero-shot voice conversion 🔊 and zero-shot singing voice conversion 🎙. Without any training, it is able to clone a voice given a reference speech of 1~30 seconds.

To find a list of demos and comparisons with previous voice conversion models, please visit our demo page🌐

We are keeping on improving the model quality and adding more features.

Installation📥

Suggested python 3.10 on Windows or Linux.

pip install -r requirements.txt

Usage🛠️

Checkpoints of the latest model release will be downloaded automatically when first run inference.

Command line inference:

python inference.py --source <source-wav> \
--target <referene-wav>
--output <output-dir>
--diffusion-steps 25 # recommended 50~100 for singingvoice conversion
--length-adjust 1.0
--inference-cfg-rate 0.7
--n-quantizers 3
--f0-condition False # set to True for singing voice conversion
--auto-f0-condition False # set to True to auto adjust source pitch to target pitch level, normally not used in singing voice conversion
--semi-tone-shift 0 # pitch shift in semitones for singing voice conversion

where:

source is the path to the speech file to convert to reference voice
target is the path to the speech file as voice reference
output is the path to the output directory
diffusion-steps is the number of diffusion steps to use, default is 25, use 50-100 for best quality, use 4-10 for fastest inference
length-adjust is the length adjustment factor, default is 1.0, set <1.0 for speed-up speech, >1.0 for slow-down speech
inference-cfg-rate has subtle difference in the output, default is 0.7
n-quantizers is the number of quantizers from FAcodec to use, default is 3, the less quantizer used, the less prosody of source audio is preserved
f0-condition is the flag to condition the pitch of the output to the pitch of the source audio, default is False, set to True for singing voice conversion
auto-f0-condition is the flag to auto adjust source pitch to target pitch level, default is False, normally not used in singing voice conversion
semi-tone-shift is the pitch shift in semitones for singing voice conversion, default is 0

Gradio web interface:

python app.py

Then open the browser and go to http://localhost:7860/ to use the web interface.

TODO📝

Release code
Release v0.1 pretrained model:
Huggingface space demo:
HTML demo page (maybe with comparisons to other VC models): Demo
Streaming inference
Singing voice conversion
Noise resiliency for source & reference audio
- This is enabled for the f0 conditioned model but not sure whether it works well...
Potential architecture improvements
- U-ViT style skip connections
- Changed input to FAcodec tokens
Code for training on custom data
Changed to BigVGAN from NVIDIA for singing voice decoding
44k Hz model for singing voice conversion of better quality
Evaluation metrics & comparison with previous models
More to be added

CHANGELOGS🗒️

2024-09-22:
- Updated singing voice conversion model to use BigVGAN from NVIDIA, providing large improvement to high-pitched singing voices
- Support chunking and streaming output for long audio files in Web UI
2024-09-18:
- Updated f0 conditioned model for singing voice conversion
2024-09-14:
- Updated v0.2 pretrained model, with smaller size and less diffusion steps to achieve same quality, and additional ability to control prosody preservation
- Added command line inference script
- Added installation and usage instructions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Seed-VC

Installation📥

Usage🛠️

TODO📝

CHANGELOGS🗒️

Files

README.md

Latest commit

History

README.md

File metadata and controls

Seed-VC

Installation📥

Usage🛠️

TODO📝

CHANGELOGS🗒️