-
Notifications
You must be signed in to change notification settings - Fork 383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Doubt in CE-RVQ loss when training NaturalSpeech 2 #222
Comments
@shreeshailgan How do you generate the code during preprocess. If you use "extract_encodec_token" function, you have to change the target bandwidth to 12.0 to match the neuralspeech2 model. Yeah can you share the code for duration preparation for LibriTTS dataset? |
@chazo1994 I am just directly running the code from encodec's documentation to extract codes. For duration preparation, you can take a look at the preprocessing script of FastSpeech 2 |
I also found a problem. I have changed the target bandwidth to 12.0 to match the neuralspeech2 model. And use FastSpeech2 to preprocess duration and pitch features. But when I started training, I found that the frame length of the code feature was shorter than that of the duration and pitch features. This caused a problem with the assertion of align_length in ns2_dataset. if dur_sum > min_len: Is this a step I did wrong? I am confused and looking forward to your reply. thanks~ |
Yes, I have faced this problem too. The code features will match the length of pitch features mostly, but sometimes, they can have 1 extra frame. In such cases I just trim the code features
I faced this error too. Don't remember now what the cause of this error is. But this error was present in very few examples, (~2-3 data points in a dataset with 200k training examples). So, I just removed these examples before training. |
Thank you for your reply. It is probably because of the LJSpeech data I used, but I feel that the align_length function has some defects. I am not sure whether it is caused by the conv sampling design of the Encodec. So I rewrote the code and duration align_length function this morning, which can also force alignment when the code and duration have huge gap. |
When training the NS2 model, for calculating the CE-RVQ loss, we have the
diff_ce_loss
method:Amphion/models/tts/naturalspeech2/ns2_loss.py
Line 65 in d335514
This function takes the ground truth indices
gt_indices
and the predicted distributionpred_dist
For the
gt_indices
, we can pass the loadedcode
tensor directly. Instead what is being passed is the code reconstructed from the ground truth latentx0
Amphion/models/tts/naturalspeech2/ns2_trainer.py
Line 464 in d335514
The ground truth latent
x0
is itself inferred earlier from the loadedcode
tensorAmphion/models/tts/naturalspeech2/ns2_trainer.py
Line 436 in d335514
Now, ideally, the reconstructed code should match the loaded ground truth code. However, in practice I've observed that the codes are different - they only match roughly 25% of the time. This is not a major issue per se, since if you just decode the codes and listen to the wavs, there is no perceptible difference.
However, I was wondering. Even if the reconstruction gave an exact match, what is the need to reconstruct? Why not pass the original
code
directly? Am I missing something?Thanks.
The text was updated successfully, but these errors were encountered: