Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doubt in CE-RVQ loss when training NaturalSpeech 2 #222

Open
shreeshailgan opened this issue Jun 20, 2024 · 5 comments
Open

Doubt in CE-RVQ loss when training NaturalSpeech 2 #222

shreeshailgan opened this issue Jun 20, 2024 · 5 comments
Assignees

Comments

@shreeshailgan
Copy link

When training the NS2 model, for calculating the CE-RVQ loss, we have the diff_ce_loss method:

def diff_ce_loss(pred_dist, gt_indices, mask):

This function takes the ground truth indices gt_indices and the predicted distribution pred_dist
For the gt_indices, we can pass the loaded code tensor directly. Instead what is being passed is the code reconstructed from the ground truth latent x0
gt_indices, _ = self.model.module.latent_to_code(x0, nq=code.shape[1])

The ground truth latent x0 is itself inferred earlier from the loaded code tensor
x0 = self.model.module.code_to_latent(code)

Now, ideally, the reconstructed code should match the loaded ground truth code. However, in practice I've observed that the codes are different - they only match roughly 25% of the time. This is not a major issue per se, since if you just decode the codes and listen to the wavs, there is no perceptible difference.
However, I was wondering. Even if the reconstruction gave an exact match, what is the need to reconstruct? Why not pass the original code directly? Am I missing something?

Thanks.

@chazo1994
Copy link

@shreeshailgan How do you generate the code during preprocess. If you use "extract_encodec_token" function, you have to change the target bandwidth to 12.0 to match the neuralspeech2 model.

Yeah can you share the code for duration preparation for LibriTTS dataset?

@shreeshailgan
Copy link
Author

shreeshailgan commented Jun 23, 2024

@chazo1994 I am just directly running the code from encodec's documentation to extract codes. For duration preparation, you can take a look at the preprocessing script of FastSpeech 2

@QingliangMeng
Copy link

@chazo1994 I am just directly running the code from encodec's documentation to extract codes. For duration preparation, you can take a look at the preprocessing script of FastSpeech 2

I also found a problem. I have changed the target bandwidth to 12.0 to match the neuralspeech2 model. And use FastSpeech2 to preprocess duration and pitch features. But when I started training, I found that the frame length of the code feature was shorter than that of the duration and pitch features. This caused a problem with the assertion of align_length in ns2_dataset.

if dur_sum > min_len:
assert (duration[-1] - (dur_sum - min_len)) >= 0
duration[-1] = duration[-1] - (dur_sum - min_len)
assert duration[-1] >= 0

Is this a step I did wrong? I am confused and looking forward to your reply. thanks~

@shreeshailgan
Copy link
Author

Yes, I have faced this problem too. The code features will match the length of pitch features mostly, but sometimes, they can have 1 extra frame. In such cases I just trim the code features code = code[:sum(duration)]. Since code lengths are usually in the hundreds, my assumption is that trimming one frame from the end wouldn't hurt.

if dur_sum > min_len:
    assert (duration[-1] - (dur_sum - min_len)) >= 0
    duration[-1] = duration[-1] - (dur_sum - min_len)
    assert duration[-1] >= 0

I faced this error too. Don't remember now what the cause of this error is. But this error was present in very few examples, (~2-3 data points in a dataset with 200k training examples). So, I just removed these examples before training.

@QingliangMeng
Copy link

Yes, I have faced this problem too. The code features will match the length of pitch features mostly, but sometimes, they can have 1 extra frame. In such cases I just trim the code features code = code[:sum(duration)]. Since code lengths are usually in the hundreds, my assumption is that trimming one frame from the end wouldn't hurt.

if dur_sum > min_len:
    assert (duration[-1] - (dur_sum - min_len)) >= 0
    duration[-1] = duration[-1] - (dur_sum - min_len)
    assert duration[-1] >= 0

I faced this error too. Don't remember now what the cause of this error is. But this error was present in very few examples, (~2-3 data points in a dataset with 200k training examples). So, I just removed these examples before training.

Thank you for your reply. It is probably because of the LJSpeech data I used, but I feel that the align_length function has some defects. I am not sure whether it is caused by the conv sampling design of the Encodec. So I rewrote the code and duration align_length function this morning, which can also force alignment when the code and duration have huge gap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants