Doubt in CE-RVQ loss when training NaturalSpeech 2 #222

shreeshailgan · 2024-06-20T06:50:10Z

When training the NS2 model, for calculating the CE-RVQ loss, we have the diff_ce_loss method:

Amphion/models/tts/naturalspeech2/ns2_loss.py

Line 65 in d335514

def diff_ce_loss(pred_dist, gt_indices, mask):

This function takes the ground truth indices gt_indices and the predicted distribution pred_dist
For the gt_indices, we can pass the loaded code tensor directly. Instead what is being passed is the code reconstructed from the ground truth latent x0

Amphion/models/tts/naturalspeech2/ns2_trainer.py

Line 464 in d335514

gt_indices, _ = self.model.module.latent_to_code(x0, nq=code.shape[1])

The ground truth latent x0 is itself inferred earlier from the loaded code tensor

Amphion/models/tts/naturalspeech2/ns2_trainer.py

Line 436 in d335514

x0 = self.model.module.code_to_latent(code)

Now, ideally, the reconstructed code should match the loaded ground truth code. However, in practice I've observed that the codes are different - they only match roughly 25% of the time. This is not a major issue per se, since if you just decode the codes and listen to the wavs, there is no perceptible difference.
However, I was wondering. Even if the reconstruction gave an exact match, what is the need to reconstruct? Why not pass the original code directly? Am I missing something?

Thanks.

The text was updated successfully, but these errors were encountered:

chazo1994 · 2024-06-22T09:10:17Z

@shreeshailgan How do you generate the code during preprocess. If you use "extract_encodec_token" function, you have to change the target bandwidth to 12.0 to match the neuralspeech2 model.

Yeah can you share the code for duration preparation for LibriTTS dataset?

shreeshailgan · 2024-06-23T06:19:55Z

@chazo1994 I am just directly running the code from encodec's documentation to extract codes. For duration preparation, you can take a look at the preprocessing script of FastSpeech 2

QingliangMeng · 2024-07-19T02:32:18Z

@chazo1994 I am just directly running the code from encodec's documentation to extract codes. For duration preparation, you can take a look at the preprocessing script of FastSpeech 2

I also found a problem. I have changed the target bandwidth to 12.0 to match the neuralspeech2 model. And use FastSpeech2 to preprocess duration and pitch features. But when I started training, I found that the frame length of the code feature was shorter than that of the duration and pitch features. This caused a problem with the assertion of align_length in ns2_dataset.

if dur_sum > min_len:
assert (duration[-1] - (dur_sum - min_len)) >= 0
duration[-1] = duration[-1] - (dur_sum - min_len)
assert duration[-1] >= 0

Is this a step I did wrong? I am confused and looking forward to your reply. thanks~

shreeshailgan · 2024-07-19T03:25:44Z

Yes, I have faced this problem too. The code features will match the length of pitch features mostly, but sometimes, they can have 1 extra frame. In such cases I just trim the code features code = code[:sum(duration)]. Since code lengths are usually in the hundreds, my assumption is that trimming one frame from the end wouldn't hurt.

if dur_sum > min_len:
    assert (duration[-1] - (dur_sum - min_len)) >= 0
    duration[-1] = duration[-1] - (dur_sum - min_len)
    assert duration[-1] >= 0

I faced this error too. Don't remember now what the cause of this error is. But this error was present in very few examples, (~2-3 data points in a dataset with 200k training examples). So, I just removed these examples before training.

QingliangMeng · 2024-07-19T05:49:58Z

Yes, I have faced this problem too. The code features will match the length of pitch features mostly, but sometimes, they can have 1 extra frame. In such cases I just trim the code features code = code[:sum(duration)]. Since code lengths are usually in the hundreds, my assumption is that trimming one frame from the end wouldn't hurt.
if dur_sum > min_len:
    assert (duration[-1] - (dur_sum - min_len)) >= 0
    duration[-1] = duration[-1] - (dur_sum - min_len)
    assert duration[-1] >= 0
I faced this error too. Don't remember now what the cause of this error is. But this error was present in very few examples, (~2-3 data points in a dataset with 200k training examples). So, I just removed these examples before training.

Thank you for your reply. It is probably because of the LJSpeech data I used, but I feel that the align_length function has some defects. I am not sure whether it is caused by the conv sampling design of the Encodec. So I rewrote the code and duration align_length function this morning, which can also force alignment when the code and duration have huge gap.

RMSnow assigned HeCheng0625 and jiaqili3 Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Doubt in CE-RVQ loss when training NaturalSpeech 2 #222

Doubt in CE-RVQ loss when training NaturalSpeech 2 #222

shreeshailgan commented Jun 20, 2024

chazo1994 commented Jun 22, 2024

shreeshailgan commented Jun 23, 2024 •

edited

Loading

QingliangMeng commented Jul 19, 2024

shreeshailgan commented Jul 19, 2024

QingliangMeng commented Jul 19, 2024

Doubt in CE-RVQ loss when training NaturalSpeech 2 #222

Doubt in CE-RVQ loss when training NaturalSpeech 2 #222

Comments

shreeshailgan commented Jun 20, 2024

chazo1994 commented Jun 22, 2024

shreeshailgan commented Jun 23, 2024 • edited Loading

QingliangMeng commented Jul 19, 2024

shreeshailgan commented Jul 19, 2024

QingliangMeng commented Jul 19, 2024

shreeshailgan commented Jun 23, 2024 •

edited

Loading