Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

option --suppress_token to reduce hallucinations / output special noise descriptions #105

Closed
misutoneko opened this issue Jun 23, 2023 · 5 comments
Labels
documentation Improvements or additions to documentation question Further information is requested

Comments

@misutoneko
Copy link

misutoneko commented Jun 23, 2023

Hi again,

I've now (finally) taken a peek at .words.json files, and it immediately paid off :D
I noticed that with medium model (with --language en), the first token is always 50364.
It's some kind of special token I guess, but I couldn't find any direct references, nor do I have any idea where it comes from.

Long story short, if I suppress this token that will totally eradicate any hallucinations related to non-speech clips.
The clip will get a reasonable description of any noise or music instead => yay :D

So, is there a reason for this token to exist? Perhaps it should be suppressed by default.
I haven't noticed any downsides to suppressing it, but I guess it's possible that some utterances might go undetected if they genuinely contain this token.

EDIT:
It seems this token is the same in all the multilingual models.
For english-only models the token is 50363 (I didn't test the large ones though, they're probably the same).

@Jeronymous
Copy link
Member

Jeronymous commented Jun 23, 2023

First here is a small piece of code to know what this token means:

import whisper
tokenizer = whisper.tokenizer.get_tokenizer(True, task="transcribe", language="en")
tokenizer.decode_with_timestamps([50364])
# Out[3]: '<|0.00|>'

So this tokens is the first timestamp token, meaning we are at time "0.00" and we want whisper decoder model to predict timestamps at the end of each segment.

Now I don't understand what you mean by "suppress this token" : how do you do this?

There is a mode in which Whisper model can predict the transcription without predicting timestamps.
I can imagine that it changes the behavior, and you seem to tell that it reduces hallucinations (which is good to know).
But then whisper-timestamped is unusable in that mode.
So again: what do you do exactly?

@misutoneko
Copy link
Author

Thanks! OK that's interesting. Actually the only thing I did was to add the parameter --suppress_tokens 50364.
My use case is a bit special in that I do the transcripting for very short clips (as described in #82).

Here's the whole invocation (for a single sample):
CUDA_VISIBLE_DEVICES=-1 /usr/local/bin/whisper_timestamped --threads 2 --device cpu --output_format srt,json --language en --model medium --vad True --suppress_tokens 50364 --output_dir . clip_19_3127.wav

@Jeronymous
Copy link
Member

Jeronymous commented Jun 26, 2023

Thanks a lot @misutoneko for the clarification.

Okay so it's getting really interesting.
What you actually do when you use "--suppress_tokens 50364" is doing two things

  1. You suppress to Whisper the possibility to decode a segment start at <|0.00|>.
    I played a bit with that, and saw that in practice, early starts will be predicted at <|0.02|>, <|0.04|>, ... instead
  2. You allow Whisper to decode non speech tokens (which means special words like "*noise"). Because suppress_tokens is -1 by default, which means those tokens that do not correspond to text.
    (note: if you want to suppress the prediction of 50364 and also of those non speech tokens, you can use "--suppress_tokens -1,50364")

It seems that the first point has no influence on hallucinations on silences, but that the second has.
This is a big discovery for me.
But it is still early for me to conclude and adapt the code to this. I am not used to see those special words coming in the transcription. I have to do some experiments.

Also this gave me an idea of making Whisper decode in the "<no timestamp>" mode (as I thought you were doing in my first comment) : this could also change Whisper behavior related to hallucinations and omissions. Because I guess that the training data used to train in that mode could be different from the Youtube subtitled videos used to train the "with timestamps" mode.
(reminder: all the predictions like "thanks for watching this video" that we can see on silences occur because of some subtitles biases in the training data).

@misutoneko
Copy link
Author

Very nice, I knew you'd make sense of it :D
So it looks like I could've used --suppress_tokens "" just as well.
(The results will be slightly different of course, as it won't suppress anything then).

@Jeronymous
Copy link
Member

Yes exactly.

On my side, I played a bit with suppress_tokens and I was disappointed.
In some case, it does not remove hallucinations (at least with "accurate" decoding, it seems you are using the more "efficient" one), and in some case it replaces hallucinations with those special words, that are not necessarily easier to filter out.
Those statistical models are unpredictable...

But I'm happy for you if you have a great experience with this.

@Jeronymous Jeronymous changed the title Significance of token 50364? option --suppress_token to reduce hallucinations / output special noise descriptions Jun 27, 2023
@Jeronymous Jeronymous added documentation Improvements or additions to documentation question Further information is requested labels Jun 27, 2023
@linto-ai linto-ai locked and limited conversation to collaborators Jun 27, 2023
@Jeronymous Jeronymous converted this issue into discussion #107 Jun 27, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
documentation Improvements or additions to documentation question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants