[Idea] Basic timestamp validation #82

misutoneko · 2023-04-17T08:45:46Z

I'm using whisper-timestamped with a set of somewhat extensive hodgepodge of preprocessing and postprocessing scripts.
I got to thinking that some of the anomalities these scripts handle could perhaps be alleviated in whisper-timestamped itself.
Actually it would be best to have no need for pre/postprocessing at all, but I'm not sure if that's realistic.
(Well, with better models, maybe...)

So, here's one example:
In .words.srt (or .words.json) there are sometimes instances where an utterance of a single word takes almost two seconds(!).
That is imo quite obviously wrong, and so the postprocessing stage will split the file in half and reprocesses both parts. Yeah a bit crude approach perhaps, but it works well enough for me.

So that's just one perhaps the most obvious example, I have more of these corner cases if you're interested :D
(make a separate issue of each one?)

You could of course do some postprocessing in whisper-timestamped too, similar to what I now do with scripts. But maybe there are better ways to deal with these. Ofc there's always the alternative to just wait for better models that take care of petty issues like this :D

darnn · 2023-04-17T09:08:05Z

In the meanwhile, could you describe what you actually do to get better results right now?

misutoneko · 2023-04-17T10:29:16Z

Sure. I guess just releasing the code would be easier, but it's such an abomination that I won't pester the world with it :D

Here's the process briefly:
The main thing I do is preprocess the audio with (customized) libfvad and get a bunch of small .wav files back which I feed to whisper.
After whisper-timestamped has processed the clips, I then check the results and filter out the ones that seem dubious:
There can be some zero-size files, some files have only one word like "You" or "Thanks for watching" etc.
If the clip is truly empty, it's discarded. If it's suspicious I re-run whisper-timestamped with --language it, or with the large model.
Then there are a number of these timing-related anomalities that are dealt with in various ways. Usually just split-and-reprocess.

After all this, I still need to do some manual editing. Usually it's something very light though, like some word is missing or misspelt etc.
As a final stage, the small .srt clips are combined into a single .srt (and then potentially translated with opusMT).

There's certainly room for improvement. For example, I'm not using any initial prompt, but I think that could get rid of some spelling mistakes. I also haven't done anything with the confidence scores (or anything else having to do with json) yet.

Jeronymous · 2023-04-17T20:12:07Z

Thanks a lot @misutoneko for opening this issue.
Indeed there is a lot to do to post-process Whisper transcriptions, especially concerning hallucinations.
Food for thought!

misutoneko · 2023-05-25T10:01:01Z

The recent heuristics updates seem to have made a difference -- seems much better now, thanks 👍
I did update my Whisper and that probably helped too.

About the example I gave in my first post, I actually found an exception to this...
If there is an utterance of multiple alphanumerical characters in a row, they count as one word. So if there's something like a registry number of a phone number, it can easily exceed the "two seconds per word" rule.

EDIT: I've noticed that sometimes, the duration can be over 30 seconds for a single word. So it's sometimes really obvious.

misutoneko · 2024-09-08T21:17:11Z

Thanks, nice job. The fact that it needs to be a gated model is a bit lamentable though, as it will most likely hinder adoption.
But this might actually be the closest we can get to a "better models" type of solution.

GioPetro · 2024-09-09T07:29:29Z

@LaurinmyReha Great approach, but I missed how this is different than the original Whisper? Meaning, what was done to improve on that? Fine tuned on which dataset? Or is it something different that was done? Can you enlighten me?

Jeronymous added the enhancement New feature or request label Apr 26, 2023

misutoneko mentioned this issue Jun 23, 2023

option --suppress_token to reduce hallucinations / output special noise descriptions #105

Closed

This comment was marked as abuse.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Idea] Basic timestamp validation #82

[Idea] Basic timestamp validation #82

misutoneko commented Apr 17, 2023

darnn commented Apr 17, 2023

misutoneko commented Apr 17, 2023

Jeronymous commented Apr 17, 2023

misutoneko commented May 25, 2023 •

edited

Loading

This comment was marked as abuse.

misutoneko commented Sep 8, 2024

GioPetro commented Sep 9, 2024

This comment was marked as abuse.

[Idea] Basic timestamp validation #82

[Idea] Basic timestamp validation #82

Comments

misutoneko commented Apr 17, 2023

darnn commented Apr 17, 2023

misutoneko commented Apr 17, 2023

Jeronymous commented Apr 17, 2023

misutoneko commented May 25, 2023 • edited Loading

This comment was marked as abuse.

misutoneko commented Sep 8, 2024

GioPetro commented Sep 9, 2024

This comment was marked as abuse.

misutoneko commented May 25, 2023 •

edited

Loading