Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Idea] Basic timestamp validation #82

Open
misutoneko opened this issue Apr 17, 2023 · 8 comments
Open

[Idea] Basic timestamp validation #82

misutoneko opened this issue Apr 17, 2023 · 8 comments
Labels
enhancement New feature or request

Comments

@misutoneko
Copy link

I'm using whisper-timestamped with a set of somewhat extensive hodgepodge of preprocessing and postprocessing scripts.
I got to thinking that some of the anomalities these scripts handle could perhaps be alleviated in whisper-timestamped itself.
Actually it would be best to have no need for pre/postprocessing at all, but I'm not sure if that's realistic.
(Well, with better models, maybe...)

So, here's one example:
In .words.srt (or .words.json) there are sometimes instances where an utterance of a single word takes almost two seconds(!).
That is imo quite obviously wrong, and so the postprocessing stage will split the file in half and reprocesses both parts. Yeah a bit crude approach perhaps, but it works well enough for me.

So that's just one perhaps the most obvious example, I have more of these corner cases if you're interested :D
(make a separate issue of each one?)

You could of course do some postprocessing in whisper-timestamped too, similar to what I now do with scripts. But maybe there are better ways to deal with these. Ofc there's always the alternative to just wait for better models that take care of petty issues like this :D

@darnn
Copy link

darnn commented Apr 17, 2023

In the meanwhile, could you describe what you actually do to get better results right now?

@misutoneko
Copy link
Author

Sure. I guess just releasing the code would be easier, but it's such an abomination that I won't pester the world with it :D

Here's the process briefly:
The main thing I do is preprocess the audio with (customized) libfvad and get a bunch of small .wav files back which I feed to whisper.
After whisper-timestamped has processed the clips, I then check the results and filter out the ones that seem dubious:
There can be some zero-size files, some files have only one word like "You" or "Thanks for watching" etc.
If the clip is truly empty, it's discarded. If it's suspicious I re-run whisper-timestamped with --language it, or with the large model.
Then there are a number of these timing-related anomalities that are dealt with in various ways. Usually just split-and-reprocess.

After all this, I still need to do some manual editing. Usually it's something very light though, like some word is missing or misspelt etc.
As a final stage, the small .srt clips are combined into a single .srt (and then potentially translated with opusMT).

There's certainly room for improvement. For example, I'm not using any initial prompt, but I think that could get rid of some spelling mistakes. I also haven't done anything with the confidence scores (or anything else having to do with json) yet.

@Jeronymous
Copy link
Member

Thanks a lot @misutoneko for opening this issue.
Indeed there is a lot to do to post-process Whisper transcriptions, especially concerning hallucinations.
Food for thought!

@Jeronymous Jeronymous added the enhancement New feature or request label Apr 26, 2023
@misutoneko
Copy link
Author

misutoneko commented May 25, 2023

The recent heuristics updates seem to have made a difference -- seems much better now, thanks 👍
I did update my Whisper and that probably helped too.

About the example I gave in my first post, I actually found an exception to this...
If there is an utterance of multiple alphanumerical characters in a row, they count as one word. So if there's something like a registry number of a phone number, it can easily exceed the "two seconds per word" rule.

EDIT: I've noticed that sometimes, the duration can be over 30 seconds for a single word. So it's sometimes really obvious.

@LaurinmyReha

This comment was marked as abuse.

@misutoneko
Copy link
Author

Thanks, nice job. The fact that it needs to be a gated model is a bit lamentable though, as it will most likely hinder adoption.
But this might actually be the closest we can get to a "better models" type of solution.

@GioPetro
Copy link

GioPetro commented Sep 9, 2024

@LaurinmyReha Great approach, but I missed how this is different than the original Whisper? Meaning, what was done to improve on that? Fine tuned on which dataset? Or is it something different that was done? Can you enlighten me?

@LaurinmyReha

This comment was marked as abuse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants