You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Eager Streaming Mode relies on confirming the currently predicted text tokens with at least 1 redundant historical prediction.
Whisper is susceptible to outputting tokens that trivially differ (e.g. "gonna" vs "going to", "amortisation" vs "amortization") for almost identical audio input. This happens occasionally and causes unnecessary slowdown due to missed opportunities to confirm predicted text tokens earlier.
Memory and Latency Regression Tests #99 implements English Text Normalization which can be integrated into the token confirmation logic in Eager Streaming Mode to avoid these unnecessary slowdowns.
Note that this would not intervene in the actually predicted tokens and the associated KV cache. This only changes the criterion for confirmation in "near matches with a trivial string variation".
The text was updated successfully, but these errors were encountered:
"gonna"
vs"going to"
,"amortisation"
vs"amortization"
) for almost identical audio input. This happens occasionally and causes unnecessary slowdown due to missed opportunities to confirm predicted text tokens earlier.The text was updated successfully, but these errors were encountered: