Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip silence around hallucinations #1838

Merged
merged 4 commits into from
Dec 18, 2023
Merged

Conversation

ryanheise
Copy link
Contributor

This PR introduces a heuristic that determines if a segment is probably a hallucination. If that "probable" hallucination occurs after a period of silence (specified by --hallucination_silence_threshold in seconds), then we seek past the silence and reprocess from that point. Eliminating the silence before a hallucination improves the likelihood of getting a correct inference, but since this also requires extra processing time, we only do this when a probable hallucination is detected.

The heuristic itself is based on the observation that words in a hallucination are often either extremely long, extremely short, or have an extremely low probability. The probability of later words may be less reliable, so we take the first 8 words of a segment only, and look for a certain threshold of anomalies within that.

Below are some successive test runs on the audio sample in #1783 with --word_timestamps True --hallucination_silence_threshold 2. It includes debug output to show when hallucinations were detected. I can confirm that the results are better on v2 than v3.

Sample output
v2 runs:


DETECTED HALLUCINATION: 就這樣子
[00:56.120 --> 00:57.400] 最为开放的领域
[00:57.920 --> 00:59.720] 一个汇聚精英的世界
[01:00.940 --> 01:02.480] 技术与资本联姻
[01:02.480 --> 01:04.640] 成功与风险共存
[01:05.220 --> 01:07.520] 用睿智投资未来
[01:08.200 --> 01:10.980] 百名数字英雄汇聚精诚
[01:10.980 --> 01:12.200] 献计献策
[01:12.200 --> 01:13.900] 共商发展
[01:14.520 --> 01:17.780] 百家讲坛为您奉上精彩系列节目
[01:17.780 --> 01:21.120] 2001北京互联网发展论坛
[01:21.120 --> 01:22.740] 敬请关注
DETECTED HALLUCINATION: 汇聚精英 汇聚精英
[01:51.720 --> 01:53.100] 1995年10月回国
[01:53.720 --> 01:56.420] 1998年成功推出搜狐网站
[01:56.980 --> 01:57.940] 同年10月
[01:57.940 --> 01:59.980] 被美国《时代周刊》评为



DETECTED HALLUCINATION: 了解更多,請訂閱谷歌頻道、按讚、分享、留言 decky born 會開始平台 Seagull Quiz
DETECTED HALLUCINATION: 了解更多,請訂閱谷歌頻道、按讚、分享、留言 decky born 會開始平台 Seagull Quiz
DETECTED HALLUCINATION: 中文字幕由 Amara.org 社群提供
[00:54.080 --> 00:54.920] 互联网
[00:55.600 --> 00:57.400] 一个最为开放的领域
[00:57.900 --> 00:59.740] 一个汇聚精英的世界
[01:01.000 --> 01:02.460] 技术与资本联姻
[01:02.460 --> 01:04.640] 成功与风险共存
[01:05.100 --> 01:07.740] 用睿智投资未来
[01:08.240 --> 01:10.980] 百名数字英雄汇聚精诚
[01:10.980 --> 01:12.200] 献计献策
[01:12.200 --> 01:13.600] 共商发展
[01:14.380 --> 01:17.740] 百家讲坛为您奉上精彩系列节目
[01:19.080 --> 01:21.120] 北京互联网发展论坛
[01:21.120 --> 01:22.800] 敬请关注
[01:35.820 --> 01:36.680] 张昭阳
[01:37.200 --> 01:38.960] 苏护网首席执行官
[01:39.500 --> 01:42.220] 1981年考入清华大学物理系
[01:42.740 --> 01:46.560] 1986年考取李正道奖学金赴美留学
[01:47.140 --> 01:50.460] 七年后获麻省理工学院物理学博士
[01:51.300 --> 01:53.080] 1995年10月回国
[01:53.080 --> 01:56.420] 1998年成功推出苏护网站
[01:56.980 --> 01:59.980] 同年10月被美国时代周刊评为



DETECTED HALLUCINATION: 由 Amara.org 社群提供的字幕
DETECTED HALLUCINATION: 中文字幕由Amara.org社区提供
DETECTED HALLUCINATION: 中文字幕由 Amara.org 社群提供
[00:56.180 --> 00:57.360] 最为开放的领域
[00:57.860 --> 00:59.720] 一个汇聚精英的世界
[01:00.940 --> 01:02.480] 技术与资本联姻
[01:02.480 --> 01:04.640] 成功与风险共存
[01:05.280 --> 01:07.500] 用睿智投资未来
[01:08.200 --> 01:10.980] 百名数字英雄汇聚精诚
[01:10.980 --> 01:12.200] 献计献策
[01:12.200 --> 01:13.600] 共商发展
[01:14.340 --> 01:17.780] 百家讲坛为您奉上精彩系列节目
[01:17.780 --> 01:21.120] 2001北京互联网发展论坛
[01:21.120 --> 01:22.740] 敬请关注
DETECTED HALLUCINATION: 汇聚精英 汇聚精英
[01:53.420 --> 01:56.460] 1998年成功推出搜狐网站
[01:56.960 --> 01:57.920] 同年十月
[01:57.920 --> 01:59.980] 被美国时代周刊评为




DETECTED HALLUCINATION: 哦
DETECTED HALLUCINATION: 中文字幕志愿者申请
[00:58.420 --> 00:59.720] 汇聚精英的世界
[01:00.960 --> 01:02.480] 技术与资本联姻
[01:02.480 --> 01:04.640] 成功与风险共存
[01:04.640 --> 01:07.420] 用睿智投资未来
[01:08.200 --> 01:09.780] 百名数字英雄
[01:09.780 --> 01:10.980] 汇聚精诚
[01:10.980 --> 01:12.200] 献计献策
[01:12.200 --> 01:13.620] 共商发展
[01:14.340 --> 01:15.460] 百家讲坛
[01:15.460 --> 01:17.760] 为您奉上精彩系列节目
[01:18.420 --> 01:21.120] 2001北京互联网发展论坛
[01:21.120 --> 01:22.740] 敬请关注
DETECTED HALLUCINATION: 欢迎订阅
DETECTED HALLUCINATION: 欢迎订阅
DETECTED HALLUCINATION: 欢迎订阅
DETECTED HALLUCINATION: 欢迎订阅
DETECTED HALLUCINATION: 评委书记
[01:39.760 --> 01:42.200] 1981年考入清华大学物理系
[01:42.760 --> 01:44.000] 1986年
[01:44.000 --> 01:45.640] 考取李正道奖学金
[01:45.640 --> 01:46.580] 赴美留学
[01:47.160 --> 01:50.460] 7年后获麻省理工学院物理学博士
[01:51.340 --> 01:53.100] 1995年10月回国
[01:53.100 --> 01:54.580] 1998年
[01:54.580 --> 01:56.440] 成功推出搜狐网站
[01:56.960 --> 01:57.940] 同年10月
[01:57.940 --> 01:59.760] 为美国《时代周刊》
[01:59.760 --> 01:59.980] 评委书记




DETECTED HALLUCINATION: 3
DETECTED HALLUCINATION: 字幕提供者-雪眼唷
DETECTED HALLUCINATION: 中文字幕由 Amara.org 社群提供
[00:54.080 --> 00:54.920] 互联网
[00:55.600 --> 00:57.360] 一个最为开放的领域
[00:57.900 --> 00:59.740] 一个汇聚精英的世界
[01:00.980 --> 01:02.480] 技术与资本联姻
[01:02.480 --> 01:04.640] 成功与风险共存
[01:05.100 --> 01:07.540] 用睿智投资未来
[01:08.240 --> 01:10.980] 百名数字英雄汇聚精诚
[01:10.980 --> 01:12.200] 献计献策
[01:12.200 --> 01:13.560] 共商发展
[01:14.380 --> 01:17.740] 百家讲坛为您奉上精彩系列节目
[01:17.740 --> 01:21.120] 2001北京互联网发展论坛
[01:21.120 --> 01:22.780] 敬请关注
[01:35.780 --> 01:36.700] 张昭阳
[01:37.380 --> 01:38.940] 苏护网首席执行官
[01:39.460 --> 01:42.220] 1981年考入清华大学物理系
[01:42.900 --> 01:46.540] 1986年考取李正道奖学金赴美留学
DETECTED HALLUCINATION: 1997年获麻省理工学院物理学博士
DETECTED HALLUCINATION: 1997年获麻省理工学院物理学博士
DETECTED HALLUCINATION: 定位为苏护网
[01:47.960 --> 01:50.460] 1997年获麻省理工学院物理学博士
[01:51.300 --> 01:53.080] 1995年十月回国
[01:53.080 --> 01:56.420] 1998年成功推出苏护网站
[01:56.980 --> 01:59.460] 同年十月为美国时代周刊
[01:59.460 --> 01:59.980] 定位为苏护网



DETECTED HALLUCINATION: 嗯 рас
DETECTED HALLUCINATION: 嗯 рас
DETECTED HALLUCINATION: 嗯
DETECTED HALLUCINATION: 中文字幕由 Amara.org 社群提供
[00:54.080 --> 00:54.920] 互联网
[00:55.600 --> 00:57.400] 一个最为开放的领域
[00:57.900 --> 00:59.740] 一个汇聚精英的世界
[01:01.000 --> 01:02.460] 技术与资本联姻
[01:02.460 --> 01:04.640] 成功与风险共存
[01:05.100 --> 01:07.740] 用睿智投资未来
[01:08.240 --> 01:10.980] 百名数字英雄汇聚精诚
[01:10.980 --> 01:12.200] 献计献策
[01:12.200 --> 01:13.600] 共商发展
[01:14.380 --> 01:17.740] 百家讲坛为您奉上精彩系列节目
[01:19.080 --> 01:21.120] 北京互联网发展论坛
[01:21.120 --> 01:22.800] 敬请关注
[01:35.820 --> 01:36.680] 张昭阳
[01:37.200 --> 01:38.960] 苏护网首席执行官
[01:39.500 --> 01:42.220] 1981年考入清华大学物理系
[01:42.740 --> 01:46.560] 1986年考取李正道奖学金赴美留学
[01:47.140 --> 01:50.460] 七年后获麻省理工学院物理学博士
[01:51.300 --> 01:53.080] 1995年10月回国
[01:53.080 --> 01:56.420] 1998年成功推出苏护网站
[01:56.980 --> 01:59.980] 同年10月被美国时代周刊评为



[00:21.400 --> 00:23.300] 评估
DETECTED HALLUCINATION: 請不吝點贊訂閱轉發打賞支持明鏡與點點欄目
DETECTED HALLUCINATION: 中文字幕由 Amara.org 社群提供
[00:57.840 --> 00:59.720] 一个汇聚精英的世界
[01:00.940 --> 01:02.480] 技术与资本联姻
[01:02.480 --> 01:04.640] 成功与风险共存
[01:04.640 --> 01:07.480] 用睿智投资未来
[01:08.200 --> 01:10.980] 百名数字英雄汇聚精诚
[01:10.980 --> 01:12.180] 献计献策
[01:12.180 --> 01:13.600] 共商发展
[01:14.340 --> 01:17.760] 百家讲坛为您奉上精彩系列节目
[01:17.760 --> 01:21.120] 2001北京互联网发展论坛
[01:21.120 --> 01:22.740] 敬请关注
DETECTED HALLUCINATION: 汇聚精英 汇聚精英 汇聚精英
[01:54.240 --> 01:56.420] 2018年成功推出搜狐网站
[01:56.960 --> 01:59.980] 同年10月被美国时代周刊定为



v3:

DETECTED HALLUCINATION: 字幕由 Amara.org 社群提供
DETECTED HALLUCINATION: 字幕志愿者 杨栋梁
DETECTED HALLUCINATION: 优优独播剧场——YoYo Television Series Exclusive
DETECTED HALLUCINATION: 请不吝点赞 订阅 转发 打赏支持明镜与点点栏目
[01:03.980 --> 01:21.600] 请不吝点赞 订阅 转发 打赏支持明镜与点点栏目
DETECTED HALLUCINATION: 明镜需要您的支持 欢迎订阅明镜
[01:51.440 --> 01:52.980] 1995年10月回国
[01:52.980 --> 01:56.320] 1998年成功推出搜狐网站
[01:56.320 --> 01:59.940] 同年10月被美国时代周刊评为

Also, this PR includes an option --clip_timestamps to specify a list of clips within the audio file where inference should be applied, given in the format start,end,start,end,... (each timestamp specified in seconds). For example, --clip_timestamps 10.5,57.8,71,103 will only run the inference on the region between 10.5 to 57.8 and on the region between 71 to 103. Also, the final end will default to the end of the file, so --clip_timestamps 30 will run inference from the 30 second mark to the end of the file. All timestamps will still be relative to the original audio file. I found this option helpful when testing the hallucination heuristic above, but obviously this option would also be very useful for someone who wants to run their own VAD model on the audio first, and then pass that into --clip_timestamps.

One interesting observation is that when running the v3 model on the linked sample audio, we can get very good results if we choose a precise clip region around the 53 second mark (I can't remember now), but if we adjust that forward or backward by very small amounts, we can get the v3 model to completely fail to notice anything that was actually uttered in the audio, suggesting it is a very sensitive model.

I've put each option in a separate commit (happy to split into two PRs if preferred).

@ryanheise
Copy link
Contributor Author

Testing on another example from #679 (comment)

Output
v2 runs:

[00:00.000 --> 00:05.660]  spero che si ripigli un attimo, ho schiacciato qualche tasto che non dovevo
DETECTED HALLUCINATION:  non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho
DETECTED HALLUCINATION:  no
DETECTED HALLUCINATION:  no



[00:00.000 --> 00:05.660]  spero che si ripigli un attimo, ho schiacciato qualche tasto che non dovevo
DETECTED HALLUCINATION:  non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho
DETECTED HALLUCINATION:  .....
DETECTED HALLUCINATION:  .....



[00:00.000 --> 00:05.660]  spero che si ripigli un attimo, ho schiacciato qualche tasto che non dovevo
DETECTED HALLUCINATION:  non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho capito, non ho
DETECTED HALLUCINATION:  uh
DETECTED HALLUCINATION:  uh



v3 run:

[00:00.000 --> 00:04.240]  Spero che si ripigli un attimo, ho schiacciato qualche tasto che non dovevo.
DETECTED HALLUCINATION:  Grazie a tutti.
DETECTED HALLUCINATION:  E' un attimo che non dovevo.
[00:54.440 --> 00:55.700]  Ehm, ehm.

@jongwook jongwook merged commit ba3f3cd into openai:main Dec 18, 2023
8 checks passed
@ryanheise ryanheise deleted the fix-hallucinations branch December 26, 2023 12:47
@dgoryeo
Copy link

dgoryeo commented Feb 1, 2024

When using --clip_timestamps, is there a need to pad before and after the clip? If so, what would be a recommended value for pad duration?

@ryanheise
Copy link
Contributor Author

It should work without padding, but if the VAD is inaccurate then padding might help compensate for that.

@geosigno
Copy link

geosigno commented Feb 7, 2024

Will this PR included in the next release? If so, when it it planned?

@ryanheise
Copy link
Contributor Author

Related PR: #2005 (fixes a bug in --clip_timestamps if you pass an end timestamp that is after the audio end.)

@Mileslewis
Copy link

Just tested out hallucination_silence_threshold and it worked for me
thanks!

@Purfview
Copy link

@ryanheise Can you show me your transcribe.py with debug stuff?

@ryanheise
Copy link
Contributor Author

I don't have the exact code anymore, but you could try temporarily inserting these two lines:

                if score >= 3 or score + 0.01 >= len(words):
                    print(f"DETECTED HALLUCINATION: {segment['text']}")

before the return in this function:

            def is_segment_anomaly(segment: Optional[dict]) -> bool:
                if segment is None or not segment["words"]:
                    return False
                words = [w for w in segment["words"] if w["word"] not in punctuation]
                words = words[:8]
                score = sum(word_anomaly_score(w) for w in words)
                return score >= 3 or score + 0.01 >= len(words)

@Purfview
Copy link

Purfview commented Feb 22, 2024

@ryanheise
Sometimes --hallucination_silence_threshold makes whole non-hallucinating segments or part of segments disappear.

Below is example with disappeared orange pigmentation. segment.

I'm using faster-whisper, but you should be able to reproduce it with whisper too as implementation is same.
Audio file -> https://we.tl/t-U5a6Al5bRs

--language en --model=base --beam_size=5 --word_timestamps=True --hallucination_silence_threshold=None:

[02:06.620 --> 02:11.120]  White tigers carry a mutated version of this gene, which prevents them from producing
  Processing segment at 02:11.120
[02:11.120 --> 02:12.460]  orange pigmentation.
[02:15.360 --> 02:18.340]  Fewer than 4,000 tigers remain in the wild.

--language en --model=base --beam_size=5 --word_timestamps=True --hallucination_silence_threshold=2:

[02:06.620 --> 02:11.120]  White tigers carry a mutated version of this gene, which prevents them from producing
  Processing segment at 02:12.380
* HST_1: Skipping silence before possible hallucinations.
* HST_3: DETECTED HALLUCINATION:  oxygen.
  Processing segment at 02:13.380
* HST_1: Skipping silence before possible hallucinations.
[02:14.680 --> 02:18.360]  fewer than 4,000 tigers remain in the wild.

EDIT:
float32 was in use

@Purfview
Copy link

Purfview commented Feb 22, 2024

I think, I've noticed a pattern, it happens when if remaining_duration > threshold: is not triggered, in there:
seek = previous_seek + segment_size

Then chunk go exactly by 30 secs cutting off the word.

Chunking when --hallucination_silence_threshold=None:

  Processing segment at 00:00.000
  Processing segment at 00:26.040
  Processing segment at 00:48.280
  Processing segment at 01:14.400
  Processing segment at 01:42.380
  Processing segment at 02:11.120
  Processing segment at 02:35.400
  Processing segment at 03:05.400

Chunking by setting high threshold --hallucination_silence_threshold=40:

  Processing segment at 00:00.000
  Processing segment at 00:30.000
  Processing segment at 01:00.000
  Processing segment at 01:30.000
  Processing segment at 02:00.000
  Processing segment at 02:30.000
  Processing segment at 03:00.000

@Purfview
Copy link

Purfview commented Feb 23, 2024

Another thing, this PR affects transcription even if both new parameters are not enabled, I meant comparing vs without this PR.

This happens sometimes, but when it happens the discrepancy is always in the last chunk.

And sometimes when discrepancy happens it tries to process additional micro chunk after it which produces some hallucination or fails because no-speech threshold is met, not sure if this is related to PR or to a discrepancy.

Example of such discrepancy [audio is 05:05.877 long]:

Without this PR [perfect transcription]:

Processing segment at 04:48.000
[04:58.120 --> 05:05.260]  I just...
[05:05.260 --> 05:05.760]  I...

With this PR [all goes exactly same till the last chunk]:

  Processing segment at 04:48.000
* Compression ratio threshold is not met with temperature 0.0 (3.523810 > 2.400000)
* Compression ratio threshold is not met with temperature 0.2 (3.523810 > 2.400000)
* Compression ratio threshold is not met with temperature 0.4 (8.038462 > 2.400000)
* Compression ratio threshold is not met with temperature 0.6 (3.523810 > 2.400000)
* Compression ratio threshold is not met with temperature 0.8 (2.423077 > 2.400000)
[05:01.940 --> 05:02.900]  Okay.
[05:02.900 --> 05:04.000]  I just-
[05:04.940 --> 05:05.740]  I-
[05:05.740 --> 05:05.840]  I-
* Reset prompt. prompt_reset_on_temperature threshold is met 1.000000 > 0.500000
  Processing segment at 05:05.840
* Log probability threshold is not met with temperature 0.0 (-1.105777 < -1.000000)
* No speech threshold is met (0.772002 > 0.600000)

@ryanheise
Copy link
Contributor Author

ryanheise commented Feb 23, 2024

Sometimes --hallucination_silence_threshold makes whole non-hallucinating segments or part of segments disappear.

This logic is part of the original Whisper strategy of advancing by the full 30 seconds to the next window whenever the current segment is unfinished. So basically, if the segment finishes before the end of the 30 second window, then Whisper will crop the window to the exact end timestamp of the last word in that segment. But if the segment does not finish by the end of the 30 second window, the window is not cropped, the speech is assumed to run all the way to the end of the window.

This logic exists whether or not the hallucination_silence_threshold is enabled, and I have seen it cause problems in both cases, however the larger models tend to be better at picking up the words across the window boundary.

In your case, the sentence in question is:

White tigers carry a mutated version of this gene, which prevents them from producing orange pigmentation.

This sentence does not fit within the 30 second window, and the word "orange" is right on the boundary. In fact, the word "orange" is slightly before the boundary and the human ear can pick it up (as can the larger models) but the smaller models fail to pick it up.

And given Whisper's logic in this case, it will assume the speech went right up to the end of the 30 second window and will resume the next window from there.

So although yes the large models would probably resolve this, I think it would still be better to change Whisper's strategy and crop the window to the end timestamp of the last word even in this case where we have an unfinished segment.

@Purfview
Copy link

Purfview commented Feb 23, 2024

This logic is part of the original Whisper strategy of advancing by the full 30 seconds to the next window whenever the current segment is unfinished.

I can't connect the dots...
Then why it's "unfinished" when using hallucination_silence_threshold and it's "finished" without it?

How remaining_duration <= hallucination_silence_threshold means an "unfinished" segment? The option doesn't read as "finished/unfinished segment threshold"....

@ryanheise
Copy link
Contributor Author

ryanheise commented Feb 23, 2024

Apologies, my explanation of that was around the wrong way. The original Whisper behaviour was that if the last segment in the window is "complete", THEN it skips to the end of the full 30 second window. If the last segment is incomplete, then it crops the window to end timestamp of the last word.

But when hallucination_silence_threshold is set, it still applies this logic in most cases except that it also includes a misfired heuristic that skips to the end of the full 30 second window if the end of the speech is close enough to the end of the window:

                # skip silence before possible hallucinations
                if hallucination_silence_threshold is not None:
                    threshold = hallucination_silence_threshold
                    if not single_timestamp_ending:
                        last_word_end = get_end(current_segments)
                        if last_word_end is not None and last_word_end > time_offset:
                            remaining_duration = window_end_time - last_word_end
                            if remaining_duration > threshold:  # <--- misfired heuristic
                                seek = round(last_word_end * FRAMES_PER_SECOND)
                            else:
                                seek = previous_seek + segment_size

The goal was to skip over as much silence as safely possible.

However, in hindsight, this was a bit opportunistic, since after all single_timestamp_ending was False for good reason. You should find your example will work if you remove that heuristic. i.e. Delete this entire section:

                    if not single_timestamp_ending:
                        last_word_end = get_end(current_segments)
                        if last_word_end is not None and last_word_end > time_offset:
                            remaining_duration = window_end_time - last_word_end
                            if remaining_duration > threshold:  # <--- misfired heuristic
                                seek = round(last_word_end * FRAMES_PER_SECOND)
                            else:
                                seek = previous_seek + segment_size

(It's OK, the other parts of this code block are already handled elsewhere.)

@ryanheise
Copy link
Contributor Author

I've created a PR #2043 incorporating the above fix based on your counter example.

@Purfview
Copy link

Thanks for explanation, now this part of code makes sense.
Do you have idea why seek in the last window can be affected by PR? -> #1838 (comment)

The goal was to skip over as much silence as safely possible.

Imho, skipping to full 30s window is pretty unsafe. 😆
And it contradicted the description: "skip silent periods longer than this threshold (in seconds) when a possible hallucination is detected"

@ryanheise
Copy link
Contributor Author

Do you have idea why seek in the last window can be affected by PR? -> #1838 (comment)

Do you have an audio file to reproduce?

@Purfview
Copy link

Do you have an audio file to reproduce?

This file has discrepancy in the last window/chunk:
t-001.mka -> https://we.tl/t-ecd6U1QaZp
--language en --model=base --beam_size 1 --word_timestamps=True

Whisper without this PR:

[01:53.920 --> 01:54.500]  I'll give you some advice.
[01:59.500 --> 02:00.080]  I'll give you some advice.
[02:00.080 --> 02:00.080]  I'll give you some advice.
[02:00.080 --> 02:00.980]  Say the word, General.
[02:02.300 --> 02:03.320]  Let him go.

Whisper with this PR:

[01:53.920 --> 01:55.200]  I'll give you some advice.
[01:59.500 --> 02:00.980]  Say the word, General.
[02:02.280 --> 02:03.320]  Let him go.

@ryanheise
Copy link
Contributor Author

I'll test tomorrow, but does this also happen on PR #2043 ?

Purfview added a commit to Purfview/faster-whisper that referenced this pull request Feb 23, 2024
Removes the wishful heuristic causing more issues than it's fixing.

Same as openai/whisper#2043

Example of the issue: openai/whisper#1838 (comment)
@Purfview
Copy link

Purfview commented Feb 23, 2024

I'll test tomorrow, but does this also happen on PR #2043 ?

Yes, because hallucination_silence_threshold option is not relevant for the issue.

Culprit affecting only the last window is found. it happens because of this:

            mel_segment = mel[:, seek : seek + segment_size]

This is the fix [that's how it was before this PR]:

            mel_segment = mel[:, seek : seek + N_FRAMES]

Not sure why you changed it, on my observation it makes more hallucinations [probably it's random].
Anyway, the fix brings back the previous behavior.

@ryanheise
Copy link
Contributor Author

That is changed for --cilp_timestamps because parts of the audio that are clipped out should not be included in the mel spectrogram. I'll take a look at your test scenario to see what's going on.

@ryanheise
Copy link
Contributor Author

I've confirmed the discrepancy, which seems to be a consequence of slightly different mel spectrograms. Although in the two examples you gave (only the latter of which I have tested with the supplied audio file), the PR actually removed a hallucination on one example and introduced a hallucination on the other example. So on balance, it's hard to say whether this discrepancy it better or worse or about the same.

So if it's not clear whether it's better or worse, do you see anything incorrect in the clipping logic? I think the difference is that I am always clipping exactly to the stretch of audio being examined, and then padding it. But originally, there was padding on the end that was added immediately when the mel spectrogram was first generated, and then (in the original code), it is also possible that due to the dynamic shifting of the window starts, it could end up padding the last part of the audio twice, because there is no guarantee that that initial padding Whisper added at the start of the process was enough to reflect where this last window ended up actually starting.

But it's possible I've done something wrong which I can't see, so let me know if you do spot something incorrect in the logic.

@ryanheise
Copy link
Contributor Author

After plotting the mel spectrograms, I noticed the padding when the audio is first loaded (as a whole) contains all -1.0's, while the padding in the main loop for each 30 second window contains all 0.0's. Not sure why that is, but there are two different padding algorithms in the code, and weirdly they are producing different padding results.

So in your example, the PR ends up always using the padding algorithm that pads to 0.0's whereas originally the end of file padding had -1.0's.

nguyendc-systran pushed a commit to SYSTRAN/faster-whisper that referenced this pull request Feb 29, 2024
Removes the wishful heuristic causing more issues than it's fixing.

Same as openai/whisper#2043

Example of the issue: openai/whisper#1838 (comment)
@Kirkezz
Copy link

Kirkezz commented Sep 5, 2024

There's still a chance that a hallucination will be produced.
For me it was:

[02:15:58.100 --> 02:16:05.380]  я вам вышлю. Всего доброго, до свидания.
[02:16:28.100 --> 02:16:30.100]  Редактор субтитров И.Бойкова

i. e.

....
[02:16:28.100 --> 02:16:30.100] Subtitle Editor I. Boykova

Notably, this timestamp belongs to the end of the audio.

Model size: small. Also there are some results in google, if you search for this phrase. One of them:

[24:26.800 --> 24:30.160]  Смотрите телебарометр на нашем телеканале.
[24:30.160 --> 24:32.160]  Редактор субтитров И.Бойкова
[24:32.160 --> 24:39.160]  Корректор А.Кулакова

from https://storage.googleapis.com/data.gdeltproject.org/blog/2022-tv-news-whisperasr/BELARUSTV_20221005_161500.small.transcribe.run1.txt

@ryanheise
Copy link
Contributor Author

That's certainly possible, and unfortunately there is no single choice of parameters that will be perfect in all scenarios. You can tweak the silence threshold, which is exposed on the command line. You can also try tweaking the other thresholds that were built into the code (like how long a word must be before it is flagged as an abnormality). If we can gather a large enough dataset of audio samples that produce hallucinations, we should be able to come up with better default settings that work well across a variety of scenarios and languages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants