Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix silence detection #175

Open
wants to merge 24 commits into
base: main
Choose a base branch
from

Conversation

aigerimmmm
Copy link

@aigerimmmm aigerimmmm commented Jun 26, 2024

PR for the issue #27.

  • Added SilenceDetectionFilter similar to LanguageDetectionFilter.
  • Implemented detectSilence similar to detectLanguage in TextDecoder that does a single forward pass from to detect silence periods.
  • Added tests for SilenceDetectionFilter.
  • Added silence detection and segment skipping logic in TranscribeTask.
  • Updated models to support silence detection:
    • Added flags in DecodingOptions for ignoring prefill prompts during silence detection.
    • Changed property for managing silence detection threshold.

Copy link
Contributor

@ZachNagengast ZachNagengast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @aigerimmmm thanks for this submission, this is a hugely needed feature 🎉 Looks like a good start, please see the comments below and lmk what you think. Also on the added files: Is there a specific reason you need the new silent audio, rather than generating silence (all zeros) and appending it to the existing audio? We do that in a few tests you may want to reference. It would also be great to test the specific case where silence is above the threshold but the log probs for the full window are also above the threshold so we keep the tokens rather than skip the window. Thanks again fro working through this!


// MARK: - Helper Function

func loadAudioSamples(forResource resource: String, withExtension ext: String) -> [Float] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea 👍

func testSilentAudio() async throws {
let whisperKit = try await WhisperKit(modelFolder: tinyModelPath(), verbose: true, logLevel: .debug)

let silentAudioSamples: [Float] = loadAudioSamples(forResource: "silent_audio", withExtension: "mp3")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be [Float](repeating: 0.0, count: 16000)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! I've updated the testSilentAudio function to use [Float](repeating: 0.0, count: 16000) for silentAudioSamples.

@@ -157,6 +157,13 @@ final class TranscribeTask {
try Task.checkCancellation()
// Send to decoder to predict text tokens with fallback
let decodingResult = try await decodeWithFallback(encoderSegment: encoderOutput, decodingOptions: options, callback: decodingCallback)

// Check for silence detection
if decodingResult.noSpeechProb > (options.noSpeechThreshold ?? 0.7) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this should also contain a check for the overall log prob of the window, see reference: https://github.com/openai/whisper/blob/ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab/whisper/transcribe.py#L289

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! I've added a check for the overall log probability of the window

if decodingResult.noSpeechProb > (options.noSpeechThreshold ?? 0.7) {
print("Detected silence with noSpeechProb \(decodingResult.noSpeechProb), skipping segment.")
// Skip processing for silent segments
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of break, this should just increase the seek and continue the loop https://github.com/openai/whisper/blob/ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab/whisper/transcribe.py#L293

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are so right, added increase the seek and continue the loop

@@ -827,7 +892,7 @@ open class TextDecoder: TextDecoding, WhisperMLModel {
let decodingFallback = DecodingFallback(
options: options,
isFirstTokenLogProbTooLow: isFirstTokenLogProbTooLow,
noSpeechProb: noSpeechProb,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intended? If I'm reading this right, I think the noSpeechProb should be passed based on the actual probability of the no speech token at the first index of predicted tokens https://github.com/openai/whisper/blob/ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab/whisper/decoding.py#L689-L693

Copy link
Author

@aigerimmmm aigerimmmm Jul 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right. I have made changes to ensure that noSpeechProb is calculated at the start of the transcript (SOT) token and passed to DecodingFallback. The commit is 3a3b512.

@@ -317,9 +319,10 @@ public struct DecodingOptions {
compressionRatioThreshold: Float? = 2.4,
logProbThreshold: Float? = -1.0,
firstTokenLogProbThreshold: Float? = -1.5,
noSpeechThreshold: Float? = 0.6,
noSpeechThreshold: Float? = 0.7,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default in openai/whisper is 0.6, can you stick with that value or is there a reason you're updating it here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reverted the noSpeechThreshold value back to 0.6 to align with the default used in openai/whisper.

concurrentWorkerCount: Int = 0,
chunkingStrategy: ChunkingStrategy? = nil
chunkingStrategy: ChunkingStrategy? = nil,
ignorePrefillPromptForNoSpeechDetection: Bool = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit verbose and also not sure exactly what it's intended use is? Are you referencing existing code or was this a need that came up from your testing? I can see a need for something that runs the forward pass for SOT token when using the prefillData, but for prefill prompt we would still want to check the no speech prob specifically at the SOT token, after all the prefill prompt tokens have been passed through to generate their KV caches. Best place to do that would be here:

if tokenIndex < intialPromptIndex {

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your suggestion! I did check the no speech probability after all prefill prompt tokens have been processed

@@ -278,3 +278,37 @@ open class LanguageLogitsFilter: LogitsFiltering {
return indexes
}
}

@available(macOS 13, iOS 16, watchOS 10, visionOS 1, *)
open class SilenceLogitsFilter: LogitsFiltering {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second look at the source, I'm not sure we need a filter for this, since that will suppress the log probs of the "predicted token" during inference. What you already have here https://github.com/argmaxinc/WhisperKit/pull/175/files#diff-5dd6579fc66020b1085535bce41d2c2cc399a0b2b8f0ba225fc89f39d9ebdbc8R402 is checking the specific no speech index, which does everything you would need already.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that you could add to the filters is supressing the no speech token in the SupressTokensFilter similar to this https://github.com/openai/whisper/blob/ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab/whisper/decoding.py#L638-L640 (also needs the remaining suppress tokens but can be fixed later)

Sources/WhisperKit/Core/TextDecoder.swift Outdated Show resolved Hide resolved
Sources/WhisperKit/Core/TranscribeTask.swift Outdated Show resolved Hide resolved
@ZachNagengast
Copy link
Contributor

Hey @aigerimmmm just checking in, how are you feeling about the changes?

@aigerimmmm
Copy link
Author

Hi @ZachNagengast, I really appreciate the all suggestions and detailed review, it's very helpful for me 👍 I've been working on it and I'm going to submit today.

@ZachNagengast
Copy link
Contributor

@aigerimmmm Checking in, is this ready for another review?

@aigerimmmm aigerimmmm closed this Aug 8, 2024
@aigerimmmm aigerimmmm reopened this Aug 8, 2024
@aigerimmmm
Copy link
Author

Hi @ZachNagengast thank you for reply, yes I am ready anytime for another review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants