Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whisper-large-v3 transcript is trimmed #1972

Open
2 of 4 tasks
yv0vaa opened this issue Jul 25, 2024 · 4 comments
Open
2 of 4 tasks

Whisper-large-v3 transcript is trimmed #1972

yv0vaa opened this issue Jul 25, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@yv0vaa
Copy link

yv0vaa commented Jul 25, 2024

System Info

optimum 1.21.2
Ubuntu 22.04.4 LTS
CUDA 12.3
cuda-toolkit 11.7
onnxruntime 1.18.1

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

import os
from transformers import WhisperForConditionalGeneration, WhisperProcessor, PretrainedConfig
import torch
import torchaudio
from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
            

model_name = 'openai/whisper-large-v3'
model_path = 'whisper-large-v3'

processor = WhisperProcessor.from_pretrained(model_name)
model = WhisperForConditionalGeneration.from_pretrained(model_name)
device = "cuda:0" if torch.cuda.is_available() else "cpu"


model_config = PretrainedConfig.from_pretrained(model_name)
sessions = ORTModelForSpeechSeq2Seq.load_model(
    os.path.join(model_path, 'encoder_model.onnx'),
    os.path.join(model_path, 'decoder_model.onnx'),
)
model = ORTModelForSpeechSeq2Seq(
    sessions[0], 
    sessions[1], 
    model_config, 
    model_path, 
    use_cache=False,
).to(device)

audio, sr = torchaudio.load("example.ogg")
audio = torchaudio.functional.resample(audio[0], sr, 16000)
input_features = processor(audio.cpu(), return_tensors="pt", sampling_rate=16000, max_new_tokens=1000).input_features.to(device)
predicted_ids = model.generate(input_features)[0]
transcription = processor.decode(predicted_ids)
print(transcription)

Expected behavior

For some reason a final transcript is incomplete and is trimmed in the middle of the speech.
I've tried to change max_tokens and max_new_tokens parameter, but nothing has changed.
Also I didn't understand how to pass compute type and batch size as parameters.
PretrainedConfig and GenerationConfig don't have such parameters. Could anyone help me?

@yv0vaa yv0vaa added the bug Something isn't working label Jul 25, 2024
@IlyasMoutawwakil
Copy link
Member

hey @yv0vaa would you have the time to try out the branch in #1971 and see if it fixes your issues ?

@yv0vaa
Copy link
Author

yv0vaa commented Jul 30, 2024

Good afternoon @IlyasMoutawwakil, thanks, but unfortunately it didn't help.

@IlyasMoutawwakil
Copy link
Member

IlyasMoutawwakil commented Jul 30, 2024

oh.. I just noticed that you're passing max_new_tokens to the processor and not generate.
Is the behavior different than that of transformers ?

@yv0vaa
Copy link
Author

yv0vaa commented Jul 31, 2024

Maybe I'm doing something wrong, but nothing changes. Variation of max_new_tokens in both processor.__call__ and model.generate does not affect the behavior of the model

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants