Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exporting to ONNX with kv_cache #1240

Open
idruker-cerence opened this issue Jul 17, 2024 · 6 comments
Open

Exporting to ONNX with kv_cache #1240

idruker-cerence opened this issue Jul 17, 2024 · 6 comments

Comments

@idruker-cerence
Copy link

idruker-cerence commented Jul 17, 2024

Disclaimer
That's not a bug report but rather a question.

To Reproduce

  • run a simplified script to download mistral-7b model from huggingface and convert it to ONNX format. That works perfect with kv_cache set false, but the resultant model does not have the kv-cache
  • run the same script but this time set kv_cache to true. The conversion fails.

Olive config

{

"input_model": {
    "type": "PyTorchModel",
    "config": {
        "io_config": {
            "input_names":  ["input_ids", "attention_mask", "position_ids"],
            "output_names": ["logits"],
            "input_shapes": [[2, 8], [2, 8], [2, 8]],
            "input_types":  ["int64", "int64", "int64"],
            "dynamic_axes": {
                "input_ids":      {"0": "batch_size", "1": "sequence_length"},
                "attention_mask": {"0": "batch_size", "1": "sequence_length"},
                "position_ids":   {"0": "batch_size", "1": "sequence_length"}
            },
            "kv_cache": true
        },
        "hf_config": {
            "model_name": "mistralai/Mistral-7B-v0.1",
            "model_class": "MistralForCausalLM"
        }
    }
},

"systems": {
    "local_system": {
        "type": "LocalSystem",
        "config": {
            "accelerators": [
                {
                    "device": "gpu",
                    "execution_providers": [
                        "CUDAExecutionProvider"
                    ]
                }
            ]
        }
    }
},

"passes": {
    "onnx_conversion": {
        "type": "OnnxConversion",
        "config": {
            "device": "cuda",
            "target_opset": 14,
            "torch_dtype": "float16"
        }
    }
},

"engine": {
    "host": "local_system",
    "target": "local_system",
    "cache_dir": "/mnt/genai/users/ilya_druker/models/cache",
    "output_dir": "models/with-past",
    "output_name": "mistral"
}

}

Olive logs
[2024-07-17 09:27:38,243] [INFO] [config.py:237:validate_evaluate_input_model] No evaluator is specified, skip to evaluate model
[2024-07-17 09:27:38,244] [INFO] [run.py:138:run_engine] Running workflow default_workflow
[2024-07-17 09:27:38,253] [INFO] [engine.py:986:save_olive_config] Saved Olive config to /mnt/genai/users/ilya_druker/models/cache/default_workflow/olive_config.json
[2024-07-17 09:27:38,258] [INFO] [accelerator_creator.py:224:create_accelerators] Running workflow on accelerator specs: gpu-cuda
[2024-07-17 09:27:38,258] [INFO] [engine.py:109:initialize] Using cache directory: /mnt/genai/users/ilya_druker/models/cache/default_workflow
[2024-07-17 09:27:38,260] [INFO] [engine.py:265:run] Running Olive on accelerator: gpu-cuda
[2024-07-17 09:27:38,260] [INFO] [engine.py:1085:_create_system] Creating target system ...
[2024-07-17 09:27:38,260] [INFO] [engine.py:1088:_create_system] Target system created in 0.000138 seconds
[2024-07-17 09:27:38,261] [INFO] [engine.py:1097:_create_system] Creating host system ...
[2024-07-17 09:27:38,261] [INFO] [engine.py:1100:_create_system] Host system created in 0.000198 seconds
[2024-07-17 09:27:38,275] [INFO] [engine.py:867:_run_pass] Running pass onnx_conversion:OnnxConversion
[2024-07-17 09:27:38,347] [INFO] [hf_config.py:112:load_hf_model] Loading Huggingface model from mistralai/Mistral-7B-v0.1
/home/ilya_druker/.local/lib/python3.8/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/mnt/homedirs/ilya_druker/.local/lib/python3.8/site-packages/torchvision/image.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev'If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source?
warn(
/home/ilya_druker/.local/lib/python3.8/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True.
warnings.warn(

Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 50%|█████ | 1/2 [00:08<00:08, 8.61s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:22<00:00, 11.56s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:22<00:00, 11.12s/it][2024-07-17 09:28:08,970] [ERROR] [engine.py:949:_run_pass] Pass run failed.
Traceback (most recent call last):
File "/home/ilya_druker/.local/lib/python3.8/site-packages/olive/engine/engine.py", line 937, in _run_pass
output_model_config = host.run_pass(p, input_model_config, data_root, output_model_path, pass_search_point)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/olive/systems/local.py", line 32, in run_pass
output_model = the_pass.run(model, data_root, output_model_path, point)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/olive/passes/olive_pass.py", line 224, in run
output_model = self._run_for_config(model, data_root, config, output_model_path)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/olive/passes/onnx/conversion.py", line 132, in _run_for_config
output_model = self._run_for_config_internal(model, data_root, config, output_model_path)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/olive/passes/onnx/conversion.py", line 182, in _run_for_config_internal
return self._convert_model_on_device(model, data_root, config, output_model_path, device, torch_dtype)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/olive/passes/onnx/conversion.py", line 439, in _convert_model_on_device
converted_onnx_model = OnnxConversion._export_pytorch_model(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/olive/passes/onnx/conversion.py", line 285, in _export_pytorch_model
torch.onnx.export(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/onnx/utils.py", line 516, in export
_export(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/onnx/utils.py", line 1612, in _export
graph, params_dict, torch_out = _model_to_graph(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/onnx/utils.py", line 1134, in _model_to_graph
graph, params, torch_out, module = _create_jit_graph(model, args)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/onnx/utils.py", line 1010, in _create_jit_graph
graph, torch_out = _trace_and_get_graph_from_model(model, args)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/onnx/utils.py", line 914, in _trace_and_get_graph_from_model
trace_graph, torch_out, inputs_states = torch.jit._get_trace_graph(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/jit/_trace.py", line 1310, in _get_trace_graph
outs = ONNXTracedModule(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/jit/_trace.py", line 138, in forward
graph, out = torch._C._create_graph_by_tracing(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/jit/_trace.py", line 129, in wrapper
outs.append(self.inner(*trace_inputs))
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1522, in _slow_forward
result = self.forward(*input, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/transformers/models/mistral/modeling_mistral.py", line 1139, in forward
outputs = self.model(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1522, in _slow_forward
result = self.forward(*input, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/transformers/models/mistral/modeling_mistral.py", line 1024, in forward
layer_outputs = decoder_layer(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1522, in _slow_forward
result = self.forward(*input, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/transformers/models/mistral/modeling_mistral.py", line 738, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1522, in _slow_forward
result = self.forward(*input, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/transformers/models/mistral/modeling_mistral.py", line 656, in forward
key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/transformers/cache_utils.py", line 155, in update
self.key_cache[layer_idx] = torch.cat([self.key_cache[layer_idx], key_states], dim=-2)
RuntimeError: Sizes of tensors must match except in dimension 2. Expected size 32 but got size 8 for tensor number 1 in the list.
[2024-07-17 09:28:08,984] [WARNING] [engine.py:360:run_accelerator] Failed to run Olive on gpu-cuda.
Traceback (most recent call last):
File "/home/ilya_druker/.local/lib/python3.8/site-packages/olive/engine/engine.py", line 339, in run_accelerator
output_footprint = self.run_no_search(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/olive/engine/engine.py", line 431, in run_no_search
should_prune, signal, model_ids = self._run_passes(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/olive/engine/engine.py", line 829, in _run_passes
model_config, model_id = self._run_pass(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/olive/engine/engine.py", line 937, in _run_pass
output_model_config = host.run_pass(p, input_model_config, data_root, output_model_path, pass_search_point)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/olive/systems/local.py", line 32, in run_pass
output_model = the_pass.run(model, data_root, output_model_path, point)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/olive/passes/olive_pass.py", line 224, in run
output_model = self._run_for_config(model, data_root, config, output_model_path)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/olive/passes/onnx/conversion.py", line 132, in _run_for_config
output_model = self._run_for_config_internal(model, data_root, config, output_model_path)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/olive/passes/onnx/conversion.py", line 182, in _run_for_config_internal
return self._convert_model_on_device(model, data_root, config, output_model_path, device, torch_dtype)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/olive/passes/onnx/conversion.py", line 439, in _convert_model_on_device
converted_onnx_model = OnnxConversion._export_pytorch_model(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/olive/passes/onnx/conversion.py", line 285, in _export_pytorch_model
torch.onnx.export(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/onnx/utils.py", line 516, in export
_export(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/onnx/utils.py", line 1612, in _export
graph, params_dict, torch_out = _model_to_graph(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/onnx/utils.py", line 1134, in _model_to_graph
graph, params, torch_out, module = _create_jit_graph(model, args)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/onnx/utils.py", line 1010, in _create_jit_graph
graph, torch_out = _trace_and_get_graph_from_model(model, args)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/onnx/utils.py", line 914, in _trace_and_get_graph_from_model
trace_graph, torch_out, inputs_states = torch.jit._get_trace_graph(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/jit/_trace.py", line 1310, in _get_trace_graph
outs = ONNXTracedModule(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/jit/_trace.py", line 138, in forward
graph, out = torch._C._create_graph_by_tracing(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/jit/_trace.py", line 129, in wrapper
outs.append(self.inner(*trace_inputs))
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1522, in _slow_forward
result = self.forward(*input, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/transformers/models/mistral/modeling_mistral.py", line 1139, in forward
outputs = self.model(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1522, in _slow_forward
result = self.forward(*input, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/transformers/models/mistral/modeling_mistral.py", line 1024, in forward
layer_outputs = decoder_layer(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1522, in _slow_forward
result = self.forward(*input, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/transformers/models/mistral/modeling_mistral.py", line 738, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1522, in _slow_forward
result = self.forward(*input, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/transformers/models/mistral/modeling_mistral.py", line 656, in forward
key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/transformers/cache_utils.py", line 155, in update
self.key_cache[layer_idx] = torch.cat([self.key_cache[layer_idx], key_states], dim=-2)
RuntimeError: Sizes of tensors must match except in dimension 2. Expected size 32 but got size 8 for tensor number 1 in the list.
[2024-07-17 09:28:08,987] [INFO] [engine.py:282:run] Run history for gpu-cuda:
[2024-07-17 09:28:08,998] [INFO] [engine.py:570:dump_run_history] run history:
+------------+-------------------+-------------+----------------+-----------+
| model_id | parent_model_id | from_pass | duration_sec | metrics |
+============+===================+=============+================+===========+
| 4c8cc2fe | | | | |
+------------+-------------------+-------------+----------------+-----------+
[2024-07-17 09:28:09,000] [INFO] [engine.py:297:run] No packaging config provided, skip packaging artifacts

/home/ilya_druker/.local/lib/python3.8/site-packages/transformers/modeling_attn_mask_utils.py:276: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
elif sliding_window is None or key_value_length < sliding_window:
/home/ilya_druker/.local/lib/python3.8/site-packages/transformers/modeling_attn_mask_utils.py:114: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if (input_shape[-1] > 1 or self.sliding_window is not None) and self.is_causal:
/home/ilya_druker/.local/lib/python3.8/site-packages/transformers/modeling_attn_mask_utils.py:162: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if past_key_values_length > 0:
/home/ilya_druker/.local/lib/python3.8/site-packages/transformers/models/mistral/modeling_mistral.py:119: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if seq_len > self.max_seq_len_cached:

Other information

  • OS: Linux, CUDA
@prashantaithal
Copy link

Any update on this?

@idruker-cerence
Copy link
Author

idruker-cerence commented Jul 30, 2024

Any update on this?

Yes, kind of. Just replace in the script OnnxConversion pass by OptimumConversion.

@prashantaithal
Copy link

Curious if that generates an ONNX file with kv cache as part of the model or should the kv cache be implemented separately. Additionally , can the lllama 2 7B onnx model be generated via Olive similar to how the Mistral onnx file you generated ?

@idruker-cerence
Copy link
Author

kv-cache is already part of an original model in pytorch format. kv_cache flag is only intended to preserve it or not when converting to onnx.

I have not tested LLama but assume it must work similarly.

@prashantaithal
Copy link

Can you please share the steps on how you used Olive config file to generate the ONNX model?

@idruker-cerence
Copy link
Author

you call

olive run --config config.json

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants