Exporting to ONNX with kv_cache #1240

idruker-cerence · 2024-07-17T13:30:27Z

Disclaimer
That's not a bug report but rather a question.

To Reproduce

run a simplified script to download mistral-7b model from huggingface and convert it to ONNX format. That works perfect with kv_cache set false, but the resultant model does not have the kv-cache
run the same script but this time set kv_cache to true. The conversion fails.

Olive config

{

"input_model": {
    "type": "PyTorchModel",
    "config": {
        "io_config": {
            "input_names":  ["input_ids", "attention_mask", "position_ids"],
            "output_names": ["logits"],
            "input_shapes": [[2, 8], [2, 8], [2, 8]],
            "input_types":  ["int64", "int64", "int64"],
            "dynamic_axes": {
                "input_ids":      {"0": "batch_size", "1": "sequence_length"},
                "attention_mask": {"0": "batch_size", "1": "sequence_length"},
                "position_ids":   {"0": "batch_size", "1": "sequence_length"}
            },
            "kv_cache": true
        },
        "hf_config": {
            "model_name": "mistralai/Mistral-7B-v0.1",
            "model_class": "MistralForCausalLM"
        }
    }
},

"systems": {
    "local_system": {
        "type": "LocalSystem",
        "config": {
            "accelerators": [
                {
                    "device": "gpu",
                    "execution_providers": [
                        "CUDAExecutionProvider"
                    ]
                }
            ]
        }
    }
},

"passes": {
    "onnx_conversion": {
        "type": "OnnxConversion",
        "config": {
            "device": "cuda",
            "target_opset": 14,
            "torch_dtype": "float16"
        }
    }
},

"engine": {
    "host": "local_system",
    "target": "local_system",
    "cache_dir": "/mnt/genai/users/ilya_druker/models/cache",
    "output_dir": "models/with-past",
    "output_name": "mistral"
}

}

Olive logs
[2024-07-17 09:27:38,243] [INFO] [config.py:237:validate_evaluate_input_model] No evaluator is specified, skip to evaluate model
[2024-07-17 09:27:38,244] [INFO] [run.py:138:run_engine] Running workflow default_workflow
[2024-07-17 09:27:38,253] [INFO] [engine.py:986:save_olive_config] Saved Olive config to /mnt/genai/users/ilya_druker/models/cache/default_workflow/olive_config.json
[2024-07-17 09:27:38,258] [INFO] [accelerator_creator.py:224:create_accelerators] Running workflow on accelerator specs: gpu-cuda
[2024-07-17 09:27:38,258] [INFO] [engine.py:109:initialize] Using cache directory: /mnt/genai/users/ilya_druker/models/cache/default_workflow
[2024-07-17 09:27:38,260] [INFO] [engine.py:265:run] Running Olive on accelerator: gpu-cuda
[2024-07-17 09:27:38,260] [INFO] [engine.py:1085:_create_system] Creating target system ...
[2024-07-17 09:27:38,260] [INFO] [engine.py:1088:_create_system] Target system created in 0.000138 seconds
[2024-07-17 09:27:38,261] [INFO] [engine.py:1097:_create_system] Creating host system ...
[2024-07-17 09:27:38,261] [INFO] [engine.py:1100:_create_system] Host system created in 0.000198 seconds
[2024-07-17 09:27:38,275] [INFO] [engine.py:867:_run_pass] Running pass onnx_conversion:OnnxConversion
[2024-07-17 09:27:38,347] [INFO] [hf_config.py:112:load_hf_model] Loading Huggingface model from mistralai/Mistral-7B-v0.1
/home/ilya_druker/.local/lib/python3.8/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/mnt/homedirs/ilya_druker/.local/lib/python3.8/site-packages/torchvision/image.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev'If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source?
warn(
/home/ilya_druker/.local/lib/python3.8/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True.
warnings.warn(

Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 50%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆ | 1/2 [00:08<00:08, 8.61s/it]
Loading checkpoint shards: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 2/2 [00:22<00:00, 11.56s/it]
Loading checkpoint shards: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 2/2 [00:22<00:00, 11.12s/it][2024-07-17 09:28:08,970] [ERROR] [engine.py:949:_run_pass] Pass run failed.
Traceback (most recent call last):
File "/home/ilya_druker/.local/lib/python3.8/site-packages/olive/engine/engine.py", line 937, in _run_pass
output_model_config = host.run_pass(p, input_model_config, data_root, output_model_path, pass_search_point)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/olive/systems/local.py", line 32, in run_pass
output_model = the_pass.run(model, data_root, output_model_path, point)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/olive/passes/olive_pass.py", line 224, in run
output_model = self._run_for_config(model, data_root, config, output_model_path)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/olive/passes/onnx/conversion.py", line 132, in _run_for_config
output_model = self._run_for_config_internal(model, data_root, config, output_model_path)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/olive/passes/onnx/conversion.py", line 182, in _run_for_config_internal
return self._convert_model_on_device(model, data_root, config, output_model_path, device, torch_dtype)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/olive/passes/onnx/conversion.py", line 439, in _convert_model_on_device
converted_onnx_model = OnnxConversion._export_pytorch_model(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/olive/passes/onnx/conversion.py", line 285, in _export_pytorch_model
torch.onnx.export(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/onnx/utils.py", line 516, in export
_export(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/onnx/utils.py", line 1612, in _export
graph, params_dict, torch_out = _model_to_graph(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/onnx/utils.py", line 1134, in _model_to_graph
graph, params, torch_out, module = _create_jit_graph(model, args)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/onnx/utils.py", line 1010, in _create_jit_graph
graph, torch_out = _trace_and_get_graph_from_model(model, args)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/onnx/utils.py", line 914, in _trace_and_get_graph_from_model
trace_graph, torch_out, inputs_states = torch.jit._get_trace_graph(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/jit/_trace.py", line 1310, in _get_trace_graph
outs = ONNXTracedModule(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/jit/_trace.py", line 138, in forward
graph, out = torch._C._create_graph_by_tracing(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/jit/_trace.py", line 129, in wrapper
outs.append(self.inner(*trace_inputs))
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1522, in _slow_forward
result = self.forward(*input, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/transformers/models/mistral/modeling_mistral.py", line 1139, in forward
outputs = self.model(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1522, in _slow_forward
result = self.forward(*input, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/transformers/models/mistral/modeling_mistral.py", line 1024, in forward
layer_outputs = decoder_layer(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1522, in _slow_forward
result = self.forward(*input, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/transformers/models/mistral/modeling_mistral.py", line 738, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1522, in _slow_forward
result = self.forward(*input, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/transformers/models/mistral/modeling_mistral.py", line 656, in forward
key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/transformers/cache_utils.py", line 155, in update
self.key_cache[layer_idx] = torch.cat([self.key_cache[layer_idx], key_states], dim=-2)
RuntimeError: Sizes of tensors must match except in dimension 2. Expected size 32 but got size 8 for tensor number 1 in the list.
[2024-07-17 09:28:08,984] [WARNING] [engine.py:360:run_accelerator] Failed to run Olive on gpu-cuda.
Traceback (most recent call last):
File "/home/ilya_druker/.local/lib/python3.8/site-packages/olive/engine/engine.py", line 339, in run_accelerator
output_footprint = self.run_no_search(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/olive/engine/engine.py", line 431, in run_no_search
should_prune, signal, model_ids = self._run_passes(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/olive/engine/engine.py", line 829, in _run_passes
model_config, model_id = self._run_pass(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/olive/engine/engine.py", line 937, in _run_pass
output_model_config = host.run_pass(p, input_model_config, data_root, output_model_path, pass_search_point)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/olive/systems/local.py", line 32, in run_pass
output_model = the_pass.run(model, data_root, output_model_path, point)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/olive/passes/olive_pass.py", line 224, in run
output_model = self._run_for_config(model, data_root, config, output_model_path)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/olive/passes/onnx/conversion.py", line 132, in _run_for_config
output_model = self._run_for_config_internal(model, data_root, config, output_model_path)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/olive/passes/onnx/conversion.py", line 182, in _run_for_config_internal
return self._convert_model_on_device(model, data_root, config, output_model_path, device, torch_dtype)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/olive/passes/onnx/conversion.py", line 439, in _convert_model_on_device
converted_onnx_model = OnnxConversion._export_pytorch_model(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/olive/passes/onnx/conversion.py", line 285, in _export_pytorch_model
torch.onnx.export(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/onnx/utils.py", line 516, in export
_export(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/onnx/utils.py", line 1612, in _export
graph, params_dict, torch_out = _model_to_graph(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/onnx/utils.py", line 1134, in _model_to_graph
graph, params, torch_out, module = _create_jit_graph(model, args)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/onnx/utils.py", line 1010, in _create_jit_graph
graph, torch_out = _trace_and_get_graph_from_model(model, args)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/onnx/utils.py", line 914, in _trace_and_get_graph_from_model
trace_graph, torch_out, inputs_states = torch.jit._get_trace_graph(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/jit/_trace.py", line 1310, in _get_trace_graph
outs = ONNXTracedModule(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/jit/_trace.py", line 138, in forward
graph, out = torch._C._create_graph_by_tracing(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/jit/_trace.py", line 129, in wrapper
outs.append(self.inner(*trace_inputs))
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1522, in _slow_forward
result = self.forward(*input, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/transformers/models/mistral/modeling_mistral.py", line 1139, in forward
outputs = self.model(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1522, in _slow_forward
result = self.forward(*input, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/transformers/models/mistral/modeling_mistral.py", line 1024, in forward
layer_outputs = decoder_layer(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1522, in _slow_forward
result = self.forward(*input, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/transformers/models/mistral/modeling_mistral.py", line 738, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1522, in _slow_forward
result = self.forward(*input, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/transformers/models/mistral/modeling_mistral.py", line 656, in forward
key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/transformers/cache_utils.py", line 155, in update
self.key_cache[layer_idx] = torch.cat([self.key_cache[layer_idx], key_states], dim=-2)
RuntimeError: Sizes of tensors must match except in dimension 2. Expected size 32 but got size 8 for tensor number 1 in the list.
[2024-07-17 09:28:08,987] [INFO] [engine.py:282:run] Run history for gpu-cuda:
[2024-07-17 09:28:08,998] [INFO] [engine.py:570:dump_run_history] run history:
+------------+-------------------+-------------+----------------+-----------+
| model_id | parent_model_id | from_pass | duration_sec | metrics |
+============+===================+=============+================+===========+
| 4c8cc2fe | | | | |
+------------+-------------------+-------------+----------------+-----------+
[2024-07-17 09:28:09,000] [INFO] [engine.py:297:run] No packaging config provided, skip packaging artifacts

/home/ilya_druker/.local/lib/python3.8/site-packages/transformers/modeling_attn_mask_utils.py:276: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
elif sliding_window is None or key_value_length < sliding_window:
/home/ilya_druker/.local/lib/python3.8/site-packages/transformers/modeling_attn_mask_utils.py:114: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if (input_shape[-1] > 1 or self.sliding_window is not None) and self.is_causal:
/home/ilya_druker/.local/lib/python3.8/site-packages/transformers/modeling_attn_mask_utils.py:162: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if past_key_values_length > 0:
/home/ilya_druker/.local/lib/python3.8/site-packages/transformers/models/mistral/modeling_mistral.py:119: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if seq_len > self.max_seq_len_cached:

Other information

OS: Linux, CUDA

The text was updated successfully, but these errors were encountered:

prashantaithal · 2024-07-30T20:50:03Z

Any update on this?

idruker-cerence · 2024-07-30T22:04:31Z

Any update on this?

Yes, kind of. Just replace in the script OnnxConversion pass by OptimumConversion.

prashantaithal · 2024-08-01T06:57:40Z

Curious if that generates an ONNX file with kv cache as part of the model or should the kv cache be implemented separately. Additionally , can the lllama 2 7B onnx model be generated via Olive similar to how the Mistral onnx file you generated ?

idruker-cerence · 2024-08-01T07:39:49Z

kv-cache is already part of an original model in pytorch format. kv_cache flag is only intended to preserve it or not when converting to onnx.

I have not tested LLama but assume it must work similarly.

prashantaithal · 2024-08-01T20:06:14Z

Can you please share the steps on how you used Olive config file to generate the ONNX model?

idruker-cerence · 2024-08-01T23:24:11Z

you call

olive run --config config.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exporting to ONNX with kv_cache #1240

Exporting to ONNX with kv_cache #1240

idruker-cerence commented Jul 17, 2024 •

edited

Loading

prashantaithal commented Jul 30, 2024

idruker-cerence commented Jul 30, 2024 •

edited

Loading

prashantaithal commented Aug 1, 2024

idruker-cerence commented Aug 1, 2024

prashantaithal commented Aug 1, 2024

idruker-cerence commented Aug 1, 2024

Exporting to ONNX with kv_cache #1240

Exporting to ONNX with kv_cache #1240

Comments

idruker-cerence commented Jul 17, 2024 • edited Loading

prashantaithal commented Jul 30, 2024

idruker-cerence commented Jul 30, 2024 • edited Loading

prashantaithal commented Aug 1, 2024

idruker-cerence commented Aug 1, 2024

prashantaithal commented Aug 1, 2024

idruker-cerence commented Aug 1, 2024

idruker-cerence commented Jul 17, 2024 •

edited

Loading

idruker-cerence commented Jul 30, 2024 •

edited

Loading