Quantize: CLI command to quantize input model #1356

shaahji · 2024-09-13T20:21:16Z

Quantize: CLI command to quantize input model

Usage:
olive quantize --m --device <cpu|gpu> --algorithms <awq,gptq> --data_config_path -o

Few other code improvements:

Moved global function is cli/base.py to be static members of cli/base/BaseOliveCLICommand to avoid multiple imports in each cli command implementation. Moreover, these functions are only useable in the context of cli command implementation anyways.
Created new new functions (add_data_config_options, add_hf_dataset_options, and add_accelerator_options) to cli/base/BaseOliveCLICommand to avoid code duplication and standardization across different cli command implementations.

Checklist before requesting a review

Add unit tests for this change.
Make sure all tests can pass.
Update documents if necessary.
Lint and apply fixes to your code by running lintrunner -a
Is this a user-facing change? If yes, give a description of this change to be included in the release notes.
Is this PR including examples changes? If yes, please remember to update example documentation in a follow-up PR.

(Optional) Issue link

olive/cli/base.py

xiaoyu-work · 2024-09-16T02:43:37Z

Can you also add a unit test for this?

olive/cli/quantize.py

olive/cli/base.py

olive/cli/quantize.py

samuel100 · 2024-09-17T10:25:06Z

Some feedback....

whack-a-mole package install. I had to install auto-awq, which was not easy as the package name is different to the module (i.e. awq module not found error leads a user to try pip install awq which is not correct.)
It is not really clear what is expected in the data_config YAML/JSON. In general data_configs are hard - what data does a user need to use, how would different datasets impact the results? what do they need to put into data_config? If the dataset impacts results we really need amazing documentation and guidance on what a user needs to provide (including how to generate [synthetic] data). If the data does not really impact the results can we get rid of the option and provide either dummy data or a static option (e.g. wikitext)?
The help file should give some more information on the different algorithms. For example, it would be good to know that AWQ will output a 4bit model.
The --providers_list assumes the user knows about ORT e.g. CPUExecutionProvider. The help information should enumerate the different options for the user.
How would a user evaluate the results for speed up/memory utilization/quality. Taking a step back, the motivation for quantization is to lower footprint and speed up execution without sacrificing efficacy for the task. The CLI command allows a user to try different algorithms (good) but it needs some evaluation information so that the user can make a decision on the "best" method.

A E2E here would be to run Quantization -> Capture the ONNX Graph for using in ORT. I therefore tried to take the output of quantization and run through capture-onnx-graph. Below is what happened:

Firstly, I tried to use Dynamo exporter option:

olive capture-onnx-graph \
    --model_name_or_path models/qwen-awq/awq/cpu-cpu_model/model \
    --use_dynamo_exporter True \
    --use_ort_genai True \
    --output_path models/qwen-awq/captured \
    --device cpu \
    --log_level 1

This hit the following error:

[2024-09-17 09:35:36,294] [INFO] [engine.py:874:_run_pass] Running pass c:OnnxConversion
You have loaded an AWQ model on CPU and have a CUDA device available, make sure to set your model on a GPU device in order to run your model.
`low_cpu_mem_usage` was None, now set to True since model is quantized.
We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
[2024-09-17 09:35:40,494] [ERROR] [engine.py:972:_run_pass] Pass run failed.
Traceback (most recent call last):
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/olive/engine/engine.py", line 960, in _run_pass
    output_model_config = host.run_pass(p, input_model_config, output_model_path, pass_search_point)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/olive/systems/local.py", line 30, in run_pass
    output_model = the_pass.run(model, output_model_path, point)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/olive/passes/olive_pass.py", line 206, in run
    output_model = self._run_for_config(model, config, output_model_path)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/olive/passes/onnx/conversion.py", line 116, in _run_for_config
    output_model = self._run_for_config_internal(model, config, output_model_path)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/olive/passes/onnx/conversion.py", line 149, in _run_for_config_internal
    return self._convert_model_on_device(model, config, output_model_path, device, torch_dtype)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/olive/passes/onnx/conversion.py", line 367, in _convert_model_on_device
    converted_onnx_model = OnnxConversion._export_pytorch_model(
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/olive/passes/onnx/conversion.py", line 205, in _export_pytorch_model
    pytorch_model(*dummy_inputs)
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1104, in forward
    outputs = self.model(
              ^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 915, in forward
    layer_outputs = decoder_layer(
                    ^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 655, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
                                                          ^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 542, in forward
    query_states = self.q_proj(hidden_states)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/awq/modules/linear/gemm.py", line 243, in forward
    out = WQLinearMMFunction.apply(
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/torch/autograd/function.py", line 598, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/awq/modules/linear/gemm.py", line 47, in forward
    out = awq_ext.gemm_forward_cuda(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: t == DeviceType::CUDA INTERNAL ASSERT FAILED at "/usr/share/miniconda3/envs/build/lib/python3.11/site-packages/torch/include/c10/cuda/impl/CUDAGuardImpl.h":28, please report a bug to PyTorch.

So, next I tried Model Builder option. This worked... BUT it is not clear what really happened - I had to set the --precision int4 option... does that re-quantize the model using RTN? The model output from MB was 1.3GB, which compared to 0.6MB for the safetensor file after the AWQ quantization.

shaahji · 2024-09-17T18:15:14Z

whack-a-mole package install. I had to install auto-awq, which was not easy as the package name is different to the module (i.e. awq module not found error leads a user to try pip install awq which is not correct.)

Ironically, I ran into the same problem initially. I had some thoughts about addressing this at a larger level. Olive already knows the dependencies for each Pass. They are defined in olive_config.json. We could potentially iterate and verify that those dependencies are present even before we start running the passes. That avoids long wait before a workflow fails after running a few expensive passes. Thoughts?

It is not really clear what is expected in the data_config YAML/JSON.

The data_config requirement is removed in a follow up commit. The arguments to the command line are inline with the finetune commend i.e. data_name, train_subset, eval_subset, etc.

The help file should give some more information on the different algorithms. For example, it would be good to know that AWQ will output a 4bit model.

There can never be enough information in help. :) One thing could be argued to be more important the other. I propose to provide a link to specific algorithm's documentation.

The --providers_list assumes the user knows about ORT e.g. CPUExecutionProvider. The help information should enumerate the different options for the user.

I will add the available options in the choices list.

How would a user evaluate the results for speed up/memory utilization/quality. Taking a step back, the motivation for quantization is to lower footprint and speed up execution without sacrificing efficacy for the task. The CLI command allows a user to try different algorithms (good) but it needs some evaluation information so that the user can make a decision on the "best" method.

As I understand the intent, CLI commands are meant to be "do one job only". For evaluation, we might introduce a separate cli command that user can chain along with this.

test/unit_test/cli/test_base.py

shaahji · 2024-09-17T23:03:48Z

All comments/inputs addressed.

olive/cli/capture_onnx.py

This is to avoid hardcoding these paramters in config files for models that aren't (like phi3) yet officially supported by auto-gptq.

Usage: olive quantize \ -m <model-name> \ --trust_remote_code \ --device <cpu|gpu|npu> \ --algorithms <awq,gptq> \ --data_name <data-name> \ --train_subset <subset-name> \ --batch_size <batch-size> \ --tempdir <temp-dir> \ -o <output-dir>

shaahji force-pushed the shaahji/cliquant branch from ff267a3 to 6f7cadc Compare September 13, 2024 21:05

xiaoyu-work reviewed Sep 15, 2024

View reviewed changes

olive/cli/base.py Outdated Show resolved Hide resolved

jambayk reviewed Sep 16, 2024

View reviewed changes

olive/cli/quantize.py Outdated Show resolved Hide resolved

devang-ml reviewed Sep 16, 2024

View reviewed changes

olive/cli/base.py Outdated Show resolved Hide resolved

olive/cli/base.py Outdated Show resolved Hide resolved

olive/cli/quantize.py Outdated Show resolved Hide resolved

shaahji force-pushed the shaahji/cliquant branch 4 times, most recently from 6ec5767 to 7cc7299 Compare September 17, 2024 21:35

github-advanced-security bot found potential problems Sep 17, 2024

View reviewed changes

test/unit_test/cli/test_base.py Fixed Show resolved Hide resolved

test/unit_test/cli/test_base.py Fixed Show resolved Hide resolved

test/unit_test/cli/test_base.py Fixed Show resolved Hide resolved

test/unit_test/cli/test_base.py Fixed Show resolved Hide resolved

shaahji force-pushed the shaahji/cliquant branch from 7cc7299 to 4abfed8 Compare September 17, 2024 23:01

shaahji force-pushed the shaahji/cliquant branch from 4abfed8 to 8b75d7b Compare September 18, 2024 18:17

devang-ml reviewed Sep 18, 2024

View reviewed changes

olive/cli/capture_onnx.py Show resolved Hide resolved

shaahji force-pushed the shaahji/cliquant branch from 8b75d7b to e9afcd9 Compare September 18, 2024 18:47

shaahji added 2 commits September 18, 2024 22:05

Make certain gptq options customizable via model specific mapping

d2bc1c4

This is to avoid hardcoding these paramters in config files for models that aren't (like phi3) yet officially supported by auto-gptq.

Quantize: CLI command to quantize input model

f38fa1c

Usage: olive quantize \ -m <model-name> \ --trust_remote_code \ --device <cpu|gpu|npu> \ --algorithms <awq,gptq> \ --data_name <data-name> \ --train_subset <subset-name> \ --batch_size <batch-size> \ --tempdir <temp-dir> \ -o <output-dir>

shaahji force-pushed the shaahji/cliquant branch from e9afcd9 to f38fa1c Compare September 18, 2024 22:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quantize: CLI command to quantize input model #1356

Quantize: CLI command to quantize input model #1356

shaahji commented Sep 13, 2024

xiaoyu-work commented Sep 16, 2024

samuel100 commented Sep 17, 2024

shaahji commented Sep 17, 2024

shaahji commented Sep 17, 2024

Quantize: CLI command to quantize input model #1356

Are you sure you want to change the base?

Quantize: CLI command to quantize input model #1356

Conversation

shaahji commented Sep 13, 2024