Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quantize: CLI command to quantize input model #1356

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

shaahji
Copy link
Contributor

@shaahji shaahji commented Sep 13, 2024

Quantize: CLI command to quantize input model

Usage:
olive quantize --m --device <cpu|gpu> --algorithms <awq,gptq> --data_config_path -o

Few other code improvements:

  • Moved global function is cli/base.py to be static members of cli/base/BaseOliveCLICommand to avoid multiple imports in each cli command implementation. Moreover, these functions are only useable in the context of cli command implementation anyways.
  • Created new new functions (add_data_config_options, add_hf_dataset_options, and add_accelerator_options) to cli/base/BaseOliveCLICommand to avoid code duplication and standardization across different cli command implementations.

Checklist before requesting a review

  • Add unit tests for this change.
  • Make sure all tests can pass.
  • Update documents if necessary.
  • Lint and apply fixes to your code by running lintrunner -a
  • Is this a user-facing change? If yes, give a description of this change to be included in the release notes.
  • Is this PR including examples changes? If yes, please remember to update example documentation in a follow-up PR.

(Optional) Issue link

olive/cli/base.py Outdated Show resolved Hide resolved
@xiaoyu-work
Copy link
Contributor

Can you also add a unit test for this?

olive/cli/quantize.py Outdated Show resolved Hide resolved
olive/cli/base.py Outdated Show resolved Hide resolved
olive/cli/base.py Outdated Show resolved Hide resolved
olive/cli/quantize.py Outdated Show resolved Hide resolved
@samuel100
Copy link
Contributor

Some feedback....

  • whack-a-mole package install. I had to install auto-awq, which was not easy as the package name is different to the module (i.e. awq module not found error leads a user to try pip install awq which is not correct.)
  • It is not really clear what is expected in the data_config YAML/JSON. In general data_configs are hard - what data does a user need to use, how would different datasets impact the results? what do they need to put into data_config? If the dataset impacts results we really need amazing documentation and guidance on what a user needs to provide (including how to generate [synthetic] data). If the data does not really impact the results can we get rid of the option and provide either dummy data or a static option (e.g. wikitext)?
  • The help file should give some more information on the different algorithms. For example, it would be good to know that AWQ will output a 4bit model.
  • The --providers_list assumes the user knows about ORT e.g. CPUExecutionProvider. The help information should enumerate the different options for the user.
  • How would a user evaluate the results for speed up/memory utilization/quality. Taking a step back, the motivation for quantization is to lower footprint and speed up execution without sacrificing efficacy for the task. The CLI command allows a user to try different algorithms (good) but it needs some evaluation information so that the user can make a decision on the "best" method.

A E2E here would be to run Quantization -> Capture the ONNX Graph for using in ORT. I therefore tried to take the output of quantization and run through capture-onnx-graph. Below is what happened:

Firstly, I tried to use Dynamo exporter option:

olive capture-onnx-graph \
    --model_name_or_path models/qwen-awq/awq/cpu-cpu_model/model \
    --use_dynamo_exporter True \
    --use_ort_genai True \
    --output_path models/qwen-awq/captured \
    --device cpu \
    --log_level 1

This hit the following error:

[2024-09-17 09:35:36,294] [INFO] [engine.py:874:_run_pass] Running pass c:OnnxConversion
You have loaded an AWQ model on CPU and have a CUDA device available, make sure to set your model on a GPU device in order to run your model.
`low_cpu_mem_usage` was None, now set to True since model is quantized.
We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
[2024-09-17 09:35:40,494] [ERROR] [engine.py:972:_run_pass] Pass run failed.
Traceback (most recent call last):
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/olive/engine/engine.py", line 960, in _run_pass
    output_model_config = host.run_pass(p, input_model_config, output_model_path, pass_search_point)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/olive/systems/local.py", line 30, in run_pass
    output_model = the_pass.run(model, output_model_path, point)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/olive/passes/olive_pass.py", line 206, in run
    output_model = self._run_for_config(model, config, output_model_path)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/olive/passes/onnx/conversion.py", line 116, in _run_for_config
    output_model = self._run_for_config_internal(model, config, output_model_path)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/olive/passes/onnx/conversion.py", line 149, in _run_for_config_internal
    return self._convert_model_on_device(model, config, output_model_path, device, torch_dtype)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/olive/passes/onnx/conversion.py", line 367, in _convert_model_on_device
    converted_onnx_model = OnnxConversion._export_pytorch_model(
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/olive/passes/onnx/conversion.py", line 205, in _export_pytorch_model
    pytorch_model(*dummy_inputs)
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1104, in forward
    outputs = self.model(
              ^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 915, in forward
    layer_outputs = decoder_layer(
                    ^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 655, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
                                                          ^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 542, in forward
    query_states = self.q_proj(hidden_states)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/awq/modules/linear/gemm.py", line 243, in forward
    out = WQLinearMMFunction.apply(
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/torch/autograd/function.py", line 598, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/awq/modules/linear/gemm.py", line 47, in forward
    out = awq_ext.gemm_forward_cuda(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: t == DeviceType::CUDA INTERNAL ASSERT FAILED at "/usr/share/miniconda3/envs/build/lib/python3.11/site-packages/torch/include/c10/cuda/impl/CUDAGuardImpl.h":28, please report a bug to PyTorch. 

So, next I tried Model Builder option. This worked... BUT it is not clear what really happened - I had to set the --precision int4 option... does that re-quantize the model using RTN? The model output from MB was 1.3GB, which compared to 0.6MB for the safetensor file after the AWQ quantization.

@shaahji
Copy link
Contributor Author

shaahji commented Sep 17, 2024

whack-a-mole package install. I had to install auto-awq, which was not easy as the package name is different to the module (i.e. awq module not found error leads a user to try pip install awq which is not correct.)

Ironically, I ran into the same problem initially. I had some thoughts about addressing this at a larger level. Olive already knows the dependencies for each Pass. They are defined in olive_config.json. We could potentially iterate and verify that those dependencies are present even before we start running the passes. That avoids long wait before a workflow fails after running a few expensive passes. Thoughts?

It is not really clear what is expected in the data_config YAML/JSON.

The data_config requirement is removed in a follow up commit. The arguments to the command line are inline with the finetune commend i.e. data_name, train_subset, eval_subset, etc.

The help file should give some more information on the different algorithms. For example, it would be good to know that AWQ will output a 4bit model.

There can never be enough information in help. :) One thing could be argued to be more important the other. I propose to provide a link to specific algorithm's documentation.

The --providers_list assumes the user knows about ORT e.g. CPUExecutionProvider. The help information should enumerate the different options for the user.

I will add the available options in the choices list.

How would a user evaluate the results for speed up/memory utilization/quality. Taking a step back, the motivation for quantization is to lower footprint and speed up execution without sacrificing efficacy for the task. The CLI command allows a user to try different algorithms (good) but it needs some evaluation information so that the user can make a decision on the "best" method.

As I understand the intent, CLI commands are meant to be "do one job only". For evaluation, we might introduce a separate cli command that user can chain along with this.

@shaahji shaahji force-pushed the shaahji/cliquant branch 4 times, most recently from 6ec5767 to 7cc7299 Compare September 17, 2024 21:35
test/unit_test/cli/test_base.py Fixed Show resolved Hide resolved
test/unit_test/cli/test_base.py Fixed Show resolved Hide resolved
test/unit_test/cli/test_base.py Fixed Show resolved Hide resolved
test/unit_test/cli/test_base.py Fixed Show resolved Hide resolved
@shaahji
Copy link
Contributor Author

shaahji commented Sep 17, 2024

All comments/inputs addressed.

This is to avoid hardcoding these paramters in config files for models that
aren't (like phi3) yet officially supported by auto-gptq.
Usage:
  olive quantize                  \
    -m <model-name>               \
    --trust_remote_code           \
    --device <cpu|gpu|npu>        \
    --algorithms <awq,gptq>       \
    --data_name <data-name>       \
    --train_subset <subset-name>  \
    --batch_size <batch-size>     \
    --tempdir <temp-dir>          \
    -o <output-dir>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants