Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Ascend NPU support #1758

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

MengqingCao
Copy link

Description

Enable Ascend NPU backend for finetuning, inferencing and gradio webui.
Main changes:

  • modify the hard code related to cuda and abstract to device
  • add NPU related configure constraints

Motivation and Context

There are two benefits:

  1. Abstracting device make sense for more backends to plugin, and Ascend NPU is a good example.
  2. Allow Ascend NPU users to use axolotl for LLM finetuning, inferencing

Example

# preprocess datasets - optional but recommended
ASCEND_RT_VISIBLE_DEVICES=0 python -m axolotl.cli.preprocess examples/openllama-3b/lora.yml

# finetune lora
accelerate launch -m axolotl.cli.train examples/openllama-3b/lora.yml

# inference
accelerate launch -m axolotl.cli.inference examples/openllama-3b/lora.yml \
    --lora_model_dir="./lora-out"

# gradio
accelerate launch -m axolotl.cli.inference examples/openllama-3b/lora.yml \
    --lora_model_dir="./lora-out" --gradio

Screenshots

NPU supported CLI inference

axolotl_cli_chat

NPU supported Gradio webui inference

axolotl_cli_chat_gradio

Config

lora.yaml

base_model: openlm-research/open_llama_3b_v2
model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer
load_in_8bit: true
load_in_4bit: false
strict: false
push_dataset_to_hub:
datasets:
  - path: teknium/GPT4-LLM-Cleaned
    type: alpaca
dataset_prepared_path:
val_set_size: 0.02
adapter: lora
lora_model_dir:
sequence_len: 1024
sample_packing: true
lora_r: 8
lora_alpha: 16
lora_dropout: 0.0
lora_target_modules:
  - gate_proj
  - down_proj
  - up_proj
  - q_proj
  - v_proj
  - k_proj
  - o_proj
lora_fan_in_fan_out:
wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:
output_dir: ./outputs/lora-out
gradient_accumulation_steps: 1
micro_batch_size: 2
num_epochs: 4
optimizer: adamw_torch
torchdistx_path:
lr_scheduler: cosine
learning_rate: 0.0002
train_on_inputs: false
group_by_length: false
float32: true
bf16: false
fp16: false
tf32: false
gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank: 0
logging_steps: 1
xformers_attention:
flash_attention: false
gptq_groupsize:
s2_attention:
gptq_model_v1:
warmup_steps: 20
evals_per_epoch: 4
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.1
fsdp:
fsdp_config:
special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"

max_memory = None

model_kwargs["device_map"] = device_map
set_model_device(cfg, max_memory, model_config, model_kwargs, device_map)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the way python passes by reference and updates them in the function feels a bit awkward here. Not sure right now what a good solution would be to make this more obvious.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the simplest way is making model_kwargs as the return value of func set_model_device. Simple but the obvious effect may be similar to the current one. And model_kwargs itself is mutable, this way probably make little sense.

A more complicated solution would be write a class ModelKwargs, with member functions like __init__, update_model_device, update_dtype, update_attention, update_quantization ... These funcs will be called in load_model, making load_model and the change of model_kwargs more clearly. However, this would bring a lot changes into src/axolotl/utils/models.py and may exsisting some issues, time is needed for validating it.

@MengqingCao
Copy link
Author

Good day! @winglian I tried to create a class ModelKwargs, but with the modification of model_kwargs, there are many other operations such as patching, creating models, etc. And their judgment conditions seem inseparable.

Thus, finnaly I refactor the whole load_model func into a class ModelLoader. All the operations in original load_model func have been placed in several member functions and followed the original logical order.

This brings a lot changes, while making the model loading pipeline more clearly. Moreover, the changes of member variables such as model_kwargs are more obvious. But I am not sure whether the current function naming and pipeline splitting method is completely reasonable.

Please review the latest code and give me some suggestions. Thanks a lot!

@MengqingCao
Copy link
Author

Hi, @winglian Could you help review the latest code in this PR? Let me know if the breaks brings by refactoring of the original code is not you want.

Just FYI, I accidentally deleted the original commit, and it cann be found in this branch.

  1. add Ascend NPU backend support
  2. refactor func load_model in src/axolotl/utils/models.py
  3. refactor load_in_8bit as a kwarg
@Yikun
Copy link

Yikun commented Sep 12, 2024

Looks like it includes two parts in this commits Model Loaders reafactor and Ascend NPU support. Maybe we could spilit it as two PRs, the first one is Model Loaders reafactor, then we will rebase the Ascend NPU support PR after it.

Or do you have any other suggestions? @winglian Please feel free let us know if you have any more concern. Thanks!

@MengqingCao
Copy link
Author

The refactoring of ModelLoder is split into #1909, and the support of Ascend NPU will be commited after #1909 . Hope this will make it easier to review and test. cc @winglian

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants