`extract_lora.py` improvements and fixes #333

jukofyork · 2024-05-27T00:08:47Z

Changes made:

General tidy up and addition of much more upfront error checking of dimensions via validate_and_combine_details(), etc.
low_rank_decomposition() now distributes sqrt(diag(S)) between both L and R to improve numerical precision.
Now uses max_rank parameter and reduces the rank down as necessary per-tensor and then puts any with rank less than the maximum in rank_pattern and alpha_pattern as needed.
Added code to save the special/transposed lora_embedding_A and lora_embedding_B tensors.
Added code to save undecomposable modules (eg: norms) in the LoRA (use --skip-undecomposable to turn off).
Added the transformers.pytorch_utils.Conv1D as linear type that can be decomposed (if any LLMs use them?).
Truncates any tensors where finetuned_size[0] > base_size[0] indicating extra tokens have been added (this may or may not work depending on the model and desired use for the LoRA). It would be best to limit this to just the embed and output tensors, but I can't see any reliable way to test for output tensors other than by name...
Added --verboseoption to print debugging information about tensor sizes, module types/names, etc.

See #328 for more details.

I have successfully used this to take and reapply a full-rank LoRA of Mistral-7B-Instruct-v0.2:

mergekit-extract-lora --verbose --device cuda --rank 4096 ./Mistral-7B-Instruct-v0.2 ./Mistral-7B-v0.1  Mistral-7B-Instruct-v0.2-LoRA
mergekit-yaml --verbose --lora-merge-cache lora-cache mistral-test.yaml mistral-test
cp ./Mistral-7B-Instruct-v0.2/config.json ./Mistral-7B-Instruct-v0.2/tokenizer_config.json ./mistral-test
llama.cpp/convert.py mistral-test --outfile mistral-test-f16.gguf --outtype f16

The cp ./Mistral-7B-Instruct-v0.2/config.json ./Mistral-7B-Instruct-v0.2/tokenizer_config.json ./mistral-test step was just to make sure the correct "rope_theta": 1000000.0 setting was used and the correct "[INST]" prompt template for fair test...

lora-cache mistral-test.yaml:

merge_method: linear
parameters:
  weight: 1.0
models:
  - model: ./Mistral-7B-v0.1+./Mistral-7B-Instruct-v0.2-LoRA
dtype: float16
tokenizer_source: model:Mistral-7B-Instruct-v0.2

and it produced a few 100 identical tokens before very slightly diverging due to either the SVD or the dfloat16 -> float16 conversion (see here).

I haven't tested the changes much beyond this, but since get_model_details() mainly tests against torch.nn.Embedding, torch.nn.Linear, etc it should be fairly robust against tensor name-changes in other models.

Feel free to make any changes you think are needed and hope it helps :)

…= base_weight.shape

jukofyork · 2024-05-27T00:31:11Z

I also removed the min_fraction_variance option as it wasn't really useful and any selective truncation really needs to be relative to the Frobenius (or other) norm of the original matrix, otherwise the raw number doesn't really tell you anything about the effect on the final model.

jukofyork · 2024-05-27T13:29:43Z

One thing that might work and be acceptable for archival purposes:

        if base_type != "to_save" and finetuned_size[0] > base_size[0]:
            assert base_size[1] == finetuned_size[1], f"Column dimension mismatch in layer '{base_name}': {base_size} != {finetuned_size}"
            base_type = "to_save"
        else:
            assert base_size == finetuned_size, f"Dimension mismatch in layer '{base_name}': {base_size} != {finetuned_size}"

        module_details.append((base_type, base_name))

This would then just save the tensors in full in the LoRA in the same way as the norms are saved. This sidesteps all the hassle with the added_tokens.json and new_embeddings.safetensors stuff, and ultimately has the same effect (so long as the + syntax of Mergekit handles this case properly - I haven't tested it).

I also thought of a way this could be improved:

        # NOTE: Should check only embedding/output types, but no clear way to test for "output" type...
        if base_type != "to_save" and finetuned_size[0] > base_size[0]:

since it's possible to accurately detect an embedding layer without looking at the name:

        if isinstance(module, torch.nn.Embedding):
            module_details.append(("embedding", name, module.weight.size()))

and AFAIK, the output tensors are always the same dimensions.

So after the loop in get_model_details() is over we could looks for and mark any we find that match with a new "output" type and use that in the code above to restrict the test to only detect "embedding" and "output" types having their rows increased.

Alternatively this could use the logic you have in mergekit/_data/architectures to match via the name using the post_weights field, etc.

thomasgauthier · 2024-06-06T18:01:59Z

LoRD author and feature owner here. Thank you @jukofyork this is great!

Couple of things I'm considering:

Truncating the embedding matrices works but is not ideal as we lose some trained embeddings. I think we should make it possible to extend the base embedding matrices to match the size of the finetune instead of truncating the finetune. I'm confident we can get LoRA extraction on par with extended vocab LoRA training as demonstrated in this notebook. The only caveat being that resize_token_embeddings needs to be called on the base model before applying the LoRA for inference. I'm afraid this will not play well with frontends like ooba. I think we can mitigate this by adding a note in the generated model card. Ideally the user should be able to choose whether than want to truncate or extend embeddings using a flag.
Regarding the difficulty of reliably identifying input and output tensors I believe get_input_embeddings and get_output_embeddings is what we are looking for.
Nitpicking: I would probably rename process_module_details to extract_lora or something.

If we can get those changes in I think this should good to merge.

jukofyork · 2024-06-09T17:46:37Z

LoRD author and feature owner here. Thank you @jukofyork this is great!

Couple of things I'm considering:

* Truncating the embedding matrices works but is not ideal as we lose some trained embeddings. I think we should make it possible to extend the base embedding matrices to match the size of the finetune instead of truncating the finetune. I'm confident we can get LoRA extraction on par with extended vocab LoRA training as demonstrated in this [notebook](https://github.com/huggingface/peft/blob/main/examples/causal_language_modeling/peft_lora_clm_with_additional_tokens.ipynb). The only caveat being that `resize_token_embeddings` [needs to be called](https://huggingface.co/docs/peft/developer_guides/troubleshooting#:~:text=for%20inference%2C%20load%20the%20base%20model%20first%20and%20resize%20it%20the%20same%20way%20you%20did%20before%20you%20trained%20the%20model.) on the base model before applying the LoRA for inference. I'm afraid this will not play well with frontends like ooba. I think we can mitigate this by adding a note in the generated model card. Ideally the user should be able to choose whether than want to truncate or extend embeddings using a flag.

* Regarding the difficulty of reliably identifying input and output tensors I believe [get_input_embeddings](https://huggingface.co/docs/transformers/v4.40.2/en/main_classes/model#transformers.PreTrainedModel.get_input_embeddings) and [get_output_embeddings](https://huggingface.co/docs/transformers/v4.40.2/en/main_classes/model#transformers.PreTrainedModel.get_output_embeddings) is what we are looking for.

* Nitpicking: I would probably rename `process_module_details` to `extract_lora` or something.

If we can get those changes in I think this should good to merge.

Hey!

Yeah, I did actually have the code all working to save the extra_saftensors file or whatever it was called, but when I saw mistral-instruct-v0.3 had prepended their extra tokens I realised it wasn't going to be possible to assume the extra vocab is at the end and removed it.

I still found it a bit odd that these extra (n_new_tokens, vocab_size) dense tensors were also in the sparse embeddings too - I would have expected the sparse embeddings to be the base model's dimensions and then these dense bits extra, rather than having both. There weren't really many examples of these files either so I couldn't really see what was in the sparse embeddings for these positions.

Im still non the wiser why the embeddings have a different name and are transposed either - all I can assume is that this is for other types of non-language models that might have different embeddings or possibly something to do with the way they are indexed into instead of multiplied?

Nitpicking: I would probably rename process_module_details to extract_lora or something.

Feel free to make or propose any changes you see fit :)

jukofyork · 2024-06-09T17:53:39Z

I forgot to add I think to make this really bulletproof, the best option would be to try creating some real LoRAs for the especially knaely cases like mistral-instruct-v0.3 to test against. They wouldn't really need to be run for more than 1 step to get an idea if the extracted LoRAs are matching up with what they are populating, using the right names, etc.

jukofyork · 2024-06-10T10:19:16Z

Just saw this on reddit:

https://www.reddit.com/r/LocalLLaMA/comments/1dc7cqi/why_dont_we_see_loras_repositories_and_why_arent/

It would be pretty cool if we could get this to the stage of being useful for archival purposes - I have 3x4TB external drives that I keep squirrelling away models on!

I did notice when the code was still in to cut off based on the fraction of variance explained, it was very clear which models had been created by applying LoRAs and even which tensors had been targeted by the LoRAs (dolphin-70b being the most obvious with r=32).

If we assume these are going to eventually be quantized, then we can use the same MSE criterion used for quantization:

https://en.m.wikipedia.org/wiki/Rate%E2%80%93distortion_theory

for:

Q = base + lora - finetune

and then use a rank large enough so that the maximum error in Q is around 1/2 the expected maximum quantization error of say q8_0 (or other 8bit quant format). There should then be minimal loss from archiving the LoRA instead of the full model:

https://en.m.wikipedia.org/wiki/Nyquist%E2%80%93Shannon_sampling_theorem

There are several LLM quantization papers that have concluded that outlier weights/activations are very important, so minimising the maximum MSE error criterion is definitely the one to use.

It would take some experimentation to figure out exactly the link between (row-wise or block-wise) quantization noise and (tensor-wise) SVD noise in terms of maximum MSE, but I'm pretty sure even with the threshold set well below the the Nyquist rate for safety; the storage requirements would be an order of magnitude less for full-finetuned models and several orders of magnitude less for LoRA-trained models.

thomasgauthier · 2024-07-17T16:45:33Z

@jukofyork can you add me to your repo? I have some changes I would like to push before closing this PR

jukofyork · 2024-07-17T16:56:14Z

@jukofyork can you add me to your repo? I have some changes I would like to push before closing this PR

Done.

…del card

jukofyork

Just having a look through the changes and noticed this:

        if module == pretrained_model.get_input_embeddings():
            # if isinstance(module, torch.nn.Embedding):
            module_details.append(("embedding", name, module.weight.size()))
        elif module == pretrained_model.get_output_embeddings():
            # if isinstance(module, torch.nn.Embedding):
            module_details.append(("output", name, module.weight.size()))

Might be worth testing this on a model where there is no separate output, eg:

https://huggingface.co/CohereForAI/c4ai-command-r-v01/blob/main/model.safetensors.index.json

IIRC, it just uses the transpose of model.embed_tokens.weight matrix instead.

jukofyork · 2024-07-18T21:05:16Z

It's been a couple of months since I last looked at this, but this is the code I had originally to calculate and save the "new_embeddings.safetensors" file for when the vocab gets extended:

    lora_weights = {}
    new_embedding_weights = {}
    rank_pattern = {}

    # Embedding weights
    for layer_name in tqdm(embedding_names):
                
        base_weight = base_loader.get_tensor(f"{layer_name}.weight")
        finetuned_weight = finetuned_loader.get_tensor(f"{layer_name}.weight")
        
        padding_size = finetuned_weight.shape[0] - base_weight.shape[0]
        if padding_size > 0 :
            base_weight = pad(base_weight, (0, 0, 0, padding_size), "constant", 0)
            #finetuned_weight = finetuned_weight[:base_weight.shape[0]]
            extra_weight = finetuned_weight[-padding_size:, :]
            new_embedding_weights["input_embeddings"] = extra_weight
            print(
                f"[P] input_dims=({base_weight.shape}), "
                f"output_dims=({finetuned_weight.shape}), "
                f"padded_dims=({extra_weight.shape})"
            )
            
        #padding_size = new_weight.shape[0] - base_weight.shape[0]
        #if padding_size > 0 :
        #    new_weight = new_weight[:base_weight.shape[0], :base_weight.shape[1]]
        #    new_weight = new_weight[:base_weight.shape[0]] ???
        
        lora_embedding_A, lora_embedding_B = decompose_delta_weight(
            finetuned_weight.T, base_weight.T, desired_rank, min_fraction_variance, device=device
        )
        
        lora_weights[f"base_model.model.{layer_name}.lora_embedding_A"] = lora_embedding_A.to(
            "cpu"
        ).contiguous()
        lora_weights[f"base_model.model.{layer_name}.lora_embedding_B"] = lora_embedding_B.to(
            "cpu"
        ).contiguous()
        
        # Add the final rank to the rank_pattern dictionary only if it is different from the desired_rank
        final_rank = lora_embedding_A.shape[0]
        if final_rank != desired_rank:
            rank_pattern[layer_name] = final_rank
        
        print(
            f"[E] {layer_name}: final_rank={final_rank}, "
            f"input_dims=({base_weight.shape}), "
            f"output_dims=({lora_embedding_A.shape}, {lora_embedding_B.shape})"
        )
.
.
.
    with open(os.path.join(out_path, "adapter_config.json"), "w") as f:
        json.dump(lora_config, f, indent=2)

    save_file(lora_weights, os.path.join(out_path, "adapter_model.safetensors"))
    
    if new_embedding_weights:
        save_file(new_embedding_weights, os.path.join(out_path, "new_embeddings.safetensors"))

but then when I saw mixtral-instruct-v0.3 had prepended their new tokens, I scraped the idea as I couldn't see any easy way to identify strange cases like this and my padding_size deduction wasn't going to be easy to work out without actually loading up and reading the models' JSON files.

I also never really found out why the extra weights of the vocab seem to get added to both the low-rank weights and the dense weights in the "new_embeddings.safetensors" file (from examination of the few real LoRAs on huggingface that have this file), and as a result wasn't 100% sure how they should be set (my best guess was set the low rank to whatever was found via truncated SVD and then find the deltas of the truncated low rank weights vs the actual original weights and then use these deltas in the dense "new_embeddings.safetensors" file, but never got as far as that IIRC).

thomasgauthier · 2024-07-23T20:15:30Z

So when finetuning extra vocab into a model and including "embed_tokens" in the LoRA target_modules, PEFT will save 3 tensors per embedding tensor. In the case of a Mistral v0.1 finetune we get the base embedding weights 'base_model.model.embed_tokens.base_layer.weight' and the two LoRA delta matrices to be applied on top: 'base_model.model.embed_tokens.lora_A.weight' and 'base_model.model.embed_tokens.lora_B.weight'.

Example

I'm not sure about new_embeddings.safetensors, it seems to be vLLM specific.

Regarding vocab order, I think in the case of a model like Mistral v0.3 where tokens are prepended, the SVD would simply produce an approximation of the transformation to get from the original (extended) embeddings to the finetuned ones, disregarding token order and simply operating on weight values. I have a feeling this is fine given a high enough rank. A more thorough solution would be to reorder the rows of the base embeddings matrix before running SVD so the token order between the finetuned model and the base is identical. But Mistral is an edge case in that regard and I'm not sure it's worth the extra complexity as the current implementation works fine for non esoteric cases.

thomasgauthier · 2024-08-05T18:44:29Z

@jukofyork I'm merging this and closing the PR. If you have more thoughts / code related to archival and compression feel free to open up another issue or PR. Thank you for your contribution!

jukofyork · 2024-08-05T19:03:57Z

@jukofyork I'm merging this and closing the PR. If you have more thoughts / code related to archival and compression feel free to open up another issue or PR. Thank you for your contribution!

Thanks - I think the next thing I was going to look at is trying to set/infer the rank automatically.

jukofyork added 4 commits May 24, 2024 08:11

added min_fraction_variance option and truncate if new_weight.shape !…

2812332

…= base_weight.shape

Fixed passing null rank_pattern

25f2a27

Merge branch 'arcee-ai:main' into extract_lora.py-changes

8ba6ede

Update extract_lora.py

b280c34

thomasgauthier added 4 commits July 18, 2024 14:21

[mergekit-extract-lora] added flag to allow vocab extension

d74a458

[mergekit-extract-lora] added vocab extension details to generated mo…

0a622ae

…del card

[mergekit-extract-lora] renamed a method for clarity

17ff289

Merge branch 'main' into extract_lora.py-changes

9f9b9d4

jukofyork commented Jul 18, 2024

View reviewed changes

thomasgauthier closed this Aug 5, 2024

thomasgauthier reopened this Aug 5, 2024

Merge branch 'main' into extract_lora.py-changes

cd7968d

thomasgauthier merged commit f086664 into arcee-ai:main Aug 5, 2024
2 of 4 checks passed

jukofyork deleted the extract_lora.py-changes branch August 5, 2024 19:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`extract_lora.py` improvements and fixes #333

`extract_lora.py` improvements and fixes #333

jukofyork commented May 27, 2024 •

edited

Loading

jukofyork commented May 27, 2024

jukofyork commented May 27, 2024 •

edited

Loading

thomasgauthier commented Jun 6, 2024

jukofyork commented Jun 9, 2024

jukofyork commented Jun 9, 2024 •

edited

Loading

jukofyork commented Jun 10, 2024 •

edited

Loading

thomasgauthier commented Jul 17, 2024

jukofyork commented Jul 17, 2024

jukofyork left a comment

jukofyork commented Jul 18, 2024 •

edited

Loading

thomasgauthier commented Jul 23, 2024

thomasgauthier commented Aug 5, 2024

jukofyork commented Aug 5, 2024

extract_lora.py improvements and fixes #333

extract_lora.py improvements and fixes #333

Conversation

jukofyork commented May 27, 2024 • edited Loading

jukofyork commented May 27, 2024

jukofyork commented May 27, 2024 • edited Loading

thomasgauthier commented Jun 6, 2024

jukofyork commented Jun 9, 2024

jukofyork commented Jun 9, 2024 • edited Loading

jukofyork commented Jun 10, 2024 • edited Loading

thomasgauthier commented Jul 17, 2024

jukofyork commented Jul 17, 2024

jukofyork left a comment

Choose a reason for hiding this comment

jukofyork commented Jul 18, 2024 • edited Loading

thomasgauthier commented Jul 23, 2024

thomasgauthier commented Aug 5, 2024

jukofyork commented Aug 5, 2024

`extract_lora.py` improvements and fixes #333

`extract_lora.py` improvements and fixes #333

jukofyork commented May 27, 2024 •

edited

Loading

jukofyork commented May 27, 2024 •

edited

Loading

jukofyork commented Jun 9, 2024 •

edited

Loading

jukofyork commented Jun 10, 2024 •

edited

Loading

jukofyork commented Jul 18, 2024 •

edited

Loading