Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not sure about the details of arrow routing #107

Open
Weifan1226 opened this issue Sep 1, 2024 · 4 comments
Open

Not sure about the details of arrow routing #107

Weifan1226 opened this issue Sep 1, 2024 · 4 comments

Comments

@Weifan1226
Copy link

Hi Team!

I have recently been working on implementing the Arrow Routing algorithm in my project. However, I'm facing a challenge due to my limited expertise, particularly in understanding the concept of "token in layer l" within the algorithm. My current understanding is that the hidden state post the attention layer serves as h_l. However, the output of a transformer layer is typically structured as [batch_size, tokens, hidden_size]. I am uncertain about how to proceed from this point.

Additionally, I am seeking clarity on the phrase "it routes differently in every layer and token, increasing the overall model expressivity." Does this imply a "per token" routing mechanism? My interpretation is that the output of each transformer layer determines the subsequent LoRA adjustments to be made to the transformer layer.

截屏2024-09-01 23 25 38

I would appreciate any guidance or insights you could provide to help me better understand these aspects of the Arrow Routing algorithm.

Thank you!

Best regards,
Fanjunduo Wei

@Weifan1226
Copy link
Author

I seem to understand the concept of "per token", just like MoE router. But my new question is: if Arrow Router selects LoRA A based on token a and LoRA B based on token b, then after processing token a in base model+A, do I need to eliminate the weight of LoRA A, reload LoRA B into the base model and then process token b?

@pclucas14
Copy link
Contributor

Hi!

Thanks for your interest in our work. Let me try and clarify Arrow routing.
Just like an MoE router, each token at each layer is routed individually. One difference however is that in your typical MoE, tokens are routed to MLP / FFN experts, whereas our experts are simple linear layers.

For example, on the Mistral model, say that we train LoRAs on the gate_proj, up_proj and down_proj of the MLP layer. Then, for each of gate_proj, up_proj and down_proj we have a LoRA adapter, meaning that for a given MLP block each token will get routed 3 times, once for each linear layer. In a standard MoE, each token would get routed once per MLP expert.

I am not sure I fully understood the last point regarding LoRA As and Bs. Whenever a token is routed to a given LoRA expert, that token will be processed by both the A and B projections of the LoRA adapter.

Hopefully that clarifies a few things!
Lucas

@herkerser
Copy link

Hi Team!

I have recently been working on implementing the Arrow Routing algorithm in my project. However, I'm facing a challenge due to my limited expertise, particularly in understanding the concept of the "first right singular vector" within the algorithm. In the algorithm, the V matrix is utilized. Then, I roughly went through the code and found that matrix U is used to calculate the top vector. Why is that?
image

@pclucas14
Copy link
Contributor

Hi,

Thanks for reaching out. This is because our notation in the paper does not quite match our implementation. In the code we use output = (W + BA)x, while in the paper we use output = (W + AB)x. So if you transpose the matrix the Us become the Vs and vice versa

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants