Not sure about the details of arrow routing #107

Weifan1226 · 2024-09-01T15:46:28Z

Hi Team!

I have recently been working on implementing the Arrow Routing algorithm in my project. However, I'm facing a challenge due to my limited expertise, particularly in understanding the concept of "token in layer l" within the algorithm. My current understanding is that the hidden state post the attention layer serves as h_l. However, the output of a transformer layer is typically structured as [batch_size, tokens, hidden_size]. I am uncertain about how to proceed from this point.

Additionally, I am seeking clarity on the phrase "it routes differently in every layer and token, increasing the overall model expressivity." Does this imply a "per token" routing mechanism? My interpretation is that the output of each transformer layer determines the subsequent LoRA adjustments to be made to the transformer layer.

I would appreciate any guidance or insights you could provide to help me better understand these aspects of the Arrow Routing algorithm.

Thank you!

Best regards,
Fanjunduo Wei

Weifan1226 · 2024-09-02T05:38:12Z

I seem to understand the concept of "per token", just like MoE router. But my new question is: if Arrow Router selects LoRA A based on token a and LoRA B based on token b, then after processing token a in base model+A, do I need to eliminate the weight of LoRA A, reload LoRA B into the base model and then process token b?

pclucas14 · 2024-09-04T02:11:11Z

Hi!

Thanks for your interest in our work. Let me try and clarify Arrow routing.
Just like an MoE router, each token at each layer is routed individually. One difference however is that in your typical MoE, tokens are routed to MLP / FFN experts, whereas our experts are simple linear layers.

For example, on the Mistral model, say that we train LoRAs on the gate_proj, up_proj and down_proj of the MLP layer. Then, for each of gate_proj, up_proj and down_proj we have a LoRA adapter, meaning that for a given MLP block each token will get routed 3 times, once for each linear layer. In a standard MoE, each token would get routed once per MLP expert.

I am not sure I fully understood the last point regarding LoRA As and Bs. Whenever a token is routed to a given LoRA expert, that token will be processed by both the A and B projections of the LoRA adapter.

Hopefully that clarifies a few things!
Lucas

herkerser · 2024-09-13T03:09:14Z

Hi Team!

I have recently been working on implementing the Arrow Routing algorithm in my project. However, I'm facing a challenge due to my limited expertise, particularly in understanding the concept of the "first right singular vector" within the algorithm. In the algorithm, the V matrix is utilized. Then, I roughly went through the code and found that matrix U is used to calculate the top vector. Why is that?

pclucas14 · 2024-09-16T13:16:06Z

Hi,

Thanks for reaching out. This is because our notation in the paper does not quite match our implementation. In the code we use output = (W + BA)x, while in the paper we use output = (W + AB)x. So if you transpose the matrix the Us become the Vs and vice versa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not sure about the details of arrow routing #107

Not sure about the details of arrow routing #107

Weifan1226 commented Sep 1, 2024

Weifan1226 commented Sep 2, 2024

pclucas14 commented Sep 4, 2024

herkerser commented Sep 13, 2024

pclucas14 commented Sep 16, 2024

Not sure about the details of arrow routing #107

Not sure about the details of arrow routing #107

Comments

Weifan1226 commented Sep 1, 2024

Weifan1226 commented Sep 2, 2024

pclucas14 commented Sep 4, 2024

herkerser commented Sep 13, 2024

pclucas14 commented Sep 16, 2024