-
Notifications
You must be signed in to change notification settings - Fork 7
-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not sure about the details of arrow routing #107
Comments
I seem to understand the concept of "per token", just like MoE router. But my new question is: if Arrow Router selects LoRA A based on token a and LoRA B based on token b, then after processing token a in base model+A, do I need to eliminate the weight of LoRA A, reload LoRA B into the base model and then process token b? |
Hi! Thanks for your interest in our work. Let me try and clarify Arrow routing. For example, on the Mistral model, say that we train LoRAs on the I am not sure I fully understood the last point regarding LoRA As and Bs. Whenever a token is routed to a given LoRA expert, that token will be processed by both the Hopefully that clarifies a few things! |
Hi, Thanks for reaching out. This is because our notation in the paper does not quite match our implementation. In the code we use |
Hi Team!
I have recently been working on implementing the Arrow Routing algorithm in my project. However, I'm facing a challenge due to my limited expertise, particularly in understanding the concept of "token in layer l" within the algorithm. My current understanding is that the hidden state post the attention layer serves as h_l. However, the output of a transformer layer is typically structured as [batch_size, tokens, hidden_size]. I am uncertain about how to proceed from this point.
Additionally, I am seeking clarity on the phrase "it routes differently in every layer and token, increasing the overall model expressivity." Does this imply a "per token" routing mechanism? My interpretation is that the output of each transformer layer determines the subsequent LoRA adjustments to be made to the transformer layer.
I would appreciate any guidance or insights you could provide to help me better understand these aspects of the Arrow Routing algorithm.
Thank you!
Best regards,
Fanjunduo Wei
The text was updated successfully, but these errors were encountered: