Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PEER MoE #93

Merged
merged 32 commits into from
Aug 23, 2024
Merged

PEER MoE #93

merged 32 commits into from
Aug 23, 2024

Conversation

oleksost
Copy link
Collaborator

@oleksost oleksost commented Aug 16, 2024

Implementation of the PEER MoE: https://arxiv.org/pdf/2407.04153.

This runs with 10K experts and a small GPT-neo/Phi2 model on 40G GPU.

This is the config:

        {
            "name": "train_moe",
            "type": "python",
            "request": "launch",
            "program": "/train_moe.py",
            "args": [
                "-c", "modular_llm/configs/models/phi-2_moe_post-hoc.json",
                "-k",
                "output_dir=/tmp/mttl_out_tmp/",
                "include_task_source=*",
                "model_modifier=peer",
                "modify_modules=.*mlp",
                "model=EleutherAI/gpt-neo-125m",
                "modify_layers=",
                "trainable_param_names=.*mlp.*",
                "finetune_task_name=dream_read_the_following_conversation_and_answer_the_question", 
                "dataset=sordonia/flan-10k-flat",
                "moe_num_experts=10000",
                "train_batch_size=1",
                "subsample_dev=1",
                "top_k=2",
                "eval_before_training=False",
                "router_selector=moe_pk_router"
            ],
            "console": "integratedTerminal",
            "justMyCode": false
        },

@@ -587,6 +592,17 @@ def as_expert(self):
"This method is not implemented for MultiExpertModel."
)

def add_empty_experts(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this func should be in MoE model I think

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes true

expert,
action=action,
is_default=is_default,
if len(expert_config.modify_layers) == 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if modify_layers = "", we do not look into children of the module, and directly modify module instead of layers.

This is useful when we want to directly replace the full MLP module

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nitpicking) I would check if == "" instead of len == 0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe checking if modify_layers == None ?

expert,
action=action,
is_default=is_default,
if len(expert_config.modify_layers) == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nitpicking) I would check if == "" instead of len == 0

expert,
action=action,
is_default=is_default,
if len(expert_config.modify_layers) == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe checking if modify_layers == None ?

"""
PEER layer from Mixture of A Million Experts (https://arxiv.org/pdf/2407.04153)

Right not it assumes that it receives a module -- an MLP block, that has attributes fc1 and fc2.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo (not / now)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merci!

from mttl.models.modifiers.modify_model import get_modifier_name

# diff architectures name those layers differently
down_names = ["fc1", "c_fc"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make these arguments to be passed from the config ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

@@ -693,3 +680,22 @@ def training_step(self, batch, _):
for i, pg in enumerate(self.optimizers().optimizer.param_groups):
self.log(f"train/lr_{i}", pg["lr"])
return total_loss

def as_expert(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does as_expert work for multiple experts ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ups, this should not be here. Addressed

from mttl.models.modifiers.lora import LoRA


@pytest.fixture
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this fixture in another file already?

expert,
action=action,
is_default=is_default,
if expert_config.modify_layers == "":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a q: do we really need to keep both modify_layers and modify_module here? isn't there something we can do to say:

modify_layers=X,

where X is all the layers/modules that match but not their children,

e.g. if X=k_proj, we modify 1.block.k_proj, then we also modify 2.block.k_proj, cause they have different parents

but if X=block, then 1.block.k_proj, and 1.block.q_proj is modified only once (we keep track of whether we modified the parent of the current layer)

doesn't this encompass modify_layers and modify_modules?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will remove "modify_layers" argument in a separate PR

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

expert,
action=action,
is_default=is_default,
if expert_config.modify_layers == "":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not modify_layers?

@oleksost oleksost changed the title [Draft] PEER MoE PEER MoE Aug 20, 2024
@oleksost oleksost requested a review from sordonia August 20, 2024 18:24
from mttl.models.modifiers.modify_model import get_modifier_name

# diff architectures name those layers differently
DOWN_NAMES = ["fc1", "c_fc"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this an argument that is part of the config ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed now

@oleksost oleksost merged commit 0be1a67 into main Aug 23, 2024
6 checks passed
@oleksost oleksost deleted the 1MoE branch August 23, 2024 20:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants