-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sparse masks #108
Sparse masks #108
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice man, very happy with this PR
@@ -0,0 +1,200 @@ | |||
# several options to compare for block sparce operations: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small typo in the file name (sparsity instead of sparsity)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed
@@ -19,6 +20,26 @@ | |||
from mttl.utils import generate_random_string, rank_zero_only_and_wait, remote_login | |||
|
|||
|
|||
def setup_profiler(args: ExpertConfig): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we put this in utils ? @matheper I know you want single use code to not be in utils but I think this could be useful somewhere else in the future
nltk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is this used ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its used by rouge evaluators (not automatically installed dependency)
Implements sparse masks in 3 different ways:
torch.scatter_add
to only update the sparse weightsAlso implements mask updates. Currently, only SNIP updater is implemented and SPieL is in the pipeline.
TODOs:
Currently, manual profiler gives me this (for GPT-neo 125M with 0.5% sparcity):
SparseLinearModule
(spops) with regular sparsity - Runtime: 0.066590s, Allocated Memory: 4552.14MB, Reserved Memory: 4645.19MBSparseLinearModule
(spops) with blcok sparsity - Runtime: 0.067642s, Allocated Memory: 4553.58MB, Reserved Memory: 4645.19MBScatteredSparseLinearModule
with block sparsity - Runtime: 0.052826s, Allocated Memory: 4734.14MB, Reserved Memory: 4817.16MBScatteredSparseLinearModule
with regular sparsity - Runtime: 0.052953s, Allocated Memory: 4734.66MB, Reserved Memory: 4817.16MBMaskedLinear
with regular sparsity - Runtime: 0.056629s, Allocated Memory: 4892.71MB, Reserved Memory: 4970.25MBMaskedLinear
with block sparsity - Runtime: 0.055440s, Allocated Memory: 4889.36MB, Reserved Memory: 4978.64MBSo
ScatteredSparseLinearModule
is the fastest now but spopsSparseLinearModule
uses the least memory.Profilled block sparse mult. with profile_block_sparcity.py: stk and triton block sparse outperform naive torch.matmul (see
profile_block_sparcity.py
):