Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Sample Sparsification Method #250

Open
wants to merge 26 commits into
base: main
Choose a base branch
from

Conversation

MonsterAzi
Copy link
Contributor

This is a new sparsification method that I have been thinking about. The trimming and dropping methods resemble the Top-P and Typical-P methods used in sampling LLMs. However, by far the most popular sampler is the temperature sampler.

This sparsification method samples the tensor itself to create its mask. This method is FAR more computationally expensive, but it, theoritically, should outperform other methods.

@MonsterAzi
Copy link
Contributor Author

I have now realized that this isn't the only merge like this possible. This merge is quite similar to dropping, but there is one that is similar to trimming that reaches a similar result. I might try implementing that as well.

@MonsterAzi
Copy link
Contributor Author

MonsterAzi commented Apr 9, 2024

Here are the empirical test results of Sampling vs DARE and magnitude:

Method MMLU ARC EQ-Bench ARC (Easy) Winogrande Weighted Sum
TIES (R -N) 52.66% 52.47% 48.44 79.71% 76.09% 59.41%
DARE TIES 51.92% 51.37% 54.09 79.50% 76.72% 60.24%
51.91% 50.94% 48.54 77.95% 76.56% 58.75%
51.69% 50.60% 36.14 79.67% 75.45% 56.02%
52.30% 51.11% 44.68 79.42% 76.56% 58.26%
Sample TIES 52.72% 51.71% 52.15 79.76% 75.53% 59.93%
52.36% 52.13% 47.98 79.71% 75.45% 59.04%
52.30% 52.56% 50.58 80.05% 75.93% 59.80%
52.54% 52.82% 52.30 80.64% 75.37% 60.25%

Several samples are shown to show randomness

As can be seen, sampling shows less variance than DARE and appears to not have the "bad runs" like it does. It does also show nondeterminism, unlike magnitude.

We cannot conclude from this small run with a low sample size that sampling is better than magnitude or DARE, but it does show that it has promise and already show some good effects (low variance compared to DARE).

@MonsterAzi MonsterAzi marked this pull request as ready for review April 13, 2024 03:16
@MonsterAzi
Copy link
Contributor Author

Added another method as well as a parameter that affects both of them.

This new method is based on TopK, but instead of just restricting it to a certain percentage, we use the rank of the tensors to build a bernoulli distribution (just like the sampling method). This method should be closer to TopK and have even lower variance than Sampling.

This new parameter is more experimental. It simply skips the bernoulli step, resulting in a spread distribution without all the zeros. Since the value is concentrated in a part of a tensor, like sparsification, this should result in a decrease in conflict. Since a lot more of the original values are maintained, this method should be more robust, allowing for lower density values and more stable iterative merges. (Smooth selection is also deterministic, which may be preferred.)

Here are updated results:
image

@cg123
Copy link
Collaborator

cg123 commented Apr 14, 2024

Looks like this breaks the sparsification unit tests - could you update them to pass in the new arguments (or give them default values?)

@MonsterAzi
Copy link
Contributor Author

Oh, I think I know what the problem is. Sorry, haven't really looked at the sparsification unit tests, so I'll try to fix them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants