Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idea: Potential speed-up of awesome_cosine_similarity when comparing certain lists to themselves #65

Open
trichie opened this issue Jun 24, 2022 · 0 comments

Comments

@trichie
Copy link

trichie commented Jun 24, 2022

Another practical use case for the algorithm could be that you have e.g. a huge list of customer names and addresses where you want to find out if some of them accidentally got assigned more than one customer id over the years. From my practical experience dealing with exactly such a dataset, there might be a single percentage number of the customer numbers that are indeed potential doubles, but for each individual customer one can safely assume that there are no more than 3 or 4 different numbers.

In those situations one doesn't have to search all potential matches between each element of list A and each element of a different list B, but between all the elements of list A against itself. Here one could imho save roughly 50% of scalar product computations by only calculating the upper or lower triangle of A x A^T, ignoring the main diagonal.

Maybe you consider this too similar to issue #24, in which case you can just delete it or flag it correspondingly. imho it isn't exactly as at least for the above practical use case one doesn't necessarily need to construct the not-calculated lower triangle from the calculated upper triangle or vice versa if one uses a sufficiently high max_fits (which in the end might just turn this into a trade-off between using more memory or being faster).

@trichie trichie changed the title Idea: Speed-up of awesome_cosine_similarity by hopefully approx. 50% when comparing a list to itself Idea: Potential speed-up of awesome_cosine_similarity when comparing certain lists to themselves Jun 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant