Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task: optimise Bloom filter lookup using memory prefetching #208

Open
dcoutts opened this issue May 9, 2024 · 0 comments
Open

Task: optimise Bloom filter lookup using memory prefetching #208

dcoutts opened this issue May 9, 2024 · 0 comments
Assignees

Comments

@dcoutts
Copy link
Collaborator

dcoutts commented May 9, 2024

In one experiment, we obtained about a 30% speedup (on a benchmark of larger than cache Bloom filters ) from using memory prefetching primitives.

This task is to do a production version, rather than just an experiment. This should include functional tests and should extend the existing macro-benchmark for the Bloom filter to measure any speedup.

In Database.LSMTree.Internal.Lookup there's a bloomQueriesDefault (which uses bloomQueries), which is the simple non-batched version of what we want:

  • given a vector of bloom filters;
  • given a vector of keys:
  • each key is looked up in each filter (so N * M),
  • the result is the lookup results where the key is (probably) present.
  • The result is represented by a vector of the pairs of the filter index and key index from the input vectors. This is a sparse representation of the N * M bit array of results.

We use a sparse representation based on the assumption (for LSMs) that most keys are not present in most filters. The result vector size is not know in advance for certain, but we expect most results to be proportional to the number of keys. We think that for typical use cases 2*N would be enough to avoid having to grow (and copy) the array. For correctness however the array needs to be able to grow.

The initial experiment using prefetching is in branch dcoutts/bloomfilter-optimisation-hacking. This adds optimised versions of bulk lookup, and extends the macrobenchmark. This experimental version uses a single key with a sequence of filters, rather than a sequence of keys with a sequence of filters (as in the current task).

The fastest version in the experiment (with the 30% speedup) works by iterating over the hash number as the outer loop, and iterating over the (remaining) keys as the inner loop. The version in the experiment also uses a dense rather than sparse representation for the result. As a setup phase it calculates the probe index for hash 0 of each key and prefetches the location. On each iteration of the main loop it iterates over two arrays: the array of probe indexes, reading each one, and creating the array of probe indexes for the next iteration.

A version that works with many keys and many filters could work in a similar way, with the outer loop being the hash number. It will have the challenge that for larger lookup batch sizes (eg. 100+) it will run out of L1 cache because the "distance" between the memory prefetch and the memory read will be too large, e.g. 100 keys * 20 filters = 2,000 items "in flight" in the CPU caches. It would be better to find an algorithm that has a "prefetch distance" that is independent of the number of keys and can be tuned to typical CPUs, e.g. on the order of 10 -- 100 items in flight.

@phadej phadej self-assigned this Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants