Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support limits on records loaded from Lucene index #10298

Open
wants to merge 2 commits into
base: 3.2.x
Choose a base branch
from

Conversation

timw
Copy link
Contributor

@timw timw commented Sep 3, 2024

What does this PR do?

Allows the records (rids) retrieved from the Lucene search to be limited, where it is known that the remainder of the query does not require the entire set to be loaded. This is useful when the underlying Lucene query returns many results, but the query overall is only intended to return a small number of them (and in the ranked order from Lucene).

This mode is opt in, by providing a limit metadata element to the Lucene search function. A value of select uses the skip/limit in the SELECT statement to determine the max hits, and an integral value specifies an explicit max hits (e.g. for a safety margin where subsequent query filter/order operations are desired).

Motivation

100% of our Lucene index queries apply all of the filtering criteria in the Lucene query, and we have some pathological scenarios where those criteria can be ranking (non mandatory) and on very general criteria.
In the worst case this resulted in millions of RIDs being loaded from the Lucene index, when we would only want the top 100.
This causes high memory pressure (and often out of memory errors), with some of the RID arrays loaded being 800MB.

Related issues

Neo4J has a similar capability: https://community.neo4j.com/t/full-text-search-skip-and-limit/58773

Additional Notes

Checklist
[x] I have run the build using mvn clean package command
[x] My unit tests cover both failure and success scenarios

Allows the records (rids) retrieved from the Lucene search to be limited, where it is known that the remainder of the query does not require the entire set to be loaded.
This is useful when the underlying Lucene query returns many results, but the query overall is only intended to return a small number of them (usually in the ranked order from Lucene).
This mode is opt in, by providing a "limit" metadata element to the Lucene search function. A value of "select' uses the skip/limit in the SELECT statement to determine the max hits, and an integral value specifies an explicit max hits (e.g. for a safety margin).
@tglman
Copy link
Member

tglman commented Sep 4, 2024

Hi,

This looks like a nice new feature, will double check it later, we do not add new features in patch releases so it will be more appropriate to target this to develop branch, I just need to reed it more to see if there are similar use case around and if the pattern used here do match the other cases, just to keep the usage consistent.

Regards

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants