Support limits on records loaded from Lucene index #10298

timw · 2024-09-03T22:54:01Z

What does this PR do?

Allows the records (rids) retrieved from the Lucene search to be limited, where it is known that the remainder of the query does not require the entire set to be loaded. This is useful when the underlying Lucene query returns many results, but the query overall is only intended to return a small number of them (and in the ranked order from Lucene).

This mode is opt in, by providing a limit metadata element to the Lucene search function. A value of select uses the skip/limit in the SELECT statement to determine the max hits, and an integral value specifies an explicit max hits (e.g. for a safety margin where subsequent query filter/order operations are desired).

Motivation

100% of our Lucene index queries apply all of the filtering criteria in the Lucene query, and we have some pathological scenarios where those criteria can be ranking (non mandatory) and on very general criteria.
In the worst case this resulted in millions of RIDs being loaded from the Lucene index, when we would only want the top 100.
This causes high memory pressure (and often out of memory errors), with some of the RID arrays loaded being 800MB.

Related issues

Neo4J has a similar capability: https://community.neo4j.com/t/full-text-search-skip-and-limit/58773

Additional Notes

Checklist
[x] I have run the build using mvn clean package command
[x] My unit tests cover both failure and success scenarios

Allows the records (rids) retrieved from the Lucene search to be limited, where it is known that the remainder of the query does not require the entire set to be loaded. This is useful when the underlying Lucene query returns many results, but the query overall is only intended to return a small number of them (usually in the ranked order from Lucene). This mode is opt in, by providing a "limit" metadata element to the Lucene search function. A value of "select' uses the skip/limit in the SELECT statement to determine the max hits, and an integral value specifies an explicit max hits (e.g. for a safety margin).

tglman · 2024-09-04T11:53:46Z

Hi,

This looks like a nice new feature, will double check it later, we do not add new features in patch releases so it will be more appropriate to target this to develop branch, I just need to reed it more to see if there are similar use case around and if the pattern used here do match the other cases, just to keep the usage consistent.

Regards

timw force-pushed the lucene_limit branch from 6c2b025 to def5d60 Compare September 3, 2024 23:12

Avoid NPE if command context null for Lucene query.

6f32372

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support limits on records loaded from Lucene index #10298

Support limits on records loaded from Lucene index #10298

timw commented Sep 3, 2024

tglman commented Sep 4, 2024

Support limits on records loaded from Lucene index #10298

Are you sure you want to change the base?

Support limits on records loaded from Lucene index #10298

Conversation

timw commented Sep 3, 2024

tglman commented Sep 4, 2024