refactor(storage): improve inverted index read fst file first to reduce load index #16385
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/
Summary
This PR introduce a new search function instead of
tantivy
index searcher, and follow the process below to perform the inverted index query search:fst
(Finite State Transducer) first, check if the term in the query matches, return if it doesn't matched.term dict
to get thepostings_range
inidx
and thepositions_range
inpos
for each terms.doc_ids
andterm_freqs
inidx
for each terms usingpostings_range
.position
of each terms inpos
usingpositions_range
.If the term does not match, only the
fst
file needs to be read. Since most of the blocks are this case, and thefst
is usually only one-tenth the size of the entire index data, it can greatly speed up queries.If the term matches, only the
idx
andpos
data of the related terms need to be read instead of all theidx
andpos
data. The size of those datas are so small that they can all be cached in memory, which will speeding up following queries.This PR mainly optimizes the query performance of inverted index for String type. The function of calculating the score and searching JSON type has not been implemented in this PR, so relevant tests have been temporarily modified, those functions will be implemented in the following PRs.
The inverted index data is stored in a new file format and split data by columns to facilitate reading related fields as required. The schema information is also stored in footer for future expansion. Previous data format reads are also compatible and can continue to be used.
create a table
pmc100
on my local environment and run some sqls for tests.old version
new version
We can see that the execution time has been greatly reduced
0.535 sec -> 0.142 sec 'name:Crystallogr'
1.049 sec -> 0.038 sec 'name:"Acta_Crystallogr_D_Biol_Crystallogr_2014"'
0.374 sec -> 0.170 sec 'body:Benzaldehydehydrazone'
0.327 sec -> 0.047 sec 'body:Benzaldehydehydrazone Hadjoudis'
fixes: #[Link the issue here]
Tests
Type of change
This change is