Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(storage): improve inverted index read fst file first to reduce load index #16385

Merged
merged 4 commits into from
Sep 18, 2024

Conversation

b41sh
Copy link
Member

@b41sh b41sh commented Sep 4, 2024

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

This PR introduce a new search function instead of tantivy index searcher, and follow the process below to perform the inverted index query search:

  1. Read the fst(Finite State Transducer) first, check if the term in the query matches, return if it doesn't matched.
  2. Read the term dict to get the postings_range in idx and the positions_range in pos for each terms.
  3. Read the doc_ids and term_freqs in idx for each terms using postings_range.
  4. If it's a phrase query, read the position of each terms in pos using positions_range.
  5. Collect matched doc ids using term-related informations.

If the term does not match, only the fst file needs to be read. Since most of the blocks are this case, and the fst is usually only one-tenth the size of the entire index data, it can greatly speed up queries.

If the term matches, only the idx and pos data of the related terms need to be read instead of all the idx and pos data. The size of those datas are so small that they can all be cached in memory, which will speeding up following queries.

This PR mainly optimizes the query performance of inverted index for String type. The function of calculating the score and searching JSON type has not been implemented in this PR, so relevant tests have been temporarily modified, those functions will be implemented in the following PRs.

The inverted index data is stored in a new file format and split data by columns to facilitate reading related fields as required. The schema information is also stored in footer for future expansion. Previous data format reads are also compatible and can continue to be used.

┌─────┐ ┌──────┐ ┌─────┐ ┌─────┐ ┌───────────┐ ┌────────┐ ┌─────────┐ ┌────────────┐ ┌──────────┐
│ fst │ │ term │ │ idx │ │ pos │ │ fieldnorm │ │ schema │ │ offsets │ │ schema_len │ │ meta_len │
└─────┘ └──────┘ └─────┘ └─────┘ └───────────┘ └────────┘ └─────────┘ └────────────┘ └──────────┘
 \                                          /   \                                              /
  \                    ___________________ /     \             _______________________________/
   \                  /                           \           /
   index columns datas                             footer meta

create a table pmc100 on my local environment and run some sqls for tests.

CREATE TABLE pmc100 (
  name VARCHAR NULL,
  journal VARCHAR NULL,
  date VARCHAR NULL,
  volume VARCHAR NULL,
  issue VARCHAR NULL,
  accession VARCHAR NULL,
  timestamp TIMESTAMP NULL,
  pmid VARCHAR NULL,
  body VARCHAR NULL
);

CREATE INVERTED INDEX idx1 on pmc100(name, body);

COPY INTO pmc100 FROM 'fs:///data2/b41sh/bench/documents.json' FILE_FORMAT = (type = NDJSON);

MySQL [(none)]> select count(*) from pmc100;
+----------+
| COUNT(*) |
+----------+
|   574199 |
+----------+
1 row in set (0.026 sec)

old version

MySQL [(none)]> select count(*) from pmc100 where query('name:Crystallogr');
+----------+
| COUNT(*) |
+----------+
|    25135 |
+----------+
1 row in set (0.535 sec)

MySQL [(none)]> select count(*) from pmc100 where query('name:"Acta_Crystallogr_D_Biol_Crystallogr_2014"');
+----------+
| COUNT(*) |
+----------+
|       93 |
+----------+
1 row in set (1.049 sec)

MySQL [(none)]> select count(*) from pmc100 where query('body:Benzaldehydehydrazone');
+----------+
| COUNT(*) |
+----------+
|       31 |
+----------+
1 row in set (0.374 sec)

MySQL [(none)]> select count(*) from pmc100 where query('body:Benzaldehydehydrazone Hadjoudis');
+----------+
| COUNT(*) |
+----------+
|       82 |
+----------+
1 row in set (0.327 sec)

new version

MySQL [(none)]> select count(*) from pmc100 where query('name:Crystallogr');
+----------+
| COUNT(*) |
+----------+
|     3240 |
+----------+
1 row in set (0.142 sec)

MySQL [(none)]> select count(*) from pmc100 where query('name:"Acta_Crystallogr_D_Biol_Crystallogr_2014"');
+----------+
| COUNT(*) |
+----------+
|       93 |
+----------+
1 row in set (0.038 sec)

MySQL [(none)]> select count(*) from pmc100 where query('body:Benzaldehydehydrazone');
+----------+
| COUNT(*) |
+----------+
|       31 |
+----------+
1 row in set (0.170 sec)

MySQL [(none)]> select count(*) from pmc100 where query('body:Benzaldehydehydrazone Hadjoudis');
+----------+
| COUNT(*) |
+----------+
|       82 |
+----------+
1 row in set (0.047 sec)

We can see that the execution time has been greatly reduced

0.535 sec -> 0.142 sec 'name:Crystallogr'
1.049 sec -> 0.038 sec 'name:"Acta_Crystallogr_D_Biol_Crystallogr_2014"'
0.374 sec -> 0.170 sec 'body:Benzaldehydehydrazone'
0.327 sec -> 0.047 sec 'body:Benzaldehydehydrazone Hadjoudis'

fixes: #[Link the issue here]

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

@github-actions github-actions bot added the pr-refactor this PR changes the code base without new features or bugfix label Sep 4, 2024
@b41sh b41sh added the ci-cloud Build docker image for cloud test label Sep 4, 2024
Copy link
Contributor

github-actions bot commented Sep 4, 2024

Docker Image for PR

  • tag: pr-16385-2023755-1725425088

note: this image tag is only available for internal use,
please check the internal doc for more details.

@BohuTANG BohuTANG removed the ci-cloud Build docker image for cloud test label Sep 5, 2024
@b41sh b41sh requested a review from sundy-li September 17, 2024 21:34
@b41sh b41sh marked this pull request as ready for review September 17, 2024 21:35
@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. A-storage Area: databend storage C-performance Category: Performance labels Sep 17, 2024
@BohuTANG
Copy link
Member

This PR mainly optimizes the query performance of inverted index for String type.

How much has the query performance improved for the inverted index on String type in this PR?

@b41sh
Copy link
Member Author

b41sh commented Sep 18, 2024

This PR mainly optimizes the query performance of inverted index for String type.

How much has the query performance improved for the inverted index on String type in this PR?

No tests yet, I will add performance test results later.

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Sep 18, 2024
@BohuTANG BohuTANG merged commit 8eaf57b into datafuselabs:main Sep 18, 2024
108 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Area: databend storage C-performance Category: Performance lgtm This PR has been approved by a maintainer pr-refactor this PR changes the code base without new features or bugfix size:XXL This PR changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants