refactor(storage): improve inverted index read fst file first to reduce load index #16385

b41sh · 2024-09-04T04:08:29Z

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

This PR introduce a new search function instead of tantivy index searcher, and follow the process below to perform the inverted index query search:

Read the fst(Finite State Transducer) first, check if the term in the query matches, return if it doesn't matched.
Read the term dict to get the postings_range in idx and the positions_range in pos for each terms.
Read the doc_ids and term_freqs in idx for each terms using postings_range.
If it's a phrase query, read the position of each terms in pos using positions_range.
Collect matched doc ids using term-related informations.

If the term does not match, only the fst file needs to be read. Since most of the blocks are this case, and the fst is usually only one-tenth the size of the entire index data, it can greatly speed up queries.

If the term matches, only the idx and pos data of the related terms need to be read instead of all the idx and pos data. The size of those datas are so small that they can all be cached in memory, which will speeding up following queries.

This PR mainly optimizes the query performance of inverted index for String type. The function of calculating the score and searching JSON type has not been implemented in this PR, so relevant tests have been temporarily modified, those functions will be implemented in the following PRs.

The inverted index data is stored in a new file format and split data by columns to facilitate reading related fields as required. The schema information is also stored in footer for future expansion. Previous data format reads are also compatible and can continue to be used.

┌─────┐ ┌──────┐ ┌─────┐ ┌─────┐ ┌───────────┐ ┌────────┐ ┌─────────┐ ┌────────────┐ ┌──────────┐
│ fst │ │ term │ │ idx │ │ pos │ │ fieldnorm │ │ schema │ │ offsets │ │ schema_len │ │ meta_len │
└─────┘ └──────┘ └─────┘ └─────┘ └───────────┘ └────────┘ └─────────┘ └────────────┘ └──────────┘
 \                                          /   \                                              /
  \                    ___________________ /     \             _______________________________/
   \                  /                           \           /
   index columns datas                             footer meta

create a table pmc100 on my local environment and run some sqls for tests.

CREATE TABLE pmc100 (
  name VARCHAR NULL,
  journal VARCHAR NULL,
  date VARCHAR NULL,
  volume VARCHAR NULL,
  issue VARCHAR NULL,
  accession VARCHAR NULL,
  timestamp TIMESTAMP NULL,
  pmid VARCHAR NULL,
  body VARCHAR NULL
);

CREATE INVERTED INDEX idx1 on pmc100(name, body);

COPY INTO pmc100 FROM 'fs:///data2/b41sh/bench/documents.json' FILE_FORMAT = (type = NDJSON);

MySQL [(none)]> select count(*) from pmc100;
+----------+
| COUNT(*) |
+----------+
|   574199 |
+----------+
1 row in set (0.026 sec)

old version

MySQL [(none)]> select count(*) from pmc100 where query('name:Crystallogr');
+----------+
| COUNT(*) |
+----------+
|    25135 |
+----------+
1 row in set (0.535 sec)

MySQL [(none)]> select count(*) from pmc100 where query('name:"Acta_Crystallogr_D_Biol_Crystallogr_2014"');
+----------+
| COUNT(*) |
+----------+
|       93 |
+----------+
1 row in set (1.049 sec)

MySQL [(none)]> select count(*) from pmc100 where query('body:Benzaldehydehydrazone');
+----------+
| COUNT(*) |
+----------+
|       31 |
+----------+
1 row in set (0.374 sec)

MySQL [(none)]> select count(*) from pmc100 where query('body:Benzaldehydehydrazone Hadjoudis');
+----------+
| COUNT(*) |
+----------+
|       82 |
+----------+
1 row in set (0.327 sec)

new version

MySQL [(none)]> select count(*) from pmc100 where query('name:Crystallogr');
+----------+
| COUNT(*) |
+----------+
|     3240 |
+----------+
1 row in set (0.142 sec)

MySQL [(none)]> select count(*) from pmc100 where query('name:"Acta_Crystallogr_D_Biol_Crystallogr_2014"');
+----------+
| COUNT(*) |
+----------+
|       93 |
+----------+
1 row in set (0.038 sec)

MySQL [(none)]> select count(*) from pmc100 where query('body:Benzaldehydehydrazone');
+----------+
| COUNT(*) |
+----------+
|       31 |
+----------+
1 row in set (0.170 sec)

MySQL [(none)]> select count(*) from pmc100 where query('body:Benzaldehydehydrazone Hadjoudis');
+----------+
| COUNT(*) |
+----------+
|       82 |
+----------+
1 row in set (0.047 sec)

We can see that the execution time has been greatly reduced

0.535 sec -> 0.142 sec 'name:Crystallogr'
1.049 sec -> 0.038 sec 'name:"Acta_Crystallogr_D_Biol_Crystallogr_2014"'
0.374 sec -> 0.170 sec 'body:Benzaldehydehydrazone'
0.327 sec -> 0.047 sec 'body:Benzaldehydehydrazone Hadjoudis'

fixes: #[Link the issue here]

Tests

Unit Test
Logic Test
Benchmark Test
No Test - Explain why

Type of change

Bug Fix (non-breaking change which fixes an issue)
New Feature (non-breaking change which adds functionality)
Breaking Change (fix or feature that could cause existing functionality not to work as expected)
Documentation Update
Refactoring
Performance Improvement
Other (please describe):

This change is

github-actions · 2024-09-04T04:46:27Z

Docker Image for PR

tag: pr-16385-2023755-1725425088

note: this image tag is only available for internal use,
please check the internal doc for more details.

…ce load index

BohuTANG · 2024-09-18T01:02:42Z

This PR mainly optimizes the query performance of inverted index for String type.

How much has the query performance improved for the inverted index on String type in this PR?

b41sh · 2024-09-18T01:58:52Z

This PR mainly optimizes the query performance of inverted index for String type.

How much has the query performance improved for the inverted index on String type in this PR?

No tests yet, I will add performance test results later.

tests/sqllogictests/suites/ee/04_ee_inverted_index/04_0000_inverted_index_base.test

github-actions bot added the pr-refactor this PR changes the code base without new features or bugfix label Sep 4, 2024

b41sh added the ci-cloud Build docker image for cloud test label Sep 4, 2024

BohuTANG removed the ci-cloud Build docker image for cloud test label Sep 5, 2024

refactor(storage): improve inverted index read fst file first to redu…

3dceccc

…ce load index

b41sh force-pushed the refactor-inverted-index-fst branch from 6212bc0 to 3dceccc Compare September 16, 2024 21:39

b41sh added 3 commits September 17, 2024 05:48

fix typos

13ad1a7

use new inverted index file format

5d5825e

fix typos

ffb6ad5

b41sh requested a review from sundy-li September 17, 2024 21:34

b41sh marked this pull request as ready for review September 17, 2024 21:35

dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. A-storage Area: databend storage C-performance Category: Performance labels Sep 17, 2024

sundy-li approved these changes Sep 18, 2024

View reviewed changes

dosubot bot added the lgtm This PR has been approved by a maintainer label Sep 18, 2024

BohuTANG reviewed Sep 18, 2024

View reviewed changes

tests/sqllogictests/suites/ee/04_ee_inverted_index/04_0000_inverted_index_base.test Show resolved Hide resolved

BohuTANG merged commit 8eaf57b into datafuselabs:main Sep 18, 2024
108 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(storage): improve inverted index read fst file first to reduce load index #16385

refactor(storage): improve inverted index read fst file first to reduce load index #16385

b41sh commented Sep 4, 2024 •

edited

Loading

github-actions bot commented Sep 4, 2024

BohuTANG commented Sep 18, 2024

b41sh commented Sep 18, 2024

refactor(storage): improve inverted index read fst file first to reduce load index #16385

refactor(storage): improve inverted index read fst file first to reduce load index #16385

Conversation

b41sh commented Sep 4, 2024 • edited Loading

Summary

Tests

Type of change

github-actions bot commented Sep 4, 2024

Docker Image for PR

BohuTANG commented Sep 18, 2024

b41sh commented Sep 18, 2024

b41sh commented Sep 4, 2024 •

edited

Loading