[HUDI-6891] Fix RO queries with RLI and record key predicate #11975

codope · 2024-09-20T09:01:03Z

Change Logs

Suppose RLI (record index) is enabled on MOR table, do an insert and then delete a record. When table is not compacted, it is expected that RO (read-optimized) query shows the deleted record, while snapshot query will not. A query with no predicates works fine, but if the query has a predicate on record key field, it will use RLI and RLI always returns the latest snapshot. Hence, this PR fixes the issue by not using RLI for queries other than snapshot queries. The PR adds a new supportsQueryType(queryType: HoodieTableQueryType) API in the base class for index support. Different index can implement it differently as and when we add time tavel support for indexes.

Impact

Fix RO queries with RLI and record key predicate.

Risk level (write none, low medium or high below)

low

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

danny0405 · 2024-09-21T09:34:15Z

A query with no predicates works fine, but if the query has a predicate on record key field, it will use RLI and RLI always returns the latest snapshot. Hence, this PR fixes the issue by not using RLI

Can we just skip the file filtering if we find the file in the index item is a log file when rdo RO query, so that we can eliminate the new API on the file index. And a RO query on full compacted table can also utilitize the RLI index.

codope · 2024-09-21T10:37:05Z

Can we just skip the file filtering if we find the file in the index item is a log file when rdo RO query, so that we can eliminate the new API on the file index. And a RO query on full compacted table can also utilitize the RLI index.

@danny0405 This is a good point. But, the index item in RLI only contains fileId and not the file name. Moreover, I still think it makes sense to have this new API because think about time travel query. Let's say there is an instant t in the past which is a compaction instant and table was fully compacted at that instant. If user runs a query with record key predicate and as of instant t then RLI would still return candidate files as per the latest snapshot. And RO queries are nothing but time travel as of the latest compaction time.

I think it's much cleaner to simply not use RLI for any type of queries other than Snapshot queries. Later, when we add support for time travel on metadata table, we can easily change the condition in the implementation of this API and use the index.

hudi-bot · 2024-09-21T10:40:06Z

CI report:

c16f9ec Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

danny0405 · 2024-09-22T03:19:28Z

Then can we just handle this logic in RLI index because the issue is specific to it? Just skip the data skipping inside RLI index should be working right? We can introduce the new API if more index type are involved I think.

codope · 2024-09-22T11:09:16Z

Then can we just handle this logic in RLI index because the issue is specific to it? Just skip the data skipping inside RLI index should be working right? We can introduce the new API if more index type are involved I think.

It's for RLI and Secondary index. I can remove the colstats override. The default implementation of API returns true which should be fine.

github-actions bot added the size:M PR with lines of changes in (100, 300] label Sep 20, 2024

codope force-pushed the HUDI-6891-rli-mor branch from 31836f8 to 0c0b22a Compare September 20, 2024 10:44

[HUDI-6891] Fix RO queries with RLI and record key predicate

c16f9ec

codope force-pushed the HUDI-6891-rli-mor branch from 0c0b22a to c16f9ec Compare September 21, 2024 06:20

danny0405 self-assigned this Sep 21, 2024

apache deleted a comment from hudi-bot Sep 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-6891] Fix RO queries with RLI and record key predicate #11975

[HUDI-6891] Fix RO queries with RLI and record key predicate #11975

codope commented Sep 20, 2024 •

edited

Loading

danny0405 commented Sep 21, 2024 •

edited

Loading

codope commented Sep 21, 2024

hudi-bot commented Sep 21, 2024

danny0405 commented Sep 22, 2024

codope commented Sep 22, 2024

[HUDI-6891] Fix RO queries with RLI and record key predicate #11975

Are you sure you want to change the base?

[HUDI-6891] Fix RO queries with RLI and record key predicate #11975

Conversation

codope commented Sep 20, 2024 • edited Loading

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

danny0405 commented Sep 21, 2024 • edited Loading

codope commented Sep 21, 2024

hudi-bot commented Sep 21, 2024

CI report:

danny0405 commented Sep 22, 2024

codope commented Sep 22, 2024

codope commented Sep 20, 2024 •

edited

Loading

danny0405 commented Sep 21, 2024 •

edited

Loading