Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-6891] Fix RO queries with RLI and record key predicate #11975

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

codope
Copy link
Member

@codope codope commented Sep 20, 2024

Change Logs

Suppose RLI (record index) is enabled on MOR table, do an insert and then delete a record. When table is not compacted, it is expected that RO (read-optimized) query shows the deleted record, while snapshot query will not. A query with no predicates works fine, but if the query has a predicate on record key field, it will use RLI and RLI always returns the latest snapshot. Hence, this PR fixes the issue by not using RLI for queries other than snapshot queries. The PR adds a new supportsQueryType(queryType: HoodieTableQueryType) API in the base class for index support. Different index can implement it differently as and when we add time tavel support for indexes.

Impact

Fix RO queries with RLI and record key predicate.

Risk level (write none, low medium or high below)

low

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@github-actions github-actions bot added the size:M PR with lines of changes in (100, 300] label Sep 20, 2024
@danny0405
Copy link
Contributor

danny0405 commented Sep 21, 2024

A query with no predicates works fine, but if the query has a predicate on record key field, it will use RLI and RLI always returns the latest snapshot. Hence, this PR fixes the issue by not using RLI

Can we just skip the file filtering if we find the file in the index item is a log file when rdo RO query, so that we can eliminate the new API on the file index. And a RO query on full compacted table can also utilitize the RLI index.

@danny0405 danny0405 self-assigned this Sep 21, 2024
@codope
Copy link
Member Author

codope commented Sep 21, 2024

Can we just skip the file filtering if we find the file in the index item is a log file when rdo RO query, so that we can eliminate the new API on the file index. And a RO query on full compacted table can also utilitize the RLI index.

@danny0405 This is a good point. But, the index item in RLI only contains fileId and not the file name. Moreover, I still think it makes sense to have this new API because think about time travel query. Let's say there is an instant t in the past which is a compaction instant and table was fully compacted at that instant. If user runs a query with record key predicate and as of instant t then RLI would still return candidate files as per the latest snapshot. And RO queries are nothing but time travel as of the latest compaction time.

I think it's much cleaner to simply not use RLI for any type of queries other than Snapshot queries. Later, when we add support for time travel on metadata table, we can easily change the condition in the implementation of this API and use the index.

@apache apache deleted a comment from hudi-bot Sep 21, 2024
@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@danny0405
Copy link
Contributor

Then can we just handle this logic in RLI index because the issue is specific to it? Just skip the data skipping inside RLI index should be working right? We can introduce the new API if more index type are involved I think.

@codope
Copy link
Member Author

codope commented Sep 22, 2024

Then can we just handle this logic in RLI index because the issue is specific to it? Just skip the data skipping inside RLI index should be working right? We can introduce the new API if more index type are involved I think.

It's for RLI and Secondary index. I can remove the colstats override. The default implementation of API returns true which should be fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:M PR with lines of changes in (100, 300]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants