Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-7563] Add support to drop index using sql #11951

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

codope
Copy link
Member

@codope codope commented Sep 17, 2024

Change Logs

  • Support dropping index through SQL
  • Add a test to drop sec index and func index (RLI is already covered)
  • Some renames and refactoring

Impact

Users can now drop index using SQL

Risk level (write none, low medium or high below)

low

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@github-actions github-actions bot added the size:M PR with lines of changes in (100, 300] label Sep 17, 2024
@nsivabalan nsivabalan self-assigned this Sep 20, 2024
Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left few minor comments

Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few minor nits

Comment on lines +72 to +76
public static HoodieSparkIndexClient getInstance(SparkSession sparkSession) {
if (_instance == null) {
synchronized (HoodieSparkFunctionalIndexClient.class) {
synchronized (HoodieSparkIndexClient.class) {
if (_instance == null) {
_instance = new HoodieSparkFunctionalIndexClient(sparkSession);
_instance = new HoodieSparkIndexClient(sparkSession);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is a bit weird for concurrency control. We should revisit it later.

Copy link
Member Author

@codope codope Sep 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the usual double-checked locking pattern. Probably you are concerned about memory visibility issues. The _instance could be non-null and yet not fully constructed when the first if (_instance == null) check passes, but before the object is fully initialized. However, note that _instance is declared as volatile which ensures that any writes to _instance are visible to all threads immediately and prevents the reordering of instructions during construction.

Comment on lines +117 to +129
if (!indexExists(metaClient, indexName)) {
if (ignoreIfNotExists) {
return;
} else {
throw new HoodieFunctionalIndexException("Index does not exist: " + indexName);
}
}

LOG.info("Dropping index {}", indexName);
HoodieIndexDefinition indexDefinition = metaClient.getIndexMetadata().get().getIndexDefinitions().get(indexName);
try (SparkRDDWriteClient writeClient = HoodieCLIUtils.createHoodieWriteClient(
sparkSession, metaClient.getBasePath().toString(), mapAsScalaImmutableMap(buildWriteConfig(metaClient, indexDefinition)), toScalaOption(Option.empty()))) {
writeClient.dropIndex(Collections.singletonList(indexName));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future reference, this logic does not have much specifics to engine itself, so it can be abstracted to the index client by plugging in the engine-specific write client.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need the engine-specific write client to call the base API BaseHoodieWriteClient.dropIndex?

HoodieSparkIndexClient.getInstance(sparkSession).drop(metaClient, indexName, ignoreIfNotExists)
} catch {
case _: IllegalArgumentException =>
SecondaryIndexManager.getInstance().drop(metaClient, indexName, ignoreIfNotExists)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why drop here again?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is legacy code due to incomplete RFC-52. SecondaryIndexManager was introduced in RFC-52 but it's just a wrapper code and does not really manage any index underneath. From RFC, it's supposed to support index built on third party libraries such as Lucene. But, we have not yet added any support so far. In my opinion, we should remove all that code. Just keeping it here to make some tests pass. If you agree, I can take the cleanup as a followup later.

@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:M PR with lines of changes in (100, 300]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants