Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Race condition on restoring from snapshot #396

Open
heemin32 opened this issue Jun 1, 2023 · 3 comments
Open

[BUG] Race condition on restoring from snapshot #396

heemin32 opened this issue Jun 1, 2023 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@heemin32
Copy link
Contributor

heemin32 commented Jun 1, 2023

What is the bug?
This is just my thought process. If extension of job scheduler with short interval acquire lock, it will create an index .opendistro-job-scheduler-lock.

After taking a snapshot, and if we restore the snapshot, the index for extension of job scheduler can be restored first and it will trigger the task which will create .opendistro-job-scheduler-lock index. If restoration of .opendistro-job-scheduler-lock is happened after that, it will fail due to index name conflict.

How can one reproduce the bug?
Steps to reproduce the behavior:

  1. Create extension of job scheduler with short run interval which also acquire lock.
  2. Take snapshot
  3. Restore from snapshot
  4. Restore of .opendistro-job-scheduler-lock will fail

What is the expected behavior?
Maybe, .opendistro-job-scheduler-lock should be taken in snapshot or restoring of it should be blocked.

What is your host/environment?
N/A

Do you have any screenshots?
N/A

Do you have any additional context?
opensearch-project/OpenSearch#7778

@heemin32 heemin32 added bug Something isn't working untriaged labels Jun 1, 2023
@andrross
Copy link
Member

If the data in .opendistro-job-scheduler-lock is truly ephemeral state that should never survive a snapshot->restore cycle, then it might be appropriate to block it from being snapshotted. However, it does raise the question of whether an index is the right place to store such ephemeral state.

If, however, there are cases where you would want the data in .opendistro-job-scheduler-lock to survive a snapshot->restore cycle, then the current behavior seems appropriate where either the snapshotter or the restorer can choose whether to exclude the index at snapshot or restore time. There may indeed be a race, but if a new .opendistro-job-scheduler-lock has been created and might have data in it, then the operator needs to explicitly make a choice as to whether to use the new data versus the data in the snapshot.

@joshpalis joshpalis self-assigned this Aug 11, 2023
@prudhvigodithi
Copy link
Collaborator

[Triage]
Hey, just following back on this, I assume this issue still persists adding @joshpalis @cwperks @dbwiddis to provide some insights. I agree with @andrross there has to be a mechanism to choose if .opendistro-job-scheduler-lock should be part of the snapshot->restore cycle.

We can also have user to first rename the index to a new name during restoration, delete the original index, and then reindex the data from the renamed index to the original one.

Adding @bbarani

@andrross
Copy link
Member

andrross commented Apr 2, 2024

We can also have user to first rename the index to a new name during restoration, delete the original index, and then reindex the data from the renamed index to the original one.

We should really try to avoid user interaction here, I think. I'm not super familiar with the low level details of job scheduler, but I suspect .opendistro-job-scheduler-lock is an implementation detail. It really raises the question of whether a system index is the right place for this data (versus say cluster state or some other mechanism).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Backlog
Development

No branches or pull requests

4 participants