Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition causes inconsistencies in chart repository's index.yaml #18

Open
thedoctor opened this issue Dec 14, 2017 · 16 comments · May be fixed by #447
Open

Race condition causes inconsistencies in chart repository's index.yaml #18

thedoctor opened this issue Dec 14, 2017 · 16 comments · May be fixed by #447

Comments

@thedoctor
Copy link

helm s3 push CHART REPO has a race condition when multiple charts are pushed around the same time. It occurs in the following situation:

  1. UserA runs: helm s3 push CHART_A s3://REPO
  2. UserB runs: helm s3 push CHART_B s3://REPO
  3. UserA's process fetches s3://REPO/index.yaml
  4. UserB's process fetches s3://REPO/index.yaml
  5. UserA's process updates his fetched index.yaml with CHART_A's new version and replaces the remote s3://REPO/index.yaml with his updated version.
  6. UserB's process updates his fetched (and now out-of-date) index.yaml with CHART-B's new version and replaces the remote s3://REPO/index.yaml with his updated version which does not contain CHART_A.

At the end of this process, both CHART_A and CHART_B are present in the repository, but CHART_A is missing from the index so any downstream charts that require it will fail when running helm dep update CHART_THAT_DEPENDS_ON_CHART_A.

A simple solution would be for the plugin to create a mutex, e.g. index.yaml.lock before fetching index.yaml which it would delete after replacing index.yaml with the updated version.
If the lockfile is already present, the plugin should wait until it has been deleted and a new one can be created before fetching index.yaml and proceeding. In the worst case, this could cause cascading delays if many charts are frequently updated, but slow is better than broken.

@markhu
Copy link

markhu commented Dec 14, 2017

Until a full-on fix arrives, is there a command or script that can regenerate the index.yaml file by listing the files in the repo?

@hypnoglow
Copy link
Owner

Hi!

@thedoctor Yes, there is an obvious race condition. But, how would lock file help? With AWS S3 it is not possible, you will run into the same issue - imagine both user clients will create a lock file. There is no atomic operation "create the file if it does not exist". Here is the consistency model of AWS S3. And the index "transactions" can only be implemented involving third-party mechanisms, like using a database for locking. If you see how to easily implement pluggable third-party consistency providers, please suggest.

I think the best solution at the moment for this plugin is to implement "reindex" action (as @markhu stated) that will regenerate repository index on demand. This will partly solve broken indexes, but of course race conditions will still be here.

So, what do you think? I see many requests for "reindex" operation from this plugin users, so I will try to implement it in the nearest weekend.

@hypnoglow
Copy link
Owner

Ref: #5

@markhu
Copy link

markhu commented Dec 14, 2017

Granted you can't 100% eliminate collisions if as you say, imagine both user clients will create a lock file at the same time. But we can still try to narrow the window of opportunity for race conditions. One improvement would be to delay processing index.yaml file after uploading the chart.tgz --here's a PR #19 for that.

@thedoctor
Copy link
Author

thedoctor commented Dec 14, 2017

Ah, I (foolishly) didn't realize s3 bucket contents were eventually consistent.
In that case, the only comprehensive solution would have to involve a service outside of s3 that implements a real mutex, which is probably beyond the scope of this plugin (although a low-overhead way to do it that I imagine would work for a lot of people would be with a git remote).

If the goal is to reduce the rate of occurrence, I'm guessing that s3's time-to-consistency is probably good enough, enough of the time that the lockfile approach would do have some impact – but I agree that it's not a full enough solution to be worth doing.

Reindexing is definitely better than nothing, and probably sufficient for almost everyone. 👍

@hypnoglow
Copy link
Owner

@thedoctor ,

although a low-overhead way to do it that I imagine would work for a lot of people would be with a git remote

I do not understand what do you mean, can you explain it and if it can be used to provide a consistent lock of the index file?

As for reindexing, yes, I think it will solve almost all cases of the broken index.

@markhu
Copy link

markhu commented Dec 16, 2017

Maybe he means store the index.yaml file in git instead of s3.

hypnoglow added a commit that referenced this issue Dec 19, 2017
Narrow down the window of opportunity for race conditions

Reference issue: #18
The idea was presented in this PR: #19
@hypnoglow
Copy link
Owner

So, I implemented the idea from #19 in #21

I've also created an issue for reindexing: #22

I think I will keep this current issue open for a while, maybe someone will come up with another idea to workaround the problem.

@hypnoglow
Copy link
Owner

Hey folks, I rolled out reindex command in 0.5.0. Let me know if it works well or not. And how well it suits your workflow, e.g. would you automate this, how would you automate reindexing (run X times per day, or run by a condition), or maybe manual occasional reindex is enough.

@josdotso
Copy link

Related: https://www.terraform.io/docs/backends/types/s3.html

@hypnoglow
Copy link
Owner

Hm, that's interesting, maybe we can implement pluggable third-party consistency providers like I thought above, with DynamoDB being the first. I will try to spare some time to investigate this.

@masterada
Copy link

Another aproach could be to make it eventually consistent. When pushing the index.yml, helm s3 plugin could also push an index.yml diff file, like this:

apiVersion: v1
addedEntries:
  alpine:
    - created: 2016-10-06T16:23:20.499814565-06:00
      description: Deploy a basic Alpine Linux pod
      digest: 99c76e403d752c84ead610644d4b1c2f2b453a74b921f422b9dcb8a7c8b559cd
      home: https://k8s.io/helm
      name: alpine
      sources:
      - https://github.com/helm/helm
      urls:
      - https://technosophos.github.io/tscharts/alpine-0.2.0.tgz
      version: 0.2.0

Or for delete:

apiVersion: v1
removedEntries:
  alpine:
    - created: 2016-10-06T16:23:20.499814565-06:00
      description: Deploy a basic Alpine Linux pod
      digest: 99c76e403d752c84ead610644d4b1c2f2b453a74b921f422b9dcb8a7c8b559cd
      home: https://k8s.io/helm
      name: alpine
      sources:
      - https://github.com/helm/helm
      urls:
      - https://technosophos.github.io/tscharts/alpine-0.2.0.tgz
      version: 0.2.0

Whenever a helm s3 index update occures, it should do the following steps:

  1. Upload the diff file for the current index modification
  2. Fetch the current index.yml file
  3. Fetch all diff files from s3
  4. Apply all diff files (including the one uploaded in step 1) to index.yaml (use the chart digest for merging, ignore already applied changes)
  5. Upload new index.yml
  6. Delete old diff files (eg: older than 5 minutes), beause it's very unlikly that there will be any conflicting change before than

Fetching the index.yaml file with the s3 plugin could repeate step 2-4 for more consistency. Using the repo without the s3 plugin (as a static site) would still be a valid helm repo, but with weaker consistency.

There could be a command that repeats step 2-5, so a cron job could provide real eventual consistency.

@pete911
Copy link

pete911 commented Jul 5, 2019

Hi, S3 objects have ETags, I'm just checking if there's an option to validate it on S3 side. If there is, the process would look like this:

  • dowload index with ETag
  • update index
  • upload new index with expected previous ETag
    If ETag does not match, upload is failed/rejected (meaning ETag does not match - someone/something changed index in the meantime) and client retries again (download, update, upload)

@demo-ecout
Copy link

Just for the record s3 now supports object lock.

@eserden
Copy link

eserden commented Mar 17, 2021

Any updates ?

@balazs-fark-epam
Copy link

Hi, @hypnoglow

Is this issue still around?

Thanks,

Balazs

chasinglogic pushed a commit to chasinglogic/helm-s3 that referenced this issue Jul 24, 2024
…yaml generation

I have set this up so it is sufficiently abstracted that other lock
backends could be added in the future.

Fixes hypnoglow#18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
9 participants