Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Remove cache when catalog is deleted #1207

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

m1kola
Copy link
Member

@m1kola m1kola commented Sep 3, 2024

Description

We add a finaliser to ClusterCatalog which enables us to remove catalog cache from filesystem on catalog deletion.

Fixes #948

Reviewer Checklist

  • API Go Documentation
  • Tests: Unit Tests (and E2E Tests, if appropriate)
  • Comprehensive Commit Messages
  • Links to related GitHub Issue(s)

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 3, 2024
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 3, 2024
Copy link

netlify bot commented Sep 3, 2024

Deploy Preview for olmv1 ready!

Name Link
🔨 Latest commit b6ae265
🔍 Latest deploy log https://app.netlify.com/sites/olmv1/deploys/66ec170df89a010008b5a112
😎 Deploy Preview https://deploy-preview-1207--olmv1.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 3, 2024
@joelanford
Copy link
Member

I like the direction of this. Maybe not something to do in this PR, but once we have a separate catalog controller, I think it would make sense to have that controller be the thing that also populates operator-controller's cache.

Right now, the operator-controller catalog cache is populated on-demand when a ClusterExtension is reconciled. The outcome is that some ClusterExtension reconciles take quite a bit longer because they have to pause to pull and extract catalog content from catalogd.

If we moved pull/extract logic to a separate ClusterCatalog controller, then operator-controller could do this work async. At that point the only ClusterExtensions that would still have to wait are the ones that are reconciled just after a ClusterCatalog has new data available.

@m1kola m1kola changed the title ✨ Remove cache when catalog is deleted 🐛 Remove cache when catalog is deleted Sep 13, 2024
@m1kola m1kola force-pushed the cleanup_cache branch 2 times, most recently from 0ed5135 to b607a08 Compare September 16, 2024 15:37
Copy link

codecov bot commented Sep 16, 2024

Codecov Report

Attention: Patch coverage is 82.97872% with 8 lines in your changes missing coverage. Please review.

Project coverage is 76.53%. Comparing base (df0e848) to head (b6ae265).

Files with missing lines Patch % Lines
internal/controllers/clustercatalog_controller.go 79.16% 4 Missing and 1 partial ⚠️
cmd/manager/main.go 66.66% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1207      +/-   ##
==========================================
+ Coverage   76.49%   76.53%   +0.04%     
==========================================
  Files          39       40       +1     
  Lines        2361     2404      +43     
==========================================
+ Hits         1806     1840      +34     
- Misses        389      396       +7     
- Partials      166      168       +2     
Flag Coverage Δ
e2e 58.11% <72.34%> (+0.21%) ⬆️
unit 53.03% <42.55%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@m1kola m1kola force-pushed the cleanup_cache branch 4 times, most recently from ff58fe9 to f664d7d Compare September 17, 2024 14:32
@m1kola m1kola marked this pull request as ready for review September 17, 2024 14:50
@m1kola m1kola requested a review from a team as a code owner September 17, 2024 14:50
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 17, 2024
cmd/manager/main.go Outdated Show resolved Hide resolved

if err = (&controllers.ClusterCatalogReconciler{
Client: cl,
Finalizers: clusterCatalogFinalizers,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry a little bit about using finalizers here. We'll end up with two different controllers that always race when catalogs are created and destroyed, which I'm guessing will add unnecessary reconcile churn.

WDYT about a best-effort garbage collector that runs the cleanup logic along the lines of:

if err := r.Get(req); errors.IsNotFound(err) {
    if err := cache.Cleanup(req); err != nil {
          metrics.IncrementCatalogCacheCleanupFailure()
          eventrecorder.Event("failed to clear catalog cache for %q", req)
    }
}

Another option is a garbage collector somewhat like catalogd has here: https://github.com/operator-framework/catalogd/blob/main/internal/garbagecollection/garbage_collector.go

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry a little bit about using finalizers here. We'll end up with two different controllers that always race when catalogs are created and destroyed, which I'm guessing will add unnecessary reconcile churn.

@joelanford since we only add/remove the finaliser on the object in this controller - the worse thing that can happen here is a conflict. In case of a conflict one of the controllers will re-queue.

Yes, there will be a bit of overhead even without conflicts: one of the controllers will have to do a no-op reconcile at some point. E.g.:

  • User creates ClusterCatalog
  • operator-controller reconciles it - add the finaliser
  • catalogd reconciles it - adds finaliser, unpacks, etc
  • operator-controller reconciles it again - we just exit because the finaliser is already in place (we are already at the desired state)

It can be the other way around around (catalogd reconciles first, operator-controller second). Similar on deletion.

WDYT about a best-effort garbage collector that runs the cleanup logic along the lines of

I think without the finaliser our controller will not be able to reconcile on the level: only on the edge. With that there is a chance that the controller will not see some events. These cases are probably going to be rare: I imagine that there should be some issue (e.g. networking) lasting long enough for etcd to clean up historical versions.

Another option is a garbage collector somewhat like catalogd has here: https://github.com/operator-framework/catalogd/blob/main/internal/garbagecollection/garbage_collector.go

I think there might be time-of-check to time-of-use issue. E.g new catalog is created after cleanup routine lists catalogs, but before it listed FS.

We can probably protect it with a mutex we already have to block reads. But this will mean that we will have to protect listing from the etcd in addition to FS IO and it will increase the blocking time for readers.

I think these two options look more brittle to me and can lead to issues which are going to be (likely) rare, but hard to debug.

I think my preference is to have a bit of overhead in reconciliation, but be explicit with finalisers. But I'm happy to pivot in a different direction.

Let me know what you want me to do here given the above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we still go with a "controller" approach that just doesn't modify the ClusterCatalog resource?

If the pod crashes, is our cache still populated? If so, we could do an initial list + cleanup on startup and then rely on informers to react to events that trigger our cleanup logic without modification of ClusterCatalog resources. Once the informer is up I am pretty confident it won't miss any events. Any events we miss are likely because the operator-controller manager container crashed and is going to restart and do the initial list+cleanup.

Does this sound like it could be a good middle ground?

Copy link
Member Author

@m1kola m1kola Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@everettraven I think this is the same as the best-effort option Joe suggested.

Anyway - I updated the controller to not change ClusterCatalog. Please take another look.

If the pod crashes, is our cache still populated? If so, we could do an initial list + cleanup on startup and then rely on informers to react to events that trigger our cleanup logic without modification of ClusterCatalog resources. Once the informer is up I am pretty confident it won't miss any events. Any events we miss are likely because the operator-controller manager container crashed and is going to restart and do the initial list+cleanup.

On pod crash we will end up with filesystem containing cache since we are currently using emptyDir which persists between crashes. But we will be rewriting the cache for each catalog since this map is not populated on restart. So it looks like using emptyDir is pointless here.

But my point was not about crashes: I was talking about issues when the manager is still running, but can't connect to API server. In this case it will still think that the cache is still present and valid because it hasn't seen delete event.

Perhaps getting rid of emptyDir will help? Edit: Ah, we need emptyDir for unpack cache because it is good to have it persist between pod restarts.

@m1kola m1kola force-pushed the cleanup_cache branch 2 times, most recently from f668757 to c59e1f4 Compare September 19, 2024 10:10
Enables us to deletes cache directory
for a given catalog from the filesystem.

Signed-off-by: Mikalai Radchuk <[email protected]>
New finaliser allows us to remove catalog cache from filesystem
on catalog deletion.

Signed-off-by: Mikalai Radchuk <[email protected]>
Comment on lines +47 to +54
if client.IgnoreNotFound(err) != nil {
return ctrl.Result{}, err
}
if apierrors.IsNotFound(err) {
return ctrl.Result{}, r.Cache.Remove(req.Name)
}

return ctrl.Result{}, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: it took me an extra second to parse the IgnoreNotFound + IsNotFound checks. We could simplify to the following.

Suggested change
if client.IgnoreNotFound(err) != nil {
return ctrl.Result{}, err
}
if apierrors.IsNotFound(err) {
return ctrl.Result{}, r.Cache.Remove(req.Name)
}
return ctrl.Result{}, nil
if apierrors.IsNotFound(err) {
return ctrl.Result{}, r.Cache.Remove(req.Name)
}
return ctrl.Result{}, err

If you aren't a fan of return ctrl.Result{}, err at the end serving the dual purpose of returning nil or non-nil errors, we could also refactor like this:

Suggested change
if client.IgnoreNotFound(err) != nil {
return ctrl.Result{}, err
}
if apierrors.IsNotFound(err) {
return ctrl.Result{}, r.Cache.Remove(req.Name)
}
return ctrl.Result{}, nil
if apierrors.IsNotFound(err) {
return ctrl.Result{}, r.Cache.Remove(req.Name)
}
if err != nil {
return ctrl.Result{}, err
}
return ctrl.Result{}, nil

Copy link
Contributor

@everettraven everettraven Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the Get() call? In our SetupWithManager() method below we filter events to only care about the delete events, so reconcile should only ever be called when a delete event is triggered for a ClusterCatalog resource.

This would make the Reconcile method essentially:

func (r *ClusterCatalogReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
        return ctrl.Result{}, r.Cache.Remove(req.Name)
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, true. We could make it that simple for now.

I'm guessing we will follow-up at some point to do what I mentioned here: #1207 (comment)

But the code is so simple as is that it would be easy to refactor later without accidentally introducing bugs or causing a regression of this code.

return ctrl.Result{}, err
}
if apierrors.IsNotFound(err) {
return ctrl.Result{}, r.Cache.Remove(req.Name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we create one or more new metrics related to this cache? For example:

  • gauge: number of known ClusterCatalogs
  • gauge: number of local caches

Or maybe just store the diff as a single gauge (that we would expect to always be 0)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is unreasonable, but I would probably consider this outside the scope of this specific PR. I do think we should eventually have a discussion about instrumenting OLMv1 with meaningful metrics.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: cached catalog contents are not cleaned up when catalog is deleted.
5 participants