🐛 Remove cache when catalog is deleted #1207

m1kola · 2024-09-03T14:16:24Z

Description

We add a finaliser to ClusterCatalog which enables us to remove catalog cache from filesystem on catalog deletion.

Fixes #948

Reviewer Checklist

API Go Documentation
Tests: Unit Tests (and E2E Tests, if appropriate)
Comprehensive Commit Messages
Links to related GitHub Issue(s)

netlify · 2024-09-03T14:16:42Z

✅ Deploy Preview for olmv1 ready!

Name	Link
🔨 Latest commit	`b6ae265`
🔍 Latest deploy log	https://app.netlify.com/sites/olmv1/deploys/66ec170df89a010008b5a112
😎 Deploy Preview	https://deploy-preview-1207--olmv1.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

joelanford · 2024-09-03T16:39:25Z

I like the direction of this. Maybe not something to do in this PR, but once we have a separate catalog controller, I think it would make sense to have that controller be the thing that also populates operator-controller's cache.

Right now, the operator-controller catalog cache is populated on-demand when a ClusterExtension is reconciled. The outcome is that some ClusterExtension reconciles take quite a bit longer because they have to pause to pull and extract catalog content from catalogd.

If we moved pull/extract logic to a separate ClusterCatalog controller, then operator-controller could do this work async. At that point the only ClusterExtensions that would still have to wait are the ones that are reconciled just after a ClusterCatalog has new data available.

codecov · 2024-09-16T15:45:21Z

Codecov Report

Attention: Patch coverage is 82.97872% with 8 lines in your changes missing coverage. Please review.

Project coverage is 76.53%. Comparing base (df0e848) to head (b6ae265).

Files with missing lines	Patch %	Lines
internal/controllers/clustercatalog_controller.go	79.16%	4 Missing and 1 partial ⚠️
cmd/manager/main.go	66.66%	2 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1207      +/-   ##
==========================================
+ Coverage   76.49%   76.53%   +0.04%     
==========================================
  Files          39       40       +1     
  Lines        2361     2404      +43     
==========================================
+ Hits         1806     1840      +34     
- Misses        389      396       +7     
- Partials      166      168       +2

Flag	Coverage Δ
e2e	`58.11% <72.34%> (+0.21%)`	⬆️
unit	`53.03% <42.55%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

cmd/manager/main.go

joelanford · 2024-09-17T20:39:36Z

cmd/manager/main.go

+
+	if err = (&controllers.ClusterCatalogReconciler{
+		Client:     cl,
+		Finalizers: clusterCatalogFinalizers,


I worry a little bit about using finalizers here. We'll end up with two different controllers that always race when catalogs are created and destroyed, which I'm guessing will add unnecessary reconcile churn.

WDYT about a best-effort garbage collector that runs the cleanup logic along the lines of:

if err := r.Get(req); errors.IsNotFound(err) { if err := cache.Cleanup(req); err != nil { metrics.IncrementCatalogCacheCleanupFailure() eventrecorder.Event("failed to clear catalog cache for %q", req) } }

Another option is a garbage collector somewhat like catalogd has here: https://github.com/operator-framework/catalogd/blob/main/internal/garbagecollection/garbage_collector.go

I worry a little bit about using finalizers here. We'll end up with two different controllers that always race when catalogs are created and destroyed, which I'm guessing will add unnecessary reconcile churn.

@joelanford since we only add/remove the finaliser on the object in this controller - the worse thing that can happen here is a conflict. In case of a conflict one of the controllers will re-queue.

Yes, there will be a bit of overhead even without conflicts: one of the controllers will have to do a no-op reconcile at some point. E.g.:

User creates ClusterCatalog

operator-controller reconciles it - add the finaliser

catalogd reconciles it - adds finaliser, unpacks, etc

operator-controller reconciles it again - we just exit because the finaliser is already in place (we are already at the desired state)

It can be the other way around around (catalogd reconciles first, operator-controller second). Similar on deletion.

WDYT about a best-effort garbage collector that runs the cleanup logic along the lines of

I think without the finaliser our controller will not be able to reconcile on the level: only on the edge. With that there is a chance that the controller will not see some events. These cases are probably going to be rare: I imagine that there should be some issue (e.g. networking) lasting long enough for etcd to clean up historical versions.

Another option is a garbage collector somewhat like catalogd has here: https://github.com/operator-framework/catalogd/blob/main/internal/garbagecollection/garbage_collector.go

I think there might be time-of-check to time-of-use issue. E.g new catalog is created after cleanup routine lists catalogs, but before it listed FS.

We can probably protect it with a mutex we already have to block reads. But this will mean that we will have to protect listing from the etcd in addition to FS IO and it will increase the blocking time for readers.

I think these two options look more brittle to me and can lead to issues which are going to be (likely) rare, but hard to debug.

I think my preference is to have a bit of overhead in reconciliation, but be explicit with finalisers. But I'm happy to pivot in a different direction.

Let me know what you want me to do here given the above.

Can we still go with a "controller" approach that just doesn't modify the ClusterCatalog resource?

If the pod crashes, is our cache still populated? If so, we could do an initial list + cleanup on startup and then rely on informers to react to events that trigger our cleanup logic without modification of ClusterCatalog resources. Once the informer is up I am pretty confident it won't miss any events. Any events we miss are likely because the operator-controller manager container crashed and is going to restart and do the initial list+cleanup.

Does this sound like it could be a good middle ground?

@everettraven I think this is the same as the best-effort option Joe suggested.

Anyway - I updated the controller to not change ClusterCatalog. Please take another look.

If the pod crashes, is our cache still populated? If so, we could do an initial list + cleanup on startup and then rely on informers to react to events that trigger our cleanup logic without modification of ClusterCatalog resources. Once the informer is up I am pretty confident it won't miss any events. Any events we miss are likely because the operator-controller manager container crashed and is going to restart and do the initial list+cleanup.

On pod crash we will end up with filesystem containing cache since we are currently using emptyDir which persists between crashes. But we will be rewriting the cache for each catalog since this map is not populated on restart. So it looks like using emptyDir is pointless here.

But my point was not about crashes: I was talking about issues when the manager is still running, but can't connect to API server. In this case it will still think that the cache is still present and valid because it hasn't seen delete event.

Perhaps getting rid of emptyDir will help? Edit: Ah, we need emptyDir for unpack cache because it is good to have it persist between pod restarts.

Enables us to deletes cache directory for a given catalog from the filesystem. Signed-off-by: Mikalai Radchuk <[email protected]>

New finaliser allows us to remove catalog cache from filesystem on catalog deletion. Signed-off-by: Mikalai Radchuk <[email protected]>

joelanford · 2024-09-19T16:30:40Z

internal/controllers/clustercatalog_controller.go

+	if client.IgnoreNotFound(err) != nil {
+		return ctrl.Result{}, err
+	}
+	if apierrors.IsNotFound(err) {
+		return ctrl.Result{}, r.Cache.Remove(req.Name)
+	}
+
+	return ctrl.Result{}, nil


Nit: it took me an extra second to parse the IgnoreNotFound + IsNotFound checks. We could simplify to the following.

Suggested change

if client.IgnoreNotFound(err) != nil {

return ctrl.Result{}, err

}

if apierrors.IsNotFound(err) {

return ctrl.Result{}, r.Cache.Remove(req.Name)

}

return ctrl.Result{}, nil

if apierrors.IsNotFound(err) {

return ctrl.Result{}, r.Cache.Remove(req.Name)

}

return ctrl.Result{}, err

If you aren't a fan of return ctrl.Result{}, err at the end serving the dual purpose of returning nil or non-nil errors, we could also refactor like this:

Suggested change

if client.IgnoreNotFound(err) != nil {

return ctrl.Result{}, err

}

if apierrors.IsNotFound(err) {

return ctrl.Result{}, r.Cache.Remove(req.Name)

}

return ctrl.Result{}, nil

if apierrors.IsNotFound(err) {

return ctrl.Result{}, r.Cache.Remove(req.Name)

}

if err != nil {

return ctrl.Result{}, err

}

return ctrl.Result{}, nil

Do we need the Get() call? In our SetupWithManager() method below we filter events to only care about the delete events, so reconcile should only ever be called when a delete event is triggered for a ClusterCatalog resource.

This would make the Reconcile method essentially:

func (r *ClusterCatalogReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) { return ctrl.Result{}, r.Cache.Remove(req.Name) }

Yeah, true. We could make it that simple for now.

I'm guessing we will follow-up at some point to do what I mentioned here: #1207 (comment)

But the code is so simple as is that it would be easy to refactor later without accidentally introducing bugs or causing a regression of this code.

joelanford · 2024-09-19T16:33:54Z

internal/controllers/clustercatalog_controller.go

+		return ctrl.Result{}, err
+	}
+	if apierrors.IsNotFound(err) {
+		return ctrl.Result{}, r.Cache.Remove(req.Name)


Should we create one or more new metrics related to this cache? For example:

gauge: number of known ClusterCatalogs

gauge: number of local caches

Or maybe just store the diff as a single gauge (that we would expect to always be 0)?

I don't think this is unreasonable, but I would probably consider this outside the scope of this specific PR. I do think we should eventually have a discussion about instrumenting OLMv1 with meaningful metrics.

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 3, 2024

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 3, 2024

m1kola force-pushed the cleanup_cache branch from d768396 to a953950 Compare September 3, 2024 14:24

m1kola mentioned this pull request Sep 3, 2024

🌱 Define a constant for the finaliser key #1209

Merged

4 tasks

m1kola force-pushed the cleanup_cache branch from a953950 to 40e003b Compare September 3, 2024 14:58

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 3, 2024

m1kola force-pushed the cleanup_cache branch from 40e003b to 5ca78e0 Compare September 9, 2024 14:07

m1kola changed the title ~~✨ Remove cache when catalog is deleted~~ 🐛 Remove cache when catalog is deleted Sep 13, 2024

m1kola force-pushed the cleanup_cache branch 2 times, most recently from 0ed5135 to b607a08 Compare September 16, 2024 15:37

m1kola force-pushed the cleanup_cache branch 4 times, most recently from ff58fe9 to f664d7d Compare September 17, 2024 14:32

m1kola marked this pull request as ready for review September 17, 2024 14:50

m1kola requested a review from a team as a code owner September 17, 2024 14:50

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 17, 2024

m1kola force-pushed the cleanup_cache branch from f664d7d to 7a2b32f Compare September 17, 2024 15:45

rashmi43 reviewed Sep 17, 2024

View reviewed changes

cmd/manager/main.go Outdated Show resolved Hide resolved

joelanford reviewed Sep 17, 2024

View reviewed changes

m1kola force-pushed the cleanup_cache branch 2 times, most recently from f668757 to c59e1f4 Compare September 19, 2024 10:10

Add Remove into filesystemCache

2ddc519

Enables us to deletes cache directory for a given catalog from the filesystem. Signed-off-by: Mikalai Radchuk <[email protected]>

m1kola force-pushed the cleanup_cache branch from c59e1f4 to 9ea5da0 Compare September 19, 2024 10:17

m1kola requested review from joelanford, rashmi43 and everettraven September 19, 2024 11:58

Add a finaliser to ClusterCatalog

b6ae265

New finaliser allows us to remove catalog cache from filesystem on catalog deletion. Signed-off-by: Mikalai Radchuk <[email protected]>

m1kola force-pushed the cleanup_cache branch from 9ea5da0 to b6ae265 Compare September 19, 2024 12:20

m1kola mentioned this pull request Sep 19, 2024

✨ Populate/update cache on ClusterCatalog reconcile #1284

Draft

4 tasks

joelanford reviewed Sep 19, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 Remove cache when catalog is deleted #1207

🐛 Remove cache when catalog is deleted #1207

m1kola commented Sep 3, 2024

netlify bot commented Sep 3, 2024 •

edited

Loading

joelanford commented Sep 3, 2024

codecov bot commented Sep 16, 2024 •

edited

Loading

joelanford Sep 17, 2024

m1kola Sep 18, 2024

everettraven Sep 18, 2024

m1kola Sep 19, 2024 •

edited

Loading

joelanford Sep 19, 2024

everettraven Sep 19, 2024 •

edited

Loading

joelanford Sep 19, 2024

joelanford Sep 19, 2024

everettraven Sep 19, 2024

joelanford Sep 19, 2024

🐛 Remove cache when catalog is deleted #1207

Are you sure you want to change the base?

🐛 Remove cache when catalog is deleted #1207

Conversation

m1kola commented Sep 3, 2024

Description

Reviewer Checklist

netlify bot commented Sep 3, 2024 • edited Loading

✅ Deploy Preview for olmv1 ready!

joelanford commented Sep 3, 2024

codecov bot commented Sep 16, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

m1kola Sep 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

everettraven Sep 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

netlify bot commented Sep 3, 2024 •

edited

Loading

codecov bot commented Sep 16, 2024 •

edited

Loading

m1kola Sep 19, 2024 •

edited

Loading

everettraven Sep 19, 2024 •

edited

Loading