Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow webhook event workers should not block other workers #771

Open
ezekg opened this issue Dec 18, 2023 · 1 comment
Open

Slow webhook event workers should not block other workers #771

ezekg opened this issue Dec 18, 2023 · 1 comment

Comments

@ezekg
Copy link
Member

ezekg commented Dec 18, 2023

We had an incident where a customer was sending upwards of 400 requests per second for a couple hours, resulting in a huge backlog of webhook events. This was fine, but their webhook endpoint that was subscribed to all events was starting to fail, causing all jobs to slow to a crawl while the hundreds of thousands of webhook events were attempted to be delivered, all with a 15 second timeout. This caused a severe backup in job queuing, which ultimately led to our Redis instance running out of memory until we were able to vertically scale the Redis instance and clear the backlog.

Webhook events should not block other jobs from running. Ideally, we rate limit webhook events per-account so that an account only ever delays its own events, not all events. To prevent the OOM issue, we should look into disabling webhook endpoints that have an error rate above a certain threshold, based on time and volume.

@ezekg
Copy link
Member Author

ezekg commented Dec 29, 2023

The main culprit here is the fact we're storing webhook event payloads in the worker argument list. We should store the payload in the database instead of the args to reduce Redis memory usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant