Slow webhook event workers should not block other workers #771

ezekg · 2023-12-18T15:29:44Z

We had an incident where a customer was sending upwards of 400 requests per second for a couple hours, resulting in a huge backlog of webhook events. This was fine, but their webhook endpoint that was subscribed to all events was starting to fail, causing all jobs to slow to a crawl while the hundreds of thousands of webhook events were attempted to be delivered, all with a 15 second timeout. This caused a severe backup in job queuing, which ultimately led to our Redis instance running out of memory until we were able to vertically scale the Redis instance and clear the backlog.

Webhook events should not block other jobs from running. Ideally, we rate limit webhook events per-account so that an account only ever delays its own events, not all events. To prevent the OOM issue, we should look into disabling webhook endpoints that have an error rate above a certain threshold, based on time and volume.

ezekg · 2023-12-29T18:43:25Z

The main culprit here is the fact we're storing webhook event payloads in the worker argument list. We should store the payload in the database instead of the args to reduce Redis memory usage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow webhook event workers should not block other workers #771

Slow webhook event workers should not block other workers #771

ezekg commented Dec 18, 2023 •

edited

Loading

ezekg commented Dec 29, 2023

Slow webhook event workers should not block other workers #771

Slow webhook event workers should not block other workers #771

Comments

ezekg commented Dec 18, 2023 • edited Loading

ezekg commented Dec 29, 2023

ezekg commented Dec 18, 2023 •

edited

Loading