You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We had an incident where a customer was sending upwards of 400 requests per second for a couple hours, resulting in a huge backlog of webhook events. This was fine, but their webhook endpoint that was subscribed to all events was starting to fail, causing all jobs to slow to a crawl while the hundreds of thousands of webhook events were attempted to be delivered, all with a 15 second timeout. This caused a severe backup in job queuing, which ultimately led to our Redis instance running out of memory until we were able to vertically scale the Redis instance and clear the backlog.
Webhook events should not block other jobs from running. Ideally, we rate limit webhook events per-account so that an account only ever delays its own events, not all events. To prevent the OOM issue, we should look into disabling webhook endpoints that have an error rate above a certain threshold, based on time and volume.
The text was updated successfully, but these errors were encountered:
We had an incident where a customer was sending upwards of 400 requests per second for a couple hours, resulting in a huge backlog of webhook events. This was fine, but their webhook endpoint that was subscribed to all events was starting to fail, causing all jobs to slow to a crawl while the hundreds of thousands of webhook events were attempted to be delivered, all with a 15 second timeout. This caused a severe backup in job queuing, which ultimately led to our Redis instance running out of memory until we were able to vertically scale the Redis instance and clear the backlog.
Webhook events should not block other jobs from running. Ideally, we rate limit webhook events per-account so that an account only ever delays its own events, not all events. To prevent the OOM issue, we should look into disabling webhook endpoints that have an error rate above a certain threshold, based on time and volume.
The text was updated successfully, but these errors were encountered: