Skip to content

Alert Response

elisa lee edited this page Aug 20, 2024 · 24 revisions

SimpleReport is actively monitored by Azure's Application Insights. In the event that abnormal application behavior is detected, an alert will automatically be sent to on-call engineering personnel for resolution.

Are you on-call? Lucky you! Here are some common alerts, and how to respond to them.

Don't forget the on-call engineer also needs to address any support requests during their shift. Here's more information on how to get setup and address support requests.

Are you getting a ton of alerts and not sure what to do? See our escalation policy.

Don't forget to put up the maintenance banner if needed.

Table of Contents


10+ DB queries with durations over 1.25s in the past 5 minutes

What Went Wrong?

A number of known issues can cause slow DB query responses. If there isn't an issue opened for the slow DB query please feel free to open a new issue.

What Should You Do?

These alerts usually resolve themselves, however, it is recommended to check the following:

  • check the logs (follow the second link in the pager duty alert that starts with https://portal.azure.com to explore the logs) to see if this is continuing
  • check application insights (e.g. for prod prime-simple-report-prod-insights) and click the "Overview" tab to explore the graphs for server response time, etc...

GraphQL query validation failures

What Went Wrong?

This most often occurs in a "SimpleReport - Non-Prod" environment as a result of a SimpleReport engineer testing on one of the lower environments

What Should You Do?

If this is occurring on a lower environment, it does not need the urgency of other production-related pages. However, please make sure to

  • check the logs (follow the second link in the pager duty alert that starts with https://portal.azure.com to explore the logs) for anything noteworthy
  • check-in with the engineer who is testing on the lower environment to confirm they are in fact testing a change/feature/etc

Prod alert when an ExperianAuthException is seen

Affected Component: SimpleReport backend LiveExperianService

What Went Wrong? We use Experian to verify users' identity during user signup. Before we submit a request to Experian, we must first fetch an activation token from Experian using our credentials. If we see this alert, it means there was a problem fetching the token and the identity verification steps couldn't be completed for the user. More context on how we use Experian to verify identity can be found here.

What Should You Do?

  • View the alert in the Azure portal. Query the exceptions table for ExperianAuthExceptions in the time period of the alert to get the stack trace, which will include the response from Experian.
  • Possible Experian API responses when fetching a token are documented here.
    • Most exceptions have historically been because of intermittent 500 responses from Experian which are not actionable and resolve themselves.
  • Query requests in Azure to see if this is a one-off and we've since had successful requests to /identity-verification/get-questions or /identity-verification/submit-answers endpoints or if all requests are failing.
  • This alert can be triggered if Experian doesn't recognize our credentials, which has happened in the past when they expired the application password without notifying us. If this appears to be the cause, first verify that we haven't made any changes to our credentials or the LiveExperianService code. If not, the resolution is to contact Experian for help.

Prod HTTP Server 5xx Errors >= 10

WIP What Went Wrong?

There are several reasons 500 errors can be thrown.

What Should You Do?

It is recommended to check the following:

  • check the logs (follow the second link in the pager duty alert that starts with https://portal.azure.com to explore the logs) to see if this is continuing
  • check application insights (e.g. for prod prime-simple-report-prod-insights)
    • click the "Overview" tab to explore the graphs for failed requests, etc...
    • click the "Failures" tab to explore the 500 failures and it may be helpful to get the call stack of the errors

QueueBatchedReportStreamUploader failed to successfully complete

Affected Component: rs-batch-publisher-prod function app

What Went Wrong? The ReportStream Batched Publisher has a built-in timer function, QueueBatchedReportStreamUploader. The function was successfully triggered, but failed to either pull messages from the queue, or properly perform an upload.

What Should You Do?

  • Check the function history. You can see at a glance what the most recent set of runs looks like.
  • For the failed run, take note of the Operation Id. You can cross-reference this value in Application Insights to get a better picture of what caused the failure.
  • If this alert is being fired off repeatedly within a short timeframe, reach out to the ReportStream team. We will need to confirm whether the issue is on the SimpleReport side, or whether it originates from ReportStream.

Follow up

  • If this failure caused a message to be added to either error queues fhir-data-publishing-error or test-event-publishing-error you will need to move these messages from the error queue to the appropriate queues for reprocessing. (e.g. Messages in the test-event-publishing-error will need to be moved to the test-event-publishing-queue.) Please refer to how to do this here. [LINK TO BE ADDED]

QueueBatchedReportStreamUploader is not triggering on schedule

Affected Component: rs-batch-publisher-prod function app

What Went Wrong? The ReportStream Batched Publisher has a built-in timer function, QueueBatchedReportStreamUploader. If this alert fires, chances are high that the code for the function is missing or corrupt.

What Should You Do?

  • Check the function history. You can see at a glance what the most recent set of runs looks like. Runs should take place every two minutes; a gap of longer than this confirms that the fired alert is valid.
  • Take a look at the Code + Test pane. Ensure that the files present here match what currently exists in the codebase.
  • If there are discrepancies between what files should be present, and what files are present, re-deploy the functions using the corresponding GitHub Action.

Prod deploy health alert

Affected Component: We set up a post-prod deploy health check workflow that fires up a Selenium browser to ping a frontend page at /app/health/deploy-smoke-test. That page returns a success / failure status based on the status of /api/actuator/health/backend-and-db-smoke-test to verify the front and backend can talk to each other after a deploy.

What Should You Do?

  • Verify that the health pages load with the UP / success statuses. If they don't, check to see that the deploy didn't break communication between the front and backends. This will probably be caused by a change to the Terraform config in a recent commit or some sort of manual Azure change.

CDC Redirect Alert

Affected Component: We have healthchecks setup in Azure for a number of URLs corresponding to SimpleReport, one of them is simplereport.cdc.gov (which redirects to simplereport.gov). The infrastructure for this endpoint/redirect sits outside of SimpleReport and is not something we can directly affect.

What Should You Do?

  • Most of these alerts are intermittent and will be automatically resolved by PagerDuty/Azure.
  • If the alert does not autoresolve within a matter of minutes (most resolve within 5-10 minutes) escalate via email to the CDC team that owns this infrastructure (TODO: get an email for escalation)

Twilio Alert

Impact: Twilio message sends

Issues: Twilio tracks errors created when trying to send messages. Twilio may be experiencing a high error rate related to but not limited to sending messages to landlines, unreachable carriers, HTTP errors, unknown handsets, or spam filtering messages sent by SimpleReport.

Actions to take:

  • Check the Twilio error logs to see what the problems are. It is possible to filter the results and narrow or expand the displayed time frame.
  • Check Twilio status page for outtage.
  • Check individual errors by clicking into an error, then clicking the RESOURCE SID link to get more information. Navigating to this view will allow us to see the number we tried to send to, the body of the message, and a complete historical record of that message within Twilio.
  • Possible corrective actions:
    • Update user records within SimpleReport.
    • Submit a Twilio support ticket. Twilio suggests that we do this if we have an example of three or more filtered messages that we believe we legitimate sends.

Local development

Setup

How to

Development process and standards

Oncall

Technical resources

How-to guides

Environments/Azure

Misc

?

Clone this wiki locally