2019-04-18 06:04 — By Erik van Eykelen

On Monitoring & Alerting

Definitions used in this article:

Logging: the process of writing information to a stream of log entries, each line containing diagnostical data including a timestamp.
Monitoring: the process of parsing the log contents, either by looking at individual lines or by analyzing a set of lines.
Alerting: the process of sending notifications (e.g. emails, SMS texts, or Slack messages) based on triggers generated by one or more monitoring processes.

This article assumes proper logging is in place (and remember... you seldom log too much, storage is cheap so log excessively as long as it’s done in a structured way).

Monitoring

This chapter is divided into three sections:

What to monitor?
Monitor what should (not) happen every time you check.
Monitor what you expect (not) to happen during a certain timeframe.

What to monitor?

The often-cited Murhpy’s law applies here: Anything that can go wrong will go wrong. Any network, any router, any server, down to every networking cable, will (given enough time) fail. Luckily hardware has become exceptionally reliable the past 20 years, today most things go wrong due to human-error.

To set up proper monitoring for an environment such as Umbrella you need to look at many different “parts”. The sheer number of parts makes this task difficult.

Let’s start with some examples:

A site goes down when the app server throws 4xx or 5xx errors.
A site becomes non-functional when its CSS/JS assets are inaccessible.
A site (i.e. “web app”) goes down when the domain name expires.
A site goes down when its SSL certificate expires (or is renewed in the wrong way).
Access to a site becomes unaccessible when its DNS goes down or is misconfigured.
An interface to a 3rd party starts to fail when the its servers are inaccessible.
A site goes down when its underlying database connection fails.
A site goes down if it can’t authenticate and authorize its users, this might occur even when the rest of the site continues to work properly.
A site remains functional albeit with potentially huge problems if the underlying data becomes corrupt. Example: batch deletes/adds/updates which should not have occurred.

Above-mentioned examples are repeated in the following sections to discuss different types of monitoring.

Monitor what should (not) happen every time you check

Looking at the examples, let’s discuss how you can monitor for conditions that should (not) happen every time you check:

Monitoring for 4xx/5xx responses is trivial. Tools such as Pingdom or WebMon can be used to set up tests which run every 10 minutes. These tests check a number of URLs, each test checking for a positive signal such as an expected string, or checking for a negative signal such as an 4xx or 5xx error.
Checking whether assets such as JS or CSS files are loadable, can also be tested using Pingdom or similar tools.
Raising alerts about domain names and SSL certificates which are about to expire should happen well before the final date because extending the domain or certificate might involve manual work by our customer or by us. A tool like MXToolBox can be used to perform these checks.
Checking whether a 3rd party API or FTP server is accessible can be performed using e.g. Pingdom in case the server is publicly accessible. However in many cases it’s necessary to add specific, public diagnostical endpoints to your app. See below for additional notes.
Detecting if the underlying database is inaccessible also requires adding a publicly accessible endpoint.
Checking if the authentication and authorization process works can be carried out by using a tool such as Ghost Inspector to perform periodic sign-in attempts using different accounts, each leading to landing on a post sign-in page for a successful sign-in attempt. This page should be structured differently based on the user’s role or user rights (e.g. an admin user has more menu items than a regular user).

In some cases you might have to add an extra endpoint to make it possible to check the accessibility of 3rd party endpoints in case this endpoint is not accessible via the public Internet. Since we can’t exert any control over 3rd party endpoints (different company, or not even up for debate according to our customer) we must channel tests through the only API we do control being the API of our own app.

Another reason why you might have to add an extra endpoint is to keep the API credentials away from the monitoring process. This is an important reason, sometimes even the primary reason for adding an extra endpoint.

A similar case is testing whether our app can actually reach its underlying database: our app needs an additional endpoint which provides diagnostical info to the caller about the status of the database connection.

Monitor what you expect (not) to happen during a certain timeframe

Checking for events that should happen (or not happen) every time is trivial in most cases. Much more difficult is detecting unexpected events happening during a longer period of time.

Examples of what you might be missing today:

Every 24 hours a certain synchronization process results in, say, 20 add, delete, and update actions on average, every working day of the week. Suddenly this spikes to 1000 add actions and nobody notices because the adds were successful. In this example it leads to database corruption because all 1000 employees are now listed twice in a user database.
Every time you run a synchronization the size of the response returned by the 3rd party service is about 100 kB of JSON. This JSON is parsed to extract the adds/deletes/updates. Suddenly the payload is 1 GB of JSON and the synchronization becomes very slow and memory hungry. Nobody notices until additional processes are added on the same server and it suddenly runs out of memory.
Every day a customer sends between 20 and 100 emails or SMS messages. Due to a bug 50 messages are sent twice that day. Nobody on our side notices because the volume is not unusual. What should have been monitored is that each recipient received a message in rapid succession (e.g. less than 2 seconds in between each transmission).
Every month a batch of SEPA direct debits (SDDs) is sent to the bank. Due to an error the same batch is sent on Feb 28th and March 1st. We forgot to add a safeguard preventing SDD batches from being transmitted with an interval less than 28 days. Our customer’s customers are charged twice, leading to angry phone calls.

Alerting

Once logging and monitoring is in place, alerting becomes a simple task. Using tools like PagerDuty it’s easy to set up (http) calls from our monitoring scripts (or a monitoring platform) to an alerting platform, which in turn sends SMSes or emails, or makes robo-calls, or sends notifications to Teams or Slack. Using PagerDuty and the likes a list of rotating, on-call engineers can be managed, and escalation paths can be defined.

Implementation

Implementing monitoring and alerting does not have to cost a tremendous amount of time or effort, or be very costly. In fact, cost savings are possible when issues are detected before they are noticed by the customer because it might prevent the support department from getting involved.

A few practical steps that can be taken today:

Implement uptime monitoring & alerting using Pingdom, WebMon, Uptime, or Uptrends. Measure the home page and some deeper pages. Don’t forget to add checks for CSS/JS assets by testing specific URLs pointing to assets hosted on the app domain or CDN.
Implement domain/SSL expiry checks using MXToolBox.
Implement 3rd party accessibility testing by adding a custom endpoint to Embrace/Umbrella which simply returns something like { "status": "ok" } or { "status": "fail", "errorcode": 123 }. Internally such an endpoint attempts to connect to a 3rd party API by making a harmless GET request which “proves” the API is accessible.
Implement database accessibility testing, also by adding a custom endpoint. Internally this endpoint attempts to read a record from one of the tables, which “proves” the database is accessible.
Implement basic functionality checks using Ghost Inspector or Katalon. Test the sign-in flow, password reset flow, and basic features such as the dashboard and account pages.
Implement PagerDuty to set up alerts to on-call engineers.

Once these checks are in place it’s time to discuss setting up the checks that deal with thresholds, outliers, sudden changes, and historical anomalies. See https://docs.datadoghq.com/monitors/monitor_types/ and https://docs.signalfx.com/en/latest/detect-alert/index.html to prepare yourself for the topics discussed during the next phase.

Check out my product Operand, a collaborative tool for due diligences, audits, and assessments.

→ See archive for older posts