What the health!? Implementing health probes for highly available, self-healing, global services

Health probes are essential to running a highly available service, yet they are surprisingly tricky to implement without inadvertently making your uptime worse. Even popular frameworks like Spring Boot (until recently1) used unfortunate defaults that may accidentally encourage a hurried developer to fall into some surprising traps that reduce their services' uptime.

Over time, my team has come up with a set of rules to avoid these traps. I'll lay them out first succinctly, and then explain them in detail along with the model we use that embodies them.

  1. Don't overload health endpoints – favor more over reuse
  2. Keep logic out of monitors – put in endpoints
  3. Be very conservative when determining if unhealthy
  4. Run health checks in background – not on demand

If your in a hurry, you can skip straight to the summary.

A narrow, useful model of health

In any nontrivial service, health is really a multi-dimensional spectrum. A service degrades in immeasurable ways, and is probably degraded in several of them right now as you read this! This is why service level objectives (SLOs) are so useful. They distill the chaos into relatively few user experience objectives, and user experience is ultimately what matters.

However, many times we still need to make a point in time, yes-or-no decision that can't be a complex aggregation of metrics like SLOs typically require2. In these cases, like load balancer health checks or Kubernetes probes, we can focus instead on answering more specific questions about health. That is, we can use a useful model of health instead of a realistic one.

This model consists of the following 5 kinds of health resources (or "queries", or "remote procedures", or "endpoints") your service can provide for use by clients like Kubernetes, monitoring software, load balancers, yourself, peer services, etc.

Readiness
Is this server warmed up, fully loaded, and "ready" to serve traffic?
Liveness
Does this server need to be killed and recreated?
Health
Is this server able to serve ANY significant traffic successfully?
Diagnostics
What's the full state of the server's ability to serve requests and its dependencies' health?
Smoketests
How do some significant use cases work from a particular call site?

To answer most of these questions, we take a white-box approach, using explicitly defined checks which either pass or fail. This means we have to think about how our process works, and some of the known ways that it can degrade. We will need to note external dependencies in particular, both because we rely on them so heavily (e.g. your database), and because using them relies on a network, which is orders of magnitude less reliable than a syscall or interprocess call on the same machine.

Health checks

The checks that compose the resources can be differentiated in three dimensions:

  1. The criticality of the dependency it is checking
  2. The depth of health we will check
  3. The timing of the check

Criticality

The impact of a fault in a dependency depends on how critical the dependency is to our service's value. If the dependency is critical, then without it we know to take some kind of action (like serve an error page, or route to another data center). For this reason we break criticality down into two flavors:

Hard dependencies
Think: your database. Dependencies which are in at least some way essential for the entirety of your service's function. Without just one of these dependencies, your service is worthless. It might as well be completely down, and taking it down is likely preferable to even trying to do any work without one of these dependencies functioning in the ways you need it. If it's required for some requests but not others, and you care about those "others", then it's not a hard dependency.
Soft dependencies
Dependencies which are non-essential. They may still be important, but you can at least do something–provide some value–without these. A full outage would still be worse than losing one of these.

Depth

The depth of a check is how coupled the check is to the useful functionality of the dependency. There are two broad classes of health depths we may check:

Connectivity (low depth)
Do not check very far; only that your process can at least communicate (including TLS handshake and authentication) with this dependency. It is a property of the network, your application configuration, the dependency's configuration, and its basic availability. Example: you can establish a connection to your database, but we don't know if the database is serving queries as expected.
Transitive health (high depth)
This is the health of the dependency itself: can it serve traffic from your service successfully? It is a property of both connectivity and how well the dependency is functioning. Example: a test database query quickly returns expected results.

You may be able to connect to a dependency, even if it is not transitively healthy. This means transitive health of a dependency will always be equally or less available than its connectivity. Given we must be conservative about unhealth, sometimes we only want to concern ourselves with connectivity. We'll see how this plays out below.

Timing

The last dimension is about when the check is run. We start with two broad categories:

Synchronous
A synchronous check is run when the endpoint is queried, blocking until a result is determined.
Asynchronous
Asynchronous checks are run in the background, and endpoints serve the last seen result immediately.

One of our rules was to run health checks in the background (asynchronously, such as by using the @Async annotation in dropwizard). There may be many health clients checking your service (e.g. load balancers), so if all of these clients add up to more replicas than your service (whether now or as they necessarily scale), all those health checks start to add up to quite a bit of traffic. If these checks are not very cheap, this can quickly escalate to hammering your service, and transitively its dependencies, with health checks. Asynchronous checks combat this problem by decoupling the timing of the check from the timing of the query: health requests return immediately, incurring negligible load, serving cached results from the last time the checks actually ran. A load balancer may still check frequently (and as such will continue to rapidly detect problems talking to a replica) and scale out independently without worry of load.

That said, you are still checking your dependencies eventually, and so even running them in the background you can still hammer them depending on the number of service replicas and frequency of the background checks running on each. Fortunately, due to the timing decoupling of asynchronous checks, we may tune these frequencies freely. And hopefully your checks are cheap enough it doesn't really matter all that much (though you'd be surprised). A hybrid approach–use a cached result while valid, otherwise check synchronously–also works well.

Implementing health endpoints

Armed with this model of health checks, we go back and use it to describe the how and why behind our 5 health resources.

Recall the first rule is not to overload them. Don't use the same URL or RPC for readiness and liveness, or readiness and health, etc. Trying to cleverly reuse resources couples these distinct checks together. For what? Adding additional resources is trivial to do in most frameworks. Instead, optimize each for their singular intended purpose, giving each their own procedure that may be invoked separately. This better protects against traps that may hurt your users, and allows the logic of each to grow with your service without having to also reconfigure load balancers or monitors at the same time.

With that separation of concerns, we are also poised to follow the second rule: put logic in endpoints and out of clients. Rather than scripting complex rules or logic or behavior ("do X, then Z, if response looks like this, or this, or this, then treat as healthy, ...") inside generic tools like monitors, put the rules and behavior inside your code–the logic of procedures themselves. This makes the endpoints more reusable, particularly where tools are difficult or impossible (or cost $) to customize or script. Even if your fancy expensive load balancer can be scripted, coupling to those features makes it harder to be use a different load balancer tech later.

Readiness

Readiness is a query particularly for Kubernetes controlled starts of containers (and by extension, pods). Rather than throwing traffic at the pod immediately, it waits until all of its containers are ready.

  • DO return OK once the process is "warmed up." You can wait for lazy loading, run some smoke tests to JIT compile hot code paths, warm in memory caches, wait until hard dependency connectivity is established, and so on. This helps prevent long tail latencies after startup, and protects against a bad configuration taking down your service, respectively.
  • CONSIDER performing no checks, and always returning OK, once you have first returned OK. If you're using global load balancing, we have a different resource for taking a region out of rotation.
  • DO NOT check soft dependencies. This means the container may still be considered ready without them, even if the problem is misconfiguration on your end. Unfortunate, but you should allow Kubernetes to reschedule your pod at any time, and this will require your containers' readiness checks to pass. If you try to check soft dependencies, even just connectivity, you risk blocking start up for your entire service if one is down. Losing a soft dependency, as discussed above, is not fatal, but a full outage sure is. We'd like to catch misconfigurations on our end, but unfortunately it's difficult, if not impossible, to detect whether the failure is due to our configuration, a specific Kubernetes node, or external factors.

Liveness

Liveness is a container self-healing mechanism. If a container is not alive, Kubernetes will restart it. Crucially, this means the scope of liveness is the container and only the container.

  • DO return NOT OK for illegal states local to the container or process: threads are deadlocked, memory is leaking/out, [container] disk is full, etc. These may all be cured be recreating the container.
  • DO NOT check any dependencies. A restart will not help you if your database is down, and such an outage would result in all of your containers restarting, which might make the problem worse, or cause new problems.

Health

The plain "health" resource is used by global load balancers, peer services, and uptime monitors. A repeat, unhealthy (or timed out) response indicates the server is unable to serve any valuable requests. For a load balancer, this means it should not route requests to that server (which may be a virtual server representing, say, an entire region). For a peer service, it means the peer may be unhealthy itself (if this is used as a transitive health check). For an uptime monitor, it may alert someone, or track statistics for later reporting.

  • DO return NOT OK if any hard dependencies have failed transitive health checks.
  • If there are no hard dependencies, it is perfectly fine and often correct to simply do nothing and always return OK, indicating the service is likely at least running, resolvable, and reachable through the network.
  • DO NOT check any soft dependencies. It may be tempting, but any check that relies on a globally shared failure domain may then take all regions out of rotation; in other words, no requests served instead of some requests served. This is why you must be conservative when deciding a service is unhealthy.
  • DO use the health endpoint for a basic uptime monitor and alert.

Diagnostics

So far, we've looked at three resources which are surprisingly restricted, and may not really examine all that much. What if you want to look at the bigger picture: all of your dependencies, perhaps even some application configuration? Or, maybe you want to look at the state of a particular soft dependency?

Diagnostics fulfills this niche. Whereas the previous resources need only return an indication of pass or fail, diagnostics is just as much about rich content, intended for human operators. For example, if you monitoring shows some averse symptoms, or when testing out a new environment, you may take a quick peak at your diagnostics endpoint to see if any dependency checks are failing. You may also use it to automatically alert on known causes. For example, you could set up some alert policies for some of the soft dependencies that aren't as urgent as your SLO-based alerts (see also: Symptoms Versus Causes).

  • DO include as much content as you'd like (such as all dependencies' health and connectivity) that is generally useful to operators.
  • DO authorize access to these details, which may be sensitive.
  • DO NOT include any secrets in content.
  • CONSIDER a parameter which allows filtering down to specific checks or set of checks.
  • CONSIDER alerting on diagnostics.

Smoketests

Lastly, we have smoketests, which warrant some special attention. I'm not just talking about making "smoketest" calls to your service. I'm talking about a specific endpoint that itself does smoketesting for you.

I use these sparingly, as I much prefer monitoring the service levels of actual users, rather than synthetic calls. However, because service levels rely on actual traffic, there are two cases where service level monitoring falls short. If you alert on a 10% error rate over 5 minutes, but you only have 50 calls in that time, it just takes 5 failed calls in 5 minutes to trip your alarm. Adding in traffic from synthetic calls helps improve your signal-to-noise ratio. Additionally, sometimes you need to monitor something that isn't available for use by actual users, such as a new region or version. When you have no traffic to look at at all, you need to generate some. Thats where these calls come in handy.

Now perhaps you can craft a call using only your terminal and jaunty hacker wits, but wouldn't it be easier if all you had to remember was /smoketest? Likewise, it makes monitors like New Relic Synthetics easier (and cheaper!) to set up to continuously generate such traffic because all you need is a simple ping check instead of scripts. We can also easily filter out test traffic from our access logs or metrics. Because our code knows it's running tests, it is poised to deal with pesky side effects that happen from normal calls, such as by inserting test data that gets cleaned up in the background, sending email to a test inbox, charging a fake credit card, etc. It even helps secure our API: instead of opening up actual calls to our monitoring tools (which may have widely accessible credential storage), we can restrict it to the health endpoints. This all falls right in line with our principle of keeping logic in endpoints and out of monitors. It's cohesive, and codifies domain and operational knowledge.

Finally, running such smoketests a few times at startup may be a simple and pragmatic way to warm up your server process (JIT, caches, connections, etc) for readiness.

  • DO add tests for high value use cases.
  • DO NOT try to be thorough. It's a lost cause. That's what service level monitoring is for.
  • DO authorize calls separately from the rest of your API.
  • CONSIDER running checks asynchronously if you cannot adequately authorize calls.
  • CONSIDER a parameter which allows filtering to specific checks or set of checks.
  • CONSIDER alerting off of smoketests.

Summary

We discussed four high level guidelines:

  1. Don't overload health endpoints – favor more over reuse
  2. Keep logic out of monitors – put in endpoints
  3. Be very conservative when determining if unhealthy
  4. Run health checks in background – not on demand

Then we described a model of health based on 5 resources defined by explicit, cause-oriented checks. These checks vary on the criticality of the dependency, the depth of health checked, and the timing of the check. The health resources are summarized in the table below.

Summary of health resources.
Resource Readiness Liveness Health Diagnostics Smoketests
Purpose Is the server done loading? Should the server be restarted? Is the server able to serve some traffic? What's the status of everything? How are real use cases working for known callers?
Intended user
Kubernetes
Operators
Peer services
Monitors
Load balancers
Checks
Hard dependencies Connectivity Health Health
Soft dependencies Health
Other "Warmed up" (caches warmed, lazy loading done, hot code paths run (JIT), ...) Illegal states local to the container (OOM, deadlocked threads, ...) Anything you find helpful! Real use cases with test data / controlled side effects

If you found this helpful, or have questions or concerns, let me know in the comments. Thanks for reading!


1 Actuator includes all health checks in its health endpoint, which I've seen many developers use as a quick health check for load balancers or for Kubernetes probes, even though semantically it matches the diagnostics query described above, which is not appropriate for either. Recently Actuator gained explicit Kubernetes probe support which has better default behavior.

2 You probably could conceive of a check which actually relied on, say, pre-aggregated metrics based on black-box, observed symptoms. This would be appropriate for weighted load balancing, and in fact some load balancers do just that using in memory statistics from previous requests to each member. For layer 7 load balancers, this is not a bad approach, as they are seeing all of the traffic anyway, and monitoring actual calls captures much more subtlety than a binary up/down decision. That said, the two approaches are not mutually exclusive.

Comments

Popular posts from this blog

Asynchronous denormalization and transactional messaging with MongoDB change streams

The secret world of testing without mocking: domain-driven design, fakes, and other patterns to simplify testing in a microservice architecture