Designing a Microservice Chassis: Health Checks

A microservice chassis provides two things, build logic and cross-cutting concerns to the services that are built on top of the chassis. In this post we'll look at Health Checks.

A microservice chassis provides two things, build logic and cross-cutting concerns to the services that are built on top of the chassis. In this post we'll look at Health Checks.

Health checks are more than just answering whether a microservice is up running or not. For a microservice to be considered healthy we need to answer the following:

  1. Is the microservice up and running?
  2. Is the microservice ready to serve requests?
  3. Can it reach all of its external dependencies?
  4. Are all of its external dependencies healthy?
  5. Is it operating within its defined limits?

Service-level Agreement (SLA)

A service-level agreement is an agreement between the provider of the microservice and its consumers. It details what the consumers can expect, as a minimum, from the service in terms of uptime, transaction processing time, downtime, and what happens when the service is completely unavailable.

Service-level Objectives (SLO)

A service-level objective is a set of objectives that the service and its team aims to provide. It includes things like the expected uptime, and acceptable transaction processing times. All the objectives are not necessarily tied to the SLA and may include internal team objectives as well. These objectives should be informed by the non-functional requirements of the service.

Service-level Indicators (SLI)

A service-level indicator is the measurement that tells us whether we have achieved a service-level objective or not, for example, how long it took to process a transaction.

Error Budget

Based on the SLO's we can derive the error budget which tells how we are tracking against our objectives. If the service-level objective is to be up 99.9% of the time, the service is allowed to be down for 8h 45m 56s per year. If we exceed the 8h 45m 56s per year we are clearly not doing well and should focus our attention to improve the uptime of the service.

Health API

Every microservice should expose an API endpoint that allows an external monitoring and alerting system to determine whether the microservice is up and running. There are many tools capable of monitoring microservices such as Elastic, Consul and InfluxDB's Telegraf.

Side note: One of the things to consider when choosing a monitoring and alerting tools is the ability to prevent flapping health statuses. Flapping is when a service starts and reports healthy just to fail two minutes later. Ideally, the monitoring tool should have the ability to wait for multiple success or failure statuses before reporting a change in the status. Also, the monitoring and alerting tool should allow a specified start-up time before the service health is checked.

Ideally, an API Gateway should be deployed in front of the microservice that is able to determine the health of the service and only route requests to the healthy instances of the microservice.

The health endpoint should also state whether the service is available and ready to accept requests. If not, it should return a 503 Service Unavailable status.

A call to the /health endpoint should determine whether all the dependencies are reachable and healthy. If an external dependency is not reachable or healthy the service is either unhealthy or degraded. Degraded means that the service is still capable of performing its functions but there may be delays in the processing of transactions. Of course, failures can quickly propagate in a distributed environment where the system as whole becomes unhealthy. To prevent this from happening we should make use of resiliency and fault-handling frameworks such as Polly. Also consider whether its worthwhile for the service to cache the health status for a short period of time.

The monitoring tool should ideally instrument the microservice and measure the service-level indicators for analysis against the SLOs and SLA. Alternatively, the health endpoint may be extended to securely expose service level indicator values or store the values in a database for analysis, for example, /health/stats. If the service is not performing within its defined limits such as CPU, memory, etc. it should report a warning.

Conclusion

In this post we have seen that health checks are and can be much more than just answering whether the service is up and running. It can help us to understand how the service is performing against pre-defined SLOs and inform future priorities for the service. Monitoring tools need to be considered carefully as they can provide great insights into the performance of 0ur services. We also briefly touched on the role of the API Gateway and why its important in terms of resiliency.

Resources

  1. Richardson, C. (2019). Microservices Patterns. Manning Publications.
  2. Newman, S. (2021). Building Microservices: Designing Fine-Grained Systems (2nd ed.). O’Reilly Media.