Observability

What is Observability? Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.

With the shift from the monolith to microservices, it has become exponentially more difficult to debug systems. We must now rely on the outputs of our systems to troubleshoot.

Metrics and monitoring are imperative for a functional product. Having insight into performance data of every level of the application can help detect and prevent major issues down the line.

Metrics and monitoring can be integrated with automated systems that can recycle services, add storage, clear cache, failover, and more. This puts overall less stress on Level 1-3 support who can instead tackle tougher issues.

What is the goal with Observability? There are two fundamental goals with observability: gradually improving an SLI (potentially optimising this over days, weeks, months), and rapidly restoring an SLI (reacting immediately, in response to an incident)." - Ben Sigelman

Rapid Restoration - ability to quickly restore service when a problem arises
Gradual Improvement - ability to measure impacts, positive or negative, of a change

"SLI... SLO... SLA... oh my!" SLIs drive SLOs which inform SLAs" - Seth Vargo

SLI¶

An SLI (Service Level Indicator) is a carefully defined quantitative measure of some aspect of the level of service that is provided. Ideally an SLI measures an area of interest to the service consumer. Examples include:

Latency - how long it takes to return a response
Error Rate - how often are responses in error
Throughput - how many responses are made per second
Availability - what percentage of time is the service usable
Durability - how likely is it that the data will be retained over time

SLO¶

An SLO (Service Level Objective) is a precise numerical target for a service level that is measured by an SLI. Every service must have an SLO in order to make data-driven decisions about that service's reliability and how that relates to future development. Measuring SLOs can help you understand when it's okay to to increase development velocity and take additional risks, and when it's necessary to slow down development and focus on improving reliability.

An SLO defines a goal based on SLIs, for example:

95^th percentile latency of homepage requests over past 5 minutes < 300ms

Warning: Don’t make your system overly reliable if you don’t intend to commit to it to being that reliable. Services that far exceed their availability targets will eventually be expected to maintain that availability, regardless of that service's published SLOs.

SLA¶

An SLA (Service Level Agreement) builds on an SLO by offering a promise to meet the SLO over a period of time, or else incur a penalty.

How do we improve our observability?

There are three primary areas to focus on in order to increase the observability of your application:

Metrics - a numeric representation of data measured over intervals of time
Logging - an immutable, timestamped record of discrete events that happened over time
Tracing - a series of related distributed events from an end-to-end flow through a distributed system.