Explain Like I'm 5
Observability Glossary

Observability terms and concepts you need to know

Welcome to the ELI5 Observability Glossary! Wether you're new to observability or a seasoned DevOps engineer, you occasionally come accross observability concepts and terms that leave you puzzled: "what does this even mean?".

We've all been there.

Observability is full of expressions that we need to understand to effectively communicate, from common terms like "distributed tracing" to concepts like "cardinality" and "dimensionality".

We believe that by breaking down the technical jargon, we can make observability more accessible and understandable for everyone. This glossary explains every concept "like I'm 5".

Aggregation

Aggregation is the process of taking a large dataset and simplifying it into a set of pre-defined parameters such as average, sum or count. It combines and summarizes data into a coherent view and is mostly used to reduce the cost of processing and storing the entire dataset.

Alerting

Alerting is the process of sending notifications to developers when predefined conditions are met in their application, such as when error occurs. Alerting is essential for maintaining the health and performance of an application.

Anomaly Detection

Anomaly detection is the process of identifying unexpected or irregular patterns in telemetry data that is far from the norm, which usually means something is off and requires human attention.

Application Performance Monitoring

Application Performance Monitoring (APM) is the practice of monitoring the performance, availability, and end-user experience of applications. It involves tracking metrics such as response time, error rates, and resource utilization to identify performance issues.

Audit Trail

An audit trail is a record of all the actions and changes made within an application, which enables traceability.

Autoscaling

Autoscaling is the ability of an application to automatically adjust its resources based on real-time demand, ensuring optimal performance and cost-efficiency. This enables applications to dynamically scale up or down in response to changes in traffic, without manual intervention.

Availability

Availability is the measure of how consistently an application is operational and accessible to users without interruptions or downtime.

AWS Lambda

AWS Lambda is a serverless computing service provided by Amazon Web Services that lets you run code without provisioning or managing servers. It automatically scales and manages the infrastructure required to run your code, enabling you to focus on writing the code itself.

Azure Functions

Azure Functions are a serverless computing service provided by Microsoft Azure, enabling developers to run event-triggered code without the need to manage infrastructure. This service enables you to focus on writing the code necessary to handle specific events or triggers, such as HTTP requests, database changes, or file uploads.

Blob Storage

A type of cloud storage that is designed to store large amounts of unstructured data, such as images, videos, and telemetry data. Blob storage is often used for long-term retention and archival purposes.

Blue Green Deployment

A deployment strategy that enables you to release a new version of your software without downtime or disruption to users.

Capacity Planning

Capacity planning is the practice of predicting and managing the amount of computing resources needed to support a application workload, and ensuring that it can handle current and future demands without performance degradation.

Canonical logs

Canonical logs are logs you add at the end of every request or transaction. Each of these logs contains all the important parameters about the request, for instance, user ID, duration, response status code, etc. Adding all these details in a single log line enable you to debug faster and perform more complex aggregations without the need to join multiple log lines together.

Change Failure Rate

The Change Failure Rate measures the percentage of changes to an application that result in failure. This metric helps you understand the impact of changes on application stability and reliability, enabling you to identify areas for improvement and reduce the risk of future failures.

Cardinality

Cardinality is the number of unique values in a dataset. For example, the number of unique values a specific field in a set of logs can take.

Cold Start

When a serverless function is invoked for the first time, it experiences a cold start. A cold start means it takes longer to respond to the invocation. The cold start comes from the underlying architecture of serverless functions. It's usually due to the initialisation of the micro virtual machine where the serverless function code will run.

Container

A container is a lightweight, standalone, and executable package that includes everything needed to run a program, including the code, runtime, system tools, system libraries, dependencies, and settings. It enables developers to run applications consistently across different environments.

Continuous Deployment

The practice of automatically deploying code changes to production as soon as they are ready, without the need for manual intervention.

Continuous Profiling

The process of analyzing the performance of a program with metris such as CPU utilization or the frequency and duration of function calls, to identify bottlenecks, memory leaks, and other areas for optimization.

Continuous Integration

The practice of frequently integrating code changes into a repository, where automated tests are run to validate these changes. This enable early detection of issues and ensures that the codebase is always in a working state.

CPU Utilization

The measure of how much of a computer's central processing unit (CPU) is being used. It indicates the amount of work the CPU is doing and can help identify performance bottlenecks.

Canary Release

A deployment strategy that enables rolling out new features to a small subset of users before releasing them to the entire user base. This method helps to mitigate risks by testing the new functionality in a production, enabling you to monitor its performance and stability before a full release.

Data Replication

The process of copying and storing data in multiple locations to ensure redundancy and fault tolerance.

Data Visualization

Data Visualization is the graphical representation of data to help developers understand the significance and correlation between data points. In observability, it's critical to present data with the highest signal to noise ratio possible.

Deployment Frequency

The rate at which new code changes are deployed to production. It measures how often a team releases updates to their application.

Distributed Tracing

Distributed Tracing is a method for tracking the flow of requests through your application. It enables you to follow the journey of a request as it travels across multiple services, so you can see where things might be going wrong.

Dimensionality

Dimensionality is the number of attributes or features in a dataset. It is a measure of the complexity and size of the data, which can impact the performance of algorithms and the ability to visualize and interpret the data.

DORA Metrics

A set of metrics used to measure the performance of software processes and teams. These metrics are: deployment frequency, lead time for changes, change failure rate, and time to restore service.

Downsampling

Downsampling is the process of reducing the amount of data points in a dataset to a lower resolution or granularity, typically for storage optimization. It involves aggregating or averaging data over a specific time interval or by a specific factor.

Edge Computing

Edge Computing is the practice of handling requests and processing data closer to the client, typically at the "edge" of the network. This typically enables lower latency, making it ideal for applications that require real-time or near-real-time processing.

Elasticity

The ability of a system to adapt to changes in workload by automatically provisioning and de-provisioning resources in response to demand.

ELK Stack

The ELK Stack is a set of three open-source tools: Elasticsearch, Logstash, and Kibana, used for log management and data visualization. It enables developers to collect, parse, store, and analyze large volumes of log data for troubleshooting and monitoring purposes.

Error Rate

The error rate is the frequency at which errors occur in an application. It is a critical metric for understanding the overall health and performance of an app, as it directly impacts user experience.

Error Tracking

Error tracking is the process of identifying, recording, and monitoring errors or bugs in applications. It involves capturing metada about errors, such as the type of error, when it occurred, and the circumstances surrounding it. This data is important to diagnose and resolve the issues.

Event Correlation

The process of identifying and associating related events across multiple datasets or data types. This association enables to understand each events in the context of the overall system.

Event-Driven Architecture

A software design pattern that promotes the production, detection, consumption, and reaction to events.

Event

An event is simply "something happening" in an application. It could be a user action, an error, or a state change.

FinOps

FinOps is the practice of managing and optimizing the financial aspects of cloud-based software development and operations.

Fluentd

Fluentd is an open-source data collector that unifies the data collection and consumption for better use and understanding of data. It is typically used to collect logs from applications.

Function as a Service (FaaS)

A cloud computing service that enables developers to run individual functions in response to events without managing having to manage the infrastructure. Function as a Service are typically the compute solution for serverless architectures.

Google Cloud Functions

Google Cloud Functions are a serverless execution environment provided by Google Cloud Platform. It enables you to write single-purpose functions that are triggered by various events, and automatically scales in response to the load.

Grafana

Grafana is a data visualization tool that enables you to create dashboards and graphs for monitoring and analyzing your data. It integrates with various data sources, making it easy to visualize and understand complex data sets.

Health Check

A health check is a diagnostic process used to assess the status of a system or application. It verifies that the system is operational and functioning as expected, providing insights into its overall health.

Histogram

A histogram a bar chart that shows the distribution of a set of data. It's commonly used in observability to visualize the frequency of values within a dataset.

Idempotence

Idempotence is the property of an operation that can be applied multiple times without changing the result beyond the initial application. No matter how many times you do an indempotent operation, you’ll still get the same result.

Incident Analysis

Incident Analysis is the process of examining and evaluating events or issues that have caused disruptions or failures in an application. It involves identifying the root cause of incidents and understanding the impact they have on the system's performance and reliability.

Incident Management

Incident Management is the process of identifying, responding to, and resolving incidents that occur within an application. It involves detecting when something goes wrong, alerting the appropriate teams, working to restore normal operations as quickly as possible, and implementing measures to prevent the same issue to occur in the future.

Incident

An incident is an unexpected disruption in the normal operation of an application. Incidents can range from complete service outages to performance degradation, and they often require immediate attention to restore normal functionality.

Incident Response

The process of reacting to and resolving unexpected issues or problems within a software system, ensuring that the system returns to a stable and functional state as quickly as possible.

Infrastructure as Code (IaC)

Infrastructure as Code is the practice of managing and provisioning cloud infrastructure through machine-readable configuration files, rather than using manual interactive configuration tools (commonly called ClickOps).

Instrumentation

The act of adding code to a software application to collect telemetry data about its behavior and performance.

IO Operations

IO Operations refers to input/output operations in a computer system, such as reading from or writing to a file or a network socket.

Jaeger

Jaeger is an open source distributed tracing system that helps developers troubleshoot transactions in complex microservices architectures.

Key-Value Store

A key-value store is a type of NoSQL database that uses a simple key-value method to store data. Each data record is stored as a pair of keys and values, making it easy to retrieve and update specific information.

Kubernetes (k8s)

Kubernetes is an open-source platform that automates container operations, enabling developers to deploy, scale, and manage containerized applications easily.

Latency

Latency is the delay between a request being made and a response being received. It is a key metric as it directly impacts user experience and system performance.

Lead Time for Changes

The Lead Time for Changes refers to the time it takes for a code change to be implemented, reviewed and deployed into production. This metric measures the time from when the code is committed to when it is running in a production environment, providing insight into the efficiency of the development, review and deployment processes.

Load Balancing

Load Balancing is the process of distributing incoming network traffic across multiple servers to ensure no single server is overworked. This is to optimizie resource utilization and prevent server overload.

Log Aggregation

The process of collecting and centralizing log data from various sources to make it easier to search, analyze, and visualize.

Log Management

Log Management is the process of collecting, storing, and analyzing log data generated by applications, servers, and other systems. It involves organizing and indexing logs to make them easily searchable and filterable, enabling developers to troubleshoot issues and gain insights into the behavior of their application.

Logging Levels

The different levels at which log messages can be generated, each indicating the severity or importance of the message.

Logging

Logging is the process of recording events, actions, or messages that occur within an application, typically for troubleshooting or analysis purposes.

Loki

Loki is a powerful open-source log aggregation system that enables developers to easily collect, store, and query log data in a microservices architecture. It provides a cost-effective and efficient way to monitor and troubleshoot applications by indexing logs and making them easily searchable.

Mean Time Between Incidents (MTBI)

The Mean Time Between Incidents (MTBI) is a metric that measures the average time between two consecutive incidents or failures in an application.

Mean Time to Detect (MTTD)

The Mean Time to Detect (MTTD) is the average time it takes to identify and acknowledge an issue or incident within an application. It measures the effectiveness of monitoring and alerting systems in quickly highlighting issues.

Mean Time to Repair (MTTR)

The Mean Time to Repair (MTTR) is the average time it takes to fix a problem after it has been identified. It measures the efficiency of the repair process and is a key metric in assessing system reliability and performance.

Memory Utilization

Memory Utilization is the measure of how much of a computer's memory is being used at a given time.

Monitoring

Monitoring is the act of tracking the performance and behavior of a system, application, or service in real-time, allowing for the detection of issues, anomalies, and performance bottlenecks. It involves collecting and analyzing metrics to ensure that everything is running smoothly and to identify potential areas for improvement.

Metrics

Metrics are a way to measure and track the performance and behavior of a system, application, or service. They provide valuable data on things like response time, error rates, and resource utilization.

Network Latency

The time it takes for data to travel from one point to another in a network. It can be affected by various factors such as network congestion, distance, and the quality of the connection.

o11y

o11y stands for Observability :)

Object Storage

Object storage is a method of storing data as objects rather than files or blocks, typically using a unique identifier to access each object. This typically enables more scalable, cheaper and more storage solutions than traditional block storage.

Observability as Code

Observability as Code is the practice of incorporating observability principles and practices directly into the codebase of an application, enabling developers to proactively monitor and troubleshoot their software throughout the development lifecycle.

Observability Platform

An observability platform is a set of solutions that enable you to understand the inner workings of your application through telemetry data (logs, metrics, traces, wide events, etc.). It gives you a complete view of your app's health, enabling you to detect, diagnose and resolve issues quickly.

Observability

Observability is the ability to understand how your application is working and behaving in production through telemetry data (logs, metrics, traces, wide events, etc.). It enables you detect, diagnose and resolve issues in your app before they impact your users and become problems.

OpenTelemetry

OpenTelemetry is a set of tools and APIs that enable you to collect telemetry data from your applications. It enables you to understand what's happening inside your app, from performance metrics to traces.

Percentile

A percentile is a measure used in statistics to indicate the value below which a given percentage of observations in a group of observations fall. In the context of observability, percentiles are often used to measure the distribution of response times and latencies.

Playbook

A playbook is a set of predefined steps and actions to take when a particular incident or issue arises, helping to guide the response and resolution process.

Prometheus

Prometheus is an open-source monitoring and alerting toolkit that is widely used for collecting and storing time series data. It enables developers to gain insight into their applications' performance.

P90

P90 is a statistical measure that represents the value below which 90% of the data falls. In observability, P90 is often used to measure the worst-case performance or latency of an application.

Query Language

A query language is a specialized programming language designed to retrieve and manipulate data from a database. It enables developers to interact with databases using a syntax that is optimized for data retrieval and manipulation.

Query Performance

Query performance is the speed and efficiency with which a database or other data storage system processes queries from users or applications. The performance is measured both in latency and resources utilization.

Real-time Alerting

Real-time alerting is the process of receiving immediate notifications about critical issues or events in your application, enabling you to quickly respond.

Real User Experience Monitoring (RUM)

Real User Experience Monitoring (RUM) is the process of tracking and analyzing how real users interact with an application or website in real-time.

Recovery

Recovery is the process of restoring an application to a stable and functional state after a failure or incident.

Resiliency

Resiliency is an app's ability to gracefully handle and recover from failures, ensuring minimal impact on the overall functionality. Building resilient applications means designing apps and infrastructure that can be fault-tolerant and responsive to unexpected issues.

Reliability

Reliability is the ability of an application to consistently 'do what is says on the can'. It's the ability of the application to perform as expected, even when the set of conditions are not optimal. It involves minimizing the occurrence of failures and ensuring that the system can recover quickly when failures happen.

Response Time

The time it takes for a system to respond to a request, typically measured from the moment the request is sent until the response is received.

Root Cause Analysis

Root Cause Analysis is the process of identifying the underlying reason for an error or performance issue in an application. It often involving a investigating logs, metrics, traces and wide events to identify the cause of the issue, instead of fixing just its symptoms.

Sampling

Sampling is the process of collecting and analyzing only a subset from a larger dataset, in order to make inferences or observations about the larger dataset. Sampling is about selecting a representative sample that represent the key features of the entire dataset without the cost associated with processing and storing the entire dataset.

Scalability

Scalability is the ability of an application to handle an increasing load, such as a growin number of users or requests witout experiencing downtime or performance degradation.

Serverless

Serverless is a cloud computing model that enables developers to build and run applications without managing the infrastructure. It enables developers to focus on writing code and delivering value without worrying about servers, virtual machines, or containers.

Service Level Agreement (SLA)

A Service Level Agreement (SLA) is a commitment between a service provider and a customer that outlines the level of service that the customer can expect. This includes details such as uptime, response time, and other performance metrics.

Service Level Indicator (SLI)

A Service Level Indicator (SLI) is a measure of the performance of a service, such as response time or error rate, that is used to assess its quality and reliability. SLIs are essential for understanding how well a service is meeting its objectives and can be used to set Service Level Objectives (SLO) and Service Level Agreements (SLA).

Sidecar

A sidecar is a pattern where a secondary process or container is attached to the main application to provide additional functionality or support. Sidecars are usually used for logging, security, or networking, without directly impacting the core functionality of the main application.

Span

A span is the primary building block of a distributed trace. It is a specific unit of work within a distributed system, typically representing a single operation.

Service Level Objective (SLO)

A Service Level Objective (SLO) is a specific target for a service or application, often expressed as a percentage of uptime or availability. It is used to measure and monitor the performance of the application. It usually provides a better developer experience than alerts.

Structured Logging

A method of logging that organizes log messages into a format that is easily searchable, filterable, and queriable. This enables you to extract valuable insights and troubleshoot issues faster.

Tail Sampling

Tail Sampling is the practice of selectively capturing and analyzing a small percentage of the most extreme or rare events or traces in an application, such as errors, timeouts, or high latency requests.

Telemetry Data

Telemetry data is all the data collected from your application and transmitted to your observability platform. It includes metrics, logs, traces, wide-events and everything in between.

Time Series Database

A time series database is a type of database optimized for handling time-stamped or time-series data. It is designed to efficiently store and retrieve data points indexed by time, making it ideal for monitoring and analyzing time-based metrics and events.

Trace

A trace is a single request or transaction and all the subsequent events as it moves through a distributed system. This data typically includes information about the components involved in the transaction, as well as the timing and outcome of each interaction.

Uptime SLA

An agreement between a service provider and its users that specifies the amount of time a system is expected to be up and running without issues.

Uptime

The amount of time a system, service, or application is available and operational, without any interruptions or downtime.

User Sessions

A user session is the period of time during which a user interacts with an application, from the moment they log in until they log out or they leave the application.

Wide event

A wide event is a comprehensive data structure that contains all relevant details of a particular request or transaction within a single event, rather than distributing them across multiple events or logs.

Zero-Downtime Deployment

Zero-Downtime Deployments are the process of updating an application without causing any disruption to the end users. This is achieved by gradually shifting user traffic from the old version to the new version, ensuring a transition without disrupting the live service.

Zero Downtime

The ability of a system to undergo updates or changes without any interruption to its operation or service. This ensures that users can continue to access and use the system while updates are being implemented.

Put your observability knowledge to work with Baselime

Now that we've covered the fundamentals of observability, it's time to put your knowledge to work to improve the reliability of your applications.

If you're just getting started with observability or a seasoned veteran, lean on this glossary when instrumenting your app, communicating with your team, or selecting an observability provider.

At Baselime we have taken our experience building internal observability tools at fast growing startups to build the modern observability solution we've always needed. If you're ready to put your knowledge to work, give us a try for free :).

PS: If you want to add a term, further details or provide some feedback, you can reach us at [email protected].

Start resolving issues today.
Without the hassle.

This website uses cookies 🍪

We use cookies to offer you a better browsing experience, analyze site traffic and personalize content. By using our site, you consent to our use of cookies.