Explain Like I'm 5
Observability Glossary

Observability terms and concepts you need to know

Welcome to the ELI5 Observability Glossary! Wether you're new to observability or a seasoned DevOps engineer, you occasionally come accross observability concepts and terms that leave you puzzled: "what does this even mean?".

We've all been there.

Observability is full of expressions that we need to understand to effectively communicate, from common terms like "distributed tracing" to concepts like "cardinality" and "dimensionality".

We believe that by breaking down the technical jargon, we can make observability more accessible and understandable for everyone. This glossary explains every concept "like I'm 5".

Application Performance Monitoring

Application Performance Monitoring (APM) is the practice of monitoring the performance, availability, and end-user experience of applications. It involves tracking metrics such as response time, error rates, and resource utilization to identify performance issues.

Aggregation

Aggregation is the process of taking a large dataset and simplifying it into a set of pre-defined parameters such as average, sum or count. It combines and summarizes data into a coherent view and is mostly used to reduce the cost of processing and storing the entire dataset.

Audit Trail

An audit trail is a record of all the actions and changes made within an application, which enables traceability.

Alerting

Alerting is the process of sending notifications to developers when predefined conditions are met in their application, such as when error occurs. Alerting is essential for maintaining the health and performance of an application.

Autoscaling

Autoscaling is the ability of an application to automatically adjust its resources based on real-time demand, ensuring optimal performance and cost-efficiency. This enables applications to dynamically scale up or down in response to changes in traffic, without manual intervention.

Availability

Availability is the measure of how consistently an application is operational and accessible to users without interruptions or downtime.

Anomaly Detection

Anomaly detection is the process of identifying unexpected or irregular patterns in telemetry data that is far from the norm, which usually means something is off and requires human attention.

Azure Functions

Azure Functions are a serverless computing service provided by Microsoft Azure, enabling developers to run event-triggered code without the need to manage infrastructure. This service enables you to focus on writing the code necessary to handle specific events or triggers, such as HTTP requests, database changes, or file uploads.

AWS Lambda

AWS Lambda is a serverless computing service provided by Amazon Web Services that lets you run code without provisioning or managing servers. It automatically scales and manages the infrastructure required to run your code, enabling you to focus on writing the code itself.

Cardinality

Cardinality is the number of unique values in a dataset. For example, the number of unique values a specific field in a set of logs can take.

Cold Start

When a serverless function is invoked for the first time, it experiences a cold start. A cold start means it takes longer to respond to the invocation. The cold start comes from the underlying architecture of serverless functions. It's usually due to the initialisation of the micro virtual machine where the serverless function code will run.

Change Failure Rate

The Change Failure Rate measures the percentage of changes to an application that result in failure. This metric helps you understand the impact of changes on application stability and reliability, enabling you to identify areas for improvement and reduce the risk of future failures.

Container

A container is a lightweight, standalone, and executable package that includes everything needed to run a program, including the code, runtime, system tools, system libraries, dependencies, and settings. It enables developers to run applications consistently across different environments.

Capacity Planning

Capacity planning is the practice of predicting and managing the amount of computing resources needed to support a application workload, and ensuring that it can handle current and future demands without performance degradation.

Continuous Deployment

The practice of automatically deploying code changes to production as soon as they are ready, without the need for manual intervention.

Canonical logs

Canonical logs are logs you add at the end of every request or transaction. Each of these logs contains all the important parameters about the request, for instance, user ID, duration, response status code, etc. Adding all these details in a single log line enable you to debug faster and perform more complex aggregations without the need to join multiple log lines together.

Continuous Integration

The practice of frequently integrating code changes into a repository, where automated tests are run to validate these changes. This enable early detection of issues and ensures that the codebase is always in a working state.

CPU Utilization

The measure of how much of a computer's central processing unit (CPU) is being used. It indicates the amount of work the CPU is doing and can help identify performance bottlenecks.

Continuous Profiling

The process of analyzing the performance of a program with metris such as CPU utilization or the frequency and duration of function calls, to identify bottlenecks, memory leaks, and other areas for optimization.

Canary Release

A deployment strategy that enables rolling out new features to a small subset of users before releasing them to the entire user base. This method helps to mitigate risks by testing the new functionality in a production, enabling you to monitor its performance and stability before a full release.

Elasticity

The ability of a system to adapt to changes in workload by automatically provisioning and de-provisioning resources in response to demand.

Edge Computing

Edge Computing is the practice of handling requests and processing data closer to the client, typically at the "edge" of the network. This typically enables lower latency, making it ideal for applications that require real-time or near-real-time processing.

Error Rate

The error rate is the frequency at which errors occur in an application. It is a critical metric for understanding the overall health and performance of an app, as it directly impacts user experience.

ELK Stack

The ELK Stack is a set of three open-source tools: Elasticsearch, Logstash, and Kibana, used for log management and data visualization. It enables developers to collect, parse, store, and analyze large volumes of log data for troubleshooting and monitoring purposes.

Event Correlation

The process of identifying and associating related events across multiple datasets or data types. This association enables to understand each events in the context of the overall system.

Error Tracking

Error tracking is the process of identifying, recording, and monitoring errors or bugs in applications. It involves capturing metada about errors, such as the type of error, when it occurred, and the circumstances surrounding it. This data is important to diagnose and resolve the issues.

Event

An event is simply "something happening" in an application. It could be a user action, an error, or a state change.

Event-Driven Architecture

A software design pattern that promotes the production, detection, consumption, and reaction to events.

Idempotence

Idempotence is the property of an operation that can be applied multiple times without changing the result beyond the initial application. No matter how many times you do an indempotent operation, you’ll still get the same result.

Incident Analysis

Incident Analysis is the process of examining and evaluating events or issues that have caused disruptions or failures in an application. It involves identifying the root cause of incidents and understanding the impact they have on the system's performance and reliability.

Incident Response

The process of reacting to and resolving unexpected issues or problems within a software system, ensuring that the system returns to a stable and functional state as quickly as possible.

Incident

An incident is an unexpected disruption in the normal operation of an application. Incidents can range from complete service outages to performance degradation, and they often require immediate attention to restore normal functionality.

Infrastructure as Code (IaC)

Infrastructure as Code is the practice of managing and provisioning cloud infrastructure through machine-readable configuration files, rather than using manual interactive configuration tools (commonly called ClickOps).

Incident Management

Incident Management is the process of identifying, responding to, and resolving incidents that occur within an application. It involves detecting when something goes wrong, alerting the appropriate teams, working to restore normal operations as quickly as possible, and implementing measures to prevent the same issue to occur in the future.

IO Operations

IO Operations refers to input/output operations in a computer system, such as reading from or writing to a file or a network socket.

Instrumentation

The act of adding code to a software application to collect telemetry data about its behavior and performance.

Latency

Latency is the delay between a request being made and a response being received. It is a key metric as it directly impacts user experience and system performance.

Lead Time for Changes

The Lead Time for Changes refers to the time it takes for a code change to be implemented, reviewed and deployed into production. This metric measures the time from when the code is committed to when it is running in a production environment, providing insight into the efficiency of the development, review and deployment processes.

Load Balancing

Load Balancing is the process of distributing incoming network traffic across multiple servers to ensure no single server is overworked. This is to optimizie resource utilization and prevent server overload.

Logging Levels

The different levels at which log messages can be generated, each indicating the severity or importance of the message.

Log Aggregation

The process of collecting and centralizing log data from various sources to make it easier to search, analyze, and visualize.

Logging

Logging is the process of recording events, actions, or messages that occur within an application, typically for troubleshooting or analysis purposes.

Log Management

Log Management is the process of collecting, storing, and analyzing log data generated by applications, servers, and other systems. It involves organizing and indexing logs to make them easily searchable and filterable, enabling developers to troubleshoot issues and gain insights into the behavior of their application.

Loki

Loki is a powerful open-source log aggregation system that enables developers to easily collect, store, and query log data in a microservices architecture. It provides a cost-effective and efficient way to monitor and troubleshoot applications by indexing logs and making them easily searchable.

Real-time Alerting

Real-time alerting is the process of receiving immediate notifications about critical issues or events in your application, enabling you to quickly respond.

Reliability

Reliability is the ability of an application to consistently 'do what is says on the can'. It's the ability of the application to perform as expected, even when the set of conditions are not optimal. It involves minimizing the occurrence of failures and ensuring that the system can recover quickly when failures happen.

Recovery

Recovery is the process of restoring an application to a stable and functional state after a failure or incident.

Resiliency

Resiliency is an app's ability to gracefully handle and recover from failures, ensuring minimal impact on the overall functionality. Building resilient applications means designing apps and infrastructure that can be fault-tolerant and responsive to unexpected issues.

Root Cause Analysis

Root Cause Analysis is the process of identifying the underlying reason for an error or performance issue in an application. It often involving a investigating logs, metrics, traces and wide events to identify the cause of the issue, instead of fixing just its symptoms.

Response Time

The time it takes for a system to respond to a request, typically measured from the moment the request is sent until the response is received.

Real User Experience Monitoring (RUM)

Real User Experience Monitoring (RUM) is the process of tracking and analyzing how real users interact with an application or website in real-time.

Sampling

Sampling is the process of collecting and analyzing only a subset from a larger dataset, in order to make inferences or observations about the larger dataset. Sampling is about selecting a representative sample that represent the key features of the entire dataset without the cost associated with processing and storing the entire dataset.

Serverless

Serverless is a cloud computing model that enables developers to build and run applications without managing the infrastructure. It enables developers to focus on writing code and delivering value without worrying about servers, virtual machines, or containers.

Scalability

Scalability is the ability of an application to handle an increasing load, such as a growin number of users or requests witout experiencing downtime or performance degradation.

Service Level Agreement (SLA)

A Service Level Agreement (SLA) is a commitment between a service provider and a customer that outlines the level of service that the customer can expect. This includes details such as uptime, response time, and other performance metrics.

Service Level Indicator (SLI)

A Service Level Indicator (SLI) is a measure of the performance of a service, such as response time or error rate, that is used to assess its quality and reliability. SLIs are essential for understanding how well a service is meeting its objectives and can be used to set Service Level Objectives (SLO) and Service Level Agreements (SLA).

Service Level Objective (SLO)

A Service Level Objective (SLO) is a specific target for a service or application, often expressed as a percentage of uptime or availability. It is used to measure and monitor the performance of the application. It usually provides a better developer experience than alerts.

Span

A span is the primary building block of a distributed trace. It is a specific unit of work within a distributed system, typically representing a single operation.

Sidecar

A sidecar is a pattern where a secondary process or container is attached to the main application to provide additional functionality or support. Sidecars are usually used for logging, security, or networking, without directly impacting the core functionality of the main application.

Structured Logging

A method of logging that organizes log messages into a format that is easily searchable, filterable, and queriable. This enables you to extract valuable insights and troubleshoot issues faster.

Put your observability knowledge to work with Baselime

Now that we've covered the fundamentals of observability, it's time to put your knowledge to work to improve the reliability of your applications.

If you're just getting started with observability or a seasoned veteran, lean on this glossary when instrumenting your app, communicating with your team, or selecting an observability provider.

At Baselime we have taken our experience building internal observability tools at fast growing startups to build the modern observability solution we've always needed. If you're ready to put your knowledge to work, give us a try for free :).

PS: If you want to add a term, further details or provide some feedback, you can reach us at [email protected].

Start resolving issues today.
Without the hassle.