engineering

Why is logging so damn hard?

10 mins read

•

05 September, 2023

Boris Tane

Founder @Baselime

Discuss this post on Discord

Join the Cloudflare Discord community to discuss this topic with 100s of CTOs, AWS heroes, community builders, and cloud computing experts.

I spent the most part of my career managing logging services at various scales and I can tell you, almost everybody gets this wrong, myself included. It's a pain all along, from the logging format, to transport pipelines, to short- and long-term storage and all the way to querying, etc.

In this post I want to share what I’ve seen not working at all (like absolutely not) so you can avoid the pitfalls I fell into in the past, and what I hope to see more around application logging in the future.

Logs are a pain

It’s counter-intuitive. Logs are essentially little post-it notes you sprinkle around your codebase such that you can come back in the future and know what happened at runtime. It’s the first thing we learn when we learn programming: how to log.

console.log('Hello, World!');

Managing logging for small applications is straightforward, but as the application grows, it quickly becomes an endless time sink, where small failures accumulate until the logging platform becomes unusable. I’ve seen it countless times:

Logs are dropped for a few hours
The query engine stops working
A new log line breaks the parser
The ingestion pipelines get clogged
The disk is full
Someone logged in a loop with 2000 iterations per request
etc.

What usually starts as a “Hey, let’s get our logs in order so we can debug production issues” quickly becomes a significant engineering investment, with maintenance costs that keep going up and up.

What are logs for, actually?

Before we all decided to build microservices and distributed systems, logs were printed to standard output or to a file, and were used by system administrators to debug servers for malfunctions.

But today, logs are, for many teams, the only way of knowing what is actually going on during the constant choreography of internal requests, message queues, load balancers, etc.

It’s common to see teams add a unique ID to a request, manually propagate it through the system and add it to every log in the path. It becomes the only way to accurately know what a user did, if a background process was completed, or even the number of requests processed.

This usually starts as a tool that developers use to debug failures in the system, but everyone else notices that the engineering team can know with relative accuracy what is going on and they also want the same powers.

Logging is now used by developers to debug failed processes, and by everyone and anyone else to understand anything and everything about the system.

Customer support needs logs to appropriately do their job, operations need logs for auditing, business analytics need logs for data aggregation and reporting, etc. It’s not about having a stream of print statements to standard output anymore, it becomes a significantly bigger, mission-critical project, but without adequate resources.

Most teams turn to the ELK stack at this point. Spin up an Elasticsearch cluster for storage, Logstash for aggregation and transport, and Kibana for visualisation. I’ve done this more times than I’m willing to admit. It always sounds simple, but in reality, you’ve just signed yourself up for a lot of pain.

Managing an ELK stack is no easy feat, and given that it’s nobody on the team’s job, it’s bound to fail. There are always a couple of people passionate about observability who end up looking after this mission-critical piece of architecture alongside their regular duties.

Now, the entire company relies on logs, for business intelligence, customer support, compliance and auditing, and actually debugging the application. But nobody is actually tasked with looking after it, improving it, and ensuring it serves its purpose.

It’s a mess. After multiple instances of dropped logs, downtime and a broken parser, it’s time to sign up for a SaaS and make it someone else’s problem.

The perils of choosing a SaaS vendor

I’ve been through all of this multiple times, and now I’m building a SaaS observability vendor. There are things we need to know about logs and observability.

It’s expensive. It doesn’t matter how you look at it, logs in the cloud are expensive. A lot of vendors (including myself) will sometimes try to convince you that it’s not, but it is. Transporting petabytes of data in real-time, indexing them and storing them has a cost. If your application is chatty and has a lot of requests daily, it can become a problem.

You need to be prepared for this. Either you dedicate a lot of engineering resources to managing your own logging platform, or you spend on a vendor.

Ok, so now you’ve decided to go with a vendor, question is, which one? There are quite a few options from the established to the hundreds of startups looking for their own piece of the pie.

The most important criteria beyond cost should be:

1 - High Cardinality and Dimensionality

These concepts are foreign to most teams so it’s probably worth explaining with a detailed example. Imagine you log an HTTP request that makes a bunch of database calls.

High cardinality means that in your logs, you can have a unique userId or requestId (which can take over a million distinct values). Those are high cardinality fields. Your vendor should enable you to query against any specific value of a high cardinality field so that you can narrow down your search to a specific user or request
High dimensionality means that in your logs, you can have thousands of possible fields. You can record each HTTP header of your request and the details of each database call. Any log can be a key-value object with thousands of individual keys. Your vendor should index every single one of the keys and enable you to query against every one of them

2 - Query speed

Your vendor should have an SLA below 5 seconds for querying 500 million log lines, which means it should take less than 5 seconds to get an answer if you query over 500 million log lines. When debugging real production outages, you’re typically already under immense pressure and you want the ability to ask questions and get answers fast.

3 - Real-time data ingestion

When logs are produced in your systems, they must be available in your SaaS vendor within seconds. You definitely do not want to be looking at stale data during an incident. You don’t want your alerts to be 10 minutes late because your vendor has a 10-minute delay in ingesting data. You do not want to make decisions on data that is older than you checked your phone.

4 - Metadata annotations

When your system produces logs, you should be responsible for adding only the things specific to your application to your logs: userId, requestId, inventoryCount, etc. Your provider should automatically add all the metadata necessary to contextualise your logs: the hostname, host ID, serverless function name, service name, cloud account and region, etc. This will enable you to efficiently identify patterns based on those parameters, without making it the responsibility of your team to manually add each and every one of these fields.

What can you do now?

Whether built in-house or using a SaaS vendor, there are a few things you need to know and do with your logs to keep your sanity.

Simplify your log levels

There are way too many log levels - It’s confusing.

Every language and framework has a set of log levels that are slightly different from each other.

JavaScript / Node.js

console.debug("This is a debug log");
console.log("This is a normal log, whatever that means");
console.info("This is an info log, not to be confused with the normal log");
console.warn("This is a warning log");
console.error("This is an error log");

Syslog

emerg or panic: Emergency, the system is unusable
alert: Alert, action must be taken immediately
crit: Critical, critical conditions
err or error: Error, error conditions
warning or warn: Warning, warning conditions
notice: Notice, normal but significant conditions
info: Informational, informational messages
debug: Debug, debug-level messages

Python

logger.info("This is an info log")
logger.warning("This is a warning log")
logger.error("This is an error log")
logger.critical("This is a critical log")
logger.exception("This is an exception log")

What’s the difference between an error and a critical log?

What is a debug log? Really, how do I know I want this to be printed during development and not printed in production? Why is this information important only during development and useless in production? If it truly is, why don’t I delete it before deploying it to production?

My approach with logs has been to limit myself to using only info and error logs. All errors should be looked at, and all info logs should be relevant and meaningful.

Once you’ve conquered which log levels to use, you have to agree on a ✨ logging format ✨

Pick a logging format and stick to it

I’ll say it outright, pick a format and stick to it. That’s it.

I have seen it countless times, a team decides they need to get their logs in order. They set aside some time to decide on the logging format. They create a document where everyone can give their opinion.

Six weeks later, the document is a mess that nobody understands anymore; the pros and cons of each and every logging format on the planet are listed; some quite obvious like JSON or key-value pairs, and some more obscure and language-specific like why Elixir logger metadata format is the best in the world.

It doesn’t matter. Pick one that enables you to log high cardinality data with as much metadata as possible, is flexible enough to work across your codebase without custom shenanigans and more importantly, works with your stack. My preferred logging format is JSON. It has downsides, you can’t ensure a standard without writing custom loggers, but it works across all languages and developers are familiar with how JSON works.

Set the right expectation

If you go down the path of building your own logging platform without the appropriate engineering resources, you must set the right expectations. It cannot be a mission-critical system with 99.999% availability. It’s not reasonable to expect a system that is not maintained full-time by your team to have better availability than your main applications.

If you need a higher availability, subscribe to a SaaS vendor and make it their problem.

Compliance logs and application logs should not live in the same system

If you have logs that need to be stored for audit and compliance, mixing them with your application logs will give you a lot of issues. The value of logs decreases exponentially as they age, whereas the value of compliance and audit traces remains the same despite their age.

It’s usually no big deal to drop application logs that are 12 months old, whereas audit logs usually need to be kept for years. It’s a pain to miss logs for a few minutes a month, but you must have all your audit logs all the time.

Process them separately, and store them separately, with different retention periods.

Why you shouldn't be logging anyways

Logs have a terrible signal-to-noise ratio. 99.99% of logs will never be looked at. Observability without action is just storage. You’ll be paying a vendor or maintaining a high availability system essentially “just” to store logs.
Logs don’t always capture all you want in the first instance. It’s extremely common to encounter a defect in production, but the logs don’t have the full context because this specific failure mode wasn’t anticipated. You have to go back and add more logs, redeploy and hope the same issue happens again.
Logs don’t have default mechanisms to capture requests across multiple microservices. In your modern microservices application, you will need to manually pass a unique request ID throughout all the microservices to get the ability to reconstruct the full flow of a transaction. It’s easier said than done. I’ve seen these custom ID propagation systems fail way to often.
Logs don’t have default sampling mechanisms. At scale, you might not want to store every info log, given that the system is working as expected. You might want to sample your logs to capture only the ones for the requests that actually resulted in unexpected behaviour. It’s difficult to do without a lot of custom code.

There’s a better way with distributed tracing. Distributed traces capture not only what happens inside a single microservice, but also across multiple. This gives you the best of both worlds, a high-level view of your application, as well as the details of how each service works internally. Also, you usually get sampling, request propagation and automatic capture of all request payloads out of the box. I’ll write about this another day.

Conclusion

These are my thoughts on logging. It’s always difficult, it always takes a lot of time, and it always costs a lot. But there are things you can do today to improve your logging experience. Distributed tracing is the current industry answer, and you should definitely try it out if you haven’t yet.