Logging Best Practices: Ship Reliable Software Faster
Master logging best practices: structured logs, levels, security, and retention. Ship reliable software and debug faster.

You're probably here because your logs already betrayed you once.
A bug hits production. Users complain. You open the logs and get a wall of noise: random print statements, stack traces with no request context, half the messages duplicated, and the one line you need buried under startup chatter. The system is running, but you can't explain what it just did.
That's the moment when logging stops being a boring implementation detail. It becomes a product feature for your future self.
Good logs don't exist to satisfy a checklist. They exist so that the version of you handling an outage, a customer complaint, or a failed deploy can answer basic questions fast. What happened? Who did it affect? Which code path ran? Did the system recover, retry, or fail unnoticed? If your logs can't answer those questions, they're expensive decoration.
Why Your Logs Fail You When It Matters Most
Most bad logging starts with good intentions. A developer adds a few console.log calls during setup. Another adds some exception output before a launch. A third copies a pattern from an old service. Months later, production has become a stream of disconnected thoughts.
At that point, the problem isn't that you have no logs. It's that you have logs with no design.
When teams say “we have observability,” they often mean they bought a tool. That's not the same thing. Logging best practices have changed because software changed. What used to be simple error recording now sits inside a broader observability and governance model, with centralized collection, correlation, structured fields, log-level discipline, retention limits, and masking of sensitive data, as described in Rapid7's guidance on log management and analytics.
The usual failure mode
A production failure rarely looks dramatic in the logs. It looks vague.
You see messages like:
- “starting request”
- “processing user”
- “error occurred”
- “done”
None of that helps. It doesn't tell you which user, which request, which service instance, or what “done” even means. If you've ever stepped through a midnight incident and wished your past self had left a breadcrumb trail, you already understand the core purpose of logging.
Good logs shorten the distance between “something is wrong” and “I know exactly where to look.”
Logging is a debugging gift you ship ahead of time
That's why I treat logs as part of the product, not as debris from development. They're part of reliability. They support debugging, support escalations, and post-incident learning. They also work best when they complement your other signals. Metrics tell you that something changed. Traces tell you where latency or failure moved through the system. Logs tell you what happened in the code and data path.
If your debugging process still depends on reproducing every bug locally, you're making life harder than it needs to be. A cleaner logging setup pairs well with stronger debugging techniques for real production issues, especially when the bug only appears under real traffic and real state.
The Core Principle of Structured Logging
Structured logging fixes a common production mistake. Teams write logs for the moment they type them, not for the 2 a.m. incident when someone needs to filter, correlate, and decide what to do next.
A plain text log line can be readable and still be a bad operational tool. The problem is not readability. The problem is that freeform text breaks the moment you need consistent queries across services, deploys, and incident responders. Good logs act like records with predictable fields, so your future self can ask precise questions and get answers fast.

What bad looks like
This is common and weak:
console.log(`Payment failed for user ${userId} because Stripe timed out`);
It tells a human roughly what happened. It does not give your tooling stable fields to query. If someone later changes the wording from “timed out” to “timeout” or adds another clause, your saved searches and alerts get worse.
Now compare that to this:
logger.warn({
event: "payment_failed",
user_id: userId,
provider: "stripe",
reason: "timeout",
order_id: orderId
});
That line is useful under pressure. You can filter by event=payment_failed, group failures by provider, pull every event for one order_id, or isolate a single user's failed attempts without parsing text.
That is the core principle. Keep the event shape consistent enough that machines can do the sorting for you.
Start with a minimum schema
Do not design a giant logging taxonomy before you ship. That usually ends in bikeshedding and half-adopted conventions. Start with the smallest field set that makes incident response faster, then keep it consistent.
For most apps, that minimum looks like this:
- Event name such as
user_login_failed,payment_succeeded,db_retry_started - Log level to separate expected activity from issues that need attention
- Timestamp in a standard format
- Service name if more than one service can emit the event
- Environment such as development, staging, production
- Request or correlation ID so related events can be found together
- Business identifiers like
user_id,order_id,tenant_idwhen they help answer support or incident questions - Error details when an operation fails
This is enough for most early-stage teams. You do not need every field on every line. You need the same key fields to show up in the same way every time an event matters.
Keep the message readable
Some teams swing too far and produce logs that are perfect for machines and miserable for humans. That is a mistake too.
Use a clear message for the person scanning the stream. Put the searchable facts in fields.
logger.info(
"Order submitted successfully",
extra={
"event": "order_submitted",
"order_id": order_id,
"user_id": user_id,
"payment_method": payment_method,
}
)
The message helps during triage. The fields make the log dependable in dashboards, alerts, and incident searches.
A simple test works well here. If routine production questions require regex, grep gymnastics, or guessing which phrase an engineer used last month, your logs need more structure.
Design events, not sentences
This is the shift that improves logging fastest.
Instead of asking, “What should this line say?” ask, “What event happened, and which fields will matter later?” That pushes you toward stable event names and stable keys. It also reduces the habit of stuffing useful context into one long string that nobody can query cleanly.
I like event names that describe a business or system action clearly: invoice_payment_failed, cache_miss, password_reset_requested. Keep them boring. Boring names age well.
Centralize once the structure exists
Structured logs pay off when they land in one searchable place. If half the story is in container output, another piece is in a worker's local file, and the rest is in a hosted function console, you still have an incident response problem.
Centralization should follow structure, not replace it. Dumping inconsistent text logs into one tool gives you one bigger mess. Sending consistent event records to one place gives your team something they can use.
The standard to aim for is simple. An engineer should be able to ask, “Show me every payment_failed event for this order across services,” and get the answer in one query. That is the point of structured logging as a product feature for your future self.
Using Log Levels Effectively
Teams often misuse log levels in one of two ways. They either log everything at INFO, or they treat DEBUG as a junk drawer for everything they didn't want to think about.
A log level is useful when it answers a question.
The question each level should answer
DEBUG answers: What was the code doing in detail?
Use it for developer-facing execution details, temporary diagnostics, payload shape summaries, decision branches, and retry state. It's great in development. It's dangerous in production if you use it lazily.
INFO answers: What normal business-relevant thing just happened?
Use it for startup events, successful state transitions, completed jobs, user-visible operations, and expected external interactions.
WARN answers: What went wrong, but the system recovered or continued?
Use it when something is off and someone should know, even if no immediate action is needed. A retry that succeeds on the second attempt belongs here. So does a fallback path.
ERROR answers: What failed and now needs attention?
Use it for failed requests, broken jobs, unhandled exceptions, and operations that didn't complete as intended.
Log Level Cheat Sheet
| Level | Purpose | Example Use Case |
|---|---|---|
| DEBUG | Detailed execution info for developers | Logging retry state, branch selection, parsed input shape during local debugging |
| INFO | Confirm normal, meaningful operation | User completed checkout, worker started, webhook processed successfully |
| WARN | Flag unexpected but recoverable behavior | Third-party API timed out once and retry succeeded |
| ERROR | Record failures that need investigation | Payment failed permanently, job crashed, database write failed |
A web app example
If a user submits a payment:
- Successful charge recorded and order marked paid.
INFO - Payment provider is slow, first request times out, second attempt works.
WARN - Charge fails and order remains unpaid.
ERROR - You're diagnosing a weird duplicate-submit bug in staging and want to inspect idempotency flow.
DEBUG
That's a healthy pattern because each level has a job.
What doesn't work
This doesn't work:
- Logging every request body at
INFO - Logging successful health checks at
INFO - Logging every exception twice, once where it happens and once where it bubbles up
- Marking recoverable fallback behavior as
ERROR - Leaving
DEBUGenabled in production because “it might help later”
Those habits make alerting worse and incident review slower. They also train engineers to ignore logs because the signal-to-noise ratio collapses.
Change verbosity by environment
This is one of the easiest wins in logging best practices. In development, be generous. In production, be selective.
A practical baseline:
- Development:
DEBUG - Staging: mostly
INFO, turn up specific services when needed - Production: default to
INFO, emitWARNandERRORcarefully, enable temporary deeper logs only for targeted diagnosis
That split keeps local debugging rich without turning production into a landfill.
Your production logs should help you operate the system, not narrate every line of code.
Connecting the Dots with Context and Tracing
A single good log line helps. A connected story solves incidents.
Most production failures don't happen in one function. A request enters an API, touches authentication, hits a database, calls a third-party service, triggers a queue consumer, and finally returns something broken to the user. If each component logs in isolation, you don't have observability. You have fragments.
The fix starts with context.

Treat the request ID like a tracking number
A correlation ID, often called a request ID or trace ID, is the tracking number for a user request. Create it when the request starts. Pass it through everything. Include it in every log line that belongs to that journey.
Without it, your logs tell you what happened somewhere. With it, your logs tell you what happened for this exact request.
logger.info({
event: "checkout_started",
request_id: req.id,
user_id: user.id,
service: "api",
});
Later in another service:
logger.warn({
event: "inventory_service_timeout",
request_id: req.id,
order_id: order.id,
service: "inventory",
});
And later in a worker:
logger.error({
event: "order_confirmation_failed",
request_id: req.id,
order_id: order.id,
service: "mailer",
reason: "template_missing"
});
Now you can search one request_id and replay the whole narrative.
Add business and system context sparingly
You don't need every field on every log. You need the fields that help answer the next question.
Useful context often includes:
- Who triggered it:
user_id,tenant_id,session_id - What object it involved:
order_id,job_id,invoice_id - Where it ran:
service,environment,version - How it behaved:
attempt,duration_ms,outcome
That context turns “database error” into “checkout failed for tenant X in production on version Y after retry attempt Z.”
Don't confuse context with clutter
A common mistake is dumping giant objects into logs and calling that context. That isn't context. That's laziness.
Good context is selective. It captures identifiers and state transitions. It doesn't serialize the entire user record, request body, or ORM object by default.
Here's a short walkthrough if you want a visual on how richer context builds toward observability:
<iframe width="100%" style="aspect-ratio: 16 / 9;" src="https://www.youtube.com/embed/Oa-zqv-EBpw" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>Tracing starts with logging discipline
You don't need a full distributed tracing setup on day one. If you consistently attach correlation IDs and service metadata, you've already laid the groundwork. Later, if you adopt tracing tools, your system will be ready because the habit is already there.
That's the practical version of observability. Start with logs that carry enough context to reconstruct a request. Add deeper tracing when the architecture needs it.
Managing Volume Storage and Costs
It is 2:13 a.m. A customer says checkout is failing, the queue is backing up, and the log search times out because your system has been dumping millions of low-value events all week.
That is what "log everything" buys you in production.
Logs are a product feature for your future self. If they are too noisy, too expensive to retain, or too slow to search under pressure, they fail at the exact moment you need them. The goal is not more logs. The goal is enough signal to diagnose real problems fast, without paying to index junk forever.
Keep logs that answer expensive questions
Use one filter for noisy categories: did this event help diagnose an incident, explain user impact, or support a security review?
If not, cut it, sample it, or lower its retention.
Common reduction targets:
- Health check success logs
- Routine polling logs
- Repeated startup chatter
- Per-item loop logs in high-volume workers
- Verbose third-party SDK output
Logs worth keeping with higher confidence:
- Authentication failures
- Authorization failures
- Application errors
- Data import and export events
- High-risk state changes
This is a trade-off, not a purity test. A per-record worker log may help during one migration, then become dead weight six weeks later. Review noisy categories after incidents and after major launches. Keep what earns its cost.
Retention should match value
One retention setting for every log type is lazy configuration.
Verbose diagnostic logs have short-term value. Security and audit-relevant events often need a longer window. Standard application logs usually sit somewhere in the middle. Set those classes on purpose, then document them so the team knows what will still exist during an incident review.
Teams that never define retention often end up paying to keep low-value history that nobody queries.
Sample repetitive success, keep failures
High-frequency success events distort both cost and search results. Sampling fixes that without blinding the team.
A practical rule works well:
- Keep all errors
- Keep all unusual states
- Keep all security-relevant events
- Sample high-volume successful events
- Drop events that are only useful during a temporary investigation
Successful health checks are a classic example. Full retention rarely helps. The same goes for repetitive cache hits or routine background task confirmations. If a system is healthy and noisy, reduce the confirmation logs and preserve the exceptions.
If your logging bill grows faster than your ability to debug incidents, you are storing too much noise.
Treat volume control as shipping discipline
Founders and small teams usually postpone this until a cloud bill spikes or a real outage exposes the mess. That is late. Retention rules, sampling decisions, and noisy event review belong in the same operating habits as deploy checks and rollback plans. A good production readiness checklist for software teams should call out which logs you keep, how long you keep them, and which events the team expects to rely on during an outage.
The minimum effective setup is usually enough:
- Send logs to one central system.
- Review high-volume categories monthly or after incidents.
- Sample or remove repetitive low-value events.
- Set retention by log type.
- Make sure the team knows which logs matter when production breaks.
That keeps logs searchable, useful, and affordable.
Security and Privacy in Your Logs
Production incident. Everyone is rushing. Someone opens the logs to find the failure path, and now the team is staring at a live session token, a customer email, and a secret pulled from config. The original bug is still bad. The log leak is worse.
Logs are not a private scratchpad. They get forwarded to vendors, indexed in search tools, copied into tickets, and reviewed by people far outside the engineer who wrote the line. Treat them like a product feature for your future self. Useful under pressure, safe by default, and boring to audit.

What should never land in logs
Some values create more risk than debugging value:
- Passwords
- API keys
- Session tokens
- Secret values from environment configuration
- Personally identifiable information unless there is a clear operational reason
- Full payment or sensitive account details
AWS recommends excluding sensitive data such as passwords, API keys, and unnecessary personal data from logs, and keeping error logging focused on what is needed for diagnosis in its logging best practices for application owners.
This gets messy fast in real code. A developer logs a full request body during a checkout failure. Another dumps a config object during startup. If the team has a weak grasp of environment file setup and secret management flow, those mistakes end up exposing the exact values that were supposed to stay out of source control, dashboards, and support threads.
Redact before storage, not after
Post-processing is not a safety plan. By the time a value hits your log pipeline, it may already be indexed, exported, or copied.
The practical fix is simple. Decide what is allowed, then log only that. In practice, that usually means logging that a token existed, not the token itself, logging a user ID instead of a full profile object, and masking email addresses or account references unless the full value is required to resolve the incident.
Whitelist-based logging beats blacklist-based logging in production systems. Blacklists fail when a new field shows up inside a nested payload and nobody updates the filter. Whitelists force intent.
Protect the logs themselves
Security is not only about what goes into logs. It is also about who can change them, disable them, or reduce them when the system is under stress.
Limit access. Restrict who can change log settings in production. Route configuration changes through the same review path you use for other operational changes. If your platform supports integrity checks or tamper evidence, turn them on. Logs only help during an investigation if the team trusts that they were captured consistently and not edited after the fact.
Logs should help you explain an incident without creating a second incident.
Logging Is a Team Discipline Not a Tool
You can buy Elastic, Datadog, Loki, CloudWatch, or any other capable platform and still end up with terrible logs. The tool helps. The discipline decides the outcome.
Teams get value from logging when they agree on a few essential principles and keep them consistent across services. That means code review standards, shared field names, sane defaults, and regular cleanup when logs become noisy or misleading.

The minimum effective checklist
If you want the short version, start here:
- Use structured logs: Prefer JSON or consistent key-value output over free-form text.
- Apply log levels on purpose:
DEBUGfor development detail,INFOfor normal operations,WARNfor recoverable issues,ERRORfor failures. - Attach context: Include request IDs, service names, and the business identifiers that matter.
- Control volume: Sample repetitive success noise and define retention instead of keeping everything forever.
- Protect sensitive data: Exclude secrets, credentials, tokens, and unnecessary personal data.
- Protect the logging system itself: Don't let random users disable or alter logging undetected.
What good teams do differently
Good teams review logs the same way they review tests and error handling. They ask whether a message will help during an incident. They rename vague events. They delete noisy output. They standardize fields across services so search works the same everywhere.
That's the payoff. Better logging reduces stress. It speeds up debugging. It makes on-call less chaotic. It helps a small team ship faster because fewer hours disappear into guessing.
The best logging setup isn't the most elaborate one. It's the one your team maintains.
If you want hands-on help tightening your logging, debugging ugly production issues, or building a more reliable shipping workflow, Jean-Baptiste Bolh works directly with founders and developers to get real software out the door without the usual fluff.