Structured Logging for Distributed Systems

DNA helix with field names and values as base pairs representing log data schema structure

It’s 3 AM. A payment is stuck somewhere between your API gateway, order service, and payment processor. You start searching logs.

You try grep 'userId'. Nothing. Maybe it’s grep 'user_id'? A few hits, but not from the payment service. grep 'user.id'? Different results again. Five queries later, you’ve pieced together most of the request path, but you’re still not sure if you’ve found everything.

Now imagine a different scenario: user.id:12345 AND event.action:payment_initiated. One query. Every service. Every log. Complete picture in seconds.

The difference isn’t better tooling. It’s discipline in how you emit logs. Structured logging with consistent schemas and correlation IDs transforms distributed debugging from archaeology into routine work. Here’s how to implement it.

Why Schema Matters

One service logs userId, another logs user_id, a third logs user.id. Without standards, every developer who adds logging invents their own conventions. Queries become guesswork, and the worst part is you don’t know what you’re missing—false negatives are invisible.

The fix isn’t documentation that nobody reads. It’s adopting a schema that makes the right choice obvious.

Adopt ECS, Don’t Invent

The Elastic Common Schema (ECS) provides 800+ pre-defined fields covering users, errors, HTTP requests, network events, and more. It works with any log backend—Elasticsearch, DataDog, Splunk, CloudWatch Logs—because the schema is about field naming, not storage format.

Here’s what an ECS-compliant log entry looks like:

{
  "@timestamp": "2024-01-15T14:32:01.234Z",
  "log.level": "error",
  "message": "Payment processing failed",
  "service.name": "payment-service",
  "trace.id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "user.id": "usr_12345",
  "error.type": "PaymentDeclinedException",
  "error.message": "Card declined: insufficient funds"
}
ECS-compliant log entry. Field names like user.id and error.type are standardized, making cross-service queries predictable.

The key field groups to adopt immediately:

  • service.* — which service emitted the log (service.name, service.version)
  • trace.* — correlation context (trace.id, span.id)
  • event.* — what happened (event.action, event.outcome)
  • error.* — failure details (error.type, error.message, error.stack_trace)

You don’t need to adopt all 800 fields. Start with these four groups and expand as needed.

Info callout:

If you’re already using an observability platform, check whether it has ECS mappings. DataDog, Splunk, and others can normalize ECS fields automatically, giving you consistent querying across tools.

Correlation IDs That Actually Work

A consistent schema lets you query individual services reliably. Correlation IDs let you follow a request across services. Without them, you’re reduced to timestamp proximity searches—“this log happened around the same time as that log, so maybe they’re related.”

The ID Hierarchy

Not all correlation IDs serve the same purpose. Here’s the hierarchy you need:

ID TypeScopePurposeExample
Trace IDEntire request treeConnect all services in a request4bf92f3577b34da6a...
Span IDSingle service hopDistinguish parent/child operations00f067aa0ba902b7
Request IDSingle HTTP requestCorrelate with load balancer logsreq_abc123
Transaction IDBusiness operationGroup related requests (e.g., checkout flow)order-12345

Table: Correlation ID types and their scopes. Trace ID is the most important—it stays constant across all services.

The key insight: trace ID stays constant across every service that handles the request. When you query trace.id:4bf92f3577b34da6a, you get logs from the API gateway, order service, payment service, inventory service, and notification service—everything involved in that single user action.

newsletter.subscribe

$ Stay Updated

> One deep dive per month on infrastructure topics, plus quick wins you can ship the same day.

$

You'll receive a confirmation email. Click the link to complete your subscription.

Propagation Is Everything

Correlation IDs are useless if they don’t travel with requests. Every service-to-service call must carry the correlation context forward.

For HTTP, use the W3C Trace Context standard. The traceparent header carries trace and span IDs in a single string:

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01

The format is version-traceId-spanId-flags. Most APM tools (OpenTelemetry, Jaeger, Zipkin) support this natively.

Async messaging requires a different approach since there are no HTTP headers. Instead, embed correlation context directly in the message envelope. Include a causationId to track which message or request triggered this one—essential for debugging event-driven architectures:

{
  "messageId": "msg_xyz789",
  "correlation": {
    "traceId": "4bf92f3577b34da6a3ce929d0e0e4736",
    "causationId": "msg_abc456",
    "originRequestId": "req_abc123"
  },
  "payload": { ... }
}

Code: Message envelope with correlation context. The causationId creates an audit trail of what triggered what.

AsyncLocalStorage for Automatic Context

The challenge with correlation is making context available everywhere without threading IDs through every function signature. Node.js’s AsyncLocalStorage solves this elegantly:

import { AsyncLocalStorage } from 'async_hooks'

const contextStorage = new AsyncLocalStorage<RequestContext>()

// Middleware: extract IDs from headers, run request in context
function correlationMiddleware(req: Request, res: Response, next: NextFunction) {
  const { traceId, spanId } = parseTraceParent(req.headers['traceparent'])
    || { traceId: generateTraceId(), spanId: generateSpanId() }

  const context = { traceId, spanId, requestId: req.headers['x-request-id'] || generateRequestId() }

  contextStorage.run(context, () => next())
}

// Anywhere in your code: get current context
function getCorrelationContext() {
  return contextStorage.getStore()
}

// Outbound requests: inject context automatically
async function fetchWithCorrelation(url: string, options: RequestInit = {}) {
  const context = getCorrelationContext()
  const headers = new Headers(options.headers)

  if (context) {
    headers.set('traceparent', formatTraceParent(context.traceId, generateSpanId()))
    headers.set('x-request-id', context.requestId)
  }

  return fetch(url, { ...options, headers })
}

Code: AsyncLocalStorage pattern for correlation context. Extract in middleware, access anywhere via getCorrelationContext(), inject automatically in outbound calls.

With this pattern, your logger can call getCorrelationContext() and automatically include trace IDs in every log—no explicit ID passing required.

Warning callout:

Every HTTP client in your codebase must propagate correlation headers. A single direct fetch() call breaks the trace chain. Wrap your HTTP clients or use OpenTelemetry auto-instrumentation to ensure consistency.

Common Correlation Breaks

These patterns break correlation in production. Each creates gaps in your trace data:

  • Generating new trace IDs per service — Logs can’t be joined. Extract the incoming traceparent; only generate if missing.
  • Direct HTTP client usage — Outbound calls don’t propagate headers. Use fetchWithCorrelation() or instrumented clients.
  • Fire-and-forget async — Background work loses context. Capture context before spawning; restore in the async handler.
  • Batched message processing — All messages share one correlation. Process each message in its own contextStorage.run() scope.

Making It Stick

Schema decisions and correlation patterns only help if they’re followed consistently. Relying on code review catches maybe 60% of violations. CI automation catches 99%.

Enforce in CI

Use JSON Schema with Ajv to validate log output in unit tests:

it('emits valid ECS-compliant logs', () => {
  logger.info('Test message', { 'user.id': 'usr_123' })
  const entry = JSON.parse(logOutput[0])

  expect(validate(entry)).toBe(true)  // Ajv schema validation
  expect(entry['trace.id']).toMatch(/^[a-f0-9]{32}$/)
})

Code: Schema compliance test. Ajv validates the log structure; the regex assertion ensures trace IDs are present and correctly formatted.

In production, Prometheus metrics can track correlation coverage. Set a target—99% of logs should have trace IDs—and alert when it drops. Missing correlation usually indicates logging from outside request context: background jobs, startup code, or uninstrumented libraries.

Success callout:

A 99% correlation target is achievable within weeks. The remaining 1% is typically startup logs and health checks—events that don’t need request context anyway.

Start Small

You don’t need to fix every service at once. A practical rollout:

  1. Week 1: Adopt ECS field names for new logs in one service
  2. Week 2: Add correlation middleware to that service
  3. Week 3: Query for logs missing trace.id and fix the gaps—usually background jobs or uninstrumented libraries
  4. Week 4: Expand to the next service in your critical path

Within a month, your most important request paths will be fully correlated. The rest can follow incrementally.

Free PDF Guide

Structured Logging: Standards That Stick at Scale

Field naming, correlation IDs, and noise filtering that keep logs useful as volume grows.

What you'll get:

  • Log schema starter blueprint
  • Correlation propagation middleware kit
  • ECS field mapping reference
  • Logging quality CI checks
PDF download

Free resource

Instant access

No credit card required.

The Payoff

The investment is upfront: schema decisions, correlation middleware, CI tests. But the payoff compounds with every incident. Instead of five queries and an hour of uncertainty, you type one query and see the complete picture in seconds. Every service’s logs connect. Debugging takes minutes instead of hours.

The next time it’s 3 AM and you’re chasing a stuck payment, you won’t be guessing at field names. You’ll type user.id:12345 AND event.action:payment_initiated—one query, every service, complete picture—and know exactly what happened.

Share this article

Found this helpful? Share it with others who might benefit.

Share this article

Enjoyed the read? Share it with your network.

Other things I've written