Why Your Service Catalog Is Failing (And How to Fix It)

Three-dimensional network topology diagram with illuminated node spheres connected by glowing lines of varying thickness representing traffic importance

Service catalogs follow a depressingly predictable arc. At launch, you’ve got 90% coverage and 95% accuracy—teams are entering their services because the initiative has leadership attention. By year two, coverage has dropped to 45%, accuracy to 30%, and the catalog has become a punchline in onboarding jokes. Engineers ask in Slack instead of checking the catalog because they’ve learned they can’t trust it.

Here’s what makes this worse: a catalog with 80% accurate data is more dangerous than no catalog at all. It gives you false confidence. You page the listed owner at 3 AM, confident you’ve got the right team, and waste twenty minutes before discovering they handed off the service six months ago. Every minute spent paging the wrong team is a minute your users are affected.

The fix isn’t discipline or better training—it’s ownership modeling that captures how teams actually work, combined with enforcement automation that keeps data accurate without relying on anyone remembering to update it.

The Ownership Problem

If I had to pick one field that determines whether a catalog succeeds or fails, it’s ownership. Every use case—incident routing, cost attribution, security scanning—depends on knowing who’s responsible. Get ownership wrong and you’ve built an expensive spreadsheet.

But “ownership” is deceptively simple. A service might have a development team that writes the code, an SRE team that handles production incidents, a security contact for vulnerability disclosures, and a cost owner for budget decisions. Flattening all of that into a single owner field creates ambiguity during incidents—when a critical CVE drops, do you page the dev team listed as owner, or does someone else handle security?

Primary Owner Plus Role Contacts

The ownership model that works distinguishes between different types of responsibility: a primary owner who’s responsible for the service’s existence and development, plus role-specific contacts for specialized functions.

RoleResponsibilityWhen contacted
Primary ownerDevelopment, roadmap, architectureDefault for anything not covered by a specific role
OncallIncident response, production issuesAutomated alerts, P1/P2 incidents
Security contactVulnerability remediation, security reviewsCVE notifications, penetration test findings
Change approverProduction deployment approvalRelease management workflows
Cost ownerBudget decisions, optimizationFinOps alerts, capacity planning
Ownership roles and their triggers.

For most services, the primary owner handles all roles. You only need role-specific contacts when there’s a reason to route differently—a shared SRE oncall rotation, a dedicated security champion, or a manager who approves production changes.

Here’s how this looks in a Backstage catalog entry:

# Service ownership with role-specific contacts
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: payment-processor
  annotations:
    pagerduty.com/service-id: P123ABC
spec:
  type: service
  lifecycle: production
  owner: team-payments
  # Role-specific contacts via custom extension
  x-contacts:
    oncall: payments-oncall-schedule
    security: alice.chen
    change-approver: team-payments-leads
Backstage entry with primary owner and role-specific contacts.

Services Change Hands

Services become orphaned when their owning team dissolves, empties out, or goes inactive. This happens more than you’d think—reorgs, layoffs, and attrition all create orphans. A tier-1 service with no valid owner is a ticking time bomb.

Run orphan detection weekly. The rules should catch dissolved teams, empty teams, inactive teams, and missing oncall schedules:

# Weekly orphan detection for service ownership
name: Orphan Detection
on:
  schedule:
    - cron: '0 9 * * 1'  # Every Monday at 9 AM
jobs:
  detect-orphans:
    runs-on: ubuntu-latest
    steps:
      - name: Check for orphaned services
        run: |
          # Fetch all tier-1 and tier-2 services
          SERVICES=$(curl -s "$CATALOG_API/services?tier=1,2")
          NAMES=$(echo "$SERVICES" | jq -r '
            .[].metadata.name
          ')

          for svc in $NAMES; do
            OWNER=$(echo "$SERVICES" | jq -r --arg SVC "$svc" '
              .[] | select(.metadata.name == $SVC) | .spec.owner
            ')

            # Check if team exists and has members
            TEAM_URL="$IDP_API/teams/$OWNER/members"
            TEAM_SIZE=$(curl -s "$TEAM_URL" | jq 'length')

            if [ "$TEAM_SIZE" -eq 0 ]; then
              echo "::error::Service $svc orphaned: $OWNER (0 members)"
            fi
          done
GitHub Action for weekly orphan detection.
Danger callout:

A tier-1 service without a valid owner means the next incident has no one to page. Treat orphan detection failures for critical services as a P1 issue.

For critical services, consider auto-escalation. When a tier-1 service is orphaned, automatically assign it to the domain owner or a catch-all platform team until proper ownership is established. This ensures someone gets paged even when the original team no longer exists.

newsletter.subscribe

$ Stay Updated

> One deep dive per month on infrastructure topics, plus quick wins you can ship the same day.

$

You'll receive a confirmation email. Click the link to complete your subscription.

Making Catalogs Self-Sustaining

Good ownership modeling gives you the right fields to capture. But even perfect schema design fails if engineers have to remember to update it. The catalogs that survive are the ones where accuracy is enforced automatically. If updating the catalog is a manual step that happens after deployment, it won’t happen consistently. The only way to maintain accuracy is to make the catalog part of the deployment path.

Catalog-as-Code

Store service definitions in the same repository as the service code. This gives you version control, code review, and the ability to enforce changes through CI/CD. When the catalog entry lives next to the code, updating it becomes part of the normal development workflow rather than a separate chore.

The Backstage catalog-info.yaml pattern works well here. Each service repository contains its own catalog entry, and a central catalog aggregates entries from all repositories. Changes to catalog metadata go through the same PR process as code changes.

The benefits add up: Git history shows who changed what and when, catalog changes get the same scrutiny as code changes, code and catalog changes ship together atomically, and teams maintain their own entries rather than relying on a central team.

CI/CD Validation

Validate catalog entries on every PR. This catches issues before they reach production: missing required fields, references to teams that don’t exist, dependencies on services that aren’t in the catalog.

# GitHub Action for catalog validation
# Requires CATALOG_API secret configured in repository settings
name: Catalog Validation
on:
  pull_request:
    paths:
      - 'catalog-info.yaml'
jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Validate schema
        run: npx @backstage/cli catalog:validate catalog-info.yaml
      - name: Check owner exists
        run: |
          OWNER=$(yq '.spec.owner // ""' catalog-info.yaml)

          if [ -z "$OWNER" ]; then
            echo "Error: .spec.owner is missing in catalog-info.yaml"
            exit 1
          fi

          curl -sf "$CATALOG_API/teams/$OWNER" || {
            echo "Owner team '$OWNER' not found in IDP"
            exit 1
          }
GitHub Action for catalog validation on PRs.

For tier-1 and tier-2 services, make validation failures blocking. For tier-3 and tier-4, warn but allow the PR to merge—you want to reduce friction for less critical services while maintaining strict standards for critical ones.

Service tierMissing ownerMissing runbookMissing dependencies
tier-1BlockBlockBlock
tier-2BlockBlockWarn
tier-3BlockWarnWarn
tier-4Warn——
Validation enforcement by tier.

Drift Detection

Even with CI validation, catalog entries go stale. Drift detection compares catalog declarations against actual system state. It catches discrepancies that validation can’t: a service running in Kubernetes but not in the catalog, a team that was deleted from the identity provider, an oncall schedule that was removed from PagerDuty.

A typical drift check queries your runtime environment and compares against catalog records:

#!/bin/bash

# Compare running Kubernetes deployments against catalog
kubectl get namespaces \
  -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' | \
  while read -r ns; do
    # Get deployments for this specific namespace
    kubectl get deployments -n "$ns" \
      -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' | \
      while read -r deploy; do
        # Check if deployment exists in catalog
        if ! curl -sf "$CATALOG_API/components/$deploy" > /dev/null; then
          echo "DRIFT: Deployment $ns/$deploy not in catalog"
        fi
      done
  done
Basic drift detection comparing Kubernetes deployments to catalog entries.

Run drift detection daily. For critical drift—invalid owner on tier-1 service, missing oncall—alert immediately. For less critical drift like undeclared dependencies on tier-3 services, batch into a weekly report.

Warning callout:

Drift detection requires API access to your identity provider, oncall system, and runtime infrastructure. Expect 2-4 weeks of integration work per system—PagerDuty and Okta have good APIs, but some legacy CMDBs will fight you.

Measuring Success

A catalog without health metrics will silently decay. Measuring catalog health requires tracking three dimensions: coverage (what percentage is cataloged), accuracy (does the data reflect reality), and freshness (when was it last updated).

MetricFormulaTarget
Coverage ratioCataloged / Total> 95%
Owner completenessWith valid owner / Cataloged100%
Oncall completeness (tier-1)With oncall / tier-1 services100%
Freshness (30-day)Updated within last 30 days / Cataloged> 70%
Core catalog health metrics.

Don’t set coverage targets at 100% on day one. A realistic progression: tier-1 services at 100% coverage and accuracy in month one, tier-2 at 95% by month three, full estate at 90% by month six. Start with what matters most for incident response, then expand.

Free PDF Guide

Service Catalog Schema: Metadata That Gets Used

Designing catalog schemas with ownership, lifecycle, and dependency data that stays accurate over time.

What you'll get:

  • Catalog schema design blueprint
  • Ownership role mapping template
  • Drift detection integration guide
  • Catalog health KPI dashboard
PDF download

Free resource

Instant access

No credit card required.

When catalog health becomes visible—tracked on a dashboard, reviewed weekly by the platform team, summarized monthly for leadership—it gets attention. Treat catalog coverage like any other SLO (if you’d alert on 99.9% availability dropping, alert on catalog accuracy dropping below target too).

Share this article

Found this helpful? Share it with others who might benefit.

Share this article

Enjoyed the read? Share it with your network.

Other things I've written