On Call Is a Product of Architecture

On call is often treated like a staffing problem.

Who is rotating. How often pages fire. Which alerts wake people up.

Those details matter, but they are not the root cause.

On call quality is largely a product of architecture.

Architecture Decides Who Gets Woken Up

Every architectural decision carries operational weight.

Synchronous dependencies increase blast radius. Tight coupling turns small failures into outages. Hidden retries create noisy cascades. Poor isolation spreads pain across services.

These decisions show up at 2 a.m., not during design reviews.

When you build a service that depends on three other services synchronously, you inherit their failure modes. If any one of them degrades, your service pages. If all three have issues simultaneously, your on-call engineer is troubleshooting four systems at once.

Tight coupling makes debugging harder. A failure in service A looks like it is coming from service B because they share state or communicate through side effects. The person on call spends hours tracing symptoms instead of fixing root causes.

Hidden retries are worse. A client library retries five times with exponential backoff. Under normal conditions, this is invisible. During an incident, those retries amplify load, trigger rate limits, and create alert storms that obscure the real problem.

Poor isolation means one team’s mistake becomes everyone’s problem. A memory leak in service X exhausts shared resources. Now three teams are on a bridge call because their services are all affected, even though none of them own the failing component.

Ownership Follows Failure

When failures are clear, ownership is clear.

When failures are ambiguous, on call suffers.

Architectures that make it hard to answer:

what failed
why it failed
who owns it

also make it hard to support humans.

People get paged not because systems are broken, but because systems are confusing.

Clear ownership starts with clear boundaries. If every service has a well-defined purpose, a single team responsible for it, and isolated failure modes, pages go to the right people. The engineer who gets woken up actually has the context and access to fix the problem.

Ambiguous ownership happens when services share responsibilities, when “platform” owns infrastructure but applications own behavior, or when technical debt has blurred the lines so much that no one is sure who should respond.

The worst on-call experiences happen when the person getting paged is not empowered to fix the issue. They can acknowledge the alert, but they have to escalate, wait, coordinate. That creates frustration and delays resolution.

Scaling Teams Without Scaling Pages

As systems grow, teams often scale on call by adding people.

That rarely works.

If architecture remains brittle, more people just means:

more context switching
more handoffs
more confusion
less accountability

Better architecture reduces pages without changing headcount.

Adding people to a broken rotation does not fix the underlying problem. It just spreads the pain across more individuals. Instead of one person being burned out, now three people are partially burned out.

Context switching gets worse. If you expand the rotation to include more engineers, each person on call is less familiar with recent changes. They spend more time getting oriented and more time asking for help from people who are not on call.

Handoffs multiply. Longer rotations mean more shift changes. Every handoff is an opportunity for context loss, miscommunication, or dropped follow-ups.

Accountability dilutes. When everyone is occasionally on call but no one is deeply invested, issues linger. Temporary fixes pile up. Root causes never get addressed because no one feels responsible long enough to care.

Better architecture fixes this. Services with clear interfaces, predictable failure modes, and good observability generate fewer pages. The pages that do fire are actionable. On-call engineers can respond quickly and confidently without needing to wake up half the team.

Boring Systems Create Better On Call

The best on-call rotations we had were boring.

Failures were:

contained
predictable
well understood
easy to mitigate

That did not happen by accident. It came from architectural restraint.

Boring does not mean simple. It means understandable. It means that when something fails, the failure is localized, the symptoms are clear, and the mitigation is obvious.

Contained failures do not cascade. If a background job crashes, it does not take down the API. If a cache fails, requests get slower but they still succeed. Graceful degradation is an architectural property, not an operational tactic.

Predictable failures are easier to handle. If a service always fails the same way under load, on-call engineers recognize the pattern immediately. They do not waste time diagnosing. They apply the known fix and move on.

Well understood systems have documentation that matches reality, consistent naming, and observable state. The person on call does not need to reverse-engineer the system during an incident. They can look at dashboards, logs, and traces and quickly form a mental model.

Easy mitigation comes from design. If restarting a service fixes most issues, make restarts safe and fast. If scaling up resolves load problems, make scaling automatic or trivial. Remove the need for heroics.

Final Thought

You cannot fix on call with process alone.

If people are burning out, look at the system they are supporting. On call is telling you something about your architecture.

Ignoring that signal is how good engineers leave.

Related reading:

Architecture Decides Who Gets Woken Up#

Ownership Follows Failure#

Scaling Teams Without Scaling Pages#

Boring Systems Create Better On Call#

Final Thought#

Architecture Decides Who Gets Woken Up

Ownership Follows Failure

Scaling Teams Without Scaling Pages

Boring Systems Create Better On Call

Final Thought