Systems are often designed for correctness and performance.
Supportability comes later. If at all.
That ordering is backwards.
Supportability Is a Feature
If a system cannot be understood under pressure, it is incomplete.
Supportable systems have:
- clear boundaries
- predictable behavior
- obvious ownership
- simple failure modes
- visible state
These are design choices.
Clear boundaries mean you know where one service ends and another begins. You can reason about dependencies. You can isolate failures. You can answer “is this my problem or someone else’s” quickly.
Predictable behavior means the system acts consistently. It does not have hidden modes or special cases. If you scale it, it scales. If you restart it, it comes back healthy. If you send it bad input, it rejects it gracefully.
Obvious ownership means it is clear who is responsible for what. Every service has a team. Every alert has an owner. Every configuration change is reviewed by someone who understands the impact.
Simple failure modes mean failures are easy to diagnose. The service crashes loudly instead of limping along in a degraded state. Errors are logged clearly. Symptoms point directly to causes.
Visible state means you can observe what the system is doing. Metrics show load and latency. Logs show decisions and errors. Traces show request flow. You do not have to guess or attach a debugger.
All of these properties must be designed in from the start. Retrofitting supportability is expensive and often incomplete.
Complexity Is a Tax on Humans
Every layer of abstraction adds cognitive load.
That cost is invisible during development. It is very visible during incidents.
Complex systems require heroic support. Heroics do not scale.
Developers build abstractions to manage complexity. That is good. But every abstraction is a layer someone has to understand during an incident.
A service calls a library that calls a framework that calls a platform API. When something goes wrong, the person debugging has to understand all four layers. If any one of those layers behaves unexpectedly, diagnosis becomes a research project.
During development, complexity is manageable. You can take your time. You can read code. You can ask the person who wrote it. You have context.
During incidents, none of that is true. You are under pressure. The person who wrote it might be asleep. Context is limited to logs and metrics. Complexity that seemed reasonable during development becomes a barrier to resolution.
Heroic support is when one or two people are the only ones who truly understand the system. They are the ones who get called during every major incident. They are the bottleneck for every difficult debugging session. This does not scale. It burns people out and creates key-person risk.
Reducing complexity is not about making systems simplistic. It is about making them understandable. Fewer moving parts. Clearer contracts. Less magic. More explicitness.
Consistency Reduces Cognitive Load
One of the most effective changes we made was boring.
We standardized:
- naming
- deployment patterns
- logging structure
- alert semantics
- environment layout
Nothing got faster. Everything got easier.
Consistency means patterns transfer. If you understand one service, you understand most services. You know where to find logs. You know how to check recent deployments. You know what environment variables matter.
Naming conventions help enormously. If every service follows team-service-env naming, you can identify ownership and environment instantly. If every resource group follows a standard pattern, you can navigate Azure subscriptions without asking for help.
Deployment patterns matter. If every service uses the same CI/CD pipeline, troubleshooting deployment issues is faster. If every service handles configuration the same way, you do not have to relearn how secrets are loaded for each one.
Logging structure is critical. If every service logs in JSON with consistent field names, you can build queries that work across all services. You can correlate events. You can automate analysis.
Alert semantics reduce confusion. If every alert includes service name, environment, and severity in a standard format, you know immediately how urgent it is and who should respond.
Environment layout helps with muscle memory. If development, staging, and production are set up identically except for scale and data, promoting changes is predictable. Debugging production does not require learning a different configuration.
None of these changes made systems faster or cheaper. But they made them easier to operate. That is a multiplier on everything else.
Make the Happy Path Obvious
Well designed systems make the normal case easy to recognize.
When something is off, it stands out.
That reduces time to understanding and prevents overreaction.
If dashboards show green most of the time, red is meaningful. If logs are usually quiet except for a few key events, a spike in error logs is obvious. If latency is normally flat, a sudden increase is immediately visible.
This requires designing for observability. Do not log everything. Log what matters. Do not create metrics for every internal variable. Create metrics that represent user experience and system health.
Making the happy path obvious also means reducing noise. If your system constantly produces warnings that do not matter, people learn to ignore them. When a real problem occurs, it blends into the background.
Good defaults help. If a service starts healthy by default and stays healthy unless something is wrong, you know that unhealthy means investigate. If a service requires manual intervention to stay healthy, you never know whether unhealthy is normal or a problem.
Design for the Worst Day
The true test of a system is not how it performs when everything works.
It is how it behaves when:
- people are tired
- context is missing
- time is limited
- stakes are high
Design for that day.
Incidents happen at the worst times. Middle of the night. During a holiday. When key people are unavailable. When multiple things are broken at once.
Designing for the worst day means assuming the person responding has minimal context. They might be new to the team. They might not have worked on this service before. They might be half asleep.
Can they still figure out what is wrong and how to fix it? Or does the system require deep expertise and perfect clarity?
Designing for the worst day also means building in safeguards. Rollback should be easy and safe. Scaling should be quick. Restarts should be non-disruptive. You do not want high-risk operations during high-stress moments.
It means having multiple ways to get information. If the primary monitoring system is down, can you still see what is happening? If logs are not searchable, can you still access raw files? Redundancy in observability tools is not wasteful. It is essential.
It means designing systems that fail safely. If a dependency is unavailable, degrade gracefully instead of crashing. If load spikes, shed non-critical work instead of falling over.
All of this comes from empathy. Put yourself in the shoes of the person on call at 2 a.m. What information do they need? What actions should be easy? What mistakes should the system prevent?
Final Thought
Systems are supported by people, not diagrams.
If a system requires constant explanation, it will fail socially before it fails technically.
Designing systems people can actually support is not optional. It is responsible engineering.
Related reading: