Operations

AI Usage Scales Differently Than Compute

Why usage patterns surprise teams. Teams often assume AI scales like infrastructure. More users. More requests. More cost. That intuition breaks down quickly. Compute Scales With Load Traditional compute scales with demand. Requests per second. Concurrent users. CPU utilization. You can model that. You can predict it. You can cap it. AI does not behave that way. Infrastructure costs are relatively predictable. You know how much a server costs per hour. You can estimate how many requests a service handles per second. You can project cost based on expected traffic growth. ...

Designing Systems People Can Actually Support

Systems are often designed for correctness and performance. Supportability comes later. If at all. That ordering is backwards. Supportability Is a Feature If a system cannot be understood under pressure, it is incomplete. Supportable systems have: clear boundaries predictable behavior obvious ownership simple failure modes visible state These are design choices. Clear boundaries mean you know where one service ends and another begins. You can reason about dependencies. You can isolate failures. You can answer “is this my problem or someone else’s” quickly. ...

Alert Fatigue Is an Organizational Problem

Alert fatigue is often blamed on tooling. Too many alerts. Bad thresholds. Noisy systems. Those things matter, but they are symptoms. Alert fatigue is an organizational problem. Alerts Reflect What the Org Cares About Every alert encodes a value judgment. This matters. This is urgent. Someone should wake up. When everything is urgent, nothing is. Organizations that cannot agree on priorities produce alert storms. The system is only reflecting that confusion. ...

Why Runbooks Failed Us

We invested heavily in runbooks. We documented scenarios. We listed steps. We added screenshots. And during incidents, people still asked for help. Runbooks Assume the Wrong Thing Most runbooks assume that the problem is known. “If X happens, do Y.” Real incidents rarely look like that. Instead, people are asking: what changed what is failing whether this is new or known who else is affected Runbooks start too late. The typical runbook structure is procedural. It starts with a symptom and walks through resolution steps. “If CPU is high, restart the service. If that does not work, check for memory leaks. If memory is fine, scale horizontally.” ...

On Call Is a Product of Architecture

On call is often treated like a staffing problem. Who is rotating. How often pages fire. Which alerts wake people up. Those details matter, but they are not the root cause. On call quality is largely a product of architecture. Architecture Decides Who Gets Woken Up Every architectural decision carries operational weight. Synchronous dependencies increase blast radius. Tight coupling turns small failures into outages. Hidden retries create noisy cascades. Poor isolation spreads pain across services. ...