On-Call at Five Engineers

Most five-engineer teams either pretend they have no on-call or copy a fifty-person playbook. Both break the same way. The team that pretends has a silent panic the first time something breaks at 2 AM. The team that copies the playbook spends six weeks building infrastructure they will throw away two quarters later. There is a version of on-call sized for five engineers, and almost nobody tells you what it is.

And in 2026, “five engineers” often means two humans and three AI agents shipping code on their behalf. That does not make the on-call problem smaller. It makes it bigger, in a specific way I will get to.

At three engineers, on-call is informal by default. “Whoever last touched it, Slack them.” At fifty, you have a real rotation with runbooks, escalation tiers, and a paging service. At five, you are in the gap where the small-team version stops working and the big-team version is too heavy. The signal that you are in the gap is when something breaks on a weekend and nobody knows who is responsible for looking at it.

Why On-Call Breaks Differently at Five

Three engineers can hold the whole system in their heads. They know what each piece does, what tends to break, and what to do when it does. The shared mental model is the runbook. You do not need to write it down.

By the time you have five engineers, the shared mental model breaks. There is now code that one person wrote and nobody else has read in detail. The person who wrote it remembers, kind of, but they are not always available. The other four engineers will get paged about systems they have never debugged.

In 2026 this gets worse. When much of your code is written with AI assistance, the original author often did not write it line by line either. They reviewed it, accepted it, shipped it. Three months later, when a part of it pages someone, even the author cannot reliably explain it from memory. The “I wrote it so I remember it” fallback you used to lean on at three engineers is mostly gone by five.

And in the version where some of your “five engineers” are agents shipping pull requests under human review, the problem doubles down. The agents do not carry pagers. They cannot be Slack-pinged at 2 AM. The human who reviewed the agent’s PR gets paged for code they did not write, written by something that is not available to help debug it.

This is the moment to add structure. Not a lot. Just enough that any one of the humans on your team can handle a 2 AM page without having to remember what the code does, or who wrote it, or whether it was a human or an agent.

The Five Tiers I See in Practice

Five-engineer teams I work with usually sit at one of these five tiers, and the difference between the bottom and the top is not the tooling. It is the discipline around what triggers a page.

Tier 1: Hope nothing breaks. No rotation, no alerts, no plan. The team will tell you they are too early for on-call. They are usually one outage away from a real conversation.

Tier 2: Slack the engineer who wrote it. Informal. Whoever last touched the system gets the message. Works at three engineers. At five, it means one or two people carry the whole burden and the others coast. They will burn out first and quit first.

Tier 3: Rotation, no runbook. Real progress. One person is on-call each week, everyone knows who, you have a paging tool. But when they get paged about a system they have never seen, they have to wake up the person who wrote it anyway. The rotation is theatre.

Tier 4: Rotation, runbook. Now the runbook tells the on-call engineer what to actually do, for the top five things that actually break. Single file, kept current. The on-call engineer can handle most pages without escalation. The system is functional.

Tier 5: Rotation, runbook, and alerts that actually mean something. The final tier. All of the above, plus alert hygiene. Every alert means “act on this in the next hour, or a customer notices.” Nothing else pages. Dashboards exist for the rest. This is the version that does not destroy your team.

Where the Real Breakage Happens

The thing that kills five-engineer teams is not the breakage. It is the false alarms. The disk-space alert that triggers every Tuesday at 6 AM. The latency spike that always recovers in two minutes. The integration test that flakes once a week. Each of those, in isolation, is a small thing. In aggregate, they teach your on-call engineer that pages do not matter.

Once that lesson is learned, the real outage gets ignored too. The alert that signals a customer is actually down looks the same on their phone as the alert that fires every Tuesday.

Five engineers, ten false pages a week, three of them at 2 AM, and you will lose somebody within a quarter. Either to burnout or to a competing offer from a place that has its on-call sorted.

The Five Rules I Use With Founders

1. One person is on-call each week. Real rotation, named in a calendar or a Slack channel topic. Everyone knows who it is right now.

2. One file is the runbook. Not a wiki. Not a folder. One markdown file with the top five things that break, the symptoms, and what to do. Updated after every page.

3. Pages mean act within the hour. Everything else is a dashboard or a daily summary email. If it is not customer-impacting in the next hour, it is not a page.

4. Customer impact only. Internal noise stays in a channel that nobody is on-call for. Only customer-facing degradation pages.

5. One post-mortem template, one page, every page. Every time someone gets paged, fill in the template. What happened, what we did, what would have prevented it. One page. Not a Confluence epic.

The Diagnostic Question

Before you decide whether your on-call is working, ask one question.

If our most senior engineer gets paged at 2 AM tonight, for a piece of code none of them wrote line by line, do they have one file they can open that tells them what to do?

If yes, you have a runbook. If no, fix the runbook before you fix the rotation. A rotation without a runbook is theatre, because the on-call engineer cannot actually act on most pages. They are just the person who gets woken up first, before they wake somebody else up.

The version where some of your team is human and some is agents only makes this sharper. The runbook is the file that turns “code somebody reviewed once” into “code somebody can fix at 2 AM.” Without it, you have a rotation that depends on memory the team no longer has.

When to Add the Heavier Process

Around eight to ten engineers, the lightweight version starts to break. You will need escalation tiers, because not every page should wake the senior engineer. You will need multi-channel paging, because Slack is no longer enough. You will need formal blameless review process, because the team is large enough that informal trust does not carry every conversation.

Add those then. Not before. Process you preemptively install almost never gets removed, and it has overhead that a five-engineer team cannot afford.

Let’s Talk

If you are running a small engineering team and trying to figure out which pieces of on-call structure are worth installing at your stage, especially as more of your code starts to be written with agents in the loop, that is the kind of question I work through with founders. I take on senior async architecture and operations work for teams making exactly these decisions. If that sounds like your situation, reach out.

On-Call at Five Engineers

Why On-Call Breaks Differently at Five

The Five Tiers I See in Practice

Where the Real Breakage Happens

The Five Rules I Use With Founders

The Diagnostic Question

When to Add the Heavier Process

Let’s Talk

In search of tailored web solution?

Let's Connect

On-Call at Five Engineers

Why On-Call Breaks Differently at Five

The Five Tiers I See in Practice

Where the Real Breakage Happens

The Five Rules I Use With Founders

The Diagnostic Question

When to Add the Heavier Process

Let’s Talk

In search of tailored web solution?

Let's Connect

Might interest you

AI Evals at Seed Stage: The Five to Build First

AI Coding Assistants Make Bad Codebases Worse

The 90% Nobody Was Testing

The Deprecation List: What to Kill Before Your Series A

Code Review at Three Engineers

When ‘Just Use Postgres’ Stops Being the Answer

The Five Questions That Decide Your Stack at Seed Stage

Why Your First AI Feature Should Be Invisible

Let's Connect

Please leave your info below