You set up human-in-the-loop for your AI agents. Reviewers get a notification, they approve or deny commands, the agent proceeds. Congratulations — you have oversight.

Three weeks later, your senior engineer confesses: "I've just been approving everything. There's like forty of them a day."

You have oversight in theory. In practice, you have a rubber stamp.

The Anatomy of Approval Fatigue

Approval fatigue isn't laziness. It's a predictable outcome of poor system design. When the volume of approval requests exceeds a person's capacity for careful evaluation, cognitive shortcuts kick in. The approval becomes automatic. The safety control becomes noise.

It follows a familiar arc:

Week 1: The reviewer reads every command carefully. Context is fresh, stakes feel real, attention is high.

Week 2: Patterns emerge. Most commands are routine. The reviewer starts scanning rather than reading. Approval time drops from 30 seconds to 5.

Week 3: The reviewer has learned that "routine" means "safe." They start approving before reading. The queue clears faster. Everyone is happy.

Week 4: A non-routine command arrives. The reviewer approves it on autopilot. The deployment breaks. Nobody can explain how it got through.

This isn't hypothetical — it's how alert fatigue works in security operations, how alarm fatigue works in intensive care units, and how it works in AI oversight. The mechanism is identical.

Why Naive Human-in-the-Loop Fails

The naive design is: every command gets a human review. It sounds thorough. It's actually a recipe for fatigue-driven rubber-stamping.

The problem is signal-to-noise ratio. If 95% of commands are routine and safe, the human's attention calibrates to "routine." When the 5% of genuinely risky commands arrive, they look like routine commands to a fatigued reviewer.

There's also a queue pressure problem. Commands pile up. The agent is waiting. Downstream work is blocked. The reviewer feels urgency to clear the queue. Urgency and careful review are incompatible.

Finally, there's context collapse. A reviewer seeing command #38 of the day has lost track of what commands #1-37 were doing. They can't see whether the current command makes sense in context. Each approval is evaluated in isolation, which means compound risk — a series of individually-reasonable commands that together produce a dangerous outcome — is invisible.

What Good Oversight Actually Looks Like

The goal isn't maximum human involvement. It's effective human judgment at the moments it matters. That requires a different design.

Risk-stratified review

Not all commands need human review. git status does not need a reviewer. DROP TABLE users absolutely does. The system should route commands based on assessed risk, and the risk assessment should be explicit and auditable.

A well-calibrated risk engine handles the routine automatically and surfaces the anomalies. Human attention is a finite resource — it should be reserved for decisions that actually benefit from human judgment.

The risk factors that matter:

  • Reversibility — Can this be undone? Write operations are higher risk than reads. Destructive operations are highest.
  • Blast radius — What's the scope of potential damage? A command affecting a single row is different from one affecting all rows.
  • Context anomaly — Does this command fit the pattern of what the agent has been doing? A shell command that doesn't match the current task is suspicious.
  • Novelty — Has this command been seen before? New commands on a whitelist-pattern basis deserve more scrutiny than exact-match repeats.

Whitelist the routine, review the novel

The fastest path to reducing review burden without reducing safety is an intelligent whitelist. Commands that have been reviewed, approved, and verified safe for a given context can be automatically allowed. The reviewer's past judgment is preserved and reused.

The key word is "intelligent." A whitelist that's too broad defeats the purpose. A whitelist that requires exact-match strings is too narrow to be useful. The sweet spot is pattern-based matching with context awareness — git commit -m <anything> is safe; rm -rf <anything> is not.

Batch context for reviewers

Reviewers lose context across a long queue. Fix this by surfacing it explicitly. Show the reviewer what the agent has been doing in the last N minutes, what the stated task is, and how the current command relates to that task. A command that looks strange in isolation looks less strange — or more strange — in context.

Also: surface anomaly detection results to the reviewer. "This command doesn't match the agent's current task" is information the reviewer can act on. "Command #38 of 40" is not.

Slow down the queue deliberately

Queue pressure drives rubber-stamping. One intervention: for high-risk commands, add an intentional pause — a configurable delay that forces the reviewer to wait before approving. This breaks the "click to clear" reflex. Even 10 seconds is enough to interrupt the autopilot pattern.

This sounds counterproductive. In practice, it's one of the most effective behavioral interventions known from adjacent domains (securities trading, clinical decision support).

Rotate reviewers

A single reviewer will fatigue. Two reviewers who rotate shift attention stays fresher. Multi-party approval — requiring two reviewers to approve before execution — is appropriate for the highest-risk operations, and it has the side benefit that both reviewers know the other one is watching.

Audit the approvals themselves

If you want to know whether your oversight is working, measure it. Track approval time distributions. A reviewer who approves 40 commands in 4 minutes is not reviewing carefully. Track approval rates — a 99.8% approval rate suggests the risk filter upstream isn't doing its job. Track anomalies that got approved — flag them in retrospective reviews.

The audit trail for your AI agent commands should include the time each approval took, who approved, and what the risk score was. This is the data that tells you whether your oversight is real or theatrical.

The Cognitive Load Budget

A useful mental model: a reviewer has a cognitive load budget. Each careful review costs some of that budget. When the budget runs out, reviews stop being careful.

Your job, as a system designer, is to stay well within that budget. That means:

  • Auto-approving everything below a risk threshold (so the budget is spent on what matters)
  • Making each review as fast as possible by surfacing the right context
  • Not routing commands to human review that shouldn't be routed to human review
  • Spreading load across multiple reviewers

A system that asks humans to review 50 commands a day will produce worse outcomes than one that asks humans to review 5 commands a day — even if the 5-command system lets more through automatically. Because the attention paid to those 5 commands will be real.

Automation That Helps vs. Automation That Replaces

There's an important distinction between automation that supports human judgment and automation that replaces it.

AI-assisted approval suggestions ("this command looks safe based on pattern analysis") can reduce cognitive load without removing human agency. The human still decides — they just have better information. The automation is a tool, not a rubber stamp.

Auto-approval of entire risk categories is different. It removes human judgment from those decisions entirely. That's appropriate for genuinely low-risk operations. It's not appropriate for operations where something can go wrong.

The question to ask: if something bad happens via an auto-approved command, are you comfortable with that? If the answer is no, it shouldn't be auto-approved.

Oversight That Scales

The good news: well-designed oversight gets better over time, not worse. As the whitelist grows with reviewed commands, the volume of novel commands requiring review shrinks. The reviewer's attention is increasingly concentrated on genuinely new or high-risk operations. Fatigue decreases as the system learns.

This is the right direction of travel: start with high review volume, gradually automate the routine, maintain vigilance on the novel. Not the other way around.

Human-in-the-loop is only as good as the humans in the loop. If your design burns through their attention budget on routine operations, the loop is broken. Design for attention efficiency first, and the safety properties follow.


Expacti uses risk scoring, intelligent whitelisting, and multi-party approval to route only commands that need human judgment to human reviewers — keeping approval volume manageable and oversight real. See the demo or start a trial.