Comprehension Debt: When Code Ships Without Theory

Why a two-day debug session on a one-month-old AI-generated bug is not a debugging problem but a theory-building problem you skipped, and the operating discipline that makes the missing theory recoverable.

By Jovani Pink May 2, 2026 13 min — Platform & AI Engineering

Outcome focus: Reader has a working definition of comprehension debt distinct from technical debt, three questions to test whether a theory exists for an AI-generated component, a PR comprehension scoring rubric, and a deliberate-practice tactic set that prevents the doom loop.

Two days debugging a bug in code I shipped a month ago. The code itself was not broken. The business logic was wrong, and I could not see it because I had no mental model of what the code was supposed to do — only what it was doing. I wrote it with an AI assistant. I committed it when the tests passed. I never built the theory.

That is comprehension debt, and it does not show up on any tracking board until the bug arrives.

Comprehension debt is distinct from technical debt. Technical debt is a deliberate shortcut you took knowing you would have to fix it later — a TODO with intent. Comprehension debt is code that works perfectly and that you could not change confidently if your job depended on it, because nobody on the team holds the theory the code was supposed to express. Technical debt has a payoff plan. Comprehension debt has compound interest.

This post is about why AI coding assistants make comprehension debt the default failure mode, and the operating discipline that brings it back under control. The tactics are old. The environment that used to enforce them is not.

Code is the shadow, not the program#

Peter Naur, 1985 — Programming as Theory Building. The actual program is the mental model in the programmer's head. The code is a shadow of that theory, useful for execution and useful for reminding the programmer of the theory, but not a substitute for it. Two programmers can hold identical code and different theories of what it does, and one of them will be wrong when the code changes.

AI coding assistants generate the shadow without the theory. The agent builds a temporary understanding to produce the artifact, then discards it the moment the context window rolls or the session ends. When you commit the AI's output, the theory dies with the agent context. The code is now legacy on day one because nobody owns the mental model it shadows.

The human-authored loop runs in both directions: the theory shapes the code, and reading the code reinforces the theory. The AI loop is one-way; the agent's theory is gone the moment the code merges, leaving the artifact without the model that produced it.

This is the core mechanism. Everything that follows — the doom loop, the lost forge, the cognitive overload pattern — is a consequence of producing shadows without keeping the theories that cast them.

AI is not a compiler#

The most common counter-argument: AI is just the next layer of abstraction, like Assembly to C to Python. The argument fails on a single distinction.

A traditional compiler is deterministic. The same input produces the same output, every time, provably. You do not need to read the assembly that gcc emits because gcc gives you a guarantee. The abstraction is trustworthy precisely because it removes randomness from the translation.

LLM code generation is stochastic. Same prompt, same context, same model — and you get a different output. There is no guarantee against race conditions. No guarantee against security holes. No guarantee that the business logic the agent inferred matches the business logic you intended. The output is a probability distribution sampled once.

You cannot trust an abstraction layer that gives you probability instead of guarantee. You can collaborate with one. The collaboration requires a theory you maintain yourself, because the layer below you is not stable enough to be a foundation. Treating an LLM as a compiler is the framing error that produces the rest of the failures.

Three questions that prove a theory exists#

A useful test for whether you have a theory of a system, AI-generated or not: can you answer these three questions without opening the code?

Where does state live? If two components both believe they own the truth, a bug exists already; it has not been triggered yet. State that lives in two places is state that will diverge under any concurrency, any retry, any cache invalidation. The question forces you to name the canonical store. If you cannot, you have a Ouija board, not a system. (Related: agent memory as an operating boundary.)

Where does feedback live? A system that fails silently is indistinguishable from a system that works. Logs, metrics, traces, error handling — these are the surface that lets a theory be tested against reality. The question forces you to name the channel through which the system tells you it is wrong. If there is no channel, you will hear about the failure from a customer, not from the system. (Related: LLM observability with OpenTelemetry GenAI conventions.)

What is the blast radius? If you delete or modify a single component, what cascades? If you cannot trace the consequences in your head, you do not have a theory; you have a guess. The question forces you to map dependencies, ownership, and retry semantics. (Related: coding assistant blast radius.)

The audit story circulating in this discourse — a 7,000-line monolithic file, paying customers, no logging, no rate limiting — fails all three. The product appears to work. It will stop appearing to work the first time anything goes wrong, because the founder has nothing to debug with except the code itself, and the code is what they did not understand. The customers will be the feedback channel, which is the most expensive feedback channel that exists.

The doom loop#

Comprehension debt compounds because the response to a bug in code you do not understand is to ask the AI to fix it. The AI patches the symptom. You commit the patch without grasping the change. You now understand the codebase less than you did before the bug arrived.

Each iteration removes another layer of mental model. By the third or fourth pass, the team has no shared theory of the system at all — only a working artifact and a queue of pending mysteries.

This is architectural amnesia, and it spreads across teams faster than across individuals because nobody on the team can answer the three questions for the parts they did not author. The codebase becomes something the team operates rather than something the team understands. Outputs appear, nobody can explain why, and the only way to change behavior is to ask the spirit world (the AI) to please do something different.

The break-out from the loop is not in the loop. You cannot patch your way back to a theory. You have to stop, read, and rebuild the model — manually, with intent, in a way that does not produce code as the deliverable. The deliverable is the theory.

The forge that is missing#

An Anthropic study circulated in this discourse: experienced engineers who used AI coding tools scored measurably lower on comprehension tests afterward, with the largest drop in debugging performance. Debugging is the forge. It is where you are forced to build a theory because the code in front of you is demonstrably not what you thought it was, and you have to find the gap. Skip the forge and the theory does not form.

The market did the rest of the damage. When generative AI hit production maturity, junior engineering roles were cut on the assumption that AI replaced entry-level output. What the cuts actually removed was the apprenticeship pipeline — the years of suffering through poorly-designed architecture in production that produced senior engineers in the first place. The pipeline does not reverse on the same timeline as the cuts. Hiring trends have started to reverse; the cohort that should have been mid-level by now will not exist for several years.

The industry-level consequence: senior judgment is now scarce, and it is not being manufactured at the pre-AI rate. The individual-level consequence is the more useful one: you cannot rely on the environment to forge the theory for you anymore. Production used to do it whether you wanted it or not. AI removes enough of the wrestle that the theory has to be a thing you choose to build.

Cognitive moves under load#

Three traps from a recent framing by Justin Sung apply directly to reviewing AI output under deadline pressure. The trap names are his; the AI-coding application is the synthesis.

Thinking harder when you are stuck. Working memory is the bottleneck, not effort. Staring at a 200-line AI-generated diff and trying to "understand all of it at once" is a cognitive overload pattern, not a discipline. The fix is separation of concern: review state changes first, then control flow, then error handling, then naming. One axis at a time. The diff that defeats you when read whole becomes tractable when sliced.

Treating overwhelm as a capacity problem. If your PR queue has thirty unreviewed AI-generated changes, the question is not "how do I get through all this." The question is "which one of these has the largest blast radius if it ships unread." Theory of Constraints — find the bottleneck artifact, review that one with full attention, let the rest move with lighter scrutiny. Equal attention is a synonym for inadequate attention.

Chasing precise confidence in a stochastic system. Demanding 100% line-level certainty before merge is analysis paralysis. Demanding zero review is comprehension debt. The middle move is confidence intervals: high precision for the parts of the diff that touch state, security, or external interfaces; broader margin on the parts that are clearly within an established pattern. Aim for accuracy with a known margin, not absolute certainty.

These map directly onto AI code review: separation of concern is the diff-reading tactic, triage is the queue discipline, and confidence intervals are the depth allocation per file. Each one fails when you treat AI output as if it were a deterministic compiler whose every line must be either trusted or audited.

The operating discipline#

The methods cluster into a workflow. None of them are new. The AI environment makes them non-optional rather than nice-to-have. (Process discipline for the same problem covers the docs-first version of this argument; the methods below are the theory-layer complement.)

Design before prompting. If you cannot draw component boxes, data flows, state locations, and failure surfaces on paper before opening the agent, the AI will fill the conceptual void with whatever shape its training distribution suggests. The drawing is the theory. The prompt is a request for the shadow of the theory. Without the drawing, you are asking the AI to invent both, which is the design path most likely to produce code you cannot debug.

A memory.md at the repo root. Not a README. A running record of the why behind major architectural decisions, security constraints, rejected alternatives, and the conditions under which a decision should be revisited. The file exists so that the next person — including future-you, who has lost the theory — can rebuild the model without re-deriving it from the code. The agent reads it too; that is a side benefit, not the point.

Make the AI defend its choices. Before merging an AI-generated PR of any consequence, ask the agent to walk through the logic, the tradeoffs considered, and the alternatives it rejected. The exercise is for you, not the agent. If the explanation reveals a choice you did not realize was made — a library swapped, a retry budget tightened, a session boundary moved — you just found the comprehension gap before it became a bug.

The PR comprehension score. Rate every PR you authored or reviewed from 1 to 5 on how well you understand it. A PR you cannot explain in your own words is a 1. Do not merge a 1. The discipline is rare because the rating is honest only when nobody is watching, and the honest rating is uncomfortable.

ScoreMeaning
5Could rewrite the change from memory if the branch were lost
4Could explain every block to a peer with no notes
3Understand intent and structure; some details fuzzy
2Understand the intent; structure is opaque
1Tests pass and I have no theory of what changed

Refactor by hand, on a schedule. Pick one file the AI wrote last month and rewrite it yourself, without AI assistance. Most of what you produce will look similar. The artifact is not the deliverable. The theory you rebuild as a side effect is the deliverable. A weekly cadence works for most teams; the cost is one developer-hour, the return is one developer's restored model of the codebase.

The deletion test. For any component you have not touched in two weeks, ask: if I delete this, what breaks, and how badly? If the answer is "I would have to read the code to find out," the theory has decayed and a refactor pass is the cheap version of the audit you will otherwise do during an incident.

The comprehension buffer. When AI saves you two hours, invest thirty minutes back into review and theory-building. The savings are still real. The accumulated theory is what makes the next two-hour saving safe. Teams that skip the buffer report the highest velocity for about two months, then stall in a way that looks like declining productivity but is actually compounding comprehension debt.

The tradeoff#

The honest tradeoff is between speed of generation and depth of theory. AI gives you speed unconditionally. Theory takes time you could otherwise spend shipping. The crossover is the point where the time you lose debugging without theory exceeds the time you saved generating without theory. That crossover is invisible while it accumulates and obvious when it lands — usually as an incident, occasionally as a two-day debug session for a one-month-old bug.

The tactics above do not eliminate the tradeoff. They move the crossover further out. A team that runs the comprehension discipline can ship faster with AI than a team that does not, because the team without the discipline pays the cost in incidents and unmaintainable code that the team with it does not have. Speed-without-theory is real, and it is also the thing that produces the codebase nobody wants to touch.

I have made this trade in both directions. The two-day debug session at the top of this post is what the wrong direction costs once. The compounding cost — the codebase nobody can confidently change — is what it costs over time. The first cost is loud. The second is the one that ends careers and companies, and it is silent until it is not.


Code is the shadow of a theory. AI lets you produce shadows without ever building the theories that cast them. The shadow looks like a program until something changes; then it is a problem you cannot debug, because you never had the model the code was supposed to instantiate.

The tactics are old: design before prompting, score your understanding, refactor by hand, write the why down. The environment used to enforce them by making development slow and painful enough that the theory accumulated as a byproduct. The environment no longer does that. You choose the discipline on purpose, or you accept that the speed is a loan you will repay during your next incident.

The developers who get the most out of AI are not the ones generating the most code. They are the ones who still, deliberately, build the theory.

Back to all writing
On this page
  1. Code is the shadow, not the program
  2. AI is not a compiler
  3. Three questions that prove a theory exists
  4. The doom loop
  5. The forge that is missing
  6. Cognitive moves under load
  7. The operating discipline
  8. The tradeoff