Outcome focus: Reader can write a decision memo that chooses modular monolith, service extraction, or boundary repair based on team ownership, delivery metrics, scaling pressure, and observable blast radius.
software architecturemicroservicesmodular monolithproduct engineeringengineering leadership
The architecture argument started after the third billing incident.
A six-engineer product team had one deployable application. Claims intake, customer notifications, billing adjustments, reporting, and admin review all lived in the same codebase. The app was not huge. It was not glamorous. It was also becoming hard to change because every feature could reach into every other feature's tables.
Billing wanted to split into a service.
That sounded mature. It also sounded like a way to make the next incident more expensive.
The failure was not "monolith." The failure was weak ownership. Billing had no public module API, claims code queried billing tables directly, reports joined whatever was convenient, and every incident looked like an argument about whether the team had picked the wrong architecture style.
I have seen this argument go sideways when the team treats architecture style as identity. Monolith becomes "legacy." Microservices become "serious engineering." Then nobody measures whether the split would actually improve delivery, recovery, or blast radius.
The useful decision is less romantic:
Should this product pay the overhead of a network boundary now, or should it enforce the module boundary inside one deployable app until the product and team shape prove the split is worth it?
The Plain-English Difference#
A modular monolith is one deployable application organized around feature or domain modules. The important part is not that the process is single. The important part is that the code has explicit boundaries.
app/
claims/
public-api.ts
commands/
events/
persistence/
billing/
public-api.ts
commands/
events/
persistence/
notifications/
public-api.ts
commands/
events/
persistence/Claims can call billing through billing/public-api.ts. Claims cannot import billing/persistence/invoices.ts. Reporting cannot update claims tables because it wants a shortcut. The database schema still may live in one database, but ownership is clear.
Microservices move those boundaries out of process. Each service has its own deployable unit, runtime, release pipeline, logs, dashboards, data ownership model, and failure behavior. Calls cross a network. State becomes harder to coordinate. On-call becomes more precise and more demanding.
Martin Fowler's Monolith First captures the warning I would put near the top of every decision memo: successful microservice stories often start with a monolith that became too large and was decomposed, while from-scratch microservice systems can get into trouble because early boundaries are unstable. Fowler and James Lewis's Microservices article describes the useful side too: independently deployable services, business-capability boundaries, and decentralized data management.
Both ideas can be true.
A monolith can be clean or rotten. A service architecture can be independent or just distributed coupling with invoices.
The Tradeoff to Make Visible#
The wrong framing is:
monolith = simple but bad later
microservices = complex but good laterThe better framing is:
| Choice | You buy | You pay |
|---|---|---|
| Modular monolith | Fast local development, simple deploys, easy refactors, one operational surface | Strong discipline required to preserve boundaries |
| Microservices | Independent deployability, isolated scaling, clearer runtime ownership, smaller failure domains | Network failures, distributed tracing, schema/version drift, CI/CD fan-out, more on-call surface |
A modular monolith is usually the right default when product discovery is still high. The team is still learning what the product is, which workflows matter, which fields survive contact with users, which modules are stable, and which teams will own which capabilities.
Microservices become attractive when the product and organization have already created pressure that one deployable unit cannot absorb cleanly: separate teams need independent release lanes, one capability has a distinct scaling profile, compliance requires stronger isolation, or incidents need a smaller failure domain than the monolith can provide.
The cost is not only infrastructure.
Microservices charge interest through every workflow:
- local setup,
- contracts,
- test environments,
- deployment sequencing,
- tracing,
- alert ownership,
- schema evolution,
- incident coordination,
- backward compatibility,
- data backfills,
- and migration code that should eventually be removed but somehow makes itself comfortable.
That overhead can be the right price. It should not be an accidental subscription.
The Decision Rule I Would Start With#
I would choose a modular monolith first when all or most of these are true:
- The team is fewer than about eight engineers.
- Requirements are changing weekly.
- The product has not stabilized around clear domain ownership.
- One team owns build, deploy, support, and incident response.
- The main bottleneck is lead time, not independent scaling.
- Compliance boundaries can be handled through code, database permissions, logging, and review controls.
- Most incidents are caused by unclear code/data ownership, not by runtime coupling.
I would consider microservices sooner when all or most of these are true:
- Multiple teams already own distinct business capabilities end to end.
- Those teams regularly block each other because they share one deploy/release surface.
- One capability needs a different scaling profile, runtime, data store, or deployment cadence.
- Compliance or customer isolation requires a harder boundary around PHI, PII, payment data, or tenant data.
- The failure mode of one capability should not take down the rest of the product.
- The team already has production-grade CI/CD, tracing, alerting, secrets, infrastructure automation, and on-call practice.
The last bullet matters. A team that cannot reliably deploy one app will not become more mature by deploying nine.
A Decision Tree That Does Not Lie to You#
The box most teams skip is "expected metric improvement named."
If nobody can say which metric should improve, the service split is probably architecture theater. The team may still need better boundaries, but a network boundary is a costly way to discover that billing needed an owner and a public API.
The Metrics#
DORA's current guidance describes software delivery performance with five metrics: change lead time, deployment frequency, failed deployment recovery time, change fail rate, and deployment rework rate. The older "four keys" framing is still the version many teams remember from Accelerate: lead time for changes, deployment frequency, MTTR, and change failure rate. The DORA guide is useful here because it warns against treating one metric as the whole story and stresses measuring at the application or service level.
For the monolith-versus-microservices decision, I would track those delivery metrics plus two architecture-specific checks.
| Metric | What it tells you | Warning sign |
|---|---|---|
| Lead time for changes | How long code takes to reach production | Services increased review, deploy, or environment wait time |
| Deployment frequency | Whether teams can ship smaller changes | The split created more ceremony but fewer releases |
| Failed deployment recovery time / MTTR | How quickly the team recovers | Incidents now require cross-service investigation before rollback |
| Change failure rate | How often deploys cause immediate remediation | Service contracts break more often than module calls did |
| Deployment rework rate | How often unplanned deploys follow incidents | The team is shipping fixes for coordination mistakes |
| Blast-radius index | Average downstream components affected per deploy or incident | One deploy still wakes every team |
| Per-service MTTR | Recovery time by service | The new service has no owner fast enough to restore it |
The extra two need local definitions. Keep them simple at first.
blast_radius_index =
count(unique downstream apps, services, queues, jobs, or teams touched by incidents)
/ count(production incidents or risky deploys)
per_service_mttr =
minutes from alert acknowledgement to customer-impacting recovery,
grouped by owning module or serviceDo not use these numbers to punish teams. Use them to see whether an architecture change moved the system in the direction the memo promised.
If service extraction does not improve deployment frequency, recovery time, or blast radius after a few sprints, stop splitting. Fix the actual bottleneck.
How Monoliths Rot#
Monoliths usually rot quietly.
The first shortcut looks harmless. A reporting job needs billing status, so it reads the billing table directly. A claims workflow needs to decide whether a customer is eligible for a payment, so it imports a helper from billing/internal. A migration adds a nullable column because nobody wants to coordinate the real contract. A test fixture creates objects across four modules because the app has no public construction paths.
None of those choices is dramatic. Together, they erase the boundary.
The monolith still deploys as one app, but now every change feels global. Engineers become afraid of touching "simple" fields because the system has no honest ownership. Product managers hear "billing is risky" and think billing needs to be extracted. Sometimes it does. More often, billing needs to become a real module first.
A modular monolith has to be designed as if extraction could happen later, even if it never does.
That means package by feature:
src/
orders/
billing/
claims/
notifications/
reporting/Not by technical layer:
src/
controllers/
services/
models/
repositories/Layer folders encourage every feature to spread across every layer. Feature folders make ownership visible. The code should answer "who owns this behavior?" before it answers "is this a controller?"
The Boundary Contract#
The boundary should be boring enough to review in a pull request.
// billing/public-api.ts
export interface BillingAccountSnapshot {
accountId: string;
paymentStatus: "current" | "past_due" | "blocked";
creditLimitCents: number;
updatedAt: string;
}
export interface BillingPort {
getAccountSnapshot(accountId: string): Promise<BillingAccountSnapshot>;
requestAdjustment(input: {
accountId: string;
claimId: string;
amountCents: number;
reasonCode: "loss_review" | "manual_correction";
requestedBy: string;
}): Promise<{ adjustmentId: string; status: "queued" | "rejected" }>;
}Claims code can depend on BillingPort. It cannot import billing repositories, billing database models, or billing migration helpers.
If this later becomes a service, the caller contract survives:
// billing/http-client.ts
export class BillingHttpClient implements BillingPort {
async getAccountSnapshot(accountId: string) {
const response = await fetch(`/internal/billing/accounts/${accountId}`, {
headers: { "x-request-id": getRequestId() },
});
if (!response.ok) {
throw new Error(`billing_snapshot_failed:${response.status}`);
}
return response.json();
}
async requestAdjustment(input: Parameters<BillingPort["requestAdjustment"]>[0]) {
const response = await fetch("/internal/billing/adjustments", {
method: "POST",
headers: {
"content-type": "application/json",
"x-request-id": getRequestId(),
},
body: JSON.stringify(input),
});
if (!response.ok) {
throw new Error(`billing_adjustment_failed:${response.status}`);
}
return response.json();
}
}The adapter changed. The caller did not.
That is the practical value of a modular monolith. It lets the team learn boundaries in-process before it pays for out-of-process failure modes.
Enforce the Boundary#
Discipline is not a plan. Add enforcement.
For TypeScript, the first pass can be eslint-plugin-boundaries, dependency-cruiser, Nx module boundaries, or a small custom lint rule. For Java, ArchUnit can enforce package rules. For .NET, NetArchTest can do the same kind of work. The tool matters less than the rule:
claims may import:
claims/**
billing/public-api
shared/observability
shared/types
claims may not import:
billing/persistence/**
billing/internal/**
billing/migrations/**The database needs a similar rule. One module owns a table. Other modules ask through the owning module's API, subscribe to its events, or read from a published projection.
If reporting needs a cross-domain view, create a reporting projection intentionally:
billing owns:
billing_accounts
invoices
payment_adjustments
claims owns:
claims
claim_photos
claim_decisions
reporting may read:
reporting_claim_financials_view
reporting_daily_adjustments_viewThat projection is allowed to exist because it is explicitly a read model. It is not a secret backdoor for claims to update invoices.
Observability Before Extraction#
A monolith can hide bad observability because stack traces are local and database joins are easy. Then the team extracts a service and discovers it cannot answer basic incident questions:
- Which request started this operation?
- Which module emitted this event?
- Which customer or tenant was affected?
- Which deploy introduced the failure?
- Which downstream dependency timed out?
- Which module owns the alert?
Add those answers before extraction.
At minimum:
request_id
trace_id
module_name
operation_name
domain_event_name
tenant_id_hash
release_sha
owner_teamThose fields carry over when a module becomes a service. Without them, service extraction turns a local debugging problem into a distributed guessing problem.
When to Split#
I would not split a module because the folder is large.
I would split when the candidate has all three of these:
- Stable, narrow API.
- Distinct scaling, compliance, data, or failure profile.
- Clear owning team that can build, deploy, observe, and restore it.
One out of three is a refactoring prompt. Two out of three is a serious conversation. Three out of three is a service candidate.
Good candidates often look like:
- document rendering that has heavy CPU and queue-based throughput,
- image/video processing that needs separate workers and retry behavior,
- payment handling that needs stronger compliance controls,
- notification delivery that can degrade without blocking the core product,
- search indexing that can lag behind source-of-truth writes,
- machine-learning inference that has separate latency, cost, and model-release concerns.
Bad candidates often look like:
- a module with many callers and no stable contract,
- business logic still changing every sprint,
- code extracted because one team dislikes the monolith,
- shared tables nobody is willing to decompose,
- a service whose owning team will not carry production on-call.
A Safer Migration Path#
Sam Newman's Strangler Fig Pattern and Fowler's Strangler Fig Application are useful because they push teams away from big-bang replacement. Move behavior gradually. Keep old and new paths observable. Prove value before expanding the migration.
For a module extraction, I would usually start with reads.
phase 1: make module ownership explicit inside the monolith
phase 2: publish read projection or event stream
phase 3: move read endpoint behind service boundary
phase 4: shadow traffic and compare responses
phase 5: move write path for one operation
phase 6: migrate remaining writes and retire transitional codeThe monolith remains useful during the migration. It can act as the integration test harness while the new service proves behavior. The old path and new path should run side by side for enough traffic to expose drift.
The acceptance test is not "service deployed."
The acceptance test is:
For the extracted capability:
- deployment frequency did not fall,
- failed deployment recovery time improved or stayed acceptable,
- change failure rate did not rise,
- blast-radius index dropped,
- the owning team can diagnose and restore without borrowing the monolith team.If those are not true, the architecture is not done. It only moved.
The Monthly Fitness Check#
Run this monthly while the product is changing.
| Question | Healthy signal | If the answer is no twice |
|---|---|---|
| Can the team ship daily without coordination meetings? | Changes flow through one review/deploy path | Reduce batch size or repair ownership |
| Do most changes touch one module? | Domain boundaries match product work | Move code/data to the owning module |
| Are incidents contained to one module or service? | Alerts name the owner and affected boundary | Add isolation, bulkheads, or service extraction |
| Can a new engineer find the public API for a module? | The facade is obvious and tested | Create/rename the public boundary |
| Are cross-module imports blocked by tooling? | CI fails on boundary violations | Add lint or architecture tests |
| Does each table have one owner? | Write paths are clear | Stop shared writes before adding services |
| Are dashboards owned by teams? | Alerts page the team that can fix the issue | Move alert ownership or service ownership |
| Did the last extraction improve a metric? | One expected metric moved | Stop extracting and fix the bottleneck |
The cadence matters because architecture decay is incremental. By the time everyone agrees the system is tangled, the cheap fixes are usually gone.
A Decision Memo Template#
This is the artifact I would put in the repo as docs/adr/0007-boundary-billing.md.
# Decision: Billing Boundary
Date: 2026-05-16
Status: Proposed
Owner: Billing/Claims pilot team
## Product Context
Billing adjustments are now involved in 38% of claims-review changes. The team
has had three production incidents in six weeks where claims, billing, and
reporting code touched the same invoice state.
## Current Boundary
- Deployable unit: monolith
- Module owner: unclear
- Tables owned by billing: invoices, payment_adjustments
- Known violations:
- claims imports billing/internal/eligibility.ts
- reporting reads invoices directly
- admin updates payment_adjustments without billing API
## Options
1. Repair billing as a module inside the monolith.
2. Extract billing read APIs first, writes later.
3. Extract full billing service now.
## Decision
Choose option 1 for the next two sprints.
## Why
- Product rules are still changing weekly.
- One six-engineer team owns the full product.
- The incidents were caused by ownership violations, not independent scaling.
- We do not have enough service-level observability to split safely.
## Boundary Rules
- All billing calls go through billing/public-api.
- Only billing writes invoices and payment_adjustments.
- Reporting reads billing-owned projections only.
- CI blocks imports from billing/internal and billing/persistence.
## Metrics We Expect to Move
- Reduce billing-related cross-module PR touches from 4 modules/change to <= 2.
- Reduce billing incident blast-radius index from 3 affected modules to <= 1.5.
- Keep deployment frequency at daily or better.
## Revisit Trigger
Reconsider service extraction when billing has:
- stable public API for two consecutive sprints,
- separate on-call owner,
- distinct scaling or compliance pressure,
- passing shadow-read comparison for extracted read side.The exact numbers are illustrative. The structure is not.
The memo forces the team to name the current pain, the tradeoff, the decision, the boundary rules, the metric expected to move, and the revisit trigger. Without those, "microservices" is too easy to use as a mood.
The Checklist I Would Drop Into a Repo#
# Architecture Boundary Checklist
## Choose modular monolith when
- [ ] Team is < 8 engineers.
- [ ] Requirements are changing weekly.
- [ ] One team owns deployment and on-call.
- [ ] Scaling pressure is not isolated to one capability.
- [ ] Incidents are caused by unclear ownership more than runtime isolation.
- [ ] Module APIs and table ownership can be enforced in CI.
## Consider service extraction when
- [ ] A stable, narrow API exists.
- [ ] One team owns the capability end to end.
- [ ] The capability has distinct scaling, compliance, data, or failure behavior.
- [ ] Observability is already good enough to debug without the monolith.
- [ ] The expected metric improvement is written down.
- [ ] The rollback and coexistence plan is written down.
## Monthly fitness check
- [ ] Most changes touch one module/service.
- [ ] The team can ship daily without coordination overhead.
- [ ] Incidents name one owning module/service.
- [ ] Dashboards and alerts map to owning teams.
- [ ] Boundary violations fail CI.
- [ ] Every service extraction has a before/after metric review.Use it as a forcing function, not a ritual. If the team checks "service extraction" but cannot name the metric that should move, it is not ready.
A Practical Ending#
Start with a modular monolith when the product is still learning and the team is small enough to coordinate through one deployable system. Make it a real modular monolith: package by feature, enforce imports, own tables, use public facades, publish events intentionally, and add observability before the split.
Move to microservices when the product and organization have earned the overhead: stable contracts, separate teams, distinct runtime pressure, and a metric that the split should improve.
Do not split because the diagram looks embarrassing.
Split when the boundary is already real, the owner is already real, and the current architecture is measurably making delivery, recovery, or blast radius worse.