System Design Papers: A Reading Map from GFS to AI Infrastructure

A de-duplicated taxonomy for system design papers: what to read first, what each paper teaches, and how to move from classic distributed systems into modern databases, observability, serverless, and AI infrastructure.

By Jovani Pink May 15, 2026 14 min — Systems & Complexity Notes

Outcome focus: Reader can turn scattered system-design paper lists into a practical reading path, identify duplicates and mixed source types, and choose papers by the design question they answer instead of by prestige.

The worst way to read system design papers is alphabetically.

The second-worst way is to treat every famous paper like a trophy. GFS, MapReduce, Dynamo, Bigtable, Chubby, Paxos, Raft, Spanner, Kafka, Cassandra, Borg, Dapper, ZooKeeper, and Tail at Scale are all worth knowing. But reading them as a flat list creates a weird kind of literacy: you can name the papers and still freeze when an interviewer asks where the state lives, what happens during a partition, why the tail latency explodes, or which guarantee the design is buying with coordination.

I have made that mistake. I read systems papers by company cluster first: Google papers, Amazon papers, Facebook papers, Berkeley papers. It felt coherent because the brands were coherent. It was not coherent enough. GFS and Dapper are both Google papers, but one teaches storage layout and master metadata; the other teaches distributed tracing. Dynamo and DynamoDB share lineage, but the first teaches availability-first key-value design under partition, while the later DynamoDB paper teaches managed-service operations and predictable performance.

The right unit is not company. The right unit is the design question.

The pasted "40 system design papers" list is a useful seed list, but it has three problems:

  • It repeats several entries: GFS, MapReduce, Dynamo, Chubby, Cassandra, and Raft show up more than once.
  • It mixes paper types: research papers, vendor white papers, docs pages, books, and surveys.
  • It does not tell you what to do after the classics.

This post is the reading map I wish those lists used. It keeps the classics, removes the duplicates, and then adds a link taxonomy for what to read next: theory, replication, databases, storage engines, stream processing, scheduling, observability, serverless, and AI infrastructure.

The artifact is simple: read by failure mode.

The De-Dupe Pass#

Start by cleaning the seed list.

Seed-list itemKeep?Why
Google File SystemyesIt teaches large-file distributed storage, master metadata, chunkservers, and design for commodity failure.
MapReduceyesIt teaches batch computation as a programming model, not just a product.
DynamoyesIt teaches availability-first design, sloppy quorum, hinted handoff, vector clocks, and conflict handling.
BigtableyesIt teaches sparse sorted maps, tablets, compaction, and structured data on top of distributed storage.
ChubbyyesIt teaches coordination as a service, coarse locks, and why clients want a small consistent core.
Paxos and Raftkeep both, but do not read three Raft linksPaxos teaches the hard consensus model; Raft teaches how API and explanation shape adoption.
SpanneryesIt teaches global transactions, TrueTime, externally consistent reads, and the cost of clocks.
LSM-treeyesIt teaches write-optimized storage and compaction tradeoffs.
KafkayesIt teaches logs, partitions, consumer groups, and durable messaging as a data backbone.
Cassandrayes, onceIt combines Dynamo-style availability with Bigtable-style data modeling.
CAP theoremyes, but read carefullyIt is not a design menu. It is a constraint under partition.
Kubernetes docs or booksuseful, but not a paperTreat as operational context, not as a research paper replacement.
Consul white paperuseful, but vendor-shapedRead after Chubby/ZooKeeper so you can separate principle from product packaging.

That pass turns a trivia list into a curriculum. The first question is no longer "have I read all 40?" The first question is "which design problem am I trying to understand?"

The First 12 Papers#

If someone wants the smallest high-ROI set, I would start here.

OrderPaperDesign question it teaches
1Time, Clocks, and the Ordering of Events in a Distributed SystemWhat does "before" mean when machines disagree?
2The Google File SystemHow do you build large storage when failure is normal?
3MapReduceHow do you make distributed batch computation feel like a simple function?
4DynamoWhat do you sacrifice to stay available during failure?
5BigtableHow do you model structured data at massive scale?
6ChubbyWhy does a distributed system still need a small coordination core?
7Paxos Made SimpleWhy is agreement hard?
8In Search of an Understandable Consensus AlgorithmHow does explanation change implementability?
9SpannerWhat does a global database buy from time?
10The Tail at ScaleWhy does the 99th percentile become the product?
11DapperHow do you debug a request that crosses dozens of services?
12ZooKeeperWhat primitives should coordination expose?

This set is not complete. It is balanced. It gives you time, storage, computation, availability, structured storage, coordination, consensus, global transactions, tail latency, tracing, and operational coordination.

That is enough vocabulary to stop reading papers as trivia and start reading them as design moves.

The map below adds forty more papers and white papers without repeating the obvious duplicates. I would not read all forty in one sprint. I would pick the lane that matches the design pressure in front of me.

1. Theory, Time, and Correctness#

These papers teach the mental model underneath almost every distributed system bug. Read them when a design depends on ordering, agreement, "latest," or "exactly once."

PaperWhat it unlocks
Impossibility of Distributed Consensus with One Faulty ProcessWhy asynchronous consensus cannot be solved deterministically with even one crash fault.
LinearizabilityA precise way to ask whether a distributed object behaves like one correct object.
Distributed SnapshotsHow to reason about a consistent global state without stopping the world.
Unreliable Failure Detectors for Reliable Distributed SystemsWhy "is this node dead?" is a guess, and how systems still make progress.
Conflict-free Replicated Data TypesHow to design data structures that converge without central coordination.

The interview mistake is to recite CAP and stop. The useful move is to say which correctness condition the design needs. A shopping cart, a bank transfer, and a collaborative note editor do not need the same guarantee.

2. Consensus, Replication, and Coordination#

Read these after Paxos, Raft, Chubby, and ZooKeeper. They deepen the replication story.

PaperWhat it unlocks
Viewstamped Replication RevisitedAnother path through state-machine replication, useful for comparing with Paxos and Raft.
Chain ReplicationA practical high-throughput replication pattern with a clean read/write path.
Practical Byzantine Fault ToleranceWhat changes when faults may be malicious or arbitrary instead of crash-only.
IronFleetHow formal verification can apply to practical distributed systems.
VerdiA framework for implementing and verifying distributed systems under explicit fault models.

The practical takeaway: replication is not one thing. There is leader-based replication, quorum replication, chain replication, state-machine replication, Byzantine replication, and verified replication. Each one answers a different failure model.

3. Distributed Databases and Transactions#

Read these when the design asks for SQL-like behavior, high availability, geo-replication, transactions, or predictable cloud service behavior.

PaperWhat it unlocks
MegastoreThe middle ground between Bigtable and Spanner: interactive services with entity-group transactions.
F1A distributed relational database built on Spanner for a high-value business workload.
CalvinDeterministic transaction ordering as a way to simplify distributed transaction execution.
FoundationDBStrict serializability, simulation testing, and an unbundled transactional key-value core.
Amazon DynamoDBThe managed-service evolution of the Dynamo lineage: predictable performance as an operating contract.

The failure I have seen here is design-by-product-name. A team says "we need Spanner" when it means "we need relational transactions," or says "we need Dynamo" when it means "we need regional availability." The paper path helps separate the product from the guarantee.

4. Storage Engines, Caches, and Serving Data#

Read these when your bottleneck is write amplification, read amplification, cache invalidation, social-graph serving, or memory pressure.

PaperWhat it unlocks
TAOGraph-aware read serving and cache consistency at massive social scale.
Scaling Memcache at FacebookCache deployment, invalidation, regional scale, and the messy production edge of "just add cache."
MyRocksProduction LSM-tree lessons inside a MySQL-compatible serving layer.
WiscKeySeparating keys from values to reduce write amplification on SSDs.
SILTMemory-efficient key-value indexing for flash-based storage.

This lane is useful because interviews often wave at "cache" or "storage engine" as if those are single boxes. These papers open the boxes.

5. Batch, Streaming, and Dataflow#

Read these when the system design involves analytics, pipelines, materialized views, out-of-order events, or recomputation after failure.

PaperWhat it unlocks
Resilient Distributed DatasetsSpark's lineage-based recovery model and in-memory cluster computing.
DryadLINQData-parallel computation as a language-integrated DAG.
FlumeJavaA higher-level pipeline abstraction that moves beyond hand-written MapReduce jobs.
MillWheelLow-latency stream processing with persistent state and exactly-once-ish operational semantics.
The Dataflow ModelEvent time, watermarks, triggers, and the batch/stream unification model behind modern streaming systems.

The main idea: data processing systems are state machines with clocks. If you do not know which clock your system believes, late data will eventually embarrass you.

6. Cluster Management, Scheduling, and Networking#

Read these when the design question shifts from one service to the fleet: placement, fairness, utilization, load balancing, and failure domains.

PaperWhat it unlocks
Large-scale Cluster Management at Google with BorgThe production predecessor ideas behind modern cluster orchestration.
OmegaShared-state scheduling and the tradeoff between centralized and decentralized control.
Dominant Resource FairnessFair allocation when jobs consume CPU, memory, disk, network, and now GPUs.
SparrowLow-latency scheduling for short tasks without a giant centralized bottleneck.
MaglevSoftware load balancing with consistent hashing, fast failover, and operational simplicity.

This lane changes how you draw architecture diagrams. A box labeled "Kubernetes" is not a design. Scheduling policy, failure isolation, admission control, load balancing, and resource fairness are the design.

7. Observability, Authorization, and Operations#

Read these when the system is too large for local reasoning. These papers are the difference between "it should work" and "we can see why it did not."

PaperWhat it unlocks
Pivot TracingDynamic causal monitoring across distributed systems.
CanopyEnd-to-end tracing from clients through backend services.
MonarchPlanet-scale metrics storage and query serving.
ZanzibarGlobal relationship-based authorization with consistency requirements.
AutopilotRightsizing and autoscaling as a production system, not a dashboard slider.

This is the lane most interview prep underweights. Real systems fail in the gaps between services, alerts, permissions, quotas, and ownership. A design that cannot be observed is still a prototype.

8. Serverless and AI Infrastructure#

Read these when the design moves into functions, stateful compute, ML orchestration, transformer serving, or GPU memory.

PaperWhat it unlocks
Cloud Programming SimplifiedA Berkeley view of what serverless makes easier and what it still cannot hide.
CloudburstStateful serverless functions and low-latency composition.
RayTasks, actors, and distributed execution for emerging AI applications.
ORCAScheduling and batching for transformer-based generative model serving.
vLLM / PagedAttentionGPU memory management for LLM serving, especially KV-cache pressure.

This lane is now part of system design, not a separate ML footnote. If the product calls an LLM in production, the architecture includes batching, queueing, memory pressure, rate limits, model placement, observability, and fallback behavior. The model is not outside the system. It is one of the most expensive parts of it.

Where to Keep Mining#

Use these sources as recurring indexes, not one-time lists.

SourceWhy it is useful
MIT 6.5840 Distributed Systems scheduleBest "read and implement" backbone. The labs force the papers into code.
Dan Creswell's distributed systems reading listGood topical grouping: latency, Google systems, Amazon systems, consistency, and theory.
Heidi Howard's distributed consensus reading listDeep consensus path: clocks, failure detectors, quorums, consensus, BFT, and verification.
Papers We Love: distributed systemsBroad discovery list with many classics and adjacent systems papers.
The Morning PaperUseful when you want a readable guided summary before or after the original paper.

The mistake is to mine these sources for "more titles" only. Mine them for adjacency. If you read Dynamo and the next bug you see is a conflict-resolution bug, branch into vector clocks, CRDTs, and read-repair papers. If you read Spanner and get stuck on time, branch into Lamport clocks, TrueTime, external consistency, and uncertainty windows. If you read Tail at Scale and your service has fan-out latency, branch into hedged requests, load shedding, queueing, and tracing.

The Reading Order I Would Actually Use#

Do not read all of this linearly. Read it in passes.

PassGoalRead
1. FoundationLearn the vocabulary of failure, storage, batch, availability, and coordination.Time/Clocks, GFS, MapReduce, Dynamo, Bigtable, Chubby.
2. CorrectnessLearn what "correct" means before you design replicas.Linearizability, FLP, Paxos, Raft, Viewstamped Replication, Chain Replication.
3. DatabasesLearn how real databases compose storage, transactions, time, and operations.Spanner, Megastore, F1, Calvin, FoundationDB, DynamoDB.
4. Data systemsLearn how computation moves to data and how streams handle time.Kafka, RDDs, FlumeJava, MillWheel, Dataflow.
5. Fleet systemsLearn the datacenter as a scheduler, not just a set of hosts.Borg, Omega, DRF, Sparrow, Maglev.
6. OperationsLearn how large systems explain themselves.Tail at Scale, Dapper, Pivot Tracing, Monarch, Zanzibar, Autopilot.
7. Modern infraLearn what changes when compute is serverless and model serving is on the critical path.Cloud Programming Simplified, Cloudburst, Ray, ORCA, vLLM.

If you are preparing for interviews, do not turn each pass into notes only. Turn it into a design drill:

  1. Draw the system the paper describes.
  2. Mark the unit of partitioning.
  3. Mark the source of truth.
  4. Mark the coordination point.
  5. Mark the retry path.
  6. Mark the failure the paper optimizes for.
  7. Mark the failure the paper accepts.

That last line matters. Every good system paper has a shape like "we are willing to pay this cost to avoid that failure." GFS accepts relaxed append semantics because the workload is large sequential data processing. Dynamo accepts reconciliation complexity because availability matters more for its shopping-cart-adjacent workloads. Spanner pays for clock infrastructure because global consistency matters. Dapper pays instrumentation overhead because untraceable distributed systems are not operable.

How to Use These in a System Design Interview#

A paper should give you a move, not a monologue.

Bad interview answer:

"I would use Dynamo because it is highly available."

Better answer:

"If this product values write availability during regional failure more than immediate conflict-free reads, I would use a Dynamo-style quorum design with explicit conflict resolution. But if the business cannot tolerate divergent writes, I would move toward leader-based replication or a transactional database and accept lower availability during partitions."

Bad answer:

"I would use Kafka for scalability."

Better answer:

"The stream is the durable boundary. Producers append events by partition key, consumers process by group, and replay is part of the recovery model. The tradeoff is that event ordering is per partition, not global, so the key choice is part of the correctness design."

Bad answer:

"I would add tracing."

Better answer:

"The request fans out across thirty services, so average latency is misleading. I would propagate trace context across the critical path, record spans around downstream calls, and alert on p95/p99 plus fan-out error contribution. Dapper and Pivot Tracing are the mental models."

The paper is not the credential. The design move is the credential.

What This Map Leaves Out#

No reading list is neutral. This one tilts toward production distributed systems, data infrastructure, and the systems side of AI. It leaves out many excellent papers in operating systems, compilers, programming languages, databases, networks, security, formal methods, and human-computer interaction.

It also leaves out a lot of products. That is deliberate. Products change faster than design constraints. Papers give you durable vocabulary for constraints: time, state, coordination, failure, locality, latency, throughput, memory, fairness, consistency, and observability.

Read product docs after you have the vocabulary. Otherwise every product page sounds like a solution.

The Operating Rule#

When a system design list feels overwhelming, stop counting papers and start naming lanes.

If you are weak on correctness, read time, linearizability, consensus, and snapshots. If you are weak on databases, read Spanner, F1, Calvin, FoundationDB, and DynamoDB. If you are weak on operations, read Tail at Scale, Dapper, Pivot Tracing, Monarch, and Autopilot. If you are working near AI infrastructure, read Ray, ORCA, vLLM, and the serverless papers next to them.

The reader who wins is not the one who has skimmed the most PDFs. It is the one who can look at a design and ask:

  • What is the source of truth?
  • What is the failure model?
  • What guarantee is the system promising?
  • What is the coordination cost?
  • What happens at p99?
  • How does the system recover?
  • How would we know it is broken?

That is what these papers are for. Not prestige. Not trivia. They are a vocabulary for making better tradeoffs when the system stops fitting in one process, one database, one region, or one diagram.

Back to all writing
On this page
  1. The De-Dupe Pass
  2. The First 12 Papers
  3. The Link Taxonomy
  4. 1. Theory, Time, and Correctness
  5. 2. Consensus, Replication, and Coordination
  6. 3. Distributed Databases and Transactions
  7. 4. Storage Engines, Caches, and Serving Data
  8. 5. Batch, Streaming, and Dataflow
  9. 6. Cluster Management, Scheduling, and Networking
  10. 7. Observability, Authorization, and Operations
  11. 8. Serverless and AI Infrastructure
  12. Where to Keep Mining
  13. The Reading Order I Would Actually Use
  14. How to Use These in a System Design Interview
  15. What This Map Leaves Out
  16. The Operating Rule