From Algorithms to AI Systems

Outcome focus: Connected classical algorithm analysis to production software, ML pipelines, RAG systems, model serving, and the trade-offs behind modern AI research.

Algorithms sound academic until a system gets slow.

Then they become practical very quickly.

A dashboard times out. A data pipeline misses its batch window. A model scoring job takes too long to finish. A tag audit scans the same event history repeatedly. A recommendation endpoint sorts more data than it needs. A retrieval system stuffs too much into the context window and still misses the document that mattered.

The failure rarely announces itself as "bad algorithmic complexity."

It shows up as latency, cost, memory pressure, missed SLAs, brittle ML pipelines, and systems that only work while the data is small.

That is why I keep coming back to algorithms as a foundation. Not because every engineer needs to re-implement quicksort. Most do not. The value is learning to see shape. Where is the loop? Where is the lookup? Where is the repeated work? Where is the hidden cross join? Where does memory grow with the input? Where does the system do expensive work at request time that could have been indexed, cached, streamed, or precomputed?

Once you learn to ask those questions, the same thinking carries into software architecture, data engineering, machine learning, and LLM systems.

This post is a map of that connection.

It builds on my earlier post, Algorithm Complexity as Engineering Judgment, but goes further: from classical complexity to ML production and modern AI research.

An algorithm is not just code#

An algorithm is a finite procedure for turning inputs into outputs.

That sounds simple, but the useful part is not the word procedure. The useful part is that an algorithm has a claim.

It claims to produce a result.

It claims to finish.

It claims to use some amount of time and memory.

It may claim to be deterministic, or it may intentionally use randomness. It may assume all data fits in memory, or it may be designed for streams, disks, distributed workers, or parallel machines. It may be exact, approximate, online, batched, probabilistic, or learned.

That is why "algorithm" is a larger idea than "the code I wrote."

The code is one implementation. The algorithm is the shape of the work.

When we analyze algorithms, we usually care about correctness and resources. Correctness asks whether the procedure produces the intended output. Complexity asks how time and memory grow as the input grows. Those two concerns are inseparable in real systems. A correct system that cannot finish inside the product's constraints is not useful.

MIT's Introduction to Algorithms course puts analysis of algorithms, asymptotic notation, recurrences, and the Master Method at the beginning of the subject for a reason. You need a language for growth before you can make good design decisions at scale.

The first lesson is that input size changes the story.

Code that feels instant at 1,000 rows can become embarrassing at 10 million rows. Code that fits in memory during a demo can crash during a customer import. Code that makes one network call per row can turn a batch job into an accidental denial-of-service against your own dependencies.

Algorithms teach you to ask what happens next.

Complexity is a vocabulary for growth#

Big O is an upper-bound language for growth. It tells you how a runtime or memory requirement scales as input size increases, ignoring constants and lower-order terms.

That does not mean constants are irrelevant.

Constants matter in production. So do cache locality, vectorization, serialization, database query plans, network hops, disk IO, and cold starts. But asymptotic thinking still matters because the largest mistakes often come from the wrong growth class.

The usual ladder looks like this:

O(1) for constant-time work
O(log n) for work that shrinks the search space repeatedly
O(n) for one pass over the input
O(n log n) for many sorting and divide-and-conquer operations
O(n^2) for many pairwise comparisons and nested scans
O(2^n) and worse for combinatorial explosions

The point is not to memorize the ladder.

The point is to notice when a product feature quietly moves from one rung to another.

A single scan is often fine. A scan inside a loop may not be. Sorting once may be fine. Sorting every request may not be. Building a map once may be cheaper than scanning repeatedly. Streaming a file may be safer than loading it all. Partitioning a data job may change an impossible operation into a set of manageable ones.

This is why I do not think of complexity as a classroom exercise.

I think of it as a design review habit.

Time complexity and space complexity show up differently#

Time complexity gets the attention because users feel latency.

Space complexity is the one that often knocks systems over.

A function that reads all data into memory can be simple and fast for a small file. It can also fail catastrophically when the file grows. An analytics job that materializes every intermediate table may be easier to debug, but it may become expensive and slow. A model serving process that loads too many models, tokenizers, indexes, and feature artifacts can run out of memory before the first request does anything useful.

Space is also where streaming becomes important.

In ingestion systems, the goal is often to keep the work close to linear in the number of records while keeping memory bounded. Read a chunk. Validate it. Transform it. Write it. Track enough state to continue. Avoid materializing more than the step needs.

That is an algorithmic decision, even when nobody calls it one.

It is also a product decision. A user does not care that the total work is O(n) if the system crashes because n did not fit in memory.

Master Method thinking outside the textbook#

The Master Method is usually taught for divide-and-conquer recurrences of the form:

T(n) = aT(n / b) + f(n)

That notation belongs to algorithms courses, but the intuition is useful in engineering.

When you split work into shards, batches, partitions, map tasks, workers, or subproblems, you are making a claim about how work decomposes. You are asking whether each smaller unit can be processed independently and how much merge work remains afterward.

That shows up in data engineering constantly.

If you partition a table by date and customer, can each partition be processed independently? What happens at the boundary? Is there a global sort? Is there a skewed key that sends half the data to one worker? Does the reduce step become the real bottleneck? Does a join multiply records across partitions?

The algebra may not be written on the whiteboard.

The recurrence is still there.

Amortization is how systems pay for repeated work#

Amortized analysis asks about the cost of a sequence of operations, not only one operation in isolation.

Dynamic arrays are the classic example. Most appends are cheap. Occasionally the array resizes and copies data. The expensive operation is acceptable because it is rare enough that the average cost across many appends stays small.

That same idea appears all over production systems.

Build an index once, then reuse it.

Precompute a rollup, then answer many queries cheaply.

Generate surrogate keys once, then join faster later.

Cache embeddings, then avoid recomputing them for every RAG query.

Materialize feature windows, then score many users without rebuilding the entire training set.

The danger is pretending amortization is free. Precomputation creates freshness questions. Caches create invalidation questions. Indexes create storage and update costs. Rollups create lineage and reconciliation work.

Still, amortization is one of the main ways engineering teams turn repeated work into sustainable systems.

The trick is to make the trade explicit.

Software engineering is algorithm selection under constraints#

Most day-to-day software work is not about inventing new algorithms.

It is about choosing the right existing shape.

An array is good when you need ordered iteration.

A set is good when you need membership.

A map is good when you need keyed lookup.

A trie is good when you need prefix search.

A heap is good when you need repeated access to the smallest or largest item.

An index is good when a database should not scan the world.

A queue is good when work should not block the request.

A stream is good when the input is too large to hold.

A sketch is good when approximate answers are acceptable and exact counting is too expensive.

The earlier post on Algorithm Complexity as Engineering Judgment walks through this with a product recommendation example. The key idea is simple: the first implementation proves the behavior, but the second implementation should respect growth.

The same rule applies to APIs, dashboards, data products, and ML systems.

If a feature will live on a hot path, repeated scans and per-row external calls deserve suspicion. If a feature processes large files, memory needs to be part of the design. If a feature joins large datasets, the join strategy matters more than the elegance of the code around it.

The shape of work becomes the shape of the system.

Data pipelines are algorithms with invoices#

Data engineering makes complexity visible because every inefficient decision eventually becomes a bill.

In metadata-driven ingestion frameworks, the goal is usually boring and strict:

one pass over source records when possible
partition-aware writes
explicit schema validation
idempotent retries
keyed joins rather than accidental all-to-all work
bounded memory
enough metadata to resume, audit, and backfill

That is not just "pipeline design."

It is algorithm design with operational constraints.

The red flags are familiar:

cross joins that nobody meant to write
window functions without partition discipline
repeated full-table scans for small lookups
per-row API calls
unbounded deduplication state
transformations that materialize huge intermediates
backfills that cannot be chunked or resumed

The good patterns are also familiar:

partition first
index or cluster on the access path
push filters down
precompute stable dimensions
use incremental models when the business question allows it
make freshness and lineage visible
separate validation from transformation from publication

This is where software engineering and data engineering share a spine.

Both are about making the work fit the boundary.

Propensity modeling is a systems problem before it is a model problem#

In a propensity model, the model usually gets too much credit and too much blame.

The hard part is often upstream.

Can we build features consistently? Are time windows correct? Are labels defined without leakage? Are joins stable? Are nulls meaningful or accidental? Are customer identifiers stitched correctly? Are feature values fresh enough to support the action the business wants to take?

That is algorithmic thinking again.

A weekly propensity pipeline should not recompute every expensive feature from scratch if the underlying data allows incremental windows. It should not use a join pattern that grows quadratically with customer and event history. It should not treat time as decoration. It should not build features in one way during training and another way during scoring.

The ML metric matters. I wrote more about that in Plain-Language Machine Learning Metrics for Real Decisions.

But before the metric, there is a pipeline.

And before the pipeline, there is a claim about complexity, data shape, and time.

Tag governance work has its own version of this.

A page session is an event stream. Consent state changes over time. Tags fire, fail, duplicate, race, or fire before they are allowed to. The audit problem is not only "did this tag appear." It is "did this tag appear at the right time, under the right state, with the right payload, and without violating the rules."

That wants stream thinking.

For each session, you can process events in order. Maintain a small amount of state: current consent status, tags seen, events expected, duplicates detected, and violations. Use maps and sets instead of repeated scans. Keep per-session work close to linear in the number of events.

That is much better than repeatedly asking global questions over the whole event log.

The same pattern applies to GTM audits, consent mode checks, and analytics validation. The right data structure makes the rule legible. The wrong one makes the audit slow and fragile.

Machine learning adds statistical complexity#

Classical algorithms ask about computation.

Machine learning adds generalization.

A model can be fast and wrong. It can be accurate on training data and useless on future data. It can optimize the metric and miss the business outcome. It can work until the data distribution moves.

That is why ML introduces another layer of trade-offs:

bias and variance
underfitting and overfitting
training cost and inference cost
data quantity and data quality
model capacity and regularization
offline metrics and online impact
feature freshness and label delay
monitoring and rollback

The bias-variance trade-off is the first conceptual bridge. A model that is too simple may miss the signal. A model that is too flexible may memorize noise. IBM's explainer frames this as two sources of prediction error: bias from overly simple assumptions and variance from sensitivity to training data.

In practical terms, this is why I still like simple baselines.

A logistic regression, decision tree, or gradient boosted model can tell you a lot before you reach for a deep model. If the simple model fails, it may expose a data problem. If it works, it gives you a benchmark. If a complex model beats it by a small amount while becoming much harder to operate, the trade may not be worth it.

Machine learning does not remove engineering judgment.

It demands more of it.

ML production is where hidden complexity compounds#

The paper Hidden Technical Debt in Machine Learning Systems remains useful because it says the quiet part clearly: the model code is a small part of the system.

The expensive parts are around it.

Data dependencies. Feature pipelines. Configuration. Training-serving skew. Monitoring. Feedback loops. Undeclared consumers. Glue code. Dead experiment paths. Changing external behavior.

This is why ML projects can look easy in notebooks and become painful in production.

The notebook hides the system.

Google's Rules of Machine Learning makes a related point from a practical angle: do not rush into complex ML before metrics, infrastructure, and simple baselines exist. It even starts with the uncomfortable rule that you should not be afraid to launch without machine learning if heuristics can get the product moving.

That is good engineering advice.

The ML Test Score pushes the same lesson into readiness checks: test data, features, model quality, infrastructure, monitoring, reproducibility, rollback, and operational behavior. It is a reminder that production ML quality is not a single AUC value.

It is a system.

This connects directly to the artifact contract in The Preprocessing Boundary Between scikit-learn and PyTorch. A model is not only weights. It is preprocessing, feature order, metadata, package versions, evaluation evidence, and serving behavior.

That is algorithmic thinking wearing an MLOps jacket.

Transformers made sequence modeling scalable, but not free#

Modern AI systems changed the surface area, but not the underlying need for complexity thinking.

The Transformer paper, Attention Is All You Need, replaced recurrence and convolution for many sequence tasks with an attention-based architecture that is highly parallelizable. That parallelism is one reason Transformers became the dominant architecture across language and beyond.

But attention has a cost.

Standard self-attention compares tokens to tokens. In the simplest form, that means sequence length has a quadratic cost. Double the context length, and the attention work can grow by roughly four times, before implementation tricks and architecture variants enter the picture.

This matters for application builders.

Long context is not just "more room."

It is latency, cost, retrieval quality, distraction, and evaluation complexity. I wrote about that in Context Engineering Keeps Long Context Useful. The practical lesson is that context should be engineered, not dumped.

Classical complexity thinking did not disappear.

It moved into token budgets, retrieval design, batching, caching, routing, and model serving.

Scaling laws changed budget conversations#

The OpenAI paper Scaling Laws for Neural Language Models showed smooth relationships between language model loss and model size, dataset size, and compute across many orders of magnitude.

The practical impact was enormous.

It gave teams a way to reason about compute allocation instead of treating model scaling as pure guesswork. Bigger models, more data, and more compute could be discussed as budget trade-offs with empirical curves behind them.

Then Chinchilla complicated the story.

Training Compute-Optimal Large Language Models argued that many large models were undertrained relative to their size. For a fixed compute budget, the authors found that model size and training tokens should scale together. In practice, this meant smaller models trained on more data could outperform much larger models trained on fewer tokens, while also being cheaper at inference time.

That lesson matters outside frontier model training.

For product teams, "bigger" is not automatically better.

A smaller model with better data, better retrieval, better evals, and better serving behavior may beat a larger model that is expensive, slow, and poorly grounded.

That is the same engineering lesson again:

Do not optimize the impressive variable.

Optimize the system.

Mixture-of-Experts is capacity with routing problems#

Mixture-of-Experts models are one answer to the scaling problem.

Instead of activating the whole model for every token, an MoE architecture routes work to a subset of expert networks. The appeal is clear: increase total model capacity without increasing active computation at the same rate.

The 2024 survey A Survey on Mixture of Experts frames MoE as a way to scale large models with limited computational overhead, while also emphasizing the engineering concerns that come with routing, load balancing, communication, and stability.

This is not magic.

It is a trade.

You gain sparse activation and capacity. You inherit routing behavior, expert utilization questions, distributed communication overhead, and new failure modes. Expert collapse, imbalance, or serving complexity can eat the benefit.

For most application teams, the lesson is not "train your own MoE."

The lesson is to understand the serving profile of the models you choose. Dense, sparse, quantized, distilled, hosted, local, GPU-backed, CPU-friendly, long-context, tool-using, multimodal: every model choice changes latency, cost, reliability, and operational complexity.

The architecture is part of the product.

RAG is retrieval plus generation, not prompt decoration#

Retrieval-Augmented Generation is often described too casually.

The original RAG paper, Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, combined a parametric model with non-parametric memory retrieved from an external corpus. The core idea is still useful: do not expect the model's weights to be the only source of knowledge.

For enterprise systems, RAG is often the right first move.

Not because it solves everything.

Because it moves changing knowledge out of model weights and into an index that can be updated, inspected, permissioned, evaluated, and cited.

But RAG has its own complexity.

Chunking changes recall. Embeddings change neighborhoods. BM25 and dense retrieval catch different signals. Metadata filters can help or accidentally hide the right document. Rerankers improve precision but add latency. Larger top-k values may improve recall while flooding the model with distractions. Context windows can carry more text but also more noise.

The 2024 survey Evaluation of Retrieval-Augmented Generation is useful because it separates retrieval and generation evaluation. That separation matters. You cannot fix bad recall with a better prompt. If the right passage never reaches the model, generation is already boxed in.

This connects to ADK Agent Memory Is an Operating Boundary and Context Engineering Keeps Long Context Useful. Retrieval, memory, and context are not decorations around an LLM. They are the information architecture of the system.

Healthy skepticism about scaling alone#

Scaling works.

Scaling alone is not a complete product strategy.

Some researchers, including Yann LeCun, have argued that larger language models and more data are not enough for robust intelligence involving planning, grounding, persistent memory, and common sense. Business Insider reported LeCun's 2025 criticism of overreliance on scaling and his preference for systems that model the world more directly.

You do not have to pick a side in the public debate to get the practical point.

Application builders should expect hybrid systems to remain useful:

retrieval for grounded knowledge
tools for actions
databases for state
rules for hard constraints
evaluators for regression checks
workflows for accountability
models for language, reasoning, classification, extraction, and planning assistance

The future may produce better models.

The present still needs systems.

The same red flags keep returning#

Across software, data, ML, and LLM systems, the warning signs rhyme.

Repeated scans.

Nested loops over large inputs.

Cross joins without intent.

Per-row network calls.

Full sorts for top-k questions.

Unbounded memory.

Indexes that are missing or stale.

Pipelines that cannot resume.

Training code that does not match serving code.

Features without lineage.

Models without rollback.

Retrieval without recall measurement.

Prompts without versioning.

Context windows filled with whatever was nearby.

Agents with too many tools and no permission boundary.

These are not separate categories of failure.

They are variations of the same engineering problem: the system's shape does not match the work it is being asked to do.

A practical checklist#

When I look at a software or data system, I ask:

What is the main input size?
What happens if it grows by 10 times?
What happens if it grows by 100 times?
Where are we scanning repeatedly?
Where are we sorting repeatedly?
Where are we making calls inside loops?
What state grows with the input?
What can be streamed?
What can be indexed?
What can be cached safely?
What can be precomputed?
What needs to stay fresh?
What needs to be exact?
What can be approximate?

For ML systems, I add:

Are training and serving using the same feature logic?
Are labels defined without leakage?
Are feature windows time-correct?
Are data dependencies declared?
Are model artifacts versioned?
Are preprocessing artifacts versioned?
Are there golden batches?
Are there drift checks?
Is there a canary path?
Is there a rollback path?
Does the metric match the decision?

For LLM systems, I add:

Is retrieval evaluated separately from generation?
What is recall at k?
What is the reranker doing?
What documents are excluded by metadata filters?
What context is being inserted and why?
Are prompts, tools, indexes, and eval sets versioned?
What is the latency budget?
What is the cost per successful task?
What happens when the model does not know?
What permissions does the agent have?

This is not bureaucracy.

It is how you keep powerful systems from becoming expensive guesses.

Practice reps that actually help#

The best way to learn this is to run small experiments.

First, implement the same task three ways. Count unique users with sort plus unique, a hash set, and a probabilistic sketch like HyperLogLog. Measure runtime and memory on increasingly large inputs. You will feel the difference between exactness, memory, and speed.

Second, build a tiny RAG pipeline over your own notes. Evaluate retrieval before you evaluate answers. Use a fixed question set. Track recall at k. Track whether generated answers cite the right source. Change chunk size and watch what happens.

Third, take an existing batch job and draw the shape of work. Where is it linear? Where is it sorting? Where is it joining? Where does it materialize data? Where could it resume after failure?

Fourth, take a model pipeline and apply the ML Test Score mindset. You do not need to adopt every item at once. Start with data validation, feature reproducibility, model artifact versioning, monitoring, and rollback.

Fifth, take an agent workflow and reduce its context. Give it fewer tools. Add clearer tool descriptions. Externalize durable state. Measure whether it gets more reliable.

Small reps build taste.

Taste is what lets you see the failure before the incident.

The point#

Algorithms are not a separate world from software engineering.

They are the grammar underneath it.

Complexity teaches you how work grows. Data engineering teaches you how that growth becomes cost. Machine learning teaches you that correctness also depends on generalization. LLM systems teach you that information selection, context, retrieval, and tool boundaries are now part of application architecture.

The details change.

The discipline is steady:

Know the shape of the work.

Choose the right structure.

Measure the real behavior.

Keep the system honest as it grows.

That is the bridge from algorithms to AI systems.

From Algorithms to AI Systems

An algorithm is not just code#

Complexity is a vocabulary for growth#

Time complexity and space complexity show up differently#

Master Method thinking outside the textbook#

Amortization is how systems pay for repeated work#

Software engineering is algorithm selection under constraints#

Data pipelines are algorithms with invoices#

Propensity modeling is a systems problem before it is a model problem#

Tag audits and consent systems are stream problems#

Machine learning adds statistical complexity#

ML production is where hidden complexity compounds#

Transformers made sequence modeling scalable, but not free#

Scaling laws changed budget conversations#

Mixture-of-Experts is capacity with routing problems#

RAG is retrieval plus generation, not prompt decoration#

Healthy skepticism about scaling alone#

The same red flags keep returning#

A practical checklist#

Practice reps that actually help#

The point#

Related notes#

Sources#