AI Strategy Starts Before the Model

Outcome focus: Defined an end-to-end AI strategy playbook and worked example that ties data readiness, use-case selection, model development, governance, deployment, and operating ownership to measurable business outcomes.

The worst AI strategy starts with a model.

Not because models are unimportant. They matter. But the model is rarely the first constraint. The first constraint is usually that the business has not decided which decision should change, what metric proves the change mattered, what data is trusted enough to support the decision, and who will own the workflow after the demo ends.

That is where most AI efforts get blurry.

The company says it wants AI. A few use cases appear. Someone builds a prototype. The demo looks impressive. Then the work slows down. The data is incomplete. The workflow is unclear. Legal has questions. The model cannot be monitored. The users do not trust the output. The metric never moves. Six months later, people say the organization is "not ready for AI," which is sometimes true, but not precise enough to be useful.

The better approach is to treat AI like a serious transformation effort.

Start from business levers.

Prove value small.

Industrialize only what works.

Govern the risk before it becomes an incident.

That is the whole strategy in plain language.

Start with business levers, not use cases#

"Use case" is one of those phrases that sounds practical while hiding a lot of work.

A use case is not enough.

The better first question is:

If this works, what changes in the business?

Revenue might change through conversion, cross-sell, retention, pricing, basket size, win rate, or customer lifetime value.

Cost might change through automation, fewer manual reviews, lower rework, faster handling time, reduced waste, improved inventory decisions, or fewer escalations.

Experience might change through NPS, complaint volume, response time, personalization, availability, or quality of service.

Risk might change through fewer compliance breaches, earlier fraud detection, safer workflows, better auditability, or reduced operational exposure.

An AI idea that does not touch one of those levers is not ready.

It might be research. It might be learning. It might be a useful technical exploration. But it is not yet a business strategy.

Define success in numbers before the model exists:

Reduce average handle time by 10 percent.
Increase conversion in a defined segment by 2 points.
Cut manual review volume by 30 percent without increasing error rate.
Improve first-contact resolution by 5 points.
Reduce high-risk false negatives by 20 percent.
Automate a workflow step with human review and a measured override rate.

The number can be wrong at first.

The absence of a number is worse.

The decision is the unit of strategy#

AI work becomes clearer when the team stops asking "what can AI do here" and starts asking "what decision are we improving."

Call these customers this week, not those.

Route this case to a specialist, not the general queue.

Approve this low-risk claim automatically, but send that one to review.

Suggest these three knowledge articles to the support agent.

Flag this data pipeline run because the distribution changed.

Summarize this contract for review, but do not approve it automatically.

That is the level where AI becomes operational.

Models generate scores, classifications, text, recommendations, embeddings, summaries, and tool calls. Businesses change through decisions and workflows.

If nobody can name the decision, the AI work will drift toward demo logic.

This is the same argument I made in What a Data Strategist Actually Does: data strategy is not about producing more artifacts. It is about improving decisions.

AI strategy is the same discipline with higher stakes.

Data readiness is not optional#

Before modeling, ask the brutal questions.

Do we have the data?

Can we legally use it?

Do we know what the fields mean?

Can we access it without heroics?

Is it fresh enough?

Does it represent the decision point?

Does it include the outcome we want to predict or improve?

Are identifiers stable enough to join across systems?

Can we reproduce the training set later?

Do we have a way to monitor drift?

Data readiness has three layers.

The first is availability. The core tables, events, documents, interactions, outcomes, and operational records need to exist somewhere. If the use case depends on customer behavior, support interactions, transactions, survey responses, claims, orders, contracts, or product usage, the team should be able to list the sources and owners.

The second is semantics. "Active customer," "churn," "high value," "resolved case," "on time," "complaint," "fraud," and "NPS" need definitions. If different teams define the same term differently, the model will inherit the conflict.

The third is platform readiness. The organization needs enough warehouse, lakehouse, orchestration, access control, lineage, and deployment capability to run the work repeatedly. That might be BigQuery and Dataform, Snowflake and dbt, Databricks and workflows, Airflow, Dagster, or another stack. The specific tools matter less than whether the data path is governed and repeatable.

If those capabilities do not exist, the first AI project may be "modernize for AI."

That is not a failure.

It is honesty.

Readiness does not mean perfection#

Teams sometimes use data readiness as a way to delay forever.

That is not the goal.

You do not need perfect data to begin. You need enough data to test a narrow claim, and you need to know where the weak spots are.

For a first pilot, document:

sources available now
sources missing
fields with high missingness
joins that are unstable
target definition risks
privacy or consent limits
freshness gaps
manual workarounds
data fixes required before scale

This turns data quality into a backlog instead of a vague objection.

The phrase I like is "good enough for a first model, explicit enough for a roadmap."

Prioritize use cases like a portfolio#

Do not let the loudest stakeholder win.

Score use cases by:

business value
data feasibility
workflow feasibility
risk
time to impact
change load
measurement clarity

High-value, low-feasibility ideas belong on the roadmap, not necessarily in the first sprint.

Low-value, high-feasibility ideas may be useful for learning, but they should not consume the strategy.

The first 1 to 3 use cases should be thin slices:

narrow scope
clear user
visible decision
measurable outcome
real workflow integration
pilotable in 3 to 6 months

Examples:

A call center triage assistant for one queue.
A churn risk score for one high-value segment.
A document summarization workflow for one legal review pattern.
A data quality anomaly detector for one critical pipeline.
A next-best-action model for one retention campaign.

The point is not to solve the whole enterprise in one move.

The point is to prove that the organization can turn AI into changed behavior.

Work backward from the workflow#

Architecture should follow the decision.

Ask:

Who uses the output?

Where do they use it?

How often?

What action changes?

What happens if the model is wrong?

Can the user override it?

How will feedback return?

Does the decision need batch scoring, online inference, search, retrieval, a copilot, an API, an agent, or a plain dashboard?

A propensity model might only need weekly batch scoring joined into a CRM or marketing platform.

A fraud or eligibility decision might need online scoring with strict latency and audit requirements.

A support copilot might need retrieval-augmented generation, source citations, prompt logging, human feedback, and safety filters.

An agent workflow might need MCP tools, API boundaries, user approval, and context management.

The wrong architecture can make a good model useless.

This is why I connect AI strategy to API Design for MCP Server Boundaries. If AI systems will act through tools, APIs, and enterprise systems, the integration contracts become part of the strategy.

Start with baselines#

Baseline first.

Always.

For predictive use cases, start with rules, logistic regression, decision trees, gradient boosting, or another simple model that can be explained and evaluated quickly.

For GenAI use cases, start with prompting, retrieval, policy rules, and human review before fine-tuning.

For agentic workflows, start with one or two narrow tools before building a complex multi-agent system.

Sophistication should be earned.

The first baseline tells you whether the data contains signal, whether the workflow can use the output, and whether the metric can move. If a simple model cannot be operationalized, a complex model will usually make the failure more expensive.

This does not mean advanced AI is unnecessary.

It means advanced AI should solve the problem that remains after the simple version teaches you something.

For predictive AI, define the label carefully#

The label is where many ML projects quietly break.

Churn is not just "customer left." It needs a time window and a decision point.

NPS is not just a score. It has survey timing, response bias, customer segment, and operational context.

Fraud is not just a flag. It has investigation lag and false positive cost.

SLA breach is not just an event. It depends on start time, stop time, exclusions, and policy.

The model should only train on data available at the time the decision would have been made.

No future leakage.

No labels that depend on post-action behavior unless that is explicitly modeled.

No target definitions that change halfway through history.

No silent exclusions that make the training data easier than reality.

The posts Plain-Language Machine Learning Metrics for Real Decisions and A scikit-learn Pipeline for Calibrated Decisions go deeper on the modeling side. The strategy point is simpler:

If the target is wrong, the model can be impressive and still useless.

For GenAI, RAG is often the first serious move#

Fine-tuning is tempting because it sounds like ownership.

For many enterprise knowledge use cases, retrieval-augmented generation is the better first move.

Use RAG when the model needs access to changing documents, policies, runbooks, catalogs, contracts, tickets, or product knowledge. Keep that knowledge in an index that can be updated, permissioned, evaluated, and cited.

Do not treat RAG as "dump documents into a vector store."

The strategy needs:

source inventory
permission model
chunking plan
metadata strategy
retrieval evaluation
answer evaluation
citation requirements
freshness expectations
fallback behavior

The post Context Engineering Keeps Long Context Useful covers why context quality matters. Long windows do not remove the need to select information carefully. They make information hygiene more important.

For agents, tools are operating permissions#

Agent strategy should be even more cautious.

An agent is not only a chat interface. It can call tools, read resources, write records, trigger workflows, and change state.

That means the strategy must define:

what the agent can read
what the agent can write
what requires user approval
what is logged
what is reversible
what is forbidden
which APIs enforce the boundary
which errors are safe to expose

I wrote about this in Codex Plugins Extend Agents, Not Interfaces and ADK Agent Memory Is an Operating Boundary. The lesson is the same: agent capability is system access. Treat it that way.

An AI strategy that ignores permissions is not a strategy.

It is a risk backlog.

Prove value with real users#

A pilot is not successful because the model runs.

It is successful if it changes behavior and improves the metric named at the beginning.

That requires a controlled test.

Use a holdout, A/B test, phased rollout, shadow mode, or champion-challenger comparison depending on the use case.

Track:

business metric
operational metric
quality metric
adoption metric
override rate
user feedback
risk events
cost

For example:

A support copilot should not only be judged by answer quality in a sandbox. It should be judged by handle time, agent adoption, edit rate, customer satisfaction, escalation rate, policy violations, and cost per conversation.

A churn model should not only be judged by AUC. It should be judged by lift in the contacted segment, incremental retention, outreach cost, false positive burden, calibration, and whether the team actually used the score.

If the pilot does not change the workflow, model quality is mostly academic.

Production means MLOps and LLMOps#

Once a pilot proves value, the work becomes industrial.

Minimum production capabilities:

versioned code
versioned data schemas
versioned prompts or model configs
reproducible training or evaluation
artifact registry
deployment pipeline
staging and production separation
monitoring
rollback
access control
incident response

For predictive ML, monitor:

feature drift
label drift
prediction distribution
calibration
model performance when labels arrive
business KPI movement
latency and failures

For LLM systems, monitor:

retrieval quality
answer quality
citation quality
hallucination rate
refusal behavior
policy violations
latency
cost per task
tool errors
user feedback

For agent systems, monitor:

tool calls
tool failure rate
permission denials
approval rate
unsafe attempts
task completion
human intervention
state changes

At that point, the AI strategy becomes an operating model.

Without it, pilots rot.

Governance should be designed early#

Governance is not where AI goes to die.

Bad governance can do that. Good governance keeps success from turning into a regulatory, ethical, or operational mess.

Every organization needs an AI use-case inventory:

owner
purpose
users
affected population
data sources
model type
vendor dependencies
risk level
monitoring plan
approval status
review date

Higher-risk use cases need more scrutiny:

credit
hiring
healthcare
insurance
legal decisions
adverse customer actions
surveillance
identity
safety-critical workflows

Define what is allowed, what requires review, and what is forbidden.

Define human-in-the-loop requirements.

Define data retention.

Define vendor controls.

Define appeal and override paths where decisions affect people materially.

This connects to Principle Stacks Make Trade-offs Explicit. AI governance is much easier when the organization knows what outranks what. Safety, trust, customer value, speed, and cost discipline will collide. The stack should decide the default.

Ownership cannot be vague#

AI strategy fails when everyone owns it.

That means nobody owns it.

Each serious AI use case needs a durable product team, even if small:

business owner for the KPI
product owner for the workflow
data engineering owner for pipelines
ML or AI engineer for model/evaluation
platform owner for deployment and observability
risk/compliance partner for higher-risk cases
change management owner for adoption

Committees can govern a portfolio.

They cannot operate a product.

The team closest to the workflow must own whether the AI system changes behavior. The technical team must own whether the system is reliable and measurable. Leadership must own the decision to scale, pause, or kill the effort.

A worked example: support escalation triage#

Here is the kind of example I would use to test whether the strategy is real.

A support organization wants AI because escalation queues are slow. The first request is a copilot that summarizes tickets and recommends the next action. That sounds reasonable, but it is still too broad.

The business lever is not "use AI in support." The lever is reducing avoidable escalation time without increasing policy violations or customer frustration.

Strategy layer	Concrete decision
Business lever	Reduce avoidable escalation time for repeat issue classes
Workflow owner	Support operations owns the triage queue
Data readiness	Ticket history, policy docs, escalation labels, and resolution outcomes must be joined
Pilot slice	Three issue classes with enough historical volume and clear policy boundaries
Evaluation	Correct route, cited evidence, abstention on missing policy, handle-time impact
Governance	Human approval required for refund, compliance, and account-risk actions
Deployment	Agent drafts recommendation inside existing support tool
Operating metric	p50 triage time, escalation precision, edit rate, policy violation rate, cost per resolved case

The first version I would reject is the demo-first version: summarize any ticket, recommend anything, and let the model impress the room. That proves almost nothing.

The thin-slice version is narrower and more useful. Pick three issue classes where historical decisions are recoverable. Build a golden set. Measure whether the system routes correctly, cites the right policy, and abstains when evidence is missing. Put the draft recommendation in front of reviewers instead of customers. Watch edit rate and override reasons.

A real AI strategy loop starts with the business lever and returns to operating metrics after deployment.

The tradeoff is slower scope expansion. The narrow pilot may feel less impressive than a broad support copilot. I would accept that tradeoff because a narrow pilot can answer the only question that matters: did the system change support behavior without increasing risk?

The metric that would change my mind is edit rate. If reviewers rewrite most recommendations, the model may still be useful as a retrieval aid, but it is not ready to own recommendation quality. If edit rate is low, escalation precision holds, and policy violations stay at zero in the pilot, then the next issue class becomes a rational expansion.

The repeatable loop#

I would run AI strategy as a loop:

Diagnose
Prioritize
Pilot
Scale
Govern and improve

Diagnose:

Name the business levers, data readiness, platform gaps, workflow constraints, and governance risks.

Prioritize:

Score use cases by value, feasibility, risk, time to impact, and measurement clarity.

Pilot:

Build a thin slice with a real user, real data, real workflow, and real metric.

Scale:

Industrialize the winners with MLOps, LLMOps, integration, monitoring, training, and ownership.

Govern and improve:

Monitor drift, cost, quality, adoption, risk, and business impact. Retrain, revise, or retire systems that stop earning their keep.

Each cycle ends with one question:

Did we move the business metric we named at the start?

If yes, scale carefully.

If no, fix the data, fix the workflow, revise the use case, or stop.

Stopping is strategy too.

The executive version#

For leadership, I would compress the whole thing into five statements.

AI strategy starts with measurable business levers, not model selection.

Data readiness determines which use cases are real now and which are roadmap items.

Thin-slice pilots prove value by changing workflow behavior, not by impressing a demo room.

Production AI requires MLOps, LLMOps, monitoring, governance, and clear ownership.

The only AI systems worth scaling are the ones that move the metric they were built to move.

That is the standard.

It is simple enough to say in a meeting.

It is hard enough to keep teams honest.

AI Strategy Starts Before the Model

Start with business levers, not use cases#

The decision is the unit of strategy#

Data readiness is not optional#

Readiness does not mean perfection#

Prioritize use cases like a portfolio#

Work backward from the workflow#

Start with baselines#

For predictive AI, define the label carefully#

For GenAI, RAG is often the first serious move#

For agents, tools are operating permissions#

Prove value with real users#

Production means MLOps and LLMOps#

Governance should be designed early#

Ownership cannot be vague#

A worked example: support escalation triage#

The repeatable loop#

The executive version#

Related notes#