Vertex AI Makes More Sense as an MLOps Map

A Vertex AI architecture map for teams that need to decide which Google Cloud AI services belong in the ML lifecycle, where ownership changes hands, and which older assumptions are now unsafe.

By Jovani Pink December 26, 2025 10 min — Platform & AI Engineering

Outcome focus: Gave teams an operating contract for using Vertex AI across data, features, training, deployment, monitoring, and generative AI without confusing a product menu for a production ML system.

The first Vertex AI architecture I distrust is the one that lists every service in order.

Workbench, BigQuery, Dataflow, Feature Store, AutoML, custom training, hyperparameter tuning, Experiments, Model Registry, Endpoints, batch prediction, Explainable AI, Model Monitoring, Pipelines, Model Garden. The diagram looks complete because the boxes are real.

Then the first production question arrives.

Which table is the training truth? Which feature definition is allowed to serve online? Which model version is the current champion? Which metric blocks promotion? Which endpoint gets canary traffic? Which alert means retrain? Which team owns the decision when the model is technically healthy and operationally wrong?

The service list does not answer those questions.

I have seen teams learn Vertex AI as a catalog and still fail to design a production ML system. They knew where training jobs ran. They knew an endpoint could autoscale. They knew Feature Store existed. The mistake was assuming a managed platform would supply the operating contract. It does not. Google Cloud gives you strong managed primitives. You still have to decide where responsibility changes hands.

Google describes Vertex AI as a unified platform for building, deploying, and scaling generative AI and machine learning systems. That is a useful frame if you read "unified" as "one control plane for many lifecycle boundaries," not "one product that removes lifecycle design."

Use the Lifecycle Map#

The Vertex AI map I find useful starts with the lifecycle, not the product names.

A useful Vertex AI diagram shows ownership and promotion boundaries, not only service adjacency.

That map changes how I read the product set.

Vertex AI Workbench belongs near exploration and development. It is a JupyterLab-based environment with Google Cloud integrations, useful for analysis, notebook execution, and early model work.

BigQuery, Cloud Storage, Pub/Sub, Firestore, Dataflow, and Dataproc sit upstream of Vertex AI in many systems. They are not side quests. Most ML failures start in data contracts, timestamps, null behavior, late events, identity stitching, or feature freshness.

Training has two broad paths. AutoML is for fast managed modeling when the data type and customization needs fit. Custom training is for control over architecture, dependencies, distributed training, and serving behavior. Vertex AI provides prebuilt training containers for common frameworks such as TensorFlow, PyTorch, scikit-learn, and XGBoost, and custom containers when you need your own runtime.

Vertex AI Experiments belongs at the comparison boundary. It tracks parameters, metrics, artifacts, and lineage so a model choice can be inspected later instead of reconstructed from notebook memory.

Vertex AI Model Registry belongs at the packaging and promotion boundary. It manages model versions, aliases, deployment handoff, batch inference entry points, and production visibility.

Vertex AI Endpoints belong at low-latency serving. Batch prediction belongs where the workload is asynchronous and can read inputs from Cloud Storage or BigQuery. These are different operating modes, not interchangeable buttons.

Vertex AI Pipelines belongs around repeated workflow. It can run Kubeflow Pipelines or TFX-defined workflows, automate training and evaluation, and connect lineage to repeatable retraining.

The managed services are strong. They do not decide the lifecycle for you.

Correct the Older Assumptions#

Some Vertex AI notes age poorly because Google Cloud's AI surface moves quickly.

The first correction: do not design new work around Python 2. Current Vertex AI training guidance centers Python training applications and containerized runtimes, and the contemporary prebuilt containers are Python 3-based with framework support schedules. If an older dependency list mentions Python 2 packages inside an image, treat that as a historical artifact, not a green light for new platform design.

The second correction: be precise about Feature Store. The current Vertex AI Feature Store model keeps feature data in BigQuery tables or views. BigQuery is the offline store. Feature Store acts as a managed metadata and online-serving layer that connects BigQuery feature sources to online serving resources. Older explanations that sound like you always ingest features into a separate Vertex-owned offline store can lead to the wrong architecture.

The third correction: be careful with explainability as a platform dependency. Vertex Explainable AI is deprecated as of March 16, 2026, with access scheduled to end on or after March 16, 2027. Explainability still matters. But a new design should not depend on that deprecated service as the only way to satisfy debugging, governance, or model-risk review.

The fourth correction: batch prediction is not just "online prediction later." For custom trained models, batch inference reads from Cloud Storage or BigQuery and writes results back to a target location. It also does not autoscale like online inference because the input set is known when the job starts. That affects cost, capacity planning, and delivery time.

The Tradeoff#

Vertex AI gives you managed infrastructure. The tradeoff is that managed infrastructure can make weak contracts look acceptable for longer.

A local training script fails loudly when paths, credentials, or dependencies are wrong. A managed training job can hide those details behind a clean job history. A notebook can make feature engineering feel reproducible because the cell ran once. A registry can make a model look production-ready because it has a version. An endpoint can make serving look complete because predictions return 200.

Those are infrastructure facts. They are not operating facts.

The operating facts are stricter:

BoundaryPlatform primitiveContract the team still owns
Data sourceBigQuery, Cloud Storage, Pub/Sub, FirestoreProducer, schema change rule, timestamp semantics, freshness SLA
Feature definitionBigQuery SQL, Dataflow, Feature StorePoint-in-time correctness, null policy, online/offline parity
TrainingAutoML or custom trainingTarget metric, split strategy, baseline, reproducible runtime
TuningHyperparameter tuningSearch budget, objective metric, stop condition
ExperimentationExperiments, TensorBoardWhich run is promotable and why
PromotionModel RegistryChampion/candidate aliases, approval, rollback
ServingEndpoints or batch predictionLatency budget, traffic split, capacity, auth, failure behavior
MonitoringModel Monitoring, Cloud Monitoring, custom checksDrift threshold, business metric, alert owner, retraining rule

That table is the part a product catalog cannot supply.

A Production Contract#

For a real system, I would rather start with a small contract than a large reference architecture.

The contract below is illustrative, but the shape is the useful part. It names the decision path, owner, data source, feature behavior, promotion gate, serving mode, and rollback rule.

vertex-ai-mlops-contract.yaml
decision_path: "same-day churn intervention"
owner: "growth data product"
 
source_data:
  warehouse: "bigquery://analytics.customer_events"
  producer: "product events"
  freshness_sla_minutes: 60
  breaking_change_notice_days: 7
 
features:
  offline_source: "BigQuery table or view"
  online_source: "Vertex AI Feature Store feature view"
  entity_key: "customer_id"
  point_in_time_join: "required"
  null_policy: "fail training if required features exceed 0.5% null"
 
training:
  method: "custom_training"
  runtime: "prebuilt or custom container pinned by digest"
  experiment_tracking: "Vertex AI Experiments"
  split_strategy: "time-based holdout"
 
promotion:
  registry_alias: "candidate"
  required_metrics:
    auc: ">= 0.82"
    calibration_ece: "<= 0.04"
    recall_at_review_budget: ">= 0.65"
  approval: "model owner and decision-path owner"
 
serving:
  mode: "online_endpoint"
  traffic_split: "5/95 canary"
  min_replicas: 1
  fallback: "do not suppress intervention; route to rules baseline"
 
monitoring:
  feature_drift: "enabled"
  prediction_drift: "enabled"
  business_metric: "save rate after intervention"
  alert_owner: "ml platform on-call"
 
rollback:
  registry_alias: "champion"
  max_minutes: 15

The specific thresholds would change by use case. Fraud, churn, content ranking, demand forecasting, medical triage, and ad bidding should not share the same numbers. The contract format matters because it prevents the team from calling the system production-ready before the promotion and rollback questions are answerable.

Feature Store Is a Serving Boundary#

Feature Store is easiest to misuse when it is treated as a magic consistency layer.

The current BigQuery-backed design is a good thing. It keeps the historical record in the warehouse where teams already govern data, audit access, and run batch jobs. But it also means the feature contract has to be explicit.

For each feature group, I want to know:

  • Which BigQuery table or view is the source?
  • Which column is the entity key?
  • Which timestamp defines feature freshness?
  • Which transformation created the value?
  • Which training job consumed the historical version?
  • Which online feature view serves the latest value?
  • What happens when the online feature is missing?

If those questions are unanswered, Feature Store can still serve values quickly. It just cannot prove they are the right values for the decision.

The most common failure is training-serving skew disguised as platform adoption. The team trains from one SQL query, serves from a slightly different transformation, and then wonders why the online model behaves worse than validation suggested. Feature Store helps when the feature source, online serving path, and point-in-time training export are governed together. It does not fix a feature definition that was never owned.

Generative AI Uses the Same Operating Map#

Vertex AI is also the home for a large part of Google Cloud's generative AI surface. Model Garden provides a catalog of Google, partner, and open models. Some are managed APIs. Some can be self-deployed. Prompt work can move through Vertex AI Studio and prompt management, where prompts can be saved and versioned.

The surface changes, but the operating questions stay familiar.

For a Gemini, Claude, Llama, Gemma, or other foundation-model workflow, the "model artifact" may be a prompt template, retrieval configuration, tool policy, safety setting, evaluation set, or model routing rule. That does not make the workflow less operational. It makes the artifact easier to underestimate.

I would still ask:

  • Which user decision or workflow changes?
  • Which evaluation set blocks release?
  • Which prompt or tool policy version is deployed?
  • Which data can the model retrieve?
  • Which action requires human approval?
  • Which safety failure pages someone?
  • Which cost or latency threshold stops rollout?

A generative AI prototype can skip those questions for a demo. A production system cannot.

What I Would Do First#

I would not start by enabling every Vertex AI service.

I would choose one decision path and map it end to end. For example: "Should this customer receive a same-day retention offer?" Then I would write the contract before choosing every implementation detail.

The first version might use BigQuery for the dataset, a simple custom training job, Experiments for run tracking, Model Registry for versioning, an endpoint for online serving, and Model Monitoring for drift. Feature Store would enter only if online feature lookup and reuse justify the extra boundary. Pipelines would enter when the workflow is repeated enough that manual promotion is the risk. Hyperparameter tuning would enter when the baseline model and metric are stable enough to make search meaningful.

That order keeps the platform honest.

Vertex AI is valuable because it gives teams managed infrastructure for the full ML and AI lifecycle: notebooks, training, tuning, registry, serving, monitoring, pipelines, foundation models, prompts, and enterprise controls such as VPC Service Controls and CMEK.

But the architecture becomes production-grade only when the team supplies the missing operating layer: decision ownership, feature contracts, promotion gates, monitoring thresholds, and rollback.

Learn the product names. Then draw the lifecycle boundary they are responsible for. That second drawing is the one that prevents a managed ML platform from becoming a managed collection of unresolved decisions.

Back to all writing
On this page
  1. Use the Lifecycle Map
  2. Correct the Older Assumptions
  3. The Tradeoff
  4. A Production Contract
  5. Feature Store Is a Serving Boundary
  6. Generative AI Uses the Same Operating Map
  7. What I Would Do First