Compliant GCP Platform Playbook for Analytics and ML

Outcome focus: Reduced governed dataset onboarding from weeks to days in the sanitized pattern while preserving auditability, cost visibility, and promotion rules for analytics and ML use cases.

The first platform design looked clean until the compliance review.

Analytics wanted faster access to governed BigQuery datasets. ML wanted stable feature tables with known lineage. Governance wanted policy tags, approval evidence, and auditability. Platform wanted a small number of reusable patterns instead of a custom exception for every team.

The trap was sequencing those needs. Build the platform first, add controls later. Let analysts move fast, then make it compliant. Give ML a sandbox, then figure out promotion. Every version of that plan created the same risk: the system would work by bypassing the rules it was supposed to enforce.

The better design treated compliance as a platform feature.

This is a sanitized case study. The dataset names, volumes, and timings are rounded or illustrative, but the architecture tradeoffs are the ones that matter.

Before state#

The starting point had three lanes, but only one of them was explicit.

Governed reporting. Slow, reviewed, and trusted once it shipped.
Ad hoc analytics. Fast, useful, and often hard to trace later.
ML experimentation. Productive in notebooks, brittle at promotion time.

Nobody intended to create shadow paths. They emerged because the official path was expensive.

The symptoms were familiar:

Sensitive columns were documented, but policy enforcement depended on humans remembering the rule.
Dataset promotion meant opening tickets and attaching screenshots.
Feature tables were copied from analytics models without a clear owner for drift, freshness, or schema change.
Cost reviews happened after a query pattern was already expensive.
Access reviews were periodic, not tied to the moment a data product changed.

The constraint was not "be compliant." That is too broad to design against.

The actual constraint was sharper: make the compliant path faster than the workaround.

Design decision#

The platform used two lanes with shared controls:

Lane	Purpose	Promotion rule
Experimental	Exploration, notebook work, prototype features	Time-limited access, no executive reporting, no production ML dependency
Governed	Reporting, published marts, ML feature tables	Tests, policy tags, owner, release approval, cost guardrail

The important part is that both lanes used the same primitives: BigQuery datasets, service accounts, IAM groups, Dataform release workflows, policy tags, and platform telemetry. The difference was not a different tool. The difference was the promotion contract.

The platform used one promotion path for reporting and ML instead of separate compliance exceptions.

The tradeoff#

The tradeoff was accepting stricter promotion in exchange for faster approved reuse.

The loose alternative was to let teams copy data into their own datasets and ask governance to review later. That is faster for the first request and slower for every request after it. It multiplies policy surfaces, duplicates definitions, and makes access review harder.

The strict alternative was to force every exploratory question through the governed lane. That protects the platform, but it destroys learning speed.

The two-lane design gave both sides a place:

experiments could happen without pretending to be production,
production assets had explicit ownership and tests,
ML feature tables did not get special exemptions,
access reviews attached to data products instead of one-off tickets.

Operating artifact: promotion checklist#

The promotion checklist was deliberately short enough to use in a pull request.

Check	Required evidence
Owner	Named data product owner and Slack/escalation route
Classification	Policy tags on sensitive columns and documented data class
Contract	Grain, primary keys, null expectations, and accepted value ranges
Tests	Dataform assertions for breaking schema and quality changes
Access	IAM group or service account reviewed for least privilege
Cost	Expected query pattern and monthly cost risk noted
Freshness	SLA and stale-data behavior documented
Downstream use	Reporting, ML, or operational consumers listed
Rollback	Last known good release or disable path available

This checklist did not replace governance. It made governance executable.

Measured outcome#

The sanitized target was not "move faster" in the abstract. It was:

reduce governed dataset onboarding from multiple weeks to a few business days for standard patterns,
remove recurring manual review for low-risk schema changes that passed policy and test gates,
make feature-table promotion use the same path as reporting marts,
expose query cost by data product owner,
give auditors evidence from the release path instead of reconstructing it from tickets.

The most useful metric was time from "dataset is ready for promotion" to "approved governed asset." That number dropped because the approval conversation moved from "please trust this dataset" to "here is the contract, tests, owner, policy tags, cost expectation, and downstream consumer list."

Not every outcome could be measured cleanly. The risk reduction from fewer shadow datasets is partly inferred. But the operational signals were visible: fewer unclear access tickets, fewer promotion surprises, faster standard approvals, and better cost attribution.

What I would change next time#

I would add cost guardrails earlier.

Security and access usually get the first governance attention, but runaway query cost is also a governance problem. A table that is technically compliant but financially invisible can still damage trust in the platform.

The next version would make these fields part of the first promotion request:

Field	Example
Expected consumer	Weekly executive dashboard, daily feature pipeline
Query pattern	Scheduled aggregate, analyst exploration, batch scoring
Cost owner	Platform, product analytics, ML team
Alert threshold	Job cost or scanned bytes above agreed limit
Remediation	Partitioning, materialized table, semantic-layer change, or access review

Compliance platforms work when the safe path is also the easy path. On GCP, that means using BigQuery, IAM, policy tags, Dataform, and telemetry as one operating system, not as disconnected controls.