Local MCP and Private Open Model Infrastructure

Outcome focus: Separated local agent tool access from private model serving, then defined a safer setup for MCP clients, local servers, and Cloud Run GPU sidecars.

There are two different privacy questions hiding inside one setup.

The first is local tool access. How do I run MCP servers on my own machine so an agent can safely reach files, databases, browsers, docs, logs, and workflows without exposing everything to a remote service?

The second is private model serving. How do I run an open model behind a private interface so my prompts and outputs do not have to go through a hosted model provider for every interaction?

Those are related, but they are not the same.

MCP is about capabilities. It gives an agent a standard way to connect to tools and context. Ollama plus Open WebUI on Cloud Run is about inference. It gives users a private chat or model interface backed by open weights running on managed infrastructure.

If I were building this for myself or a small technical team, I would not try to solve both in one giant platform. I would build two lanes:

A local MCP lane for agent tool access on my development machine.
A private open-model lane for inference with Cloud Run GPUs, Ollama, and Open WebUI.

The local lane should be boring, explicit, and permissioned. The model lane should be reproducible, protected, and clear about where model files live.

That sounds simple. The details are where people get cut.

What MCP is for#

The Model Context Protocol is an open standard for connecting AI applications to external systems. The official docs describe it as a standard way for AI applications to connect to data sources, tools, and workflows.

That framing is useful because MCP is not a model runtime.

It is not where the model thinks. It is where the client and server agree on how tools, resources, prompts, and context are exposed. A client such as Claude Code, Cursor, VS Code, or Windsurf can connect to an MCP server. The server exposes capabilities. The model can then use those capabilities through the client.

The practical examples are straightforward:

A filesystem MCP server exposes controlled file reads and writes.
A Postgres MCP server exposes schema inspection and safe queries.
A Playwright MCP server exposes browser automation.
A GitHub MCP server exposes repository and issue workflows.
A docs MCP server exposes internal documentation search.
A custom MCP server exposes one business workflow, such as "get deployment health" or "query customer support taxonomy."

That last example is the one I care about most.

MCP is strongest when it exposes a bounded workflow, not when it gives the agent broad access to everything. A tool named get_recent_deployments is easier to reason about than generic shell access. A tool named search_runbooks is safer than dumping a whole wiki into the prompt.

The protocol gives you a standard interface. It does not give you good boundaries automatically.

You still have to design the boundary.

The best local MCP setup#

For one person running MCP on a local machine, I would start with stdio.

In the MCP transport spec, stdio means the client launches the server as a subprocess. The server reads JSON-RPC messages from standard input and writes JSON-RPC messages to standard output. This is the simplest setup for local, single-user development.

It has several advantages:

No open network port.
No OAuth server to configure.
Simple process lifecycle.
Works well with local clients that manage MCP servers directly.
Easy to keep secrets in local environment variables.

For a local developer machine, that is usually the right first move.

I would use Streamable HTTP only when the server needs to be shared across clients, run independently, or serve more than one user. HTTP is the right shape for remote MCP servers, team services, and servers that need OAuth or central deployment. But if HTTP is run locally, the MCP spec is explicit about the security posture: bind to localhost rather than all interfaces, validate the Origin header, and implement proper authentication when needed.

That warning matters.

A local MCP server can be a local code execution surface. If it can read files, run commands, query databases, or call internal APIs, then a malicious page, bad tool description, poisoned package, or prompt-injection path can become more than a bad answer. It can become an action on your machine.

So my default local pattern is:

client
  -> stdio MCP server
  -> narrow local capability
  -> scoped environment variables
  -> explicit approvals for writes

Not:

client
  -> giant MCP marketplace bundle
  -> all local files
  -> broad tokens
  -> unreviewed shell access

The second one may feel powerful. It is also how a demo turns into a bad morning.

Local MCP operating rules#

The best local MCP setup is not mostly about the config file.

It is about rules.

First, one server should have one reason to exist. A filesystem server, browser server, database server, and GitHub server should not be blended into a custom everything server unless there is a good reason. Separate servers make permissions and failures easier to understand.

Second, start read-only. If the server can query a database, begin with schema inspection and limited SELECT queries. If it can reach files, begin with a project root and no writes. If it can call GitHub, begin with read access before issue creation or PR mutation.

Third, pin what you run. If you use npx, uvx, Docker images, or downloaded server packages, pin versions where practical. A local MCP server is code that runs on your machine. Treat it like a dependency, not a browser bookmark.

Fourth, keep secrets out of prompts. Put tokens in environment variables, keychains, or local secret tooling. Do not paste API keys into chat and ask the agent to remember them.

Fifth, prefer project-local config for project-specific tools. A repo-specific MCP config makes sense for a test database or project docs server. A global config makes sense for personal tools you use everywhere.

Sixth, keep tool loadout small. Too many tools degrade agent behavior. They also make reviews harder. If a coding task needs Git, tests, and docs search, it probably does not need your calendar, CRM, and cloud billing tools in the same session.

Seventh, inspect before trusting. Use client tooling, logs, or MCP Inspector-style workflows to see which tools a server exposes. If the tool descriptions are vague, fix them before giving them to an agent.

Eighth, treat shell and browser tools as high risk. They are useful. They also cross boundaries quickly.

These rules are dull on purpose.

Good agent infrastructure should feel a little boring at the boundary.

Affordable MCP clients#

Prices and packaging change quickly, so I would not build a long-term plan around today's exact subscription page. But as of April 27, 2026, a few client choices are reasonable.

The cheapest serious path is VS Code with GitHub Copilot. GitHub's current Copilot page lists a free plan with limited requests, Pro at 10 dollars per month, and Pro+ at 39 dollars per month. It also lists MCP server integration across the current plans. VS Code's current agent docs describe local agents that can use built-in tools, extension tools, and MCP tools. If you already live in VS Code and want a cost-controlled starting point, this is the first option I would test.

Cursor is the strongest "agentic IDE first" choice for many developers. Cursor's pricing page currently lists a free Hobby plan, Pro at 20 dollars per month, Pro+ at 60 dollars per month, and Ultra at 200 dollars per month. The Pro plan explicitly includes MCPs, skills, hooks, and cloud agents. Cursor's MCP docs support stdio, SSE, and Streamable HTTP, and its CLI can list servers and server tools. If I were doing daily code work with local MCP servers, Cursor Pro is a reasonable middle ground.

Claude Code and Claude Desktop are the most natural MCP-first experience if you want to stay close to Anthropic's ecosystem. Claude's current pricing page lists Free, Pro at 20 dollars monthly or 200 dollars annually, Max from 100 dollars monthly, and Team tiers. The Claude Code MCP docs are deep: local stdio, remote HTTP, SSE, install scopes, environment expansion, OAuth flows, output limits, tool search, prompts, resources, and managed MCP configuration. If I wanted a strong terminal agent with MCP and did not mind the subscription limits, I would test Claude Pro first and move up only if usage demands it.

Windsurf is a reasonable alternative if you like Cascade. Windsurf's docs show native MCP integration with stdio, HTTP, and SSE, plus marketplace and manual configuration. Its pricing has changed into self-serve plans with Free, Pro, Max, Teams, and Enterprise surfaces, so I would check the current pricing page before committing. I would consider it if I preferred its editor workflow over Cursor or VS Code.

My practical recommendation:

Lowest cost: VS Code plus GitHub Copilot Free or Pro.
Best agentic coding IDE value: Cursor Pro.
Best MCP-native terminal workflow: Claude Code Pro, with Max only for heavy use.
Best if you already like Cascade: Windsurf Pro after checking current limits.

I would not start by paying for everything.

Pick one main client. Add one or two MCP servers. Run real tasks. Watch which limits you hit. Most people buy too much before they know whether their bottleneck is model quality, context, tool design, or local server safety.

What I would run locally first#

The first local MCP servers I would run are boring:

Filesystem limited to one workspace.
Git or GitHub read access.
Playwright for local browser testing when building frontend work.
Postgres or SQLite with read-only query permissions.
A docs search server for local or internal docs.

I would avoid broad personal-data servers at the beginning. Calendar, email, Slack, cloud admin, payments, and production databases are useful, but they deserve more careful permissions and audit rules.

The first custom MCP server I would write would not be generic. It would expose one workflow I repeat often.

For example:

get_local_project_status
run_safe_quality_gate
search_architecture_notes
list_recent_failed_jobs
summarize_open_release_risks

That shape is better than giving the agent a shell and hoping it discovers the workflow. A narrow MCP tool turns tribal process into an explicit contract.

The local machine is a good place to learn this because the feedback loop is short.

Once the workflow is useful and safe locally, then I would consider whether it should become a remote MCP server for a team.

The private open model lane#

The second half of the setup is model serving.

If local MCP gives agents access to capabilities, a private open model stack gives users access to inference that is not tied to a hosted model provider for every request.

The pattern in the Google codelab is straightforward:

browser
  -> Open WebUI ingress container
  -> localhost:11434
  -> Ollama sidecar
  -> Gemma model files on GCS Fuse
  -> Cloud Run GPU instance

Open WebUI is the frontend. Ollama is the model server. Both run in the same Cloud Run service as separate containers. Open WebUI talks to Ollama over http://localhost:11434. The GPU is attached to the Ollama sidecar only.

This is a nice pattern for small private inference because Cloud Run removes a lot of cluster management.

But the draft setup instructions need cleanup.

The current Google codelab only requires these APIs:

gcloud services enable run.googleapis.com \
  cloudbuild.googleapis.com \
  storage.googleapis.com \
  artifactregistry.googleapis.com

Cloud Functions is not needed. Vertex AI is not needed for this specific codelab. If another architecture later uses Vertex AI or functions, enable those APIs then. Do not enable them because the workload feels AI-adjacent.

The service spec should use Knative service YAML:

apiVersion: serving.knative.dev/v1
kind: Service

And if you apply the YAML service spec, use:

gcloud run services replace service.yaml

The Cloud Run docs still show this command for applying YAML changes. The codelab uses gcloud beta run services replace, but the stable gcloud run services replace path is the one I would use now.

Do not use gcloud run deploy --source cloudrun.yaml for this. That path is for source deployment. A service YAML is a service spec.

The important correction: GCS-backed versus baked model#

The draft contains a common confusion.

There are two valid model storage modes:

GCS-backed model files.
Baked-model image.

The current Google codelab is GCS-backed.

It installs Ollama locally, pulls gemma2:2b, copies the local Ollama models directory into a Cloud Storage bucket, then mounts that bucket into the Ollama sidecar with GCS Fuse. The Dockerfile configures the Ollama container, but it does not copy model files into the image and it does not pull the model during build.

That matters.

The codelab Dockerfile is small:

FROM --platform=linux/amd64 ollama/ollama
 
ENV OLLAMA_HOST=0.0.0.0:11434
ENV OLLAMA_MODELS=/models
ENV OLLAMA_DEBUG=false
ENV OLLAMA_KEEP_ALIVE=-1
 
ENTRYPOINT ["ollama", "serve"]

Then the service YAML overrides the model path for the Ollama sidecar:

env:
- name: OLLAMA_MODELS
  value: /root/.ollama/models
volumeMounts:
- name: gcs-1
  mountPath: /root/.ollama

So the runtime path is:

/root/.ollama/models

Backed by the Cloud Storage bucket.

A baked model image is a legitimate variant. In that variant, the Dockerfile starts Ollama during build and runs ollama pull. But that is not the current codelab path. If you bake the model, remove the GCS model mount. If you use GCS, do not pretend the image contains the model.

Pick one.

Mixing both is how teams lose an afternoon.

GCS-backed Dockerfile#

For the GCS-backed variant, I would use the small image.

FROM --platform=linux/amd64 ollama/ollama
 
ENV OLLAMA_HOST=0.0.0.0:11434
ENV OLLAMA_DEBUG=false
ENV OLLAMA_KEEP_ALIVE=-1
 
EXPOSE 11434
ENTRYPOINT ["ollama", "serve"]

I intentionally do not set OLLAMA_MODELS here. The service spec sets it to match the GCS mount:

- name: OLLAMA_MODELS
  value: /root/.ollama/models

That keeps the runtime contract in one place.

If the service spec mounts the bucket at /root/.ollama, then the bucket should contain a models directory copied from the local Ollama install.

The setup looks like this:

export PROJECT_ID="your-gcp-project-id"
export REGION="us-central1"
 
gcloud config set project "$PROJECT_ID"
 
gcloud services enable run.googleapis.com \
  cloudbuild.googleapis.com \
  storage.googleapis.com \
  artifactregistry.googleapis.com
 
gcloud storage buckets create "gs://$PROJECT_ID-gemma2-2b-codelab"

Then prepare the model:

curl -fsSL https://ollama.com/install.sh | sh
ollama serve

In another terminal:

ollama pull gemma2:2b
gsutil cp -r "/home/$USER/.ollama/models" "gs://$PROJECT_ID-gemma2-2b-codelab"

That /home/$USER path is Linux-oriented because the codelab assumes that environment. On macOS or another host, check the real Ollama model directory before copying.

Build and publish the images#

Create the Artifact Registry repository:

gcloud artifacts repositories create ollama-sidecar-codelab-repo \
  --repository-format=docker \
  --location="$REGION" \
  --description="Ollama and Open WebUI"

Build the Ollama sidecar image:

gcloud builds submit \
  --tag "$REGION-docker.pkg.dev/$PROJECT_ID/ollama-sidecar-codelab-repo/ollama-gcs" \
  --machine-type e2-highcpu-32

Then mirror Open WebUI into Artifact Registry:

docker pull ghcr.io/open-webui/open-webui:main
 
gcloud auth configure-docker "$REGION-docker.pkg.dev"
 
docker tag ghcr.io/open-webui/open-webui:main \
  "$REGION-docker.pkg.dev/$PROJECT_ID/ollama-sidecar-codelab-repo/openwebui"
 
docker push "$REGION-docker.pkg.dev/$PROJECT_ID/ollama-sidecar-codelab-repo/openwebui"

For a tutorial, main is acceptable. For production, pin an image version or digest. Floating tags are convenient until a rebuild changes behavior.

Corrected GCS-backed service YAML#

This is the GCS-backed service shape I would start from.

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: ollama-sidecar-codelab
  labels:
    cloud.googleapis.com/location: us-central1
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/maxScale: '5'
        run.googleapis.com/cpu-throttling: 'false'
        run.googleapis.com/startup-cpu-boost: 'true'
        run.googleapis.com/container-dependencies: '{"openwebui":["ollama-sidecar"]}'
    spec:
      containerConcurrency: 80
      timeoutSeconds: 300
      containers:
      - name: openwebui
        image: us-central1-docker.pkg.dev/PROJECT_ID/ollama-sidecar-codelab-repo/openwebui
        ports:
        - name: http1
          containerPort: 8080
        env:
        - name: OLLAMA_BASE_URL
          value: http://localhost:11434
        resources:
          limits:
            cpu: 2000m
            memory: 1Gi
        volumeMounts:
        - name: openwebui-data
          mountPath: /app/backend/data
        startupProbe:
          tcpSocket:
            port: 8080
          timeoutSeconds: 240
          periodSeconds: 240
          failureThreshold: 1
 
      - name: ollama-sidecar
        image: us-central1-docker.pkg.dev/PROJECT_ID/ollama-sidecar-codelab-repo/ollama-gcs
        env:
        - name: OLLAMA_MODELS
          value: /root/.ollama/models
        resources:
          limits:
            cpu: '6'
            memory: 16Gi
            nvidia.com/gpu: '1'
        volumeMounts:
        - name: gcs-models
          mountPath: /root/.ollama
        startupProbe:
          tcpSocket:
            port: 11434
          timeoutSeconds: 1
          periodSeconds: 10
          failureThreshold: 3
 
      volumes:
      - name: gcs-models
        csi:
          driver: gcsfuse.run.googleapis.com
          volumeAttributes:
            bucketName: PROJECT_ID-gemma2-2b-codelab
      - name: openwebui-data
        emptyDir:
          medium: Memory
          sizeLimit: 10Gi
 
      nodeSelector:
        run.googleapis.com/accelerator: nvidia-l4

There are several deliberate choices here.

The GPU is attached only to ollama-sidecar:

nvidia.com/gpu: '1'

The L4 selector is at the template level:

nodeSelector:
  run.googleapis.com/accelerator: nvidia-l4

Current Cloud Run GPU docs describe one L4 GPU per Cloud Run instance, and note that if sidecars are used the GPU can only be attached to one container. The same docs require at least 4 CPU and 16 GiB memory for GPU services.

Open WebUI is the ingress container. It listens on port 8080 and talks to Ollama on localhost:

OLLAMA_BASE_URL=http://localhost:11434

Open WebUI's environment variable docs now list OLLAMA_BASE_URL as the current variable and describe OLLAMA_API_BASE_URL as deprecated. Use the current name.

The container-dependencies annotation makes Open WebUI wait for the Ollama sidecar. Cloud Run's container docs also say startup probes are necessary for this feature to work successfully. That is why both containers have probes.

Deploy from the service spec#

Replace the placeholder and apply the YAML:

sed -i "s/PROJECT_ID/${PROJECT_ID}/g" service.yaml
gcloud run services replace service.yaml --region "$REGION"

If updating an existing service, export first:

gcloud run services describe ollama-sidecar-codelab \
  --region "$REGION" \
  --format export > service.yaml

Then edit and replace. This keeps the configuration close to what Cloud Run actually accepts.

Auth is the release gate#

The codelab disables Open WebUI auth because it is a tutorial.

That should not survive into production.

Open WebUI supports OAuth, OIDC, and trusted-header patterns. Its current SSO docs also call out WEBUI_URL as required for OAuth setup and describe server-side OAuth session handling. For a Cloud Run deployment, I would treat auth as a required part of the service, not a later improvement.

For production:

Do not set WEBUI_AUTH=false.
Configure OAuth or OIDC.
Set WEBUI_URL to the service URL or custom domain.
Store WEBUI_SECRET_KEY in Secret Manager.
Disable open signup unless the service is intentionally self-service.
Be careful with trusted headers unless the only route to Open WebUI is through the authenticating proxy.
Consider Cloud Run ingress and IAM settings, not only app-level auth.

An unauthenticated model UI on a public URL is not private.

It is just obscure until someone finds it.

Cost and capacity notes#

Cloud Run GPU services use instance-based billing. GPU is billed for the instance lifecycle. Minimum instances cost money while idle. GPU quota also caps maximum instances.

That means containerConcurrency: 80 is not a universal recommendation.

It is a tutorial value. The right value depends on model size, prompt length, latency target, and how many simultaneous users the UI needs to support. A small model may tolerate more concurrency. A larger model may produce unacceptable tail latency or memory pressure.

Start lower. Load test. Watch latency and errors. Then tune.

Also decide whether you need GPU zonal redundancy. It changes cost and availability behavior. For experiments, you may not want to pay for the stronger availability posture. For production, the availability requirement should be explicit.

Baked model variant#

If you prefer a baked model image, use a separate Dockerfile and service file.

FROM --platform=linux/amd64 ollama/ollama
 
ENV OLLAMA_HOST=0.0.0.0:11434
ENV OLLAMA_MODELS=/models
ENV OLLAMA_DEBUG=false
ENV OLLAMA_KEEP_ALIVE=-1
 
ENV MODEL=gemma2:2b
RUN ollama serve & sleep 5 && ollama pull "$MODEL"
 
EXPOSE 11434
ENTRYPOINT ["ollama", "serve"]

Then remove the GCS volume and do not override OLLAMA_MODELS in the service YAML.

Baked images can make revisions more self-contained. They also produce larger images and require rebuilds for model updates. GCS-backed models keep the image smaller and allow model updates without rebuilding, but the service depends on the mounted bucket and its permissions.

There is no universal winner.

There is only a correct match to the operating model.

How the two lanes fit together#

Local MCP and Cloud Run open models solve different parts of the private agent stack.

Local MCP gives a local client safe access to local capabilities:

Claude Code, Cursor, VS Code, or Windsurf
  -> local MCP servers
  -> files, databases, browser, docs, project tools

Cloud Run gives users a private inference surface:

Open WebUI
  -> Ollama sidecar
  -> open model on Cloud Run GPU

You can combine them, but I would not rush.

The first milestone is a useful local MCP workflow with a hosted or client-provided model. The second milestone is a private model service for chat and inference. The third milestone is deciding whether agent clients should call the private model directly, use Open WebUI as a UI only, or connect to a separate OpenAI-compatible endpoint.

Do not make the first version a maze.

Build one reliable capability at a time.

Failure modes#

The first failure mode is confusing MCP with model serving. MCP gives tools to an agent. Ollama serves models. They can work together, but one does not replace the other.

The second is exposing too many MCP tools locally. The agent gets confused, and the user loses track of what the agent can do.

The third is running local HTTP MCP servers on 0.0.0.0 without auth. The MCP spec warns against this for a reason.

The fourth is broad tokens. A GitHub token, cloud credential, or database credential in an MCP server should be scoped to the job.

The fifth is using gcloud run deploy --source for a service YAML. Apply service specs with gcloud run services replace.

The sixth is mixing GCS-backed and baked-model designs. If the model is in GCS, mount the bucket and set OLLAMA_MODELS accordingly. If the model is baked, remove the mount.

The seventh is attaching the GPU to the wrong container. Open WebUI does not need it. Ollama does.

The eighth is no startup dependency. Open WebUI can race Ollama unless startup order and probes are set.

The ninth is demo auth in production. WEBUI_AUTH=false is a tutorial setting.

The tenth is ignoring billing. GPU inference on Cloud Run can scale to zero, but every live GPU instance still costs money.

My recommended path#

For local MCP, start with VS Code plus Copilot Pro or Cursor Pro. If you prefer terminal-first agent work and the Claude ecosystem, start with Claude Code Pro. Add only two MCP servers: a project filesystem server and one workflow-specific server. Keep everything else out until the pattern proves itself.

For private open models, start with the GCS-backed Cloud Run codelab variant. It matches the current Google tutorial, keeps the image small, and makes model updates easier. Use the Knative YAML. Use services replace. Keep the GPU on Ollama. Use OLLAMA_BASE_URL. Do not disable auth in production.

Once both lanes work separately, then decide how much to connect.

The right architecture is not the one with the most integrations.

It is the one where every capability has a clear boundary, every credential has a reason, every model endpoint has an owner, and every expensive resource can be explained.

That is how a private agent stack stays useful instead of becoming another pile of powerful parts.

Local MCP and Private Open Model Infrastructure

What MCP is for#

The best local MCP setup#

Local MCP operating rules#

Affordable MCP clients#

What I would run locally first#

The private open model lane#

The important correction: GCS-backed versus baked model#

GCS-backed Dockerfile#

Build and publish the images#

Corrected GCS-backed service YAML#

Deploy from the service spec#

Auth is the release gate#

Cost and capacity notes#

Baked model variant#

How the two lanes fit together#

Failure modes#

My recommended path#

Sources#