Cloud Run GPU Sidecars Need Deployment Discipline

Outcome focus: Clarified Cloud Run GPU sidecar deployment choices so model storage, service YAML, startup ordering, authentication, and billing constraints are explicit before launch.

Cloud Run makes the Ollama plus Open WebUI pattern look simple.

That is both the appeal and the trap.

The appealing part is real. You can run Open WebUI as the ingress container, run Ollama as a sidecar, attach a GPU to the Ollama container, and serve a local model without managing a Kubernetes cluster. Cloud Run handles the service boundary, scaling, revisions, traffic, and a lot of operational surface that I do not want to rebuild for a small inference interface.

The trap is that this deployment has several details that look interchangeable but are not. A Dockerfile that bakes a model into the image is not the same design as a service that mounts models from Cloud Storage. A Cloud Run service YAML is not the same thing as a source deploy command. A sidecar that starts eventually is not the same thing as a sidecar that is healthy before the ingress container connects to it. A public Open WebUI endpoint with WEBUI_AUTH=false is not a harmless demo setting once it is reachable by anyone.

This is the kind of setup where the first mistake often works just well enough to hide the second mistake.

I like Cloud Run for this shape of workload because it keeps the platform small. But the service spec has to be deliberate. The GPU should be attached to exactly one container. The model storage path should match the chosen build strategy. The deployment command should apply a service spec, not try to infer one from source. Open WebUI should know that Ollama is at http://localhost:11434. Authentication should be treated as part of the architecture, not a note at the end.

The goal is not to memorize every flag. The goal is to keep the boundaries clear.

What changed since older notes#

The first thing to check is the GPU guidance.

Older Cloud Run GPU examples often talk as if L4 is the only option. The current Cloud Run GPU documentation lists two GPU types for services: NVIDIA L4 with 24 GB of VRAM and NVIDIA RTX PRO 6000 Blackwell with 96 GB of VRAM. The resource floors are different. L4 requires at least 4 CPU and 16 GiB of memory. RTX PRO 6000 requires at least 20 CPU and 80 GiB of memory.

For the Ollama plus Gemma 2 2B codelab path, L4 is still the natural choice. It is the example shape, it is smaller, and the model fits the intent of the walkthrough. But I would no longer write "L4 only" as a general Cloud Run rule. The precise version is: this deployment uses L4, while Cloud Run GPU services now also support RTX PRO 6000 for larger workloads.

The second thing to check is the codelab's model storage pattern.

The current Google codelab prepares the Gemma 2 2B model locally, copies the Ollama models directory into a Cloud Storage bucket, and mounts that bucket into the Ollama sidecar with GCS Fuse. The Dockerfile in the codelab configures Ollama, but the service YAML overrides OLLAMA_MODELS to /root/.ollama/models, where the Cloud Storage volume is mounted.

That is different from a baked-model image.

A baked image can be a good design. It can reduce runtime dependency on a Cloud Storage mount and make the revision more self-contained. But it should be treated as a different variant, not half-mixed with the GCS-backed variant.

The third thing to check is deployment.

If I have a service.yaml, I use gcloud run services replace service.yaml. That command applies the service configuration. I do not use gcloud run deploy --source cloudrun.yaml, because that path is for deploying source code, not for applying a Knative service spec.

Those three checks remove most of the confusion.

The architecture#

The shape is a multi-container Cloud Run service.

Open WebUI is the ingress container. It exposes the HTTP port that receives browser traffic. Ollama is the sidecar. It listens on port 11434 inside the same Cloud Run instance. Open WebUI talks to it over localhost with OLLAMA_BASE_URL=http://localhost:11434.

The GPU belongs to the Ollama sidecar.

That point matters. Cloud Run supports one GPU per service instance, and if the service uses sidecars, the GPU can be attached to only one container. Open WebUI does not need it. Ollama does.

The service also needs a startup dependency. Open WebUI should wait for Ollama, not race it. Cloud Run supports this through the run.googleapis.com/container-dependencies annotation, but that feature is only useful when the dependent container has a startup probe. Without health checks, startup order can still mean "started" rather than "ready."

So the minimum architecture looks like this:

Open WebUI receives external traffic on port 8080.
Open WebUI uses OLLAMA_BASE_URL=http://localhost:11434.
Ollama listens on 0.0.0.0:11434.
The GPU is attached only to the Ollama sidecar.
Startup order makes Open WebUI depend on Ollama.
Model storage is either GCS-backed or baked into the Ollama image, not both by accident.
Auth is enabled for anything public.

That is enough structure to keep the deployment honest.

APIs I would enable#

For this pattern, I keep the API surface small.

The codelab enables Cloud Run, Cloud Build, Cloud Storage, and Artifact Registry:

gcloud services enable run.googleapis.com \
  cloudbuild.googleapis.com \
  storage.googleapis.com \
  artifactregistry.googleapis.com

Cloud Functions is not needed. Vertex AI is not needed. If the service later calls another Google Cloud API, I would enable that intentionally when the dependency appears. I would not front-load extra APIs just because the workload is AI-adjacent.

This matters for platform hygiene. The first version of an internal AI service is often copied into the next one. A loose bootstrap becomes a loose standard.

Choose one model storage path#

The main design choice is where the model lives.

The GCS-backed path stores Ollama model files in a Cloud Storage bucket and mounts that bucket into the Ollama container with GCS Fuse. The image stays smaller. Model updates can happen without rebuilding the image. The tradeoff is that startup and runtime behavior now depend on the mounted bucket, IAM, and GCS Fuse behavior.

The baked-image path pulls the model during the image build and stores it in the image. The service revision becomes more self-contained. Cold start can be more predictable once the image is available. The tradeoff is a larger image and a rebuild whenever the model changes.

Both patterns are valid. Mixing them is where things get weird.

If the service mounts Cloud Storage at /root/.ollama and sets OLLAMA_MODELS=/root/.ollama/models, the model comes from the bucket. If the Dockerfile sets OLLAMA_MODELS=/models and pulls the model during build, the model is in the image. If both are present, you need to know which environment variable wins and which path Ollama actually reads.

I prefer to make that choice visible in the file names:

Dockerfile.ollama-gcs for the small image that expects mounted models.
Dockerfile.ollama-baked for the image that pulls the model during build.
service.gcs.yaml for the Cloud Storage mount path.
service.baked.yaml for the baked-image path.

That small bit of naming prevents future confusion.

GCS-backed Ollama image#

This is closest to the current codelab shape. The Dockerfile configures the Ollama process, but it does not pull the model during build.

FROM --platform=linux/amd64 ollama/ollama
 
ENV OLLAMA_HOST=0.0.0.0:11434
ENV OLLAMA_DEBUG=false
ENV OLLAMA_KEEP_ALIVE=-1
 
ENTRYPOINT ["ollama", "serve"]

In this mode, the service YAML should set the model path to the mounted bucket location:

- name: OLLAMA_MODELS
  value: /root/.ollama/models

Then the volume mount should put the Cloud Storage bucket at /root/.ollama:

volumeMounts:
- name: gcs-models
  mountPath: /root/.ollama

And the volume should use the Cloud Storage CSI driver:

volumes:
- name: gcs-models
  csi:
    driver: gcsfuse.run.googleapis.com
    volumeAttributes:
      bucketName: PROJECT_ID-gemma2-2b-codelab

This path is useful when the model files are prepared ahead of time and placed in the bucket.

The setup commands look like this:

export PROJECT_ID="your-gcp-project-id"
export REGION="us-central1"
 
gcloud config set project "$PROJECT_ID"
 
gcloud services enable run.googleapis.com \
  cloudbuild.googleapis.com \
  storage.googleapis.com \
  artifactregistry.googleapis.com
 
gcloud storage buckets create "gs://$PROJECT_ID-gemma2-2b-codelab"
 
curl -fsSL https://ollama.com/install.sh | sh
ollama serve

In another terminal:

ollama pull gemma2:2b
gsutil cp -r "/home/$USER/.ollama/models" "gs://$PROJECT_ID-gemma2-2b-codelab"

For a workstation that does not use /home/$USER, I would inspect where Ollama actually stores models before copying. The codelab assumes that Linux path. The principle is the important part: copy the Ollama models directory into the bucket that the sidecar mounts.

Baked-model Ollama image#

The baked path pulls the model during the image build.

FROM --platform=linux/amd64 ollama/ollama
 
ENV OLLAMA_HOST=0.0.0.0:11434
ENV OLLAMA_MODELS=/models
ENV OLLAMA_DEBUG=false
ENV OLLAMA_KEEP_ALIVE=-1
 
ENV MODEL=gemma2:2b
RUN ollama serve & sleep 5 && ollama pull "$MODEL"
 
ENTRYPOINT ["ollama", "serve"]

In this mode, I would not mount the Cloud Storage bucket at /root/.ollama for models. I would also avoid overriding OLLAMA_MODELS in the service YAML unless I am intentionally changing the image contract.

The advantage is simplicity at runtime. The service revision already contains the model. The disadvantage is image size and build time. For a small model, that may be acceptable. For larger models, the GCS-backed path or a different serving stack may be better.

Build and publish images#

The Artifact Registry repository needs to exist before the images are pushed:

gcloud artifacts repositories create ollama-sidecar-codelab-repo \
  --repository-format=docker \
  --location="$REGION" \
  --description="Ollama and Open WebUI"

Build the Ollama sidecar image:

gcloud builds submit \
  --tag "$REGION-docker.pkg.dev/$PROJECT_ID/ollama-sidecar-codelab-repo/ollama-gemma-2b" \
  --machine-type e2-highcpu-32

For Open WebUI, the codelab pulls the public image and pushes it into Artifact Registry:

docker pull ghcr.io/open-webui/open-webui:main
 
gcloud auth configure-docker "$REGION-docker.pkg.dev"
 
docker tag ghcr.io/open-webui/open-webui:main \
  "$REGION-docker.pkg.dev/$PROJECT_ID/ollama-sidecar-codelab-repo/openwebui"
 
docker push "$REGION-docker.pkg.dev/$PROJECT_ID/ollama-sidecar-codelab-repo/openwebui"

For production, I would pin image tags or digests instead of using main. Floating tags make examples convenient and operations unpleasant.

Corrected GCS-backed service YAML#

This is the service shape I would use for the GCS-backed codelab variant.

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: ollama-sidecar-codelab
  labels:
    cloud.googleapis.com/location: us-central1
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/maxScale: '5'
        run.googleapis.com/cpu-throttling: 'false'
        run.googleapis.com/startup-cpu-boost: 'true'
        run.googleapis.com/container-dependencies: '{"openwebui":["ollama-sidecar"]}'
    spec:
      containerConcurrency: 80
      timeoutSeconds: 300
      containers:
      - name: openwebui
        image: us-central1-docker.pkg.dev/PROJECT_ID/ollama-sidecar-codelab-repo/openwebui
        ports:
        - name: http1
          containerPort: 8080
        env:
        - name: OLLAMA_BASE_URL
          value: http://localhost:11434
        resources:
          limits:
            cpu: 2000m
            memory: 1Gi
        volumeMounts:
        - name: openwebui-data
          mountPath: /app/backend/data
        startupProbe:
          tcpSocket:
            port: 8080
          timeoutSeconds: 240
          periodSeconds: 240
          failureThreshold: 1
 
      - name: ollama-sidecar
        image: us-central1-docker.pkg.dev/PROJECT_ID/ollama-sidecar-codelab-repo/ollama-gemma-2b
        env:
        - name: OLLAMA_MODELS
          value: /root/.ollama/models
        resources:
          limits:
            cpu: '6'
            memory: 16Gi
            nvidia.com/gpu: '1'
        volumeMounts:
        - name: gcs-models
          mountPath: /root/.ollama
        startupProbe:
          tcpSocket:
            port: 11434
          timeoutSeconds: 1
          periodSeconds: 10
          failureThreshold: 3
 
      volumes:
      - name: gcs-models
        csi:
          driver: gcsfuse.run.googleapis.com
          volumeAttributes:
            bucketName: PROJECT_ID-gemma2-2b-codelab
      - name: openwebui-data
        emptyDir:
          medium: Memory
          sizeLimit: 10Gi
 
      nodeSelector:
        run.googleapis.com/accelerator: nvidia-l4

There are a few deliberate choices here.

The file uses Knative service YAML: apiVersion: serving.knative.dev/v1 and kind: Service. That is the YAML shape used by Cloud Run service export and replace flows.

The GPU limit appears only under the Ollama sidecar:

nvidia.com/gpu: '1'

The L4 selector appears at the template level:

nodeSelector:
  run.googleapis.com/accelerator: nvidia-l4

The Open WebUI container does not disable auth in this version. The codelab sets WEBUI_AUTH=false because it is a demo. For anything internet-facing, I would remove that and configure authentication intentionally.

Baked-image service differences#

For the baked model variant, I would remove the gcs-models volume and the /root/.ollama mount, and I would not override OLLAMA_MODELS in the sidecar.

The Ollama sidecar becomes simpler:

- name: ollama-sidecar
  image: us-central1-docker.pkg.dev/PROJECT_ID/ollama-sidecar-codelab-repo/ollama-gemma-2b
  resources:
    limits:
      cpu: '6'
      memory: 16Gi
      nvidia.com/gpu: '1'
  startupProbe:
    tcpSocket:
      port: 11434
    timeoutSeconds: 1
    periodSeconds: 10
    failureThreshold: 3

That is the whole point of baking the model. The image contract says where the model lives. The service spec does not need a model bucket.

Apply the service spec#

Once the images exist and the YAML is ready, replace the placeholder and apply the service spec:

sed -i "s/PROJECT_ID/${PROJECT_ID}/g" service.yaml
gcloud run services replace service.yaml --region "$REGION"

For a new service, replace can create or update from the YAML. For an existing service, I usually export the current service first, make a focused edit, and then replace:

gcloud run services describe ollama-sidecar-codelab \
  --region "$REGION" \
  --format export > service.yaml

That gives me the actual service shape Cloud Run understands, without generated status fields.

Auth is not optional in production#

The codelab intentionally disables Open WebUI authentication:

- name: WEBUI_AUTH
  value: 'false'

That is fine for a controlled tutorial. It is not fine for a public URL.

Open WebUI's current docs say WEBUI_AUTH defaults to True, and setting it to False disables authentication for a fresh install. The docs also support OAuth, OIDC, and trusted-header patterns. For OAuth and SSO, WEBUI_URL needs to be set before use. For production, WEBUI_SECRET_KEY needs to be stable and secure, especially across multiple replicas.

On Cloud Run, I would treat auth as a launch blocker:

Keep Open WebUI auth enabled.
Configure OAuth or OIDC through the documented environment variables.
Set WEBUI_URL to the public service URL or custom domain before relying on OAuth.
Set WEBUI_SECRET_KEY from Secret Manager, not a literal YAML value.
Disable open signups unless the instance is meant to be self-service.
Be careful with trusted headers unless the only path to Open WebUI is through the authenticating proxy.

The trusted-header pattern is useful, but it is easy to misconfigure. If untrusted clients can reach Open WebUI directly and set the trusted header themselves, they can impersonate users. That is not a Cloud Run problem. That is an auth boundary problem.

GPU cost and scaling notes#

Cloud Run GPU services use instance-based billing. They can still scale to zero, but minimum instances cost money while idle. GPU is billed for the instance lifecycle, not only for the moments the model is generating tokens.

That changes how I think about concurrency.

containerConcurrency: 80 is a convenient example value, not a universal answer. A small model may tolerate higher concurrency. A larger model may need a lower number to avoid bad tail latency or memory pressure. The right value depends on model size, context length, request pattern, acceptable latency, and whether the UI is for a small internal group or broad usage.

I would start lower, load test with representative prompts, and tune based on actual latency and GPU utilization. For production, I would also set maximum instances below available GPU quota. The docs are clear that max instances cannot exceed the GPU quota allocated for the project and region.

One more small but important detail: region support differs by GPU type. Do not assume the region from an old tutorial is valid for the GPU you choose. Check the current supported regions before the deployment becomes a calendar event.

Failure modes I watch for#

The first failure mode is the wrong API surface. Cloud Functions or Vertex AI gets enabled even though the service only needs Cloud Run, Cloud Build, Cloud Storage, and Artifact Registry. That is not catastrophic, but it is sloppy platform hygiene.

The second is using the wrong deployment command. gcloud run deploy --source is for source deployment. A Cloud Run service YAML should go through gcloud run services replace.

The third is mixing model storage modes. If the model is baked into /models, do not mount GCS over a different Ollama model path and expect the image to matter. If the model is in GCS, do not pretend the Dockerfile fully defines the runtime.

The fourth is startup race. Open WebUI may start before Ollama is actually reachable. Use container dependencies and startup probes.

The fifth is attaching the GPU to the wrong container or trying to attach it to both. The GPU belongs to the Ollama sidecar in this pattern.

The sixth is demo auth escaping into production. WEBUI_AUTH=false should make a reviewer stop the release.

The seventh is using floating image tags. ghcr.io/open-webui/open-webui:main is acceptable for a walkthrough. For production, pin the version or digest.

The eighth is treating the in-memory Open WebUI data mount as durable. An emptyDir with medium: Memory is ephemeral. That can be fine for a demo. It is not a user data strategy.

The ninth is ignoring billing. Minimum instances, zonal redundancy, concurrency, and GPU type all change cost. The service may scale to zero, but that does not mean every configuration is cheap.

The production version of the pattern#

The production version of this architecture is not much larger than the codelab. It is just more explicit.

I would use a pinned Open WebUI image. I would choose either GCS-backed models or baked images and name the files accordingly. I would keep the GPU on Ollama only. I would set startup probes and container dependencies. I would use Secret Manager for auth secrets. I would set a real WEBUI_URL, configure OAuth or OIDC, and keep signups locked down. I would tune concurrency with load testing instead of copying the example value. I would apply changes through service YAML and review the revision diff before sending traffic.

None of that makes the service complicated. It makes the service honest.

Cloud Run is a good fit for this kind of small inference surface when the team wants managed deployment, simple scaling, and fewer platform moving parts. But managed does not mean decision-free. The model has to live somewhere. The sidecar has to be ready. The GPU has to be attached to the right container. The UI has to be protected. The deployment command has to match the artifact.

That is the discipline that keeps a clean demo from becoming a fragile public service.

Cloud Run GPU Sidecars Need Deployment Discipline

What changed since older notes#

The architecture#

APIs I would enable#

Choose one model storage path#

GCS-backed Ollama image#

Baked-model Ollama image#

Build and publish images#

Corrected GCS-backed service YAML#

Baked-image service differences#

Apply the service spec#

Auth is not optional in production#

GPU cost and scaling notes#

Failure modes I watch for#

The production version of the pattern#

Sources#