Tag: gpu

8 entries tagged "gpu" — 2 posts, 6 links.

Posts

Oct 31, 2025 — 14 min — Platform & AI

Cloud Run GPU Sidecars Need Deployment Discipline

A practical deployment guide for running Ollama behind Open WebUI on Cloud Run GPUs without mixing service specs, model storage modes, sidecar startup order, or auth assumptions.

Outcome: Clarified Cloud Run GPU sidecar deployment choices so model storage, service YAML, startup ordering, authentication, and billing constraints are explicit before launch.

gcp cloud run gpu ollama open webui

Sep 29, 2025 — 7 min — Platform & AI

PyTorch Training Throughput: The Patterns That Actually Move the Number

torch.compile, mixed precision, gradient accumulation, DDP vs FSDP, and the profiler — the five levers I reach for before rethinking the model architecture.

Outcome: Cut training wall-clock time and GPU memory pressure by applying compile, AMP, and accumulation patterns in sequence before ever touching model architecture.

pytorch ml training performance gpu distributed-training

Links

Articlehuggingface.coApr 21, 2026Permalink

Tricks from OpenAI gpt-oss You Can Use with Transformers

Hugging Face

This is the runtime stack link for the gpt-oss moment: kernels from the Hub, MXFP4, tensor and expert parallelism, dynamic KV cache behavior, continuous batching, and faster loading.

Worth keeping because it connects model release excitement to the boring but decisive parts of deployment: memory, cache shape, batching, and what hardware the trick actually runs on.

Articlegpuopen.comApr 16, 2026Permalink

A Beginner's Guide to Deploying LLMs with AMD on Windows using PyTorch

Warren Eng, Sheen Lam, and Alexander Blake-Davies, AMD GPUOpen

This is worth keeping because local and Windows-side inference are no longer edge cases for product teams. Hardware choices now shape who can prototype, test, and deploy models without waiting for a cloud GPU lane.

The guide is beginner-friendly, so it is not the deepest systems reference. Its value is showing where AMD's PyTorch story is becoming practical enough to track.

pytorch amd gpu local inference

Articlesvana.nameApr 15, 2026Permalink

How I Solved PyTorch's Cross-Platform Nightmare

Milos Svana

This belongs next to any local inference or training work because cross-platform ML is usually where clean notebooks meet real machines.

The useful lesson is not only PyTorch-specific. Runtime packaging, hardware backends, and dependency friction are product risks when your users are expected to run models outside your exact environment.

pytorch gpu software engineering ai engineering

Articlegimletlabs.aiApr 14, 2026Permalink

Speeding up PyTorch inference on Apple devices with AI-generated Metal kernels

Taras Sereda, Gimlet Blog

This is a strong local-inference link because it shows the gap between framework support and hardware-specific performance work. Apple Silicon can be serious, but the path often goes through kernels, not just model.to("mps").

The AI-generated-kernel angle is interesting, but the bigger point is operational: when the backend is the bottleneck, model.to("mps") is only the start.

pytorch apple silicon gpu local inference

Articlejax-ml.github.ioApr 13, 2026Permalink

How to Think About GPUs

How To Scale Your Model

This is reference-quality material for anyone trying to reason about accelerator performance without hand-waving. It explains GPUs as machines with memory, bandwidth, communication, and parallelism constraints.

Useful far beyond JAX. The same mental model helps with PyTorch, Transformers, NeMo, and any conversation where "just add GPUs" is hiding the actual bottleneck.

gpu jax training ai engineering

Articledeveloper.nvidia.comMar 26, 2026Permalink

NVFP4 Trains with Precision of 16-Bit and Speed and Efficiency of 4-Bit

NVIDIA Technical Blog

This belongs in the queue because fine-tuning and inference decisions increasingly depend on numerical formats, not only model architecture. NVFP4 is a reminder that performance work moves through the whole stack: math representation, kernels, hardware, training stability, and serving economics.

The tradeoff is hardware specificity. A format can be a breakthrough and still be irrelevant to a Mac-local or non-NVIDIA deployment path.

gpu model training quantization nvidia

All tags