The Three-Run Lab: How I Triage Slow PyTorch Training

Outcome focus: Identified and resolved training bottlenecks in under an hour by running the three-run baseline and reading profiler signatures before changing any model code.

GPU utilization at 95% is not a green light.

A training loop that pegs the GPU can still be badly underperforming. The GPU might be thrashing through hundreds of tiny kernels with launch overhead between each one, stalling every other batch while the DataLoader scrambles to keep up, or running in fp32 on hardware with Tensor Cores sitting idle. The utilization metric does not tell you which of those is happening. It just tells you the GPU is busy.

The instinct when training is slow is to look at the model. Swap an architecture, fiddle with the optimizer, adjust the batch size. That is almost never where the time is going. The real bottlenecks are a short list and they are diagnosable in under an hour — but you have to run the right checks before you start changing things.

Establish a number before you change anything#

I always start with the three-run lab. Three configurations, same seed, same fixed data slice, back to back: fp32 baseline, AMP, AMP with compile. Before optimizing anything, I want to know where the ceiling is and how far the current loop is from it.

three_run_lab.py

import time
import torch
 
def run_lab(model_fn, dataloader, amp_dtype=None, compile_model=False, steps=55, warmup=5):
    model = model_fn().cuda().train()
    if compile_model:
        model = torch.compile(model)
 
    opt = torch.optim.Adam(model.parameters(), lr=1e-4)
    loss_fn = torch.nn.CrossEntropyLoss()
    # Only fp16 needs a scaler. bfloat16 has float32 exponent range.
    scaler = torch.amp.GradScaler() if amp_dtype == torch.float16 else None
 
    torch.cuda.reset_peak_memory_stats()
    t0, n = None, 0
 
    for step, (xb, yb) in enumerate(dataloader):
        if step >= steps:
            break
        xb = xb.cuda(non_blocking=True)
        yb = yb.cuda(non_blocking=True)
 
        if amp_dtype:
            with torch.autocast("cuda", dtype=amp_dtype):
                loss = loss_fn(model(xb), yb)
        else:
            loss = loss_fn(model(xb), yb)
 
        if scaler:
            scaler.scale(loss).backward()
            scaler.unscale_(opt)
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            scaler.step(opt)
            scaler.update()
        else:
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            opt.step()
 
        opt.zero_grad(set_to_none=True)  # frees memory immediately vs zeroing in place
 
        if step == warmup - 1:
            torch.cuda.synchronize()
            t0 = time.perf_counter()
            n = 0
        elif step >= warmup:
            n += xb.size(0)
 
    torch.cuda.synchronize()
    elapsed = time.perf_counter() - t0
    return {"samples_per_sec": n / elapsed, "peak_gb": torch.cuda.max_memory_allocated() / 1e9}
 
 
results = {
    "fp32":              run_lab(MyModel, loader),
    "bfloat16":          run_lab(MyModel, loader, amp_dtype=torch.bfloat16),
    "bfloat16 + compile": run_lab(MyModel, loader, amp_dtype=torch.bfloat16, compile_model=True),
}
 
for name, r in results.items():
    print(f"{name:<22} {r['samples_per_sec']:7.0f} samples/s   {r['peak_gb']:.2f} GB peak")

What the numbers tell you: if fp32 → bfloat16 is a small jump (less than 20%), the bottleneck is not precision — the GPU is waiting on something else, usually the DataLoader. If bfloat16 → compile is a small jump, the kernels are already well-fused for your model shape. If both transitions are large, you had room in both and you should take it.

Run this before touching anything else. It takes fifteen minutes and rules out the most common wrong explanations.

The DataLoader is the first suspect#

If AMP gives less than 20% improvement over fp32, stop looking at the model. The GPU is starving.

The DataLoader ships with several knobs that most people leave at defaults. On a machine with 8 physical cores and fast storage, I start here and sweep from there:

dataloader_tuning.py

dataloader = torch.utils.data.DataLoader(
    dataset,
    batch_size=BS,
    shuffle=True,
    num_workers=8,            # start at num_physical_cores // 2, sweep up
    pin_memory=True,          # avoids an extra copy through pageable memory
    prefetch_factor=4,        # batches pre-staged per worker
    persistent_workers=True,  # workers stay alive between epochs; avoids respawn overhead
)

non_blocking=True on the .cuda() calls is the matching half of pin_memory. Without it, pin_memory allocates fast transfer memory but the transfer itself still blocks the CPU:

xb = xb.cuda(non_blocking=True)
yb = yb.cuda(non_blocking=True)

With both set, the host-to-device copy overlaps with the next batch being prepared by the workers. The GPU never has to wait for the transfer to complete.

The signs that the DataLoader is the bottleneck: the GPU utilization trace sawtooths with regular idle gaps, the CPU is fully pegged while the GPU is not, or the profiler shows long DataLoader or transform ops in the CPU timeline. Heavy Python-based augmentation transforms are a common source. If the transforms cannot be moved to GPU or replaced with vectorized ops, precompute and cache them.

Reading the profiler#

After the three-run lab, a single profiler run gives you the actual time breakdown. Five patterns cover most situations:

profiler_run.py

from torch.profiler import profile, record_function, ProfilerActivity, schedule
 
prof = profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    schedule=schedule(wait=1, warmup=2, active=5, repeat=1),
    record_shapes=True,
    profile_memory=True,
    with_stack=False,
)
 
prof.start()
for step, (xb, yb) in enumerate(dataloader):
    if step >= 8:
        break
    xb = xb.cuda(non_blocking=True)
    yb = yb.cuda(non_blocking=True)
    with torch.autocast("cuda", dtype=torch.bfloat16):
        loss = loss_fn(model(xb), yb)
    loss.backward()
    opt.step()
    opt.zero_grad(set_to_none=True)
    prof.step()
prof.stop()
 
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=15))
prof.export_chrome_trace("trace.json")

Open trace.json in chrome://tracing or Perfetto. Then look for:

A few ops dominate cuda_time_total. Usually one or two layers are eating most of the GPU time. Check the tensor shapes going into those layers — non-power-of-two dimensions can prevent Tensor Core usage even with AMP enabled. Reshape or pad the problematic dimensions and rerun.

Hundreds of tiny short-duration kernels. The GPU is spending more time launching kernels than executing them. torch.compile is the fix here — it fuses those launches into fewer, larger kernels. I have seen compile drop kernel counts from several hundred per batch to under fifty with no model changes.

Regular GPU idle gaps with CPU pegged. DataLoader problem. Raise num_workers, enable pin_memory and non_blocking, increase prefetch_factor. If transforms are the CPU bottleneck, move them to GPU-side augmentation libraries or cache preprocessed tensors.

Memory growing per step or OOM mid-training. Check whether there is a .item() or .detach() missing on the loss before it goes into a running average. A loss accumulated over steps without .item() keeps the entire computation graph alive. Also check for any Python objects in the loop that hold references to tensors — that one shows up less often but is harder to find.

NaNs or gradient explosions. Gradient clipping catches this at training time but does not tell you where the NaN originated. To find the source, wrap the backward pass temporarily with detect_anomaly:

nan_debug.py

# Slow. Use only when tracking down the source of a NaN.
with torch.autograd.detect_anomaly():
    loss.backward()

detect_anomaly traces the forward graph and reports the exact operation that produced the problematic gradient. It roughly doubles backward time, so pull it out once you have found the culprit. After that, check learning rate magnitude, weight initialization, and whether any layer is producing unbounded activations before the loss.

Memory triage#

Two numbers tell you most of what you need to know:

# After each phase or batch size sweep
peak = torch.cuda.max_memory_allocated() / 1e9
reserved = torch.cuda.memory_reserved() / 1e9
print(f"peak allocated: {peak:.2f} GB   reserved: {reserved:.2f} GB")
torch.cuda.reset_peak_memory_stats()

If reserved is much larger than peak allocated, the allocator is holding onto freed blocks. This is normal behavior — PyTorch caches blocks to avoid re-requesting them from CUDA — but if you are switching between phases with significantly different memory footprints (training then evaluation on large batches), a manual torch.cuda.empty_cache() between phases returns those blocks to CUDA. Do not call it inside the training loop; the overhead of re-acquiring blocks every step costs more than it saves.

If peak allocated is growing step over step, something is retaining the computation graph. Add .item() when storing scalar losses:

# Wrong: keeps the graph alive
total_loss += loss
 
# Right: extracts the Python float, graph is freed
total_loss += loss.item()

The starting scaffold#

Once the triage is done, I work from a minimal scaffold that has all the knobs wired. Easier to prune back than to remember what to add:

train_loop.py

import os, time, torch
from torch.profiler import profile, ProfilerActivity, schedule
 
model = MyModel().cuda()
model = torch.compile(model)
 
opt = torch.optim.AdamW(model.parameters(), lr=3e-4, fused=True)  # fused=True on CUDA saves a kernel
loss_fn = torch.nn.CrossEntropyLoss()
 
loader = torch.utils.data.DataLoader(
    dataset,
    batch_size=BS,
    shuffle=True,
    num_workers=min(8, os.cpu_count() // 2),
    pin_memory=True,
    prefetch_factor=4,
    persistent_workers=True,
)
 
def step(xb, yb):
    xb = xb.cuda(non_blocking=True)
    yb = yb.cuda(non_blocking=True)
    with torch.autocast("cuda", dtype=torch.bfloat16):
        loss = loss_fn(model(xb), yb)
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    opt.step()
    opt.zero_grad(set_to_none=True)
    return loss.item()
 
# Profile the first real batch after warmup
for i, (xb, yb) in enumerate(loader):
    step(xb, yb)
    if i == 2:
        break
 
torch.cuda.synchronize()
with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    schedule=schedule(wait=0, warmup=1, active=3),
    record_shapes=True,
    profile_memory=True,
) as prof:
    for i, (xb, yb) in enumerate(loader):
        if i >= 4:
            break
        step(xb, yb)
        prof.step()
 
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=15))
print(f"peak: {torch.cuda.max_memory_allocated() / 1e9:.2f} GB")

AdamW(fused=True) is a small one worth knowing: on CUDA it fuses the per-parameter update loops into a single kernel launch instead of one per parameter tensor. On wide models with many parameters the saved launches add up.

The profiler run embedded in the warmup phase means the first serious training run also produces a baseline trace. If something changes in a future run and performance drops, comparing traces is a ten-minute exercise rather than a debugging session.

Run the three-run lab first. Read the profiler second. Change the architecture last.