Fine-Tuning LLMs Is an Operating Loop, Not a Training Command

Why LLM fine-tuning projects fail when teams jump to NeMo or Hugging Face training commands before deciding the model, data, evaluation, serving, and governance loop.

By Jovani Pink March 8, 2026 10 min — Platform & AI Engineering

Outcome focus: Defined a fine-tuning operating loop that connects base-model choice, data curation, PEFT, evaluation, distributed training, serving, and governance into one repeatable release path.

Part 1 of 4 in the LLM fine-tuning series.

The fastest way to waste a GPU budget is to treat fine-tuning like a command.

I have seen teams do the familiar move: pick a fashionable base model, throw a folder of examples at a trainer, watch loss go down, and celebrate before they compare the tuned model against the base model on the tasks that matter. The run looks real. The dashboard moves. The checkpoint exists.

Then the first production-like prompt exposes the problem.

The model learned the team's formatting mistakes. The eval set leaked into training. The tokenizer template was wrong. The base model license was never checked. The LoRA adapter worked in the notebook but behaved differently after export. The tuned model sounded more confident and became less correct.

Fine-tuning did not fail because the framework was bad.

It failed because the operating loop was missing.

Frameworks like NVIDIA NeMo and Hugging Face Transformers are powerful. They do not decide what should be trained, what data should be trusted, what metric should stop the run, or which deployment path can safely serve the result. That remains engineering and product judgment.

The Fine-Tuning Loop#

The useful unit is not SFT, PEFT, QLoRA, DPO, or ZeRO.

The useful unit is the loop that turns a model change into a measured product decision.

A fine-tuning run only matters when it belongs to a repeatable evaluation and release loop.

The first failure mode is starting in the wrong box. A team asks, "Should we use NeMo or Hugging Face?" before it knows whether the model needs domain adaptation, format control, style control, refusal calibration, tool-use behavior, or preference alignment.

Those are different jobs.

If the base model already knows the domain and only needs output shape, a small instruction-tuning set may be enough. If the model needs to behave differently for one customer workflow, a LoRA adapter may be the right first move. If the model is massively domain-shifting, a small adapter can hit its ceiling quickly. If the product needs users to prefer one response over another, DPO may be useful after SFT. If the team lacks a trusted eval set, every training decision is floating.

The tradeoff is simple and painful: do less training until you have better evidence.

That feels slow. It is usually cheaper than training the wrong behavior into a large model.

Start With Base Model and License#

Do not try to "fix" a poor base model with data.

Choose a base that is already strong near the target task. If the system needs long-context behavior, start with a model designed for long context. If it needs multilingual output, start with a model that has multilingual strength. If it needs tool calling, start with a model that already understands tool structure.

Retrofitting deep capability through a small fine-tune is fragile. Fine-tuning is better at steering, formatting, specialization, and domain adaptation than at creating a missing foundation.

The license check belongs here, not after the experiment works.

Create a small model-selection record:

# Base Model Decision
 
model: openai/gpt-oss-20b
license: Apache-2.0
intended_use: internal domain assistant
redistribution: adapter only, no public redistribution
context_need: <= 4096 training tokens
task_fit: strong reasoning and tool-use baseline
known_risks:
  - harmony formatting must be preserved
  - quantized training path depends on runtime
  - safety behavior must be regression-tested

The artifact is boring on purpose. It keeps the team from discovering commercial, redistribution, privacy, or prompt-format constraints after the run is already useful.

Data Is the Moat#

Small, clean, on-task data beats big messy data.

That sounds obvious until a team starts padding the training set with synthetic examples because the initial dataset feels too small. Synthetic data can help, but only if it is reviewed, deduplicated, and tied to the behavior being trained. Otherwise, it becomes a noise amplifier.

The minimum curation gate:

  • Remove exact and near duplicates.
  • Remove PII and sensitive content the model should not memorize.
  • Remove examples that leak the answer or instruction pattern unrealistically.
  • Split train, validation, and test data before repeated experiments.
  • Keep a frozen test set untouched until the end.
  • Keep a small human-graded "gold" set for sanity checks.
  • Preserve the chat template or prompt format the model expects.

For instruction tuning, I like JSONL examples that make the intended behavior inspectable:

train.jsonl
{"messages":[{"role":"system","content":"You are a concise platform support assistant."},{"role":"user","content":"A governed table is stale. What should I check first?"},{"role":"assistant","content":"Check the freshness monitor, then inspect the latest pipeline run logs before escalating to the dataset owner."}]}

That example is not enough by itself. It is a shape. The actual record needs provenance, review status, and split assignment somewhere in the pipeline.

metadata.jsonl
{"id":"ex_0142","source":"internal_runbook_sanitized","reviewed_by":"sme","contains_pii":false,"split":"train","task":"incident_triage"}

NeMo's data tooling is worth knowing here. The existing post Fine-Tuning Open Source LLMs With NVIDIA NeMo goes deeper on NeMo Curator and the broader NeMo lifecycle. For this series, the principle is enough: data curation is a model decision, not a preprocessing chore.

Prefer PEFT Before Full SFT#

Start with parameter-efficient fine-tuning unless the evals prove it is not enough.

LoRA in Hugging Face PEFT keeps the base weights frozen and trains low-rank update matrices. The practical advantages are strong: fewer trainable parameters, smaller checkpoints, easier rollback, multiple task adapters on one base, and often comparable performance to full fine-tuning for targeted adaptation.

The default Hugging Face shape is familiar:

lora_config.py
from peft import LoraConfig, get_peft_model
 
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules="all-linear",
    task_type="CAUSAL_LM",
)
 
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

For constrained NVIDIA GPUs, QLoRA is often the sweet spot: the base model is loaded in 4-bit form while gradients flow into LoRA adapters. The QLoRA paper introduced NF4, double quantization, and paged optimizers to reduce memory enough to fine-tune much larger models on a single GPU while preserving strong quality.

The important wording is "on NVIDIA-style infrastructure" for most common Hugging Face QLoRA recipes. The popular bitsandbytes path is not the same as Apple Silicon MLX quantized LoRA. That distinction becomes part 3 of this series.

For NeMo, PEFT is also first-class. NVIDIA's current NeMo PEFT documentation describes LoRA and DoRA support in Megatron Bridge, including target modules such as fused QKV projections and MLP layers. The names differ from Hugging Face because the model internals and fused kernels differ.

An illustrative NeMo-style LoRA config should expose the same decisions:

nemo_lora_shape.yaml
model:
  name: meta-llama/Llama-3.2-8B-Instruct
  max_seq_length: 4096
 
peft:
  method: lora
  target_modules:
    - linear_qkv
    - linear_proj
    - linear_fc1
    - linear_fc2
  dim: 16
  alpha: 32
  dropout: 0.05
 
training:
  precision: bf16
  micro_batch_size: 1
  global_batch_size: 64
  learning_rate: 0.0001
  warmup_ratio: 0.03
  epochs: 2

Treat exact field names as version-specific. Treat the visible choices as non-negotiable: model, context length, adapter rank, target modules, precision, batch strategy, learning rate, and output path.

Keep Training Conservative#

Most early fine-tuning runs should be diagnostic.

The first serious run should answer:

  • Does the data format match the model template?
  • Does loss decrease without divergence?
  • Does validation improve before training loss becomes vanity?
  • Does the model improve on task-specific examples?
  • Does safety or refusal behavior regress?
  • Can the adapter load in the intended serving path?

Good first-pass defaults:

starter_training_defaults.yaml
optimizer: adamw
learning_rate:
  peft: 0.0001
  full_sft: 0.000005
warmup_ratio: 0.03
epochs: 1-3
micro_batch_size: 1
gradient_accumulation_steps: 16
max_seq_length: 2048
precision:
  nvidia_datacenter: bf16
  consumer_cuda: fp16_with_scaler
early_stop:
  metric: task_eval_score
  patience: 2

Do not chase loss alone. Loss can improve while the product gets worse. A support assistant can become more verbose. A classifier can become overconfident. A reasoning model can learn a format and lose a refusal boundary. A summarizer can learn house style and hallucinate more.

The mistake I watch for is over-training on a dataset that is too narrow. The model starts sounding perfect on the team's examples and brittle everywhere else. Keep a small slice of base-distribution or general instruction data when forgetting becomes visible, and always compare against the base model.

Scale Only When the Loop Earns It#

If the model fits and PEFT answers the product question, do not jump to distributed training for sport.

When full fine-tuning or larger models are truly required, memory strategy becomes infrastructure. DeepSpeed ZeRO partitions optimizer states, gradients, and parameters across devices. Stage 3 is the common inflection point for very large models because parameters are sharded too. Offload can save memory, but CPU or NVMe offload can make IO the bottleneck.

Hugging Face users should usually reach for Accelerate plus a DeepSpeed or FSDP config instead of rolling custom distributed data parallel logic.

deepspeed_zero3_minimal.json
{
  "zero_optimization": {
    "stage": 3
  },
  "bf16": {
    "enabled": true
  },
  "gradient_accumulation_steps": 16,
  "train_micro_batch_size_per_gpu": 1
}

NeMo users should lean on the NeMo stack when they need Megatron-Core scale. NeMo 2 supports pretraining, SFT, and PEFT in its LLM collection and uses NeMo Lightning to train Megatron Core based models in a modular way. That is valuable when tensor parallelism, pipeline parallelism, distributed checkpoints, and cluster execution are part of the actual requirement.

The tradeoff is complexity. Advanced parallelism buys scale and throughput. It also expands the failure surface: NCCL, topology, checkpoint conversion, fused modules, rank mapping, precision settings, and container drift.

Scale after the small run proves the behavior is worth scaling.

DPO Before PPO for Most Teams#

Supervised fine-tuning teaches the model to imitate.

Preference optimization teaches it to choose.

When the team has reliable chosen/rejected response pairs, TRL's DPO Trainer is often the practical next step. DPO directly optimizes a policy model against preference pairs and a reference model without requiring a separate reward model and online PPO loop.

The input shape is conceptually simple:

preference.jsonl
{"prompt":[{"role":"user","content":"Summarize this incident for an executive."}],"chosen":[{"role":"assistant","content":"A pipeline delay affected one governed table for 42 minutes. No downstream report was published during the stale window."}],"rejected":[{"role":"assistant","content":"There was a pipeline issue but it is fixed now."}]}

Use PPO only when the product genuinely needs online reward-model updates or a more complex RLHF system. If the team cannot label preference pairs consistently, it is not ready for PPO. It may not even be ready for DPO.

Preference data quality is the release gate.

Evaluation Is the Product Contract#

Every fine-tuning project needs a test plan before the run.

I want an eval checkpoint file that lives with the experiment:

eval_contract.yaml
gold_set: data/evals/gold.jsonl
frozen_test_set: data/splits/test.jsonl
compare:
  - base_model
  - tuned_adapter
  - merged_model
metrics:
  task_success_min: 0.82
  schema_validity_min: 0.98
  refusal_regression_max: 0.02
  toxicity_regression_max: 0.00
human_review:
  sample_size: 200
  blind_compare: true
release_rule: "tuned must beat base on task success without safety regression"

The tuned model should beat the base model on the reason it exists. It should not regress on safety, refusal calibration, or formatting. It should survive adapter loading and merged-weight behavior if both are supported.

Measure adapter serving and merged serving separately. Sometimes merging LoRA back into the base is operationally convenient. Sometimes adapter hot-swap is the better release pattern. Either can shift behavior enough to deserve golden prompt checks.

Serving and Handoff#

Serving is not an afterthought.

For fast iteration, serve adapters on top of the base when your runtime supports it. You can compare adapter versions, rollback quickly, and avoid producing a new full checkpoint for every task. Merge weights only when a single artifact simplifies deployment enough to justify the loss of adapter flexibility.

Set decoding per task:

decoding_policy.yaml
classification:
  temperature: 0.0
  top_p: 1.0
  max_new_tokens: 32
summarization:
  temperature: 0.2
  top_p: 0.9
  max_new_tokens: 512
creative_draft:
  temperature: 0.8
  top_p: 0.95
  max_new_tokens: 1200

Deterministic or near-deterministic decoding is usually right for classification-like tasks. Creative tasks can tolerate more sampling. Do not let the serving defaults become hidden product behavior.

Security, Privacy, and Reproducibility#

The run should leave a paper trail:

  • Package versions.
  • Base model revision.
  • Dataset hashes.
  • Train, validation, and test split IDs.
  • Adapter config.
  • Random seeds.
  • Evaluation results.
  • Known failure modes.
  • License and data-provenance notes.
  • Red-team prompts and outcomes.

Security and privacy are part of that trail. Strip PII. Track data provenance. Maintain do-not-train lists. Validate commercial fine-tuning and redistribution rights. Red-team the tuned model before release, especially if it touches sensitive prompts, regulated workflows, code execution, tool calls, or customer-facing advice.

Fine-tuning makes a model more yours.

It also makes the risks more yours.

Part 1 of 4. Next: The Faster Transformers Stack Behind GPT-OSS.

Back to all writing
On this page
  1. The Fine-Tuning Loop
  2. Start With Base Model and License
  3. Data Is the Moat
  4. Prefer PEFT Before Full SFT
  5. Keep Training Conservative
  6. Scale Only When the Loop Earns It
  7. DPO Before PPO for Most Teams
  8. Evaluation Is the Product Contract
  9. Serving and Handoff
  10. Security, Privacy, and Reproducibility
  11. Related Notes