Outcome focus: Defined a local 64GB MacBook Pro fine-tuning path for GPT-OSS 20B that prioritizes Harmony formatting, MLX quantized LoRA, small evals, and a clear fallback to NVIDIA when scale is required.
gpt-ossmlxapple siliconllm fine-tuninglocal ai
Part 4 of 4 in the LLM fine-tuning series.
The concrete machine is a 64GB Apple Silicon MacBook Pro.
The concrete model is openai/gpt-oss-20b.
The honest answer: yes, this is a plausible local fine-tuning setup for LoRA-style experimentation. No, it is not the same setup as the Hugging Face H100 tutorial, the MXFP4 Triton path, or a NeMo multi-GPU training job.
That difference is the whole recipe.
The Mac path should be MLX-first, evaluation-first, and Harmony-aware. Use it to prove the data, prompt format, adapter behavior, and evaluation loop. Move to NVIDIA only when the local loop proves the experiment deserves a bigger run.
The Hardware Reality#
OpenAI's gpt-oss announcement says gpt-oss-20b is designed for local or specialized use cases and can run on edge devices with 16 GB of memory in the intended optimized paths. The Hugging Face model page says the smaller model can be fine-tuned on consumer hardware and shows Transformers, vLLM, Ollama, LM Studio, PyTorch/Triton, and other routes.
The memory numbers depend on the route.
On NVIDIA with MXFP4 support, GPT-OSS 20B can have a much smaller footprint. On a Mac, the NVIDIA/ROCm-specific Triton/MXFP4 assumptions from part 2 do not apply. You should think in terms of MLX quantized checkpoints, unified memory, small batches, short training sequences, and LoRA adapters.
The practical working assumption:
64GB Apple Silicon Mac:
good for:
- local inference
- dataset and template debugging
- small LoRA or quantized LoRA experiments
- eval harness development
not ideal for:
- full SFT of 20B weights
- DPO at serious scale
- CUDA-only QLoRA/bitsandbytes recipes
- NeMo Megatron-Core training
- production throughput benchmarkingThe tradeoff is worth it. Local iteration is private, cheap, and fast to start. Cloud GPUs are expensive, but they let you use the CUDA stack when the model behavior is worth scaling.
The Release Gate: Harmony Formatting#
GPT-OSS is not a generic ChatML model.
The models were trained on OpenAI's Harmony response format. The OpenAI Harmony docs explain the conversation structure, channels, reasoning, tool calls, and renderer library. The OpenAI gpt-oss repository says the models should only be used with the Harmony format or they will not work correctly.
That means prompt formatting is a release gate.
If you fine-tune examples in the wrong message format, you are training protocol drift into the adapter. The model can look broken even when the weights are not the issue.
At minimum, keep data in message format and let the tokenizer or MLX path apply the model template:
{"messages":[{"role":"system","content":"Reasoning: low\nYou are a concise platform support assistant."},{"role":"user","content":"A governed table is stale. What should I check first?"},{"role":"assistant","content":"Check the freshness monitor and latest pipeline logs, then notify the dataset owner if a governed downstream asset is affected."}]}For direct Transformers generation, the model card says the Transformers chat template automatically applies Harmony. If you use model.generate directly, apply the chat template or use the openai-harmony package.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("openai/gpt-oss-20b")
messages = [
{"role": "system", "content": "Reasoning: low\nYou are concise."},
{"role": "user", "content": "Explain LoRA in one sentence."},
]
rendered = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
print(rendered[:500])Do this before training. A ten-second template check can save a useless overnight run.
The Recommended Path: MLX 8-bit LoRA#
Use mlx-lm first. Its README describes generation, quantization, Hugging Face Hub integration, low-rank fine-tuning, full fine-tuning, and support for quantized models on Apple Silicon. The MLX LM LoRA guide documents mlx_lm.lora, dataset formats, --mask-prompt, evaluation, generation, fusing, and memory controls.
There are MLX-community GPT-OSS checkpoints, including mlx-community/gpt-oss-20b-mlx-q8, which is an 8-bit quantized GPT-OSS 20B build optimized for Apple Silicon.
Install:
python -m venv .venv
source .venv/bin/activate
pip install -U "mlx-lm[train]" datasets huggingface_hubCreate the smallest useful dataset:
gpt-oss-local-tune/
data/
gpt_oss/
train.jsonl
valid.jsonl
test.jsonl
adapters/
reports/
eval-notes.mdStart with a toy run before touching the full dataset:
mlx_lm.lora \
--model mlx-community/gpt-oss-20b-mlx-q8 \
--train \
--data data/gpt_oss \
--iters 50 \
--batch-size 1 \
--grad-accumulation-steps 16 \
--num-layers 4 \
--adapter-path adapters/smokeThe smoke run should prove only four things:
- The model loads.
- The data format is accepted.
- Loss reports without crashing.
- An adapter is written.
Then run a small real adapter:
mlx_lm.lora \
--model mlx-community/gpt-oss-20b-mlx-q8 \
--train \
--data data/gpt_oss \
--iters 600 \
--batch-size 1 \
--grad-accumulation-steps 32 \
--num-layers 8 \
--learning-rate 1e-4 \
--adapter-path adapters/domain-v1 \
--mask-prompt \
--grad-checkpointThe exact flags should be verified against your installed mlx_lm.lora --help, because MLX evolves. The intent should not change: small batch, accumulated gradients, limited layer count at first, prompt masking for completion-style training, and gradient checkpointing if memory gets tight.
Evaluate Before You Generate for Fun#
Run the test set:
mlx_lm.lora \
--model mlx-community/gpt-oss-20b-mlx-q8 \
--adapter-path adapters/domain-v1 \
--data data/gpt_oss \
--testThen compare base and adapter generations on the same prompts:
mlx_lm.generate \
--model mlx-community/gpt-oss-20b-mlx-q8 \
--prompt "A governed table is stale. What should I do first?" \
--max-tokens 300
mlx_lm.generate \
--model mlx-community/gpt-oss-20b-mlx-q8 \
--adapter-path adapters/domain-v1 \
--prompt "A governed table is stale. What should I do first?" \
--max-tokens 300Keep an eval sheet:
model: mlx-community/gpt-oss-20b-mlx-q8
adapter: adapters/domain-v1
hardware: 64GB Apple Silicon MacBook Pro
task: platform support triage
gold_prompts: data/gpt_oss/test.jsonl
pass_rules:
task_accuracy: "adapter beats base on human blind review"
format: "no broken Harmony-style turns"
safety: "no new unsafe operational advice"
verbosity: "answers stay under 8 sentences unless asked"
rollback:
condition: "adapter sounds better but loses required escalation steps"The failure mode I expect is not a crash. It is a tuned adapter that sounds more helpful and quietly drops the policy-critical step. Human review catches that better than loss.
Memory Levers on 64GB#
Use these in order:
- Lower sequence length by shortening examples.
- Use
--batch-size 1. - Increase
--grad-accumulation-steps. - Reduce
--num-layers. - Turn on
--grad-checkpoint. - Use a smaller or more aggressively quantized MLX checkpoint for iteration.
- Move to cloud GPUs if the task still needs more.
MLX LM also documents large-model behavior on macOS 15 or later, including wired memory. If the model fits in RAM but is slow because memory cannot stay wired, the README documents:
sudo sysctl iogpu.wired_limit_mb=49152Do not paste that blindly. Choose a value appropriate for the machine, leave room for the OS and other applications, and understand that this is a system-level local tuning knob.
HF Transformers + MPS Fallback#
Use Hugging Face + PyTorch MPS when you need compatibility with the broader Transformers/PEFT/TRL ecosystem more than native Apple Silicon throughput.
Install:
pip install -U torch transformers accelerate peft datasets
export PYTORCH_ENABLE_MPS_FALLBACK=1Skeleton:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model
model_id = "openai/gpt-oss-20b"
device = "mps" if torch.backends.mps.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
)
model.to(device)
lora = LoraConfig(
r=8,
lora_alpha=16,
lora_dropout=0.05,
target_modules="all-linear",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora)
model.gradient_checkpointing_enable()
model.config.use_cache = FalseThis path may be tight or slow for the 20B model depending on exact macOS, PyTorch, model dtype, memory pressure, and unsupported operations. The point of including it is not to pretend it is the best Mac route. It is the compatibility fallback.
For GPT-OSS specifically, keep checking the official OpenAI and Hugging Face guidance. The OpenAI cookbook has a GPT-OSS Transformers guide and a GPT-OSS fine-tuning with Hugging Face Transformers guide. The fine-tuning guide is written around an H100-style environment, so do not copy its hardware assumptions onto your Mac.
The Local Workflow I Would Use#
The first useful outcome is not "I trained GPT-OSS 20B on my laptop."
The first useful outcome is:
I can reproduce a local adapter run,
compare it against the base model,
show which examples improved,
show which examples regressed,
and explain whether the next dollar should go to data, evals, or NVIDIA compute.That is the standard.
When to Leave the Mac#
Leave the Mac when:
- The dataset and eval loop are stable.
- The adapter improves the product task.
- The run is blocked by memory even after reducing sequence length, batch, and layer count.
- You need CUDA-only QLoRA or MXFP4 training behavior.
- You need DPO at meaningful scale.
- You need NeMo/Megatron-Core, DeepSpeed, or multi-node training.
- You need production-serving benchmarks on vLLM, TensorRT-LLM, Triton, or NIM.
Do not leave because local training feels less glamorous.
Leave when the evidence says scale is the next bottleneck.
Related Notes#
- Fine-Tuning LLMs on a MacBook Pro With MPS and MLX
- Fine-Tuning LLMs Is an Operating Loop, Not a Training Command
- The Faster Transformers Stack Behind GPT-OSS
Part 4 of 4. Previous: Fine-Tuning LLMs on a MacBook Pro With MPS and MLX.