Outcome focus: Separated Mac-local MPS and MLX fine-tuning paths from NVIDIA-only training features so local experiments can start with realistic hardware expectations.
apple siliconmlxpytorchmpsllm fine-tuning
Part 3 of 4 in the LLM fine-tuning series.
A MacBook Pro can be a serious LLM prototyping machine.
It is not a small NVIDIA cluster hiding under a keyboard.
That distinction is where a lot of local fine-tuning advice breaks. A developer sees a Hugging Face recipe using QLoRA, bitsandbytes, Triton kernels, Flash Attention, MXFP4, or NeMo multi-GPU training and assumes the same switches can be moved to Apple Silicon. Then the install works halfway, the model loads slowly, an op falls back to CPU, or the quantized training path does not exist for MPS.
The Mac did not fail.
The CUDA recipe was imported into the wrong runtime.
The right question is not "can I fine-tune on a Mac?" Yes, often. The better question is which stack fits the job: PyTorch MPS, Apple MLX, or a cloud NVIDIA machine.
The Decision Tree#
The concrete scenario: you have an Apple Silicon MacBook Pro and want to adapt an open model for a domain task. You do not want to rent an H100 before you know whether the data and evals are any good. That is a healthy instinct.
The real tradeoff is local iteration speed versus runtime capability. Local runs reduce cost, latency, and data movement. NVIDIA cloud runs unlock CUDA-only quantization, large distributed training, and production-like serving benchmarks.
Use the Mac to learn cheaply. Move to NVIDIA when the experiment earns scale.
PyTorch MPS#
Apple's Accelerated PyTorch training on Mac page explains the core path: PyTorch uses the Metal Performance Shaders backend for GPU acceleration on Apple Silicon. PyTorch exposes that as the mps device, and the PyTorch MPS backend docs show the basic pattern.
import torch
if torch.backends.mps.is_available():
device = torch.device("mps")
else:
device = torch.device("cpu")
print(device)Move tensors and modules the same way you would for CUDA:
model.to("mps")
batch = {k: v.to("mps") for k, v in batch.items()}Hugging Face Accelerate on MPS makes this easier for normal training scripts and explicitly frames Apple Silicon as useful for prototyping and fine-tuning locally. It also names the sharp limitation: distributed setups such as gloo and nccl are not working with the mps device in that Accelerate path, so you should treat MPS as single-device.
In practice, I would set the CPU fallback environment variable during early experiments:
export PYTORCH_ENABLE_MPS_FALLBACK=1Fallback is not a performance strategy. It is a debugging convenience. If important operations fall back to CPU, the run can become unexpectedly slow.
A Minimal HF LoRA-on-MPS Skeleton#
Hugging Face plus MPS is useful when you want to stay close to the standard Transformers, PEFT, TRL, and Datasets ecosystem.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model
model_id = "Qwen/Qwen2.5-1.5B-Instruct"
device = "mps" if torch.backends.mps.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
)
model.to(device)
lora_config = LoraConfig(
r=8,
lora_alpha=16,
lora_dropout=0.05,
target_modules="all-linear",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.gradient_checkpointing_enable()
model.config.use_cache = FalseThis is a compatibility path, not the fastest Mac path for every workload. MPS support has improved, but CUDA remains the main target for many Hugging Face performance features. If the model is too large, the dtype is unsupported, or an op falls back, your Mac may feel slower than the specs suggest.
The mistake is interpreting that as "Macs cannot do ML." The better reading is "this framework path is not the native Mac path for this run."
MLX and mlx-lm#
Apple's MLX ecosystem is the more native path for Apple Silicon LLM work. The mlx-lm README describes it as a package for generating text and fine-tuning large language models on Apple Silicon with MLX. It integrates with the Hugging Face Hub, supports quantization, and supports low-rank and full-model fine-tuning, including quantized models.
Install:
pip install -U "mlx-lm[train]" datasetsGenerate:
mlx_lm.generate \
--model mlx-community/Llama-3.2-3B-Instruct-4bit \
--prompt "Explain LoRA in three sentences."Fine-tune with LoRA:
mlx_lm.lora \
--model mlx-community/Llama-3.2-3B-Instruct-4bit \
--train \
--data data/local_lora \
--iters 600 \
--batch-size 1 \
--grad-accumulation-steps 16 \
--adapter-path adapters/domain-v1The MLX LM LoRA guide is the page I would keep open. It documents mlx_lm.lora, LoRA, DoRA, full fine-tuning, quantized LoRA behavior, dataset formats, prompt masking, evaluation, generation, fusing, and memory-reduction levers.
The most important local data shape is simple:
{"messages":[{"role":"system","content":"You are a concise internal support assistant."},{"role":"user","content":"A pipeline failed. What should I inspect first?"},{"role":"assistant","content":"Start with the latest run logs, then check freshness and owner metadata for the affected dataset."}]}Then evaluate:
mlx_lm.lora \
--model mlx-community/Llama-3.2-3B-Instruct-4bit \
--adapter-path adapters/domain-v1 \
--data data/local_lora \
--testAnd generate:
mlx_lm.generate \
--model mlx-community/Llama-3.2-3B-Instruct-4bit \
--adapter-path adapters/domain-v1 \
--prompt "A governed table is stale. What should I do?"MLX also has Mac-specific memory behavior worth knowing. The README notes that large models relative to total RAM can be slow, and on macOS 15 or later mlx-lm can wire memory occupied by the model and cache. It also documents increasing iogpu.wired_limit_mb when a model fits in RAM but needs more wired memory headroom.
That is not a casual setting. Treat it as a deliberate local performance adjustment, not a default.
What Does Not Fit on macOS#
The Hugging Face bitsandbytes docs are clear that bitsandbytes provides quantized layers and 8-bit optimizers, including QLoRA, with current backend support centered on NVIDIA CUDA plus Intel XPU, Intel Gaudi, and CPU paths. The CUDA path is the one most LLM QLoRA tutorials assume.
So do not expect this to be the Mac path:
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_4bit=True)For Apple Silicon quantized LoRA, use MLX unless you have a specific reason not to.
Likewise, the GPT-OSS faster Transformers features from part 2 have hardware boundaries:
- MXFP4 Triton kernels are not the Apple Silicon path.
- Flash Attention 3 kernels discussed there are not the Apple Silicon path.
- Most NeMo training workflows are built for NVIDIA GPUs and distributed CUDA infrastructure.
- DeepSpeed ZeRO and NCCL-style distributed training are not MPS features.
- vLLM-style CUDA serving optimizations do not become Metal features by changing one device string.
The Mac has its own advantages: unified memory, local development, low setup friction, private data experiments, and native MLX tooling. Use those advantages instead of trying to cosplay a CUDA box.
llama.cpp, Ollama, and GGUF#
For inference-only work, GGUF runners remain useful. llama.cpp, Ollama, and LM Studio are often the fastest way to test a local model interactively, especially with quantized weights and Metal acceleration.
But inference convenience is not the same as fine-tuning control.
Use them for:
- local qualitative evals,
- prompt debugging,
- comparison against base behavior,
- lightweight demos,
- checking whether a model family is worth deeper tuning.
Use MLX or Hugging Face training paths for adapter work.
What I Would Do First#
If I had a MacBook Pro and wanted to fine-tune an LLM, I would start with this path:
- Build the dataset and eval harness locally.
- Run base-model inference with MLX or a GGUF runner.
- Run a tiny MLX LoRA experiment on a small model first.
- Confirm the adapter saves, evaluates, and loads.
- Move to the target model only after the loop works.
- Use HF + MPS only when the standard Hugging Face API path matters.
- Move to NVIDIA when the task requires CUDA-only quantization, full SFT, DPO at scale, NeMo, or production-serving benchmarks.
That sequence keeps the Mac where it shines: fast learning before expensive training.
Related Notes#
- Fine-Tuning LLMs Is an Operating Loop, Not a Training Command
- The Faster Transformers Stack Behind GPT-OSS
Part 3 of 4. Previous: The Faster Transformers Stack Behind GPT-OSS. Next: Fine-Tuning GPT-OSS 20B on a 64GB MacBook Pro.