A practical, no-fluff guide to choosing a large language model on Hugging Face, setting up your local environment, downloading a model, preparing data, and fine-tuning it for your specific use case โ written for people who are new to the whole pipeline.
This guide walks the complete path from zero knowledge to running and retraining a real LLM locally. We cover the questions to ask before picking any model, how to create your Hugging Face account, download a model with the CLI, set up your Python environment, understand what data you can use, and walk through concrete retraining use cases.
Read sequentially if you are completely new. Experienced practitioners can jump to any chapter โ each is self-contained. All code is tested on Python 3.10+ and works on both CUDA (NVIDIA) and Apple Silicon (MPS).
Picking the wrong model wastes weeks. Before opening Hugging Face, answer these questions honestly. There are no wrong answers โ only answers that point you to the right model family.
Hugging Face hosts hundreds of thousands of models, but most derive from a small set of foundation families. Knowing the families shortcuts the search dramatically.
| Family | Best For | Sizes Available | Licence |
|---|---|---|---|
| Llama 3 / 3.1 / 3.2 (Meta) | General-purpose, instruction following, coding, reasoning | 1B, 3B, 8B, 70B, 405B | Llama Community (non-commercial <700M users) |
| Mistral / Mixtral | Fast inference, European language support, MoE efficiency | 7B, 8ร7B, 8ร22B | Apache 2.0 (fully open) |
| Phi-3 / Phi-4 (Microsoft) | Tiny but capable: on-device, constrained hardware | 3.8B, 7B, 14B | MIT |
| Qwen 2.5 (Alibaba) | Multilingual, math, code, long-context | 0.5Bโ72B | Apache 2.0 |
| Gemma 2 (Google) | Safe, efficient, strong benchmarks at small sizes | 2B, 9B, 27B | Gemma Terms (custom open) |
| DeepSeek-R1 / V3 | Reasoning, math, code | 7Bโ671B | MIT |
| BERT / RoBERTa / DeBERTa | Classification, NER, embeddings (encoder-only) | 110Mโ435M | Apache 2.0 |
| Whisper (OpenAI) | Audio โ text transcription | tiny to large-v3 | MIT |
Pre-trained on raw text. They complete text โ they don't follow instructions. Use these as the starting point for your own fine-tuning pipeline. Examples: meta-llama/Llama-3.1-8B, mistralai/Mistral-7B-v0.1.
Fine-tuned to follow prompts and chat. Use these for immediate deployment or as a starting point for task-specific fine-tuning. Examples: meta-llama/Llama-3.1-8B-Instruct, mistralai/Mistral-7B-Instruct-v0.3.
| Model Size | 4-bit (VRAM) | 16-bit (VRAM) | Minimum Hardware |
|---|---|---|---|
| 1โ3B params | ~1โ2 GB | ~3โ6 GB | Any modern laptop, M1 8GB |
| 7โ8B params | ~5โ6 GB | ~14โ16 GB | M1 Pro 16GB, RTX 3060 |
| 13B params | ~9 GB | ~26 GB | M2 Max 32GB, RTX 3090 |
| 34โ40B params | ~22 GB | ~70 GB | M2 Ultra 64GB, A6000 |
| 70B params | ~40 GB | ~140 GB | Multi-GPU, A100 80GB |
Bash# Install the CLI (comes with the huggingface_hub package)
pip install huggingface_hub
# Log in โ this saves your token to ~/.cache/huggingface/token
huggingface-cli login
# Paste your token when prompted.
# Your token is now saved โ no need to pass it to every API call.
# Verify you're logged in
huggingface-cli whoami
Downloaded models are stored in ~/.cache/huggingface/hub/ by default. Each model is stored once and shared across all scripts that reference it. You can change the location with the environment variable:
Bash# Change the cache directory (add to your .zshrc or .bashrc)
export HF_HOME=/path/to/your/drive/huggingface_cache
# For large models (70B+), point this to an external SSD
Method 1 โ CLI Snapshot Download (recommended for first-timers)
Bash# Download an entire model to a local folder
huggingface-cli download meta-llama/Llama-3.2-3B-Instruct \
--local-dir ./models/llama-3.2-3b-instruct
# Download a specific file only (e.g. the config)
huggingface-cli download meta-llama/Llama-3.2-3B-Instruct config.json
Method 2 โ Python API (most flexible)
Pythonfrom huggingface_hub import snapshot_download
# Downloads to ~/.cache/huggingface/hub/ automatically
# Subsequent calls use the cache โ no re-download
model_path = snapshot_download(
repo_id="mistralai/Mistral-7B-Instruct-v0.3",
ignore_patterns=["*.pt", "*.ot"], # skip legacy weights if present
)
print(f"Model cached at: {model_path}")
Method 3 โ via transformers (laziest, good for prototyping)
Pythonfrom transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "microsoft/Phi-3.5-mini-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16, # use bfloat16 on newer hardware
device_map="auto", # auto-detects CUDA, MPS, or CPU
)
# Quick test
inputs = tokenizer("Hello, who are you?", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
If your hardware can't fit the full model, use a quantised version. The easiest way is via bitsandbytes for CUDA or pre-quantised GGUF files via llama-cpp-python for Apple Silicon.
Python# 4-bit quantisation on load (CUDA / bitsandbytes)
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4", # NF4 is best quality 4-bit
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B-Instruct",
quantization_config=bnb_config,
device_map="auto",
)
Bash# Apple Silicon: use GGUF via Ollama (easiest local setup)
# Install: https://ollama.com
ollama pull llama3.2:3b
ollama run llama3.2:3b
Pythonfrom transformers import AutoConfig
# Check config loads without errors
config = AutoConfig.from_pretrained("./models/llama-3.2-3b-instruct")
print(f"Architecture: {config.architectures}")
print(f"Context length: {config.max_position_embeddings}")
print(f"Hidden size: {config.hidden_size}")
print(f"Layers: {config.num_hidden_layers}")
| Data Type | Used For | Format | Minimum Quantity |
|---|---|---|---|
| Instruction pairs | Supervised fine-tuning (SFT) | {"prompt": "...", "response": "..."} | 500โ2,000 high quality |
| Conversation threads | Chat fine-tuning | ShareGPT / ChatML format | 1,000+ turns |
| Raw domain text | Continued pre-training | Plain text, JSONL | Millions of tokens |
| Preference pairs | DPO / RLHF alignment | {"chosen": "...", "rejected": "..."} | 1,000+ pairs |
| Classification labels | Fine-tuning classifiers | {"text": "...", "label": 0} | 500+ per class |
datasets library to load any of them in one line.What "quality" means for instruction data: clear, unambiguous instructions; responses that are complete and correct; diversity across topics and styles; no contradictions within the dataset; no degenerate examples (empty strings, encoding errors, repetitive loops).
Pythonfrom datasets import load_dataset
# Load a popular instruction dataset from the Hub
ds = load_dataset("HuggingFaceH4/ultrachat_200k")
# Load your own JSONL file
ds = load_dataset("json", data_files={"train": "my_data.jsonl"})
# Basic quality checks
print(f"Total examples: {len(ds['train'])}")
print(f"Columns: {ds['train'].column_names}")
print(f"First example: {ds['train'][0]}")
# Filter out short responses (likely low quality)
ds = ds.filter(lambda x: len(x["response"]) > 100)
Bash# Python 3.10 or 3.11 recommended (3.12 has some library gaps)
python --version
# Create a dedicated environment for LLM work
python -m venv llm_env
source llm_env/bin/activate # Linux / macOS
# llm_env\Scripts\activate # Windows
# Or with conda (easier for CUDA management)
conda create -n llm_env python=3.11
conda activate llm_env
Bash# Core Hugging Face stack
pip install transformers datasets accelerate peft trl
# Weights & Biases for experiment tracking (optional but recommended)
pip install wandb
# For NVIDIA GPU (CUDA 12.1 โ match your CUDA version)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# For Apple Silicon (MPS backend is built into PyTorch)
pip install torch torchvision torchaudio
# For 4-bit quantisation on CUDA (bitsandbytes)
pip install bitsandbytes
# For Flash Attention 2 (speeds up training significantly on CUDA)
pip install flash-attn --no-build-isolation
Pythonimport torch
import transformers
# Check PyTorch version and device availability
print(f"PyTorch: {torch.__version__}")
print(f"Transformers: {transformers.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"MPS available: {torch.backends.mps.is_available()}")
# Should print device info โ not "cpu" if you have a GPU
device = (
"cuda" if torch.cuda.is_available()
else "mps" if torch.backends.mps.is_available()
else "cpu"
)
print(f"Using device: {device}")
# Quick GPU memory check (CUDA only)
if torch.cuda.is_available():
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
Shellmy_llm_project/
โโโ data/
โ โโโ raw/ # original source files
โ โโโ processed/ # cleaned JSONL ready for training
โ โโโ splits/ # train / val / test splits
โโโ models/
โ โโโ base/ # downloaded base model (or symlink to HF cache)
โโโ checkpoints/ # saved during training
โโโ outputs/ # final merged model weights
โโโ scripts/
โ โโโ prepare_data.py
โ โโโ train.py
โ โโโ evaluate.py
โโโ configs/
โ โโโ lora_config.yaml
โโโ requirements.txt
.env to your .gitignore and store your HF token there. Load it with python-dotenv. Never commit access tokens to version control โ even private repos.LoRA (Low-Rank Adaptation) is the standard fine-tuning technique for LLMs on consumer hardware. Instead of updating all billions of parameters, LoRA injects small trainable weight matrices at key layers. This reduces memory by 10โ100ร with minimal quality loss.
Pythonfrom peft import LoraConfig, TaskType
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM, # or SEQ_CLS for classification
r=16, # rank: higher = more capacity, more memory (try 8โ64)
lora_alpha=32, # scaling factor โ rule of thumb: 2ร rank
lora_dropout=0.05,
# Which layers to target โ these are standard for most LLaMA-family models
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
bias="none",
)
Pythonfrom transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
from peft import get_peft_model, LoraConfig, TaskType
from trl import SFTTrainer
from datasets import load_dataset
import torch
MODEL_ID = "meta-llama/Llama-3.2-3B-Instruct"
DATA_PATH = "./data/processed/train.jsonl"
OUTPUT_DIR = "./checkpoints"
# 1. Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
tokenizer.pad_token = tokenizer.eos_token # required for batching
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
torch_dtype=torch.bfloat16,
device_map="auto",
)
# 2. Apply LoRA
lora_config = LoraConfig(
r=16, lora_alpha=32, lora_dropout=0.05,
task_type=TaskType.CAUSAL_LM,
target_modules=["q_proj","k_proj","v_proj","o_proj",
"gate_proj","up_proj","down_proj"],
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Prints: trainable params: 41,943,040 || all params: 3,254,878,208 || 1.29%
# 3. Load dataset
dataset = load_dataset("json", data_files={"train": DATA_PATH})
def format_prompt(example):
return {"text": f"### Instruction:\n{example['prompt']}\n\n### Response:\n{example['response']}"}
dataset = dataset.map(format_prompt)
# 4. Training arguments
args = TrainingArguments(
output_dir=OUTPUT_DIR,
num_train_epochs=3,
per_device_train_batch_size=2,
gradient_accumulation_steps=8, # effective batch = 16
warmup_ratio=0.03,
learning_rate=2e-4,
bf16=True, # bfloat16 on modern hardware
logging_steps=10,
save_strategy="epoch",
report_to="wandb", # or "none" to disable
)
# 5. Train
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
args=args,
train_dataset=dataset["train"],
dataset_text_field="text",
max_seq_length=2048,
)
trainer.train()
# 6. Save LoRA adapter (small โ only a few hundred MB)
model.save_pretrained("./outputs/lora_adapter")
tokenizer.save_pretrained("./outputs/lora_adapter")
Pythonfrom peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load base model + adapter, then merge
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-3B-Instruct",
torch_dtype=torch.float16,
device_map="cpu", # merge on CPU to avoid VRAM spike
)
model = PeftModel.from_pretrained(base_model, "./outputs/lora_adapter")
merged = model.merge_and_unload() # fuses adapter weights into base
# Save the complete merged model
merged.save_pretrained("./outputs/merged_model")
AutoTokenizer.from_pretrained("./outputs/lora_adapter").save_pretrained(
"./outputs/merged_model"
)
| Parameter | Recommended Start | Effect of Increasing |
|---|---|---|
r (LoRA rank) | 8โ16 | More capacity, more memory, slower training |
learning_rate | 2e-4 | Too high โ loss diverges. Too low โ slow convergence |
num_train_epochs | 2โ5 | Too many โ overfitting (model memorises training data) |
max_seq_length | 1024โ2048 | More context, but quadratic memory cost |
gradient_accumulation_steps | 4โ8 | Simulates larger batch without extra VRAM |
load_best_model_at_end=True in TrainingArguments to automatically keep the best checkpoint.Fine-tuning is not always the right answer. This chapter maps common use cases to the appropriate model choice and training strategy.
Summarising clinical notes, ICD coding, drug interaction QA. Use BioMedLM or fine-tune Mistral-7B on PubMed + MIMIC. Domain vocabulary drifts heavily from general pre-training.
Fine-tune recommendedContract clause extraction, case summarisation, precedent search. Fine-tune on LexGLUE or proprietary contracts. Instruction-tuned general models struggle with precise legal phrasing.
Fine-tune + RAGAuto-complete, bug detection, test generation. DeepSeek-Coder and CodeLlama are already specialised. Fine-tune on your codebase for proprietary APIs and internal conventions.
Specialised model firstAnswer FAQs, route tickets, escalate appropriately. Fine-tune on historical ticket/resolution pairs. A 3B model fine-tuned on your data beats GPT-4 on your specific ticket vocabulary.
Small model + fine-tuneSummarise reports, emails, research papers. Long-context models (8K+) work out-of-the-box. Only fine-tune if you need a specific summary format or domain style.
Base instruct modelModels trained primarily on English underperform on Swahili, Welsh, Tagalog, etc. Continued pre-training on target-language text dramatically improves quality.
Continued pre-trainingExtracting people, organisations, locations, custom entities from text. Fine-tune a BERT/DeBERTa encoder โ much smaller and faster than a generative model for pure classification.
Encoder model (BERT)Transcribe audio with Whisper, then feed into an LLM for Q&A or summarisation. Fine-tune Whisper on domain-specific speech (accents, jargon, background noise).
Whisper + LLM pipelineEmbed documents and retrieve relevant chunks at query time. Use sentence-transformers (BAAI/bge, E5) for the embedding model. Often more effective than fine-tuning for knowledge retrieval.
Embedding model + RAGPythonfrom transformers import pipeline
import json
# Load your merged model
pipe = pipeline(
"text-generation",
model="./outputs/merged_model",
torch_dtype=torch.bfloat16,
device_map="auto",
)
# Load a held-out test set (never seen during training)
with open("./data/splits/test.jsonl") as f:
test_examples = [json.loads(l) for l in f]
# Run inference and compare
results = []
for ex in test_examples[:100]:
output = pipe(ex["prompt"], max_new_tokens=200, do_sample=False)[0]
results.append({
"prompt": ex["prompt"],
"expected": ex["response"],
"generated": output["generated_text"],
})
# Save for human review or automated scoring
with open("eval_results.jsonl", "w") as f:
for r in results:
f.write(json.dumps(r) + "\n")
| Metric | Use When | Library |
|---|---|---|
| ROUGE-L | Summarisation tasks | evaluate (HF) |
| BLEU | Translation | sacrebleu |
| Exact Match / F1 | QA with ground truth | evaluate |
| Perplexity | Language model quality | transformers |
| Human eval / LLM-as-judge | Open-ended generation | Custom or GPT-4 scoring |
| Task-specific accuracy | Classification | sklearn |
Pythonfrom transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("./outputs/merged_model")
tokenizer = AutoTokenizer.from_pretrained("./outputs/merged_model")
# Push to your personal hub (creates repo automatically)
model.push_to_hub("your-username/my-finetuned-llama-3b")
tokenizer.push_to_hub("your-username/my-finetuned-llama-3b")
# Or push just the LoRA adapter (much smaller โ recommended)
from peft import PeftModel
adapter_model = PeftModel.from_pretrained(base_model, "./outputs/lora_adapter")
adapter_model.push_to_hub("your-username/my-lora-adapter")
Bash# Option 1: vLLM (fastest, CUDA only)
pip install vllm
python -m vllm.entrypoints.openai.api_server \
--model ./outputs/merged_model \
--port 8000
# Option 2: Ollama (easiest, supports Apple Silicon)
# Convert to GGUF first, then:
ollama create my-model -f ./Modelfile
ollama run my-model
# Option 3: Hugging Face Text Generation Inference (TGI)
docker run --gpus all ghcr.io/huggingface/text-generation-inference \
--model-id ./outputs/merged_model \
--port 8080
base_url to http://localhost:8000/v1. Zero application code changes.Base models are not instruction-following. Always use an instruct-tuned checkpoint as your starting point for SFT unless you are doing continued pre-training.
Hold out at least 10% of data for validation and 10% for final test. Never evaluate on training data โ it will always look better than it actually is.
2e-4 is a safe default for LoRA. Going to 1e-3 or higher will often cause the loss to spike. Start conservative and tune upward if training is slow.
3 epochs is often enough for SFT. After 5+, many models start memorising training examples verbatim rather than generalising. Monitor validation loss.
Each instruct model has a specific chat template (ChatML, Llama-3, etc.). Using the wrong format at inference will cause poor outputs even from a well-trained model. Use tokenizer.apply_chat_template().
One bad example in 10 is not a problem. One bad example in 100 starts to matter. One bad pattern repeated 50 times will dominate your model's outputs. Curate aggressively.
Apple Silicon (M1 through M4) is uniquely suited for on-device LLM work. Unified memory means the GPU and CPU share the same physical pool โ a 64 GB M2 Max can hold models that exceed a 24 GB NVIDIA card's limit. But MPS and MLX have different strengths, and knowing which to reach for saves hours of debugging.
The MPS backend plugs into the Hugging Face ecosystem you already know โ transformers, PEFT, SFTTrainer all work out of the box. Fastest for batch training workloads. Some ops silently fall back to CPU; bitsandbytes 4-bit quantisation is not supported on MPS.
Purpose-built by Apple for Apple Silicon. No device management needed โ unified memory means tensors just exist. Faster than MPS for inference and single-item prediction. The go-to for running quantised models and LoRA fine-tuning when bitsandbytes is not an option.
Pythonimport os
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
# Tell accelerate to use MPS when device_map="auto"
os.environ["ACCELERATE_USE_MPS_DEVICE"] = "True"
# Allow unsupported ops to fall back to CPU instead of crashing
os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1"
# Verify MPS is active before loading anything
assert torch.backends.mps.is_available(), "MPS not available โ check macOS and PyTorch versions"
assert torch.backends.mps.is_built(), "PyTorch was not built with MPS support"
# Quick sanity check
x = torch.ones(3, device="mps")
print(f"Tensor lives on: {x.device}") # โ mps:0
# Load a model โ use device_map="auto" (not "mps", which is not a valid map value)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3.5-mini-instruct")
model = AutoModelForCausalLM.from_pretrained(
"microsoft/Phi-3.5-mini-instruct",
torch_dtype=torch.float16, # float16 works on MPS; bfloat16 support varies
device_map="auto", # accelerate will route to MPS via env var above
)
# Or for small models, manually place on MPS device:
# model = model.to(torch.device("mps"))
# Confirm layers are actually on MPS
first_layer_device = next(model.parameters()).device
print(f"Model on: {first_layer_device}") # should show mps:0, not cpu
MLX is Apple's own array framework. The mlx-lm package provides a complete pipeline for running and fine-tuning LLMs. There is no device management โ all tensors share unified memory automatically.
Bash# Install MLX and the LLM wrapper (Apple Silicon only โ pip will reject on Intel/CUDA)
pip install mlx mlx-lm
# Run a quantised model โ downloads from mlx-community on Hugging Face Hub
mlx_lm.generate \
--model mlx-community/Llama-3.2-3B-Instruct-4bit \
--prompt "Explain attention mechanisms in one paragraph" \
--max-tokens 300
# Fine-tune with LoRA โ replace bitsandbytes QLoRA on Apple Silicon
mlx_lm.lora \
--model mlx-community/Mistral-7B-Instruct-v0.3-4bit \
--train \
--data ./data/processed \
--iters 1000 \
--batch-size 4 \
--lora-layers 16 \
--learning-rate 1e-5
# Convert any Hugging Face model to MLX format yourself
mlx_lm.convert \
--hf-path meta-llama/Llama-3.2-3B-Instruct \
--mlx-path ./models/llama-3.2-3b-mlx \
--quantize \
--q-bits 4
Pythonfrom mlx_lm import load, generate
# Load โ no device placement needed, unified memory handles it
model, tokenizer = load("mlx-community/Llama-3.2-3B-Instruct-4bit")
# Generate โ verbose=True prints live tokens/sec
response = generate(
model,
tokenizer,
prompt="What is the difference between SFT and DPO?",
max_tokens=300,
verbose=True, # prints token speed as output streams
)
print(response)
If your goal is just running models locally without configuring Python, Ollama is the fastest path. It uses llama.cpp with Metal acceleration and exposes an OpenAI-compatible API out of the box.
Bash# Install via Homebrew (or download from https://ollama.com)
brew install ollama
# Start the background server
ollama serve
# Pull models โ fully Metal-accelerated on Apple Silicon
ollama pull llama3.2:3b # ~2 GB on disk
ollama pull mistral:7b # ~4.1 GB
ollama pull phi4:14b # ~8.5 GB
# Chat interactively
ollama run llama3.2:3b
# Or call the OpenAI-compatible REST API โ works with any OpenAI SDK app
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"llama3.2:3b","messages":[{"role":"user","content":"Hello!"}]}'
| Mac Config | Unified Memory | Max Model (4-bit) | Best Path |
|---|---|---|---|
| M1 / M2 base | 8 GB | 1โ3B params | Ollama or MLX with Phi-3.5-mini-instruct |
| M1 Pro / M2 Pro | 16 GB | 7โ8B params | MLX with Mistral-7B or Llama-3.2-3B comfortably |
| M1 Max / M2 Max | 32โ38 GB | 13โ14B params | Phi-4-14B or Llama-3.1-8B with headroom for fine-tuning |
| M2 Ultra / M3 Max | 64โ96 GB | 34โ40B params | Llama-3.3-70B in 4-bit, QwQ-32B |
| M2 Ultra (192 GB) | 192 GB | 70B+ params | Llama-3.1-70B in 8-bit, full fine-tuning of 13B models |
MLX LoRA expects your data in a specific JSONL format inside a folder. Both train.jsonl and valid.jsonl must exist.
Pythonimport json, os
os.makedirs("./data/processed", exist_ok=True)
# Each line must be a JSON object with a "text" field
# Apply your model's chat template before saving
from mlx_lm import load
model, tokenizer = load("mlx-community/Mistral-7B-Instruct-v0.3-4bit")
examples = [
{"prompt": "What is LoRA?", "response": "LoRA is Low-Rank Adaptation..."},
{"prompt": "Define fine-tuning.", "response": "Fine-tuning is..."},
# ... your data
]
def to_mlx_format(ex):
msgs = [
{"role": "user", "content": ex["prompt"]},
{"role": "assistant", "content": ex["response"]},
]
return {"text": tokenizer.apply_chat_template(msgs, tokenize=False)}
formatted = [to_mlx_format(e) for e in examples]
# 90/10 train/val split
split = int(len(formatted) * 0.9)
for fname, rows in [("train.jsonl", formatted[:split]), ("valid.jsonl", formatted[split:])]:
with open(f"./data/processed/{fname}", "w") as f:
for row in rows:
f.write(json.dumps(row) + "\n")
print(f"Train: {split} examples | Val: {len(formatted)-split} examples")
Whether you're on NVIDIA or Apple Silicon, these techniques are additive โ stack them in order of effort for the highest return. None of them require changing your model or your data.
| Method | Bits | Quality Loss | Hardware | Best For |
|---|---|---|---|---|
| NF4 (bitsandbytes) | 4-bit | ~1โ2% | CUDA only | QLoRA training, CUDA inference |
| GPTQ | 4-bit | ~1โ2% | CUDA | Fast CUDA inference, pre-quantised Hub models |
| AWQ | 4-bit | ~0.5% | CUDA | Best quality 4-bit on CUDA |
| GGUF Q4_K_M | 4-bit | ~1% | CPU + MPS + CUDA | Apple Silicon via Ollama/llama.cpp |
| GGUF Q8_0 | 8-bit | <0.5% | CPU + MPS + CUDA | Higher quality, 2ร size of Q4 |
| MLX 4-bit | 4-bit | ~1% | Apple Silicon only | MLX inference and LoRA fine-tuning on Mac |
Pythonfrom transformers import TrainingArguments
args = TrainingArguments(
output_dir="./checkpoints",
# --- Memory ---
gradient_checkpointing=True, # recompute activations, saves 30-40% VRAM
gradient_accumulation_steps=8, # effective batch = per_device x 8
per_device_train_batch_size=2, # raise until you hit OOM, then lower by 1
# --- Precision ---
bf16=True, # bfloat16: safer than fp16, same speed
# fp16=True, # use instead on older GPUs without bf16
# --- Speed (CUDA) ---
# attn_implementation="flash_attention_2" # set in model.from_pretrained instead
dataloader_num_workers=4, # parallel data loading (set 0 on Windows)
dataloader_pin_memory=True, # faster CPU->GPU transfers (CUDA only)
# --- Eval memory safety ---
per_device_eval_batch_size=1,
eval_accumulation_steps=4,
# --- Logging and saving ---
logging_steps=10,
save_strategy="epoch",
load_best_model_at_end=True, # auto-keeps best checkpoint by val loss
metric_for_best_model="eval_loss",
report_to="wandb", # or "none" to disable tracking
)
Python# pip install flash-attn --no-build-isolation (CUDA only, not MPS)
from transformers import AutoModelForCausalLM
import torch
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B-Instruct",
torch_dtype=torch.bfloat16,
device_map="auto",
attn_implementation="flash_attention_2", # โ the entire change
)
# Verify it took effect
print(model.config._attn_implementation) # โ flash_attention_2
Pythonfrom trl import SFTTrainer
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
args=training_args,
train_dataset=dataset["train"],
dataset_text_field="text",
max_seq_length=2048,
packing=True, # packs multiple short examples into one sequence
# set False if your examples are already near max_seq_length
)
# Rule of thumb: if average example is under 30% of max_seq_length, packing helps
Pythonimport torch
# First call compiles (slow โ up to several minutes). All subsequent calls are fast.
model = torch.compile(model, mode="reduce-overhead")
# mode options:
# "default" โ safe, ~10-20% speedup, wide compatibility
# "reduce-overhead" โ reduces Python overhead, good for LLMs, ~20-40% faster
# "max-autotune" โ aggressive compile, fastest runtime, CUDA only
# fullgraph=False โ safer if model has dynamic control flow
Bash# NVIDIA โ live GPU stats every 0.5 seconds
watch -n 0.5 nvidia-smi
# Better NVIDIA monitor (pip install nvitop)
nvitop
# Apple Silicon โ command-line GPU power and utilisation
sudo powermetrics --samplers gpu_power -i 1000 -n 10
# Cross-platform Python memory snapshot during training
import torch
print(f"Allocated : {torch.cuda.memory_allocated()/1e9:.2f} GB")
print(f"Reserved : {torch.cuda.memory_reserved()/1e9:.2f} GB")
print(f"Peak alloc: {torch.cuda.max_memory_allocated()/1e9:.2f} GB")
# Reset peak tracking between runs
torch.cuda.reset_peak_memory_stats()
dataset.save_to_disk(). Then load the pre-tokenised version at training time โ never re-tokenise on each epoch. Also increase dataloader_num_workers to 4.Fine-tuning looks simple from the outside โ load a model, run a trainer, save weights. The reality has a dozen invisible traps that silently produce a worse model while the loss curve looks fine. This chapter is everything you need internalised before running a single training step.
A base model has already learned how language works โ grammar, reasoning, world knowledge โ from hundreds of billions of tokens. Fine-tuning does not inject new knowledge. It teaches the model how to behave: which format to use, which tone to adopt, which vocabulary to prioritise. If you want it to know new facts โ use RAG. If you want it to act differently โ fine-tune.
| Mode | What It Teaches | Data Needed | When to Use It |
|---|---|---|---|
| SFT (Supervised Fine-Tuning) | Follow a specific instruction format and style | Prompt โ Response pairs (500โ10K) | First step for almost every use case |
| Continued Pre-Training | New domain vocabulary and concepts | Raw domain text (millions of tokens) | Deeply specialised domains โ medical, legal, novel languages |
| DPO (Direct Preference Optimisation) | Prefer one response style over another | Chosen / rejected pairs (1K+) | After SFT, to align tone, safety, or output quality |
| LoRA / QLoRA | Same as SFT โ but via efficient adapter layers | Same as SFT | Consumer hardware โ almost always your first choice |
Pythonimport json
from datasets import Dataset
# STEP 1: Define one format and apply it 100% consistently
def format_example(prompt: str, response: str) -> dict:
return {
"text": (
f"### Instruction:\n{prompt}\n\n"
f"### Response:\n{response}"
)
}
# STEP 2: Validate every example โ the model learns your bugs too
def validate(ex: dict) -> list:
issues = []
if not ex.get("prompt") or len(ex["prompt"].strip()) < 10:
issues.append("prompt too short or empty")
if not ex.get("response") or len(ex["response"].strip()) < 20:
issues.append("response too short or empty")
if ex.get("prompt") == ex.get("response"):
issues.append("prompt equals response")
if len(ex.get("response", "")) > 6000:
issues.append("response suspiciously long โ possible duplicate or corruption")
return issues
# STEP 3: Load, audit, fix
raw = [json.loads(l) for l in open("my_data.jsonl")]
bad = [(i, validate(ex)) for i, ex in enumerate(raw) if validate(ex)]
print(f"Total: {len(raw)} | Bad: {len(bad)}")
for idx, issues in bad[:10]:
print(f" Row {idx}: {issues}")
# STEP 4: Filter and format clean examples only
clean = [format_example(ex["prompt"], ex["response"])
for ex in raw if not validate(ex)]
dataset = Dataset.from_list(clean)
Every instruct model was trained with a specific prompt format. Using the wrong one is one of the most common silent failures โ the model degrades with no error message.
Pythonfrom transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")
# WRONG: hand-rolled format almost certainly mismatches what the model expects
wrong = "### Instruction:\nWhat is LoRA?\n\n### Response:\n"
# RIGHT: apply_chat_template always produces the exact format the model expects
messages = [
{"role": "system", "content": "You are a helpful ML assistant."},
{"role": "user", "content": "What is LoRA?"},
]
correct = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True, # adds the assistant turn opener
)
print(correct)
# <|begin_of_text|><|start_header_id|>system<|end_header_id|>\nYou are...
# Apply at data-prep time โ not at inference time inside the training loop
def format_with_template(example):
msgs = [
{"role": "user", "content": example["prompt"]},
{"role": "assistant", "content": example["response"]},
]
return {"text": tokenizer.apply_chat_template(msgs, tokenize=False)}
dataset = dataset.map(format_with_template)
Pythonfrom datasets import Dataset
import random
random.seed(42)
data = list(dataset)
random.shuffle(data)
n = len(data)
train = data[:int(n * 0.80)]
val = data[int(n * 0.80):int(n * 0.90)]
test = data[int(n * 0.90):]
# Save to disk before any training โ never mix splits
Dataset.from_list(train).save_to_disk("./data/splits/train")
Dataset.from_list(val).save_to_disk("./data/splits/val")
Dataset.from_list(test).save_to_disk("./data/splits/test")
print(f"Train: {len(train)} | Val: {len(val)} | Test: {len(test)}")
Healthy training โ the model is generalising. Continue or stop when val loss plateaus.
โ Keep goingOverfitting โ the model is memorising. Stop here and use the checkpoint where val was lowest.
๐ Stop earlyLearning rate too high or a corrupted batch. Halve the learning rate and relaunch.
โ ๏ธ Lower LRLR too low, frozen layers, or wrong LoRA target_modules. Check your config.
โ ๏ธ Check configData issue โ empty batch, bad label, or zero-length sequence. Add pre-training validation logging.
๐ Debug dataThe model has learned everything in your data. More epochs won't help โ get more diverse data.
โ Evaluate now