๐Ÿค— Complete Beginner's Guide

Hugging Face LLMs
From Zero to Fine-Tuned

A practical, no-fluff guide to choosing a large language model on Hugging Face, setting up your local environment, downloading a model, preparing data, and fine-tuning it for your specific use case โ€” written for people who are new to the whole pipeline.

๐Ÿ“‹ What This Guide Covers

This guide walks the complete path from zero knowledge to running and retraining a real LLM locally. We cover the questions to ask before picking any model, how to create your Hugging Face account, download a model with the CLI, set up your Python environment, understand what data you can use, and walk through concrete retraining use cases.

Read sequentially if you are completely new. Experienced practitioners can jump to any chapter โ€” each is self-contained. All code is tested on Python 3.10+ and works on both CUDA (NVIDIA) and Apple Silicon (MPS).

Python Version
3.10 or 3.11 recommended
Core Library
transformers ยท datasets ยท peft
Training Framework
PyTorch (+ optionally MLX on Apple)
Efficient Fine-tuning
LoRA / QLoRA via PEFT
Experiment Tracking
Weights & Biases or TensorBoard
Model Hub
huggingface.co

Chapter 01 Questions to Ask Before Choosing Any Model

Picking the wrong model wastes weeks. Before opening Hugging Face, answer these questions honestly. There are no wrong answers โ€” only answers that point you to the right model family.

1.1 What Problem Are You Solving?

1.2 What Are Your Hardware Constraints?

1.3 What Are Your Long-Term Goals?

1.4 What Are Your Data and Privacy Constraints?

๐Ÿ’ก
The most common mistake: Picking the largest model available and then being surprised it won't fit in memory or is too slow. Always start with the smallest model that could plausibly work, get it running end-to-end, then scale up if needed.

Chapter 02 Best Practices for Choosing a Model

2.1 Understand Model Families

Hugging Face hosts hundreds of thousands of models, but most derive from a small set of foundation families. Knowing the families shortcuts the search dramatically.

FamilyBest ForSizes AvailableLicence
Llama 3 / 3.1 / 3.2 (Meta)General-purpose, instruction following, coding, reasoning1B, 3B, 8B, 70B, 405BLlama Community (non-commercial <700M users)
Mistral / MixtralFast inference, European language support, MoE efficiency7B, 8ร—7B, 8ร—22BApache 2.0 (fully open)
Phi-3 / Phi-4 (Microsoft)Tiny but capable: on-device, constrained hardware3.8B, 7B, 14BMIT
Qwen 2.5 (Alibaba)Multilingual, math, code, long-context0.5Bโ€“72BApache 2.0
Gemma 2 (Google)Safe, efficient, strong benchmarks at small sizes2B, 9B, 27BGemma Terms (custom open)
DeepSeek-R1 / V3Reasoning, math, code7Bโ€“671BMIT
BERT / RoBERTa / DeBERTaClassification, NER, embeddings (encoder-only)110Mโ€“435MApache 2.0
Whisper (OpenAI)Audio โ†’ text transcriptiontiny to large-v3MIT

2.2 Instruction-Tuned vs Base Models

๐Ÿ”ค Base Models

Pre-trained on raw text. They complete text โ€” they don't follow instructions. Use these as the starting point for your own fine-tuning pipeline. Examples: meta-llama/Llama-3.1-8B, mistralai/Mistral-7B-v0.1.

๐Ÿ’ฌ Instruction-Tuned Models

Fine-tuned to follow prompts and chat. Use these for immediate deployment or as a starting point for task-specific fine-tuning. Examples: meta-llama/Llama-3.1-8B-Instruct, mistralai/Mistral-7B-Instruct-v0.3.

2.3 How to Read a Model Card

โ‘ 
Check the Licence
First thing. Apache 2.0 and MIT are fully permissive. Llama Community is non-commercial under certain conditions. Always read the licence tab before anything else if you plan to deploy.
โ‘ก
Check Benchmarks
Look at MMLU, HellaSwag, HumanEval (for code), and MT-Bench scores. Compare within size class. A well-fine-tuned 7B can beat a base 13B on specific tasks.
โ‘ข
Check Context Length
Listed as "max_position_embeddings" in config.json. 4K is limiting. 8K is comfortable. 128K opens document-level tasks.
โ‘ฃ
Check Available Quantisations
Look at community forks with "GGUF" or "AWQ" or "GPTQ" in the name. These are smaller, memory-efficient versions that sacrifice minimal quality for dramatic size reduction.
โ‘ค
Check Downloads & Community Activity
High download counts and active discussions in the Community tab mean bugs are caught, documentation is better, and library support is tested.

2.4 Model Size vs Hardware โ€” Quick Reference

Model Size4-bit (VRAM)16-bit (VRAM)Minimum Hardware
1โ€“3B params~1โ€“2 GB~3โ€“6 GBAny modern laptop, M1 8GB
7โ€“8B params~5โ€“6 GB~14โ€“16 GBM1 Pro 16GB, RTX 3060
13B params~9 GB~26 GBM2 Max 32GB, RTX 3090
34โ€“40B params~22 GB~70 GBM2 Ultra 64GB, A6000
70B params~40 GB~140 GBMulti-GPU, A100 80GB
โš ๏ธ
Memory overhead: These numbers are for inference only. Fine-tuning requires additional memory for gradients and optimiser states โ€” even with LoRA, budget an extra 2โ€“4 GB on top of the model size.

Chapter 03 Getting a Hugging Face Account

3.1 Sign Up

โ‘ 
Go to huggingface.co
Click "Sign Up" in the top right. You can register with email or sign in with GitHub/Google. GitHub is recommended if you plan to push model weights or datasets.
โ‘ก
Verify Your Email
Unverified accounts cannot download gated models (Llama, Gemma, etc.) or push to the Hub. Verify immediately after signup.
โ‘ข
Create an Access Token
Go to Settings โ†’ Access Tokens โ†’ New Token. Create a token with "Read" scope for downloading, or "Write" scope if you want to push your own models. Copy and save it โ€” it won't be shown again.
โ‘ฃ
Request Access for Gated Models (if needed)
Models like Llama 3 require you to agree to their licence on the model page. Click "Access repository", fill in the form, and wait for approval (usually instant to a few hours).

3.2 Install the Hugging Face CLI

Bash# Install the CLI (comes with the huggingface_hub package)
pip install huggingface_hub

# Log in โ€” this saves your token to ~/.cache/huggingface/token
huggingface-cli login

# Paste your token when prompted.
# Your token is now saved โ€” no need to pass it to every API call.

# Verify you're logged in
huggingface-cli whoami

3.3 Understand the Cache

Downloaded models are stored in ~/.cache/huggingface/hub/ by default. Each model is stored once and shared across all scripts that reference it. You can change the location with the environment variable:

Bash# Change the cache directory (add to your .zshrc or .bashrc)
export HF_HOME=/path/to/your/drive/huggingface_cache

# For large models (70B+), point this to an external SSD
โ„น๏ธ
Storage planning: A 7B model in float16 is ~14 GB. A 70B model is ~140 GB. Plan your disk space before downloading. Use an external SSD for anything over 13B.

Chapter 04 Downloading a Model Locally

4.1 Three Ways to Download

Method 1 โ€” CLI Snapshot Download (recommended for first-timers)

Bash# Download an entire model to a local folder
huggingface-cli download meta-llama/Llama-3.2-3B-Instruct \
  --local-dir ./models/llama-3.2-3b-instruct

# Download a specific file only (e.g. the config)
huggingface-cli download meta-llama/Llama-3.2-3B-Instruct config.json

Method 2 โ€” Python API (most flexible)

Pythonfrom huggingface_hub import snapshot_download

# Downloads to ~/.cache/huggingface/hub/ automatically
# Subsequent calls use the cache โ€” no re-download
model_path = snapshot_download(
    repo_id="mistralai/Mistral-7B-Instruct-v0.3",
    ignore_patterns=["*.pt", "*.ot"],  # skip legacy weights if present
)
print(f"Model cached at: {model_path}")

Method 3 โ€” via transformers (laziest, good for prototyping)

Pythonfrom transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "microsoft/Phi-3.5-mini-instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,   # use bfloat16 on newer hardware
    device_map="auto",           # auto-detects CUDA, MPS, or CPU
)

# Quick test
inputs = tokenizer("Hello, who are you?", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

4.2 Download Quantised Models (for limited memory)

If your hardware can't fit the full model, use a quantised version. The easiest way is via bitsandbytes for CUDA or pre-quantised GGUF files via llama-cpp-python for Apple Silicon.

Python# 4-bit quantisation on load (CUDA / bitsandbytes)
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",       # NF4 is best quality 4-bit
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    quantization_config=bnb_config,
    device_map="auto",
)
Bash# Apple Silicon: use GGUF via Ollama (easiest local setup)
# Install: https://ollama.com
ollama pull llama3.2:3b
ollama run llama3.2:3b

4.3 Verify the Download

Pythonfrom transformers import AutoConfig

# Check config loads without errors
config = AutoConfig.from_pretrained("./models/llama-3.2-3b-instruct")
print(f"Architecture: {config.architectures}")
print(f"Context length: {config.max_position_embeddings}")
print(f"Hidden size: {config.hidden_size}")
print(f"Layers: {config.num_hidden_layers}")

Chapter 05 What Data Can You Use?

5.1 Data Types for LLM Training

Data TypeUsed ForFormatMinimum Quantity
Instruction pairsSupervised fine-tuning (SFT){"prompt": "...", "response": "..."}500โ€“2,000 high quality
Conversation threadsChat fine-tuningShareGPT / ChatML format1,000+ turns
Raw domain textContinued pre-trainingPlain text, JSONLMillions of tokens
Preference pairsDPO / RLHF alignment{"chosen": "...", "rejected": "..."}1,000+ pairs
Classification labelsFine-tuning classifiers{"text": "...", "label": 0}500+ per class

5.2 Where to Find Datasets

โ‘ 
Hugging Face Datasets Hub
huggingface.co/datasets โ€” over 100,000 public datasets. Filter by task, language, and licence. Use datasets library to load any of them in one line.
โ‘ก
Your Own Data
Internal documents, support tickets, product manuals, code repositories, customer emails (anonymised). This is almost always the most valuable data source for domain-specific fine-tuning.
โ‘ข
Synthetic Data (Model-Generated)
Use a large model (GPT-4, Llama-3.1-70B) to generate instruction-response pairs from seed examples or documents. Works well for bootstrapping when you have few real examples.
โ‘ฃ
Public Web Corpora
Common Crawl, The Pile, RedPajama, FineWeb. These are for continued pre-training, not SFT. Very large โ€” filter aggressively for quality before using.

5.3 Data Licensing โ€” Critical Checklist

5.4 Data Quality Over Quantity

๐Ÿ”ฌ
The LIMA finding: A 2023 paper showed that fine-tuning Llama on just 1,000 carefully curated instruction pairs produced results competitive with models fine-tuned on hundreds of thousands of noisy examples. Quality of examples matters far more than raw count. Spend time on data curation โ€” it's the highest-leverage activity in fine-tuning.

What "quality" means for instruction data: clear, unambiguous instructions; responses that are complete and correct; diversity across topics and styles; no contradictions within the dataset; no degenerate examples (empty strings, encoding errors, repetitive loops).

Pythonfrom datasets import load_dataset

# Load a popular instruction dataset from the Hub
ds = load_dataset("HuggingFaceH4/ultrachat_200k")

# Load your own JSONL file
ds = load_dataset("json", data_files={"train": "my_data.jsonl"})

# Basic quality checks
print(f"Total examples: {len(ds['train'])}")
print(f"Columns: {ds['train'].column_names}")
print(f"First example: {ds['train'][0]}")

# Filter out short responses (likely low quality)
ds = ds.filter(lambda x: len(x["response"]) > 100)

Chapter 06 Setting Up Your Training Environment

6.1 Install Python and Create a Virtual Environment

Bash# Python 3.10 or 3.11 recommended (3.12 has some library gaps)
python --version

# Create a dedicated environment for LLM work
python -m venv llm_env
source llm_env/bin/activate       # Linux / macOS
# llm_env\Scripts\activate        # Windows

# Or with conda (easier for CUDA management)
conda create -n llm_env python=3.11
conda activate llm_env

6.2 Core Package Installation

Bash# Core Hugging Face stack
pip install transformers datasets accelerate peft trl

# Weights & Biases for experiment tracking (optional but recommended)
pip install wandb

# For NVIDIA GPU (CUDA 12.1 โ€” match your CUDA version)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# For Apple Silicon (MPS backend is built into PyTorch)
pip install torch torchvision torchaudio

# For 4-bit quantisation on CUDA (bitsandbytes)
pip install bitsandbytes

# For Flash Attention 2 (speeds up training significantly on CUDA)
pip install flash-attn --no-build-isolation

6.3 Verify Your Setup

Pythonimport torch
import transformers

# Check PyTorch version and device availability
print(f"PyTorch: {torch.__version__}")
print(f"Transformers: {transformers.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"MPS available:  {torch.backends.mps.is_available()}")

# Should print device info โ€” not "cpu" if you have a GPU
device = (
    "cuda" if torch.cuda.is_available()
    else "mps" if torch.backends.mps.is_available()
    else "cpu"
)
print(f"Using device: {device}")

# Quick GPU memory check (CUDA only)
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

6.4 Project Directory Structure

Shellmy_llm_project/
โ”œโ”€โ”€ data/
โ”‚   โ”œโ”€โ”€ raw/          # original source files
โ”‚   โ”œโ”€โ”€ processed/    # cleaned JSONL ready for training
โ”‚   โ””โ”€โ”€ splits/       # train / val / test splits
โ”œโ”€โ”€ models/
โ”‚   โ””โ”€โ”€ base/         # downloaded base model (or symlink to HF cache)
โ”œโ”€โ”€ checkpoints/      # saved during training
โ”œโ”€โ”€ outputs/          # final merged model weights
โ”œโ”€โ”€ scripts/
โ”‚   โ”œโ”€โ”€ prepare_data.py
โ”‚   โ”œโ”€โ”€ train.py
โ”‚   โ””โ”€โ”€ evaluate.py
โ”œโ”€โ”€ configs/
โ”‚   โ””โ”€โ”€ lora_config.yaml
โ””โ”€โ”€ requirements.txt
โœ…
Pro tip: Add .env to your .gitignore and store your HF token there. Load it with python-dotenv. Never commit access tokens to version control โ€” even private repos.

Chapter 07 Fine-Tuning with LoRA โ€” Step by Step

LoRA (Low-Rank Adaptation) is the standard fine-tuning technique for LLMs on consumer hardware. Instead of updating all billions of parameters, LoRA injects small trainable weight matrices at key layers. This reduces memory by 10โ€“100ร— with minimal quality loss.

7.1 LoRA Config Explained

Pythonfrom peft import LoraConfig, TaskType

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,  # or SEQ_CLS for classification
    r=16,           # rank: higher = more capacity, more memory (try 8โ€“64)
    lora_alpha=32,  # scaling factor โ€” rule of thumb: 2ร— rank
    lora_dropout=0.05,
    # Which layers to target โ€” these are standard for most LLaMA-family models
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    bias="none",
)

7.2 Complete Training Script

Pythonfrom transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
from peft import get_peft_model, LoraConfig, TaskType
from trl import SFTTrainer
from datasets import load_dataset
import torch

MODEL_ID   = "meta-llama/Llama-3.2-3B-Instruct"
DATA_PATH  = "./data/processed/train.jsonl"
OUTPUT_DIR = "./checkpoints"

# 1. Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
tokenizer.pad_token = tokenizer.eos_token  # required for batching

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# 2. Apply LoRA
lora_config = LoraConfig(
    r=16, lora_alpha=32, lora_dropout=0.05,
    task_type=TaskType.CAUSAL_LM,
    target_modules=["q_proj","k_proj","v_proj","o_proj",
                    "gate_proj","up_proj","down_proj"],
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Prints: trainable params: 41,943,040 || all params: 3,254,878,208 || 1.29%

# 3. Load dataset
dataset = load_dataset("json", data_files={"train": DATA_PATH})

def format_prompt(example):
    return {"text": f"### Instruction:\n{example['prompt']}\n\n### Response:\n{example['response']}"}

dataset = dataset.map(format_prompt)

# 4. Training arguments
args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,   # effective batch = 16
    warmup_ratio=0.03,
    learning_rate=2e-4,
    bf16=True,                        # bfloat16 on modern hardware
    logging_steps=10,
    save_strategy="epoch",
    report_to="wandb",                # or "none" to disable
)

# 5. Train
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    args=args,
    train_dataset=dataset["train"],
    dataset_text_field="text",
    max_seq_length=2048,
)

trainer.train()

# 6. Save LoRA adapter (small โ€” only a few hundred MB)
model.save_pretrained("./outputs/lora_adapter")
tokenizer.save_pretrained("./outputs/lora_adapter")

7.3 Merge the Adapter Back into the Base Model

Pythonfrom peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load base model + adapter, then merge
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-3B-Instruct",
    torch_dtype=torch.float16,
    device_map="cpu",  # merge on CPU to avoid VRAM spike
)
model = PeftModel.from_pretrained(base_model, "./outputs/lora_adapter")
merged = model.merge_and_unload()     # fuses adapter weights into base

# Save the complete merged model
merged.save_pretrained("./outputs/merged_model")
AutoTokenizer.from_pretrained("./outputs/lora_adapter").save_pretrained(
    "./outputs/merged_model"
)

7.4 Key Hyperparameters for New Fine-Tuners

ParameterRecommended StartEffect of Increasing
r (LoRA rank)8โ€“16More capacity, more memory, slower training
learning_rate2e-4Too high โ†’ loss diverges. Too low โ†’ slow convergence
num_train_epochs2โ€“5Too many โ†’ overfitting (model memorises training data)
max_seq_length1024โ€“2048More context, but quadratic memory cost
gradient_accumulation_steps4โ€“8Simulates larger batch without extra VRAM
โš ๏ธ
Watch for overfitting: If your training loss keeps dropping but your validation loss starts rising, stop early. Use load_best_model_at_end=True in TrainingArguments to automatically keep the best checkpoint.

Chapter 08 Use Cases โ€” Models and When to Retrain

Fine-tuning is not always the right answer. This chapter maps common use cases to the appropriate model choice and training strategy.

8.1 Use Case Landscape

๐Ÿฉบ
Medical / Clinical NLP

Summarising clinical notes, ICD coding, drug interaction QA. Use BioMedLM or fine-tune Mistral-7B on PubMed + MIMIC. Domain vocabulary drifts heavily from general pre-training.

Fine-tune recommended
โš–๏ธ
Legal Document Analysis

Contract clause extraction, case summarisation, precedent search. Fine-tune on LexGLUE or proprietary contracts. Instruction-tuned general models struggle with precise legal phrasing.

Fine-tune + RAG
๐Ÿ’ป
Code Generation & Review

Auto-complete, bug detection, test generation. DeepSeek-Coder and CodeLlama are already specialised. Fine-tune on your codebase for proprietary APIs and internal conventions.

Specialised model first
๐Ÿ›Ž๏ธ
Customer Support Chatbot

Answer FAQs, route tickets, escalate appropriately. Fine-tune on historical ticket/resolution pairs. A 3B model fine-tuned on your data beats GPT-4 on your specific ticket vocabulary.

Small model + fine-tune
๐Ÿ“„
Document Summarisation

Summarise reports, emails, research papers. Long-context models (8K+) work out-of-the-box. Only fine-tune if you need a specific summary format or domain style.

Base instruct model
๐ŸŒ
Low-Resource Language

Models trained primarily on English underperform on Swahili, Welsh, Tagalog, etc. Continued pre-training on target-language text dramatically improves quality.

Continued pre-training
๐Ÿท๏ธ
Named Entity Recognition

Extracting people, organisations, locations, custom entities from text. Fine-tune a BERT/DeBERTa encoder โ€” much smaller and faster than a generative model for pure classification.

Encoder model (BERT)
๐ŸŽ™๏ธ
Voice Transcription & QA

Transcribe audio with Whisper, then feed into an LLM for Q&A or summarisation. Fine-tune Whisper on domain-specific speech (accents, jargon, background noise).

Whisper + LLM pipeline
๐Ÿ”Ž
Semantic Search / RAG

Embed documents and retrieve relevant chunks at query time. Use sentence-transformers (BAAI/bge, E5) for the embedding model. Often more effective than fine-tuning for knowledge retrieval.

Embedding model + RAG

8.2 Decision Tree โ€” To Fine-Tune or Not?

โ‘ 
Try the base instruct model first
Prompt-engineer a good system prompt. Test on 50โ€“100 real examples. Measure quality. Many tasks don't need fine-tuning โ€” a well-crafted prompt is enough.
โ‘ก
Add RAG if the problem is knowledge retrieval
If the model is good at reasoning but lacks specific facts (your company's docs, recent events, proprietary knowledge), add a retrieval layer before reaching for fine-tuning.
โ‘ข
Fine-tune if output format or style is wrong
If the model consistently produces the wrong format, uses wrong tone, or fails domain-specific tasks even with good prompts, fine-tuning fixes the behaviour permanently.
โ‘ฃ
Consider continued pre-training for deep domain shift
If your domain uses vocabulary and concepts the model has never seen (novel scientific field, proprietary jargon), continued pre-training on domain text before SFT gives the best results.

Chapter 09 After Training โ€” Evaluation & Deployment

9.1 Evaluate Your Fine-Tuned Model

Pythonfrom transformers import pipeline
import json

# Load your merged model
pipe = pipeline(
    "text-generation",
    model="./outputs/merged_model",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# Load a held-out test set (never seen during training)
with open("./data/splits/test.jsonl") as f:
    test_examples = [json.loads(l) for l in f]

# Run inference and compare
results = []
for ex in test_examples[:100]:
    output = pipe(ex["prompt"], max_new_tokens=200, do_sample=False)[0]
    results.append({
        "prompt":    ex["prompt"],
        "expected":  ex["response"],
        "generated": output["generated_text"],
    })

# Save for human review or automated scoring
with open("eval_results.jsonl", "w") as f:
    for r in results:
        f.write(json.dumps(r) + "\n")

9.2 Evaluation Metrics to Track

MetricUse WhenLibrary
ROUGE-LSummarisation tasksevaluate (HF)
BLEUTranslationsacrebleu
Exact Match / F1QA with ground truthevaluate
PerplexityLanguage model qualitytransformers
Human eval / LLM-as-judgeOpen-ended generationCustom or GPT-4 scoring
Task-specific accuracyClassificationsklearn

9.3 Push Your Model to the Hub

Pythonfrom transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("./outputs/merged_model")
tokenizer = AutoTokenizer.from_pretrained("./outputs/merged_model")

# Push to your personal hub (creates repo automatically)
model.push_to_hub("your-username/my-finetuned-llama-3b")
tokenizer.push_to_hub("your-username/my-finetuned-llama-3b")

# Or push just the LoRA adapter (much smaller โ€” recommended)
from peft import PeftModel
adapter_model = PeftModel.from_pretrained(base_model, "./outputs/lora_adapter")
adapter_model.push_to_hub("your-username/my-lora-adapter")

9.4 Serve the Model Locally

Bash# Option 1: vLLM (fastest, CUDA only)
pip install vllm
python -m vllm.entrypoints.openai.api_server \
  --model ./outputs/merged_model \
  --port 8000

# Option 2: Ollama (easiest, supports Apple Silicon)
# Convert to GGUF first, then:
ollama create my-model -f ./Modelfile
ollama run my-model

# Option 3: Hugging Face Text Generation Inference (TGI)
docker run --gpus all ghcr.io/huggingface/text-generation-inference \
  --model-id ./outputs/merged_model \
  --port 8080
๐Ÿ’ก
The OpenAI-compatible API: Both vLLM and Ollama expose an OpenAI-compatible endpoint. Any app built against the OpenAI SDK works with your local model by just changing the base_url to http://localhost:8000/v1. Zero application code changes.

Chapter 10 Common Mistakes & How to Avoid Them

โŒ Training on raw base model

Base models are not instruction-following. Always use an instruct-tuned checkpoint as your starting point for SFT unless you are doing continued pre-training.

โŒ Not separating train/val/test

Hold out at least 10% of data for validation and 10% for final test. Never evaluate on training data โ€” it will always look better than it actually is.

โŒ Learning rate too high

2e-4 is a safe default for LoRA. Going to 1e-3 or higher will often cause the loss to spike. Start conservative and tune upward if training is slow.

โŒ Training too many epochs

3 epochs is often enough for SFT. After 5+, many models start memorising training examples verbatim rather than generalising. Monitor validation loss.

โŒ Ignoring chat templates

Each instruct model has a specific chat template (ChatML, Llama-3, etc.). Using the wrong format at inference will cause poor outputs even from a well-trained model. Use tokenizer.apply_chat_template().

โŒ Mixing data quality

One bad example in 10 is not a problem. One bad example in 100 starts to matter. One bad pattern repeated 50 times will dominate your model's outputs. Curate aggressively.

10.1 Final Checklist Before You Start Training

๐Ÿš€
You're ready: With this guide, a Hugging Face account, Python installed, and a GPU with at least 8 GB of VRAM (or an M-series Mac), you have everything you need to go from a downloaded model to a domain-specific fine-tuned LLM. The loop is always the same: get data โ†’ clean data โ†’ train โ†’ evaluate โ†’ iterate.

Chapter 11 Apple Silicon โ€” Getting Maximum Performance

Apple Silicon (M1 through M4) is uniquely suited for on-device LLM work. Unified memory means the GPU and CPU share the same physical pool โ€” a 64 GB M2 Max can hold models that exceed a 24 GB NVIDIA card's limit. But MPS and MLX have different strengths, and knowing which to reach for saves hours of debugging.

11.1 MPS vs MLX โ€” Pick the Right Tool

๐Ÿ”ท PyTorch MPS

The MPS backend plugs into the Hugging Face ecosystem you already know โ€” transformers, PEFT, SFTTrainer all work out of the box. Fastest for batch training workloads. Some ops silently fall back to CPU; bitsandbytes 4-bit quantisation is not supported on MPS.

๐ŸŽ MLX (Apple-native)

Purpose-built by Apple for Apple Silicon. No device management needed โ€” unified memory means tensors just exist. Faster than MPS for inference and single-item prediction. The go-to for running quantised models and LoRA fine-tuning when bitsandbytes is not an option.

โ„น๏ธ
Performance reality: Benchmarks show PyTorch MPS can outperform MLX on batch training tasks. MLX leads on inference startup latency and single-item throughput. For fine-tuning on consumer Macs, MLX LoRA is the most practical starting point โ€” bitsandbytes does not support MPS, so QLoRA is unavailable in the PyTorch path.

11.2 Using PyTorch MPS Correctly

Pythonimport os
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Tell accelerate to use MPS when device_map="auto"
os.environ["ACCELERATE_USE_MPS_DEVICE"] = "True"
# Allow unsupported ops to fall back to CPU instead of crashing
os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1"

# Verify MPS is active before loading anything
assert torch.backends.mps.is_available(), "MPS not available โ€” check macOS and PyTorch versions"
assert torch.backends.mps.is_built(),     "PyTorch was not built with MPS support"

# Quick sanity check
x = torch.ones(3, device="mps")
print(f"Tensor lives on: {x.device}")   # โ†’ mps:0

# Load a model โ€” use device_map="auto" (not "mps", which is not a valid map value)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3.5-mini-instruct")
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3.5-mini-instruct",
    torch_dtype=torch.float16,   # float16 works on MPS; bfloat16 support varies
    device_map="auto",           # accelerate will route to MPS via env var above
)

# Or for small models, manually place on MPS device:
# model = model.to(torch.device("mps"))

# Confirm layers are actually on MPS
first_layer_device = next(model.parameters()).device
print(f"Model on: {first_layer_device}")   # should show mps:0, not cpu

11.3 MLX โ€” Native Apple Silicon Stack

MLX is Apple's own array framework. The mlx-lm package provides a complete pipeline for running and fine-tuning LLMs. There is no device management โ€” all tensors share unified memory automatically.

Bash# Install MLX and the LLM wrapper (Apple Silicon only โ€” pip will reject on Intel/CUDA)
pip install mlx mlx-lm

# Run a quantised model โ€” downloads from mlx-community on Hugging Face Hub
mlx_lm.generate \
  --model mlx-community/Llama-3.2-3B-Instruct-4bit \
  --prompt "Explain attention mechanisms in one paragraph" \
  --max-tokens 300

# Fine-tune with LoRA โ€” replace bitsandbytes QLoRA on Apple Silicon
mlx_lm.lora \
  --model mlx-community/Mistral-7B-Instruct-v0.3-4bit \
  --train \
  --data ./data/processed \
  --iters 1000 \
  --batch-size 4 \
  --lora-layers 16 \
  --learning-rate 1e-5

# Convert any Hugging Face model to MLX format yourself
mlx_lm.convert \
  --hf-path meta-llama/Llama-3.2-3B-Instruct \
  --mlx-path ./models/llama-3.2-3b-mlx \
  --quantize \
  --q-bits 4
Pythonfrom mlx_lm import load, generate

# Load โ€” no device placement needed, unified memory handles it
model, tokenizer = load("mlx-community/Llama-3.2-3B-Instruct-4bit")

# Generate โ€” verbose=True prints live tokens/sec
response = generate(
    model,
    tokenizer,
    prompt="What is the difference between SFT and DPO?",
    max_tokens=300,
    verbose=True,    # prints token speed as output streams
)
print(response)

11.4 Ollama โ€” Zero-Config Local Inference

If your goal is just running models locally without configuring Python, Ollama is the fastest path. It uses llama.cpp with Metal acceleration and exposes an OpenAI-compatible API out of the box.

Bash# Install via Homebrew (or download from https://ollama.com)
brew install ollama

# Start the background server
ollama serve

# Pull models โ€” fully Metal-accelerated on Apple Silicon
ollama pull llama3.2:3b       # ~2 GB on disk
ollama pull mistral:7b        # ~4.1 GB
ollama pull phi4:14b          # ~8.5 GB

# Chat interactively
ollama run llama3.2:3b

# Or call the OpenAI-compatible REST API โ€” works with any OpenAI SDK app
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3.2:3b","messages":[{"role":"user","content":"Hello!"}]}'

11.5 Memory Tiers โ€” Which Mac Can Run What

Mac ConfigUnified MemoryMax Model (4-bit)Best Path
M1 / M2 base8 GB1โ€“3B paramsOllama or MLX with Phi-3.5-mini-instruct
M1 Pro / M2 Pro16 GB7โ€“8B paramsMLX with Mistral-7B or Llama-3.2-3B comfortably
M1 Max / M2 Max32โ€“38 GB13โ€“14B paramsPhi-4-14B or Llama-3.1-8B with headroom for fine-tuning
M2 Ultra / M3 Max64โ€“96 GB34โ€“40B paramsLlama-3.3-70B in 4-bit, QwQ-32B
M2 Ultra (192 GB)192 GB70B+ paramsLlama-3.1-70B in 8-bit, full fine-tuning of 13B models
๐Ÿ’ก
Activity Monitor check: Open Activity Monitor โ†’ GPU History while a model runs. If GPU usage sits near 0%, your model is running on CPU โ€” something is wrong with your device setup. You should see sustained 60โ€“100% GPU utilisation during inference and training on Apple Silicon.

11.6 MLX LoRA Fine-Tuning โ€” Data Format

MLX LoRA expects your data in a specific JSONL format inside a folder. Both train.jsonl and valid.jsonl must exist.

Pythonimport json, os

os.makedirs("./data/processed", exist_ok=True)

# Each line must be a JSON object with a "text" field
# Apply your model's chat template before saving
from mlx_lm import load
model, tokenizer = load("mlx-community/Mistral-7B-Instruct-v0.3-4bit")

examples = [
    {"prompt": "What is LoRA?",  "response": "LoRA is Low-Rank Adaptation..."},
    {"prompt": "Define fine-tuning.", "response": "Fine-tuning is..."},
    # ... your data
]

def to_mlx_format(ex):
    msgs = [
        {"role": "user",      "content": ex["prompt"]},
        {"role": "assistant", "content": ex["response"]},
    ]
    return {"text": tokenizer.apply_chat_template(msgs, tokenize=False)}

formatted = [to_mlx_format(e) for e in examples]

# 90/10 train/val split
split = int(len(formatted) * 0.9)
for fname, rows in [("train.jsonl", formatted[:split]), ("valid.jsonl", formatted[split:])]:
    with open(f"./data/processed/{fname}", "w") as f:
        for row in rows:
            f.write(json.dumps(row) + "\n")

print(f"Train: {split} examples | Val: {len(formatted)-split} examples")

Chapter 12 Performance Gains โ€” Speed, Memory & Throughput

Whether you're on NVIDIA or Apple Silicon, these techniques are additive โ€” stack them in order of effort for the highest return. None of them require changing your model or your data.

12.1 The Optimisation Stack โ€” Apply in This Order

โ‘ 
Switch to bfloat16 โ€” one word, instant win
bfloat16 has a wider dynamic range than float16, meaning fewer NaN/inf errors during training. Every A100, H100, and M-series chip supports it natively. Change torch_dtype=torch.float16 to torch.bfloat16 everywhere.
โ‘ก
Quantise the base model โ€” fit a larger model in your VRAM
4-bit NF4 (via bitsandbytes on CUDA) or 4-bit GGUF/MLX (on Apple) cuts model size by 4ร—. A 7B model drops from ~14 GB to ~4โ€“5 GB with less than 2% quality loss on most tasks.
โ‘ข
Enable gradient checkpointing โ€” trade 20% compute for 30โ€“40% less VRAM
Instead of storing all intermediate activations during the forward pass, recompute them during backprop. One flag in TrainingArguments. Essential for fine-tuning on consumer GPUs and MPS Macs.
โ‘ฃ
Gradient accumulation โ€” simulate large batches on small hardware
Run N forward passes before each optimiser step. Effective batch = per_device_batch ร— accumulation_steps. Use a small per-device batch (1โ€“2) and accumulate 8โ€“16 steps.
โ‘ค
Flash Attention 2 (CUDA only) โ€” free 2โ€“4ร— speed in attention layers
Rewrites the attention computation to be IO-bound rather than memory-bound. One line to enable. MPS does not support Flash Attention โ€” use sequence packing instead for throughput on Apple Silicon.
โ‘ฅ
Sequence packing โ€” eliminate wasted padding
If your examples are short (under 512 tokens), packing multiple examples into one sequence removes padding waste and can 2โ€“4ร— your training throughput. Works on both CUDA and MPS.

12.2 Quantisation Options Compared

MethodBitsQuality LossHardwareBest For
NF4 (bitsandbytes)4-bit~1โ€“2%CUDA onlyQLoRA training, CUDA inference
GPTQ4-bit~1โ€“2%CUDAFast CUDA inference, pre-quantised Hub models
AWQ4-bit~0.5%CUDABest quality 4-bit on CUDA
GGUF Q4_K_M4-bit~1%CPU + MPS + CUDAApple Silicon via Ollama/llama.cpp
GGUF Q8_08-bit<0.5%CPU + MPS + CUDAHigher quality, 2ร— size of Q4
MLX 4-bit4-bit~1%Apple Silicon onlyMLX inference and LoRA fine-tuning on Mac

12.3 Key Training Flags โ€” All in One Place

Pythonfrom transformers import TrainingArguments

args = TrainingArguments(
    output_dir="./checkpoints",

    # --- Memory ---
    gradient_checkpointing=True,        # recompute activations, saves 30-40% VRAM
    gradient_accumulation_steps=8,      # effective batch = per_device x 8
    per_device_train_batch_size=2,      # raise until you hit OOM, then lower by 1

    # --- Precision ---
    bf16=True,                          # bfloat16: safer than fp16, same speed
    # fp16=True,                        # use instead on older GPUs without bf16

    # --- Speed (CUDA) ---
    # attn_implementation="flash_attention_2"  # set in model.from_pretrained instead
    dataloader_num_workers=4,           # parallel data loading (set 0 on Windows)
    dataloader_pin_memory=True,         # faster CPU->GPU transfers (CUDA only)

    # --- Eval memory safety ---
    per_device_eval_batch_size=1,
    eval_accumulation_steps=4,

    # --- Logging and saving ---
    logging_steps=10,
    save_strategy="epoch",
    load_best_model_at_end=True,        # auto-keeps best checkpoint by val loss
    metric_for_best_model="eval_loss",
    report_to="wandb",                  # or "none" to disable tracking
)

12.4 Flash Attention 2 โ€” CUDA Only, One Line

Python# pip install flash-attn --no-build-isolation (CUDA only, not MPS)
from transformers import AutoModelForCausalLM
import torch

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="flash_attention_2",   # โ† the entire change
)

# Verify it took effect
print(model.config._attn_implementation)      # โ†’ flash_attention_2

12.5 Sequence Packing โ€” Works on CUDA and MPS

Pythonfrom trl import SFTTrainer

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=dataset["train"],
    dataset_text_field="text",
    max_seq_length=2048,
    packing=True,   # packs multiple short examples into one sequence
                    # set False if your examples are already near max_seq_length
)
# Rule of thumb: if average example is under 30% of max_seq_length, packing helps

12.6 torch.compile โ€” Free Speed on PyTorch 2.x

Pythonimport torch

# First call compiles (slow โ€” up to several minutes). All subsequent calls are fast.
model = torch.compile(model, mode="reduce-overhead")

# mode options:
# "default"          โ†’ safe, ~10-20% speedup, wide compatibility
# "reduce-overhead"  โ†’ reduces Python overhead, good for LLMs, ~20-40% faster
# "max-autotune"     โ†’ aggressive compile, fastest runtime, CUDA only
# fullgraph=False    โ†’ safer if model has dynamic control flow

12.7 Monitor What Your Hardware Is Actually Doing

Bash# NVIDIA โ€” live GPU stats every 0.5 seconds
watch -n 0.5 nvidia-smi

# Better NVIDIA monitor (pip install nvitop)
nvitop

# Apple Silicon โ€” command-line GPU power and utilisation
sudo powermetrics --samplers gpu_power -i 1000 -n 10

# Cross-platform Python memory snapshot during training
import torch
print(f"Allocated : {torch.cuda.memory_allocated()/1e9:.2f} GB")
print(f"Reserved  : {torch.cuda.memory_reserved()/1e9:.2f} GB")
print(f"Peak alloc: {torch.cuda.max_memory_allocated()/1e9:.2f} GB")

# Reset peak tracking between runs
torch.cuda.reset_peak_memory_stats()
โš ๏ธ
GPU under 60% utilisation during training? Your data loader is the bottleneck. Pre-tokenise your dataset once and save it with dataset.save_to_disk(). Then load the pre-tokenised version at training time โ€” never re-tokenise on each epoch. Also increase dataloader_num_workers to 4.

Chapter 13 Fine-Tuning Fundamentals โ€” What Every New User Must Know

Fine-tuning looks simple from the outside โ€” load a model, run a trainer, save weights. The reality has a dozen invisible traps that silently produce a worse model while the loss curve looks fine. This chapter is everything you need internalised before running a single training step.

13.1 The Core Mental Model

What Fine-Tuning Actually Does

A base model has already learned how language works โ€” grammar, reasoning, world knowledge โ€” from hundreds of billions of tokens. Fine-tuning does not inject new knowledge. It teaches the model how to behave: which format to use, which tone to adopt, which vocabulary to prioritise. If you want it to know new facts โ†’ use RAG. If you want it to act differently โ†’ fine-tune.

13.2 The Four Training Modes

ModeWhat It TeachesData NeededWhen to Use It
SFT (Supervised Fine-Tuning)Follow a specific instruction format and stylePrompt โ†’ Response pairs (500โ€“10K)First step for almost every use case
Continued Pre-TrainingNew domain vocabulary and conceptsRaw domain text (millions of tokens)Deeply specialised domains โ€” medical, legal, novel languages
DPO (Direct Preference Optimisation)Prefer one response style over anotherChosen / rejected pairs (1K+)After SFT, to align tone, safety, or output quality
LoRA / QLoRASame as SFT โ€” but via efficient adapter layersSame as SFTConsumer hardware โ€” almost always your first choice

13.3 Data Preparation โ€” Where 80% of Failures Come From

Pythonimport json
from datasets import Dataset

# STEP 1: Define one format and apply it 100% consistently
def format_example(prompt: str, response: str) -> dict:
    return {
        "text": (
            f"### Instruction:\n{prompt}\n\n"
            f"### Response:\n{response}"
        )
    }

# STEP 2: Validate every example โ€” the model learns your bugs too
def validate(ex: dict) -> list:
    issues = []
    if not ex.get("prompt") or len(ex["prompt"].strip()) < 10:
        issues.append("prompt too short or empty")
    if not ex.get("response") or len(ex["response"].strip()) < 20:
        issues.append("response too short or empty")
    if ex.get("prompt") == ex.get("response"):
        issues.append("prompt equals response")
    if len(ex.get("response", "")) > 6000:
        issues.append("response suspiciously long โ€” possible duplicate or corruption")
    return issues

# STEP 3: Load, audit, fix
raw = [json.loads(l) for l in open("my_data.jsonl")]
bad = [(i, validate(ex)) for i, ex in enumerate(raw) if validate(ex)]
print(f"Total: {len(raw)} | Bad: {len(bad)}")
for idx, issues in bad[:10]:
    print(f"  Row {idx}: {issues}")

# STEP 4: Filter and format clean examples only
clean = [format_example(ex["prompt"], ex["response"])
         for ex in raw if not validate(ex)]
dataset = Dataset.from_list(clean)

13.4 Chat Templates โ€” The Invisible Bug

Every instruct model was trained with a specific prompt format. Using the wrong one is one of the most common silent failures โ€” the model degrades with no error message.

Pythonfrom transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")

# WRONG: hand-rolled format almost certainly mismatches what the model expects
wrong = "### Instruction:\nWhat is LoRA?\n\n### Response:\n"

# RIGHT: apply_chat_template always produces the exact format the model expects
messages = [
    {"role": "system", "content": "You are a helpful ML assistant."},
    {"role": "user",   "content": "What is LoRA?"},
]
correct = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,   # adds the assistant turn opener
)
print(correct)
# <|begin_of_text|><|start_header_id|>system<|end_header_id|>\nYou are...

# Apply at data-prep time โ€” not at inference time inside the training loop
def format_with_template(example):
    msgs = [
        {"role": "user",      "content": example["prompt"]},
        {"role": "assistant", "content": example["response"]},
    ]
    return {"text": tokenizer.apply_chat_template(msgs, tokenize=False)}

dataset = dataset.map(format_with_template)

13.5 Train / Val / Test Split โ€” Non-Negotiable

Pythonfrom datasets import Dataset
import random

random.seed(42)
data = list(dataset)
random.shuffle(data)

n = len(data)
train = data[:int(n * 0.80)]
val   = data[int(n * 0.80):int(n * 0.90)]
test  = data[int(n * 0.90):]

# Save to disk before any training โ€” never mix splits
Dataset.from_list(train).save_to_disk("./data/splits/train")
Dataset.from_list(val).save_to_disk("./data/splits/val")
Dataset.from_list(test).save_to_disk("./data/splits/test")
print(f"Train: {len(train)} | Val: {len(val)} | Test: {len(test)}")
โš ๏ธ
Split before any preprocessing. If you deduplicate after splitting, examples from your test set may have leaked into training. Split first, then process each split independently.

13.6 Reading Your Loss Curves

๐Ÿ“‰
Both losses drop together

Healthy training โ€” the model is generalising. Continue or stop when val loss plateaus.

โœ… Keep going
๐Ÿ“ˆ
Val rises, train drops

Overfitting โ€” the model is memorising. Stop here and use the checkpoint where val was lowest.

๐Ÿ›‘ Stop early
๐Ÿ”€
Loss spikes suddenly

Learning rate too high or a corrupted batch. Halve the learning rate and relaunch.

โš ๏ธ Lower LR
๐Ÿ“Š
Loss stalls from step 1

LR too low, frozen layers, or wrong LoRA target_modules. Check your config.

โš ๏ธ Check config
๐ŸŽฏ
Val loss = NaN

Data issue โ€” empty batch, bad label, or zero-length sequence. Add pre-training validation logging.

๐Ÿ› Debug data
๐Ÿ
Both losses floor out

The model has learned everything in your data. More epochs won't help โ€” get more diverse data.

โœ… Evaluate now

13.7 The Fastest Path from Idea to Working Fine-Tune

โ‘ 
Always start from an instruct-tuned checkpoint, not a base model
Llama-3.2-3B-Instruct, Phi-3.5-mini-instruct, Mistral-7B-Instruct โ€” these already follow instructions. Your SFT only needs to adjust behaviour, not teach instruction-following from scratch. Base models require far more data to be useful.
โ‘ก
Write 200 examples by hand before touching any training code
If you cannot write 200 clear prompt โ†’ response pairs yourself, your task is not well-defined enough to fine-tune. This is not a suggestion. Good data beats good code every time. Spend a day on this.
โ‘ข
Run a 10-step smoke test before committing to a full run
Set max_steps=10 in TrainingArguments. If you see OOM, NaN loss, or no GPU activity, fix it now โ€” not after 6 hours. Confirm: loss is not NaN, a checkpoint saves, and your monitor shows GPU activity.
โ‘ฃ
Train 1 epoch, evaluate fully, then decide on more epochs
Many tasks are done after one epoch. Evaluate on your held-out test set before spending more compute. You will often be surprised how little training is needed โ€” especially with LoRA.
โ‘ค
Keep the LoRA adapter, not the merged model, until you are satisfied
A LoRA adapter is a few hundred MB. The merged model is full size. Adapters can be re-merged anytime. Keep adapters from each epoch and only merge the winner after your evaluation is done.

13.8 Sanity Checks Before You Ship Anything

๐Ÿง 
The mindset that separates fast learners: Treat your first fine-tune as a debugging exercise, not a production run. The goal is to get the full loop working โ€” data in, model trained, evaluation completed โ€” with any result at all. A model that trains without crashing and produces coherent (even if imperfect) output is a success. From that baseline, iterate on data quality. Everything else is tuning noise.