🤗 Complete Beginner's Guide

Hugging Face LLMs
From Zero to Fine-Tuned

A practical, no-fluff guide to choosing a large language model on Hugging Face, setting up your local environment, downloading a model, preparing data, and fine-tuning it for your specific use case — written for people who are new to the whole pipeline.

📋 What This Guide Covers

This guide walks the complete path from zero knowledge to running and retraining a real LLM locally. We cover the questions to ask before picking any model, how to create your Hugging Face account, download a model with the CLI, set up your Python environment, understand what data you can use, and walk through concrete retraining use cases.

Read sequentially if you are completely new. Experienced practitioners can jump to any chapter — each is self-contained. All code is tested on Python 3.10+ and works on both CUDA (NVIDIA) and Apple Silicon (MPS).

Python Version

3.10 or 3.11 recommended

Core Library

transformers · datasets · peft

Training Framework

PyTorch (+ optionally MLX on Apple)

Efficient Fine-tuning

LoRA / QLoRA via PEFT

Experiment Tracking

Weights & Biases or TensorBoard

Model Hub

huggingface.co

Chapter 01 Questions to Ask Before Choosing Any Model

Picking the wrong model wastes weeks. Before opening Hugging Face, answer these questions honestly. There are no wrong answers — only answers that point you to the right model family.

1.1 What Problem Are You Solving?

What is the task? Text generation, summarisation, classification, question answering, code generation, translation, image captioning, or something else? Each has dominant model families.
What is the input and output format? Text in → text out is straightforward. Text in → structured JSON out requires a model that follows instructions well. Audio or image inputs require multimodal models.
How long are typical inputs and outputs? A model's context window (e.g. 4K, 8K, 128K tokens) determines the maximum length it can process in one call. Legal documents and codebases need large windows. Short Q&A does not.
Does the answer need to be factually grounded, or can it be creative? Factual tasks may need retrieval augmentation (RAG). Creative tasks benefit from higher temperature and less constrained models.

1.2 What Are Your Hardware Constraints?

How much VRAM / unified memory do you have? A rough rule: 7B parameter models need ~7GB in 4-bit, ~14GB in 16-bit. 13B needs ~13GB / ~26GB. 70B is impractical without multi-GPU or quantisation.
Are you on NVIDIA (CUDA), Apple Silicon (MPS), or CPU only? CUDA gives the widest library support. MPS supports most things but some ops fall back to CPU. CPU-only is viable for small models with GGUF/llama.cpp.
Is inference speed a hard requirement? If you need real-time responses (<1s), a 70B model is almost certainly out unless you have A100/H100-class hardware.
Are you running this on a server or a laptop? Servers allow quantisation at scale. Laptops benefit from GGUF models via Ollama or llama.cpp.

1.3 What Are Your Long-Term Goals?

Do you want to deploy this, or just experiment? Deployment brings licensing concerns (MIT vs non-commercial licences). Llama models are non-commercial by default unless you qualify for Meta's commercial licence.
Will the model ever see user data in production? If yes, you need to understand the model's licence, the data's privacy implications, and whether you can run fully on-premise.
Do you plan to fine-tune, or use the base model as-is? Base models are not instruction-following. If you need it to answer questions in a specific format, you need either an instruction-tuned variant or your own fine-tuning.
Is your use case domain-specific? Medical, legal, financial, and scientific domains often benefit enormously from fine-tuning on domain text — a generic 7B fine-tuned beats a generic 13B base on domain-specific tasks.
What is your maintenance appetite? Models get better fast. Picking a model that is easily swappable (via a config file) versus hard-coded saves you weeks of refactoring in six months.

1.4 What Are Your Data and Privacy Constraints?

Can your data leave your machine? If not, you need a locally-running model. Cloud APIs (OpenAI, Anthropic) are off the table.
Is your training data labelled or unlabelled? Supervised fine-tuning needs labelled pairs (prompt → response). Continued pre-training works on raw unlabelled text. RLHF needs human preference rankings.
How much training data do you have? LoRA fine-tuning can work meaningfully with as few as 500–1,000 high-quality examples. Full fine-tuning needs tens of thousands.

💡

The most common mistake: Picking the largest model available and then being surprised it won't fit in memory or is too slow. Always start with the smallest model that could plausibly work, get it running end-to-end, then scale up if needed.

Chapter 02 Best Practices for Choosing a Model

2.1 Understand Model Families

Hugging Face hosts hundreds of thousands of models, but most derive from a small set of foundation families. Knowing the families shortcuts the search dramatically.

Family	Best For	Sizes Available	Licence
Llama 3 / 3.1 / 3.2 (Meta)	General-purpose, instruction following, coding, reasoning	1B, 3B, 8B, 70B, 405B	Llama Community (non-commercial <700M users)
Mistral / Mixtral	Fast inference, European language support, MoE efficiency	7B, 8×7B, 8×22B	Apache 2.0 (fully open)
Phi-3 / Phi-4 (Microsoft)	Tiny but capable: on-device, constrained hardware	3.8B, 7B, 14B	MIT
Qwen 2.5 (Alibaba)	Multilingual, math, code, long-context	0.5B–72B	Apache 2.0
Gemma 2 (Google)	Safe, efficient, strong benchmarks at small sizes	2B, 9B, 27B	Gemma Terms (custom open)
DeepSeek-R1 / V3	Reasoning, math, code	7B–671B	MIT
BERT / RoBERTa / DeBERTa	Classification, NER, embeddings (encoder-only)	110M–435M	Apache 2.0
Whisper (OpenAI)	Audio → text transcription	tiny to large-v3	MIT

2.2 Instruction-Tuned vs Base Models

🔤 Base Models

Pre-trained on raw text. They complete text — they don't follow instructions. Use these as the starting point for your own fine-tuning pipeline. Examples: meta-llama/Llama-3.1-8B, mistralai/Mistral-7B-v0.1.

💬 Instruction-Tuned Models

Fine-tuned to follow prompts and chat. Use these for immediate deployment or as a starting point for task-specific fine-tuning. Examples: meta-llama/Llama-3.1-8B-Instruct, mistralai/Mistral-7B-Instruct-v0.3.

2.3 How to Read a Model Card

①

Check the Licence

First thing. Apache 2.0 and MIT are fully permissive. Llama Community is non-commercial under certain conditions. Always read the licence tab before anything else if you plan to deploy.

②

Check Benchmarks

Look at MMLU, HellaSwag, HumanEval (for code), and MT-Bench scores. Compare within size class. A well-fine-tuned 7B can beat a base 13B on specific tasks.

③

Check Context Length

Listed as "max_position_embeddings" in config.json. 4K is limiting. 8K is comfortable. 128K opens document-level tasks.

④

Check Available Quantisations

Look at community forks with "GGUF" or "AWQ" or "GPTQ" in the name. These are smaller, memory-efficient versions that sacrifice minimal quality for dramatic size reduction.

⑤

Check Downloads & Community Activity

High download counts and active discussions in the Community tab mean bugs are caught, documentation is better, and library support is tested.

2.4 Model Size vs Hardware — Quick Reference

Model Size	4-bit (VRAM)	16-bit (VRAM)	Minimum Hardware
1–3B params	~1–2 GB	~3–6 GB	Any modern laptop, M1 8GB
7–8B params	~5–6 GB	~14–16 GB	M1 Pro 16GB, RTX 3060
13B params	~9 GB	~26 GB	M2 Max 32GB, RTX 3090
34–40B params	~22 GB	~70 GB	M2 Ultra 64GB, A6000
70B params	~40 GB	~140 GB	Multi-GPU, A100 80GB

⚠️

Memory overhead: These numbers are for inference only. Fine-tuning requires additional memory for gradients and optimiser states — even with LoRA, budget an extra 2–4 GB on top of the model size.

Chapter 03 Getting a Hugging Face Account

3.1 Sign Up

①

Go to huggingface.co

Click "Sign Up" in the top right. You can register with email or sign in with GitHub/Google. GitHub is recommended if you plan to push model weights or datasets.

②

Verify Your Email

Unverified accounts cannot download gated models (Llama, Gemma, etc.) or push to the Hub. Verify immediately after signup.

③

Create an Access Token

Go to Settings → Access Tokens → New Token. Create a token with "Read" scope for downloading, or "Write" scope if you want to push your own models. Copy and save it — it won't be shown again.

④

Request Access for Gated Models (if needed)

Models like Llama 3 require you to agree to their licence on the model page. Click "Access repository", fill in the form, and wait for approval (usually instant to a few hours).

3.2 Install the Hugging Face CLI

Bash# Install the CLI (comes with the huggingface_hub package)
pip install huggingface_hub

# Log in — this saves your token to ~/.cache/huggingface/token
huggingface-cli login

# Paste your token when prompted.
# Your token is now saved — no need to pass it to every API call.

# Verify you're logged in
huggingface-cli whoami

3.3 Understand the Cache

Downloaded models are stored in ~/.cache/huggingface/hub/ by default. Each model is stored once and shared across all scripts that reference it. You can change the location with the environment variable:

Bash# Change the cache directory (add to your .zshrc or .bashrc)
export HF_HOME=/path/to/your/drive/huggingface_cache

# For large models (70B+), point this to an external SSD

ℹ️

Storage planning: A 7B model in float16 is ~14 GB. A 70B model is ~140 GB. Plan your disk space before downloading. Use an external SSD for anything over 13B.

Chapter 04 Downloading a Model Locally

4.1 Three Ways to Download

Method 1 — CLI Snapshot Download (recommended for first-timers)

Bash# Download an entire model to a local folder
huggingface-cli download meta-llama/Llama-3.2-3B-Instruct \
  --local-dir ./models/llama-3.2-3b-instruct

# Download a specific file only (e.g. the config)
huggingface-cli download meta-llama/Llama-3.2-3B-Instruct config.json

Method 2 — Python API (most flexible)

Pythonfrom huggingface_hub import snapshot_download

# Downloads to ~/.cache/huggingface/hub/ automatically
# Subsequent calls use the cache — no re-download
model_path = snapshot_download(
    repo_id="mistralai/Mistral-7B-Instruct-v0.3",
    ignore_patterns=["*.pt", "*.ot"],  # skip legacy weights if present
)
print(f"Model cached at: {model_path}")

Method 3 — via transformers (laziest, good for prototyping)

Pythonfrom transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "microsoft/Phi-3.5-mini-instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,   # use bfloat16 on newer hardware
    device_map="auto",           # auto-detects CUDA, MPS, or CPU
)

# Quick test
inputs = tokenizer("Hello, who are you?", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

4.2 Download Quantised Models (for limited memory)

If your hardware can't fit the full model, use a quantised version. The easiest way is via bitsandbytes for CUDA or pre-quantised GGUF files via llama-cpp-python for Apple Silicon.

Python# 4-bit quantisation on load (CUDA / bitsandbytes)
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",       # NF4 is best quality 4-bit
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    quantization_config=bnb_config,
    device_map="auto",
)

Bash# Apple Silicon: use GGUF via Ollama (easiest local setup)
# Install: https://ollama.com
ollama pull llama3.2:3b
ollama run llama3.2:3b

4.3 Verify the Download

Pythonfrom transformers import AutoConfig

# Check config loads without errors
config = AutoConfig.from_pretrained("./models/llama-3.2-3b-instruct")
print(f"Architecture: {config.architectures}")
print(f"Context length: {config.max_position_embeddings}")
print(f"Hidden size: {config.hidden_size}")
print(f"Layers: {config.num_hidden_layers}")

Chapter 05 What Data Can You Use?

5.1 Data Types for LLM Training

Data Type	Used For	Format	Minimum Quantity
Instruction pairs	Supervised fine-tuning (SFT)	`{"prompt": "...", "response": "..."}`	500–2,000 high quality
Conversation threads	Chat fine-tuning	ShareGPT / ChatML format	1,000+ turns
Raw domain text	Continued pre-training	Plain text, JSONL	Millions of tokens
Preference pairs	DPO / RLHF alignment	`{"chosen": "...", "rejected": "..."}`	1,000+ pairs
Classification labels	Fine-tuning classifiers	`{"text": "...", "label": 0}`	500+ per class

5.2 Where to Find Datasets

①

Hugging Face Datasets Hub

huggingface.co/datasets — over 100,000 public datasets. Filter by task, language, and licence. Use datasets library to load any of them in one line.

②

Your Own Data

Internal documents, support tickets, product manuals, code repositories, customer emails (anonymised). This is almost always the most valuable data source for domain-specific fine-tuning.

③

Synthetic Data (Model-Generated)

Use a large model (GPT-4, Llama-3.1-70B) to generate instruction-response pairs from seed examples or documents. Works well for bootstrapping when you have few real examples.

④

Public Web Corpora

Common Crawl, The Pile, RedPajama, FineWeb. These are for continued pre-training, not SFT. Very large — filter aggressively for quality before using.

5.3 Data Licensing — Critical Checklist

Data scraped from websites may not be legally usable for commercial model training — check the site's Terms of Service.
Synthetic data generated by OpenAI models cannot be used to train competing models per their Terms of Service.
Synthetic data generated by open-licensed models (Llama, Mistral) is generally fine — check the model's specific licence.
Data containing personal information (PII) needs explicit consent or anonymisation before use.
Copyrighted text (books, articles) in training data is legally contested — consult a lawyer for commercial deployments.
Research use has broader fair-use protections in many jurisdictions than commercial deployment.

5.4 Data Quality Over Quantity

🔬

The LIMA finding: A 2023 paper showed that fine-tuning Llama on just 1,000 carefully curated instruction pairs produced results competitive with models fine-tuned on hundreds of thousands of noisy examples. Quality of examples matters far more than raw count. Spend time on data curation — it's the highest-leverage activity in fine-tuning.

What "quality" means for instruction data: clear, unambiguous instructions; responses that are complete and correct; diversity across topics and styles; no contradictions within the dataset; no degenerate examples (empty strings, encoding errors, repetitive loops).

Pythonfrom datasets import load_dataset

# Load a popular instruction dataset from the Hub
ds = load_dataset("HuggingFaceH4/ultrachat_200k")

# Load your own JSONL file
ds = load_dataset("json", data_files={"train": "my_data.jsonl"})

# Basic quality checks
print(f"Total examples: {len(ds['train'])}")
print(f"Columns: {ds['train'].column_names}")
print(f"First example: {ds['train'][0]}")

# Filter out short responses (likely low quality)
ds = ds.filter(lambda x: len(x["response"]) > 100)

Chapter 06 Setting Up Your Training Environment

6.1 Install Python and Create a Virtual Environment

Bash# Python 3.10 or 3.11 recommended (3.12 has some library gaps)
python --version

# Create a dedicated environment for LLM work
python -m venv llm_env
source llm_env/bin/activate       # Linux / macOS
# llm_env\Scripts\activate        # Windows

# Or with conda (easier for CUDA management)
conda create -n llm_env python=3.11
conda activate llm_env

6.2 Core Package Installation

Bash# Core Hugging Face stack
pip install transformers datasets accelerate peft trl

# Weights & Biases for experiment tracking (optional but recommended)
pip install wandb

# For NVIDIA GPU (CUDA 12.1 — match your CUDA version)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# For Apple Silicon (MPS backend is built into PyTorch)
pip install torch torchvision torchaudio

# For 4-bit quantisation on CUDA (bitsandbytes)
pip install bitsandbytes

# For Flash Attention 2 (speeds up training significantly on CUDA)
pip install flash-attn --no-build-isolation

6.3 Verify Your Setup

Pythonimport torch
import transformers

# Check PyTorch version and device availability
print(f"PyTorch: {torch.__version__}")
print(f"Transformers: {transformers.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"MPS available:  {torch.backends.mps.is_available()}")

# Should print device info — not "cpu" if you have a GPU
device = (
    "cuda" if torch.cuda.is_available()
    else "mps" if torch.backends.mps.is_available()
    else "cpu"
)
print(f"Using device: {device}")

# Quick GPU memory check (CUDA only)
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

6.4 Project Directory Structure

Shellmy_llm_project/
├── data/
│   ├── raw/          # original source files
│   ├── processed/    # cleaned JSONL ready for training
│   └── splits/       # train / val / test splits
├── models/
│   └── base/         # downloaded base model (or symlink to HF cache)
├── checkpoints/      # saved during training
├── outputs/          # final merged model weights
├── scripts/
│   ├── prepare_data.py
│   ├── train.py
│   └── evaluate.py
├── configs/
│   └── lora_config.yaml
└── requirements.txt

✅

Pro tip: Add .env to your .gitignore and store your HF token there. Load it with python-dotenv. Never commit access tokens to version control — even private repos.

Chapter 07 Fine-Tuning with LoRA — Step by Step

LoRA (Low-Rank Adaptation) is the standard fine-tuning technique for LLMs on consumer hardware. Instead of updating all billions of parameters, LoRA injects small trainable weight matrices at key layers. This reduces memory by 10–100× with minimal quality loss.

7.1 LoRA Config Explained

Pythonfrom peft import LoraConfig, TaskType

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,  # or SEQ_CLS for classification
    r=16,           # rank: higher = more capacity, more memory (try 8–64)
    lora_alpha=32,  # scaling factor — rule of thumb: 2× rank
    lora_dropout=0.05,
    # Which layers to target — these are standard for most LLaMA-family models
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    bias="none",
)

7.2 Complete Training Script

Pythonfrom transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
from peft import get_peft_model, LoraConfig, TaskType
from trl import SFTTrainer
from datasets import load_dataset
import torch

MODEL_ID   = "meta-llama/Llama-3.2-3B-Instruct"
DATA_PATH  = "./data/processed/train.jsonl"
OUTPUT_DIR = "./checkpoints"

# 1. Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
tokenizer.pad_token = tokenizer.eos_token  # required for batching

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# 2. Apply LoRA
lora_config = LoraConfig(
    r=16, lora_alpha=32, lora_dropout=0.05,
    task_type=TaskType.CAUSAL_LM,
    target_modules=["q_proj","k_proj","v_proj","o_proj",
                    "gate_proj","up_proj","down_proj"],
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Prints: trainable params: 41,943,040 || all params: 3,254,878,208 || 1.29%

# 3. Load dataset
dataset = load_dataset("json", data_files={"train": DATA_PATH})

def format_prompt(example):
    return {"text": f"### Instruction:\n{example['prompt']}\n\n### Response:\n{example['response']}"}

dataset = dataset.map(format_prompt)

# 4. Training arguments
args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,   # effective batch = 16
    warmup_ratio=0.03,
    learning_rate=2e-4,
    bf16=True,                        # bfloat16 on modern hardware
    logging_steps=10,
    save_strategy="epoch",
    report_to="wandb",                # or "none" to disable
)

# 5. Train
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    args=args,
    train_dataset=dataset["train"],
    dataset_text_field="text",
    max_seq_length=2048,
)

trainer.train()

# 6. Save LoRA adapter (small — only a few hundred MB)
model.save_pretrained("./outputs/lora_adapter")
tokenizer.save_pretrained("./outputs/lora_adapter")

7.3 Merge the Adapter Back into the Base Model

Pythonfrom peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load base model + adapter, then merge
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-3B-Instruct",
    torch_dtype=torch.float16,
    device_map="cpu",  # merge on CPU to avoid VRAM spike
)
model = PeftModel.from_pretrained(base_model, "./outputs/lora_adapter")
merged = model.merge_and_unload()     # fuses adapter weights into base

# Save the complete merged model
merged.save_pretrained("./outputs/merged_model")
AutoTokenizer.from_pretrained("./outputs/lora_adapter").save_pretrained(
    "./outputs/merged_model"
)

7.4 Key Hyperparameters for New Fine-Tuners

Parameter	Recommended Start	Effect of Increasing
`r` (LoRA rank)	8–16	More capacity, more memory, slower training
`learning_rate`	2e-4	Too high → loss diverges. Too low → slow convergence
`num_train_epochs`	2–5	Too many → overfitting (model memorises training data)
`max_seq_length`	1024–2048	More context, but quadratic memory cost
`gradient_accumulation_steps`	4–8	Simulates larger batch without extra VRAM

⚠️

Watch for overfitting: If your training loss keeps dropping but your validation loss starts rising, stop early. Use load_best_model_at_end=True in TrainingArguments to automatically keep the best checkpoint.

Chapter 08 Use Cases — Models and When to Retrain

Fine-tuning is not always the right answer. This chapter maps common use cases to the appropriate model choice and training strategy.

8.1 Use Case Landscape

🩺

Medical / Clinical NLP

Summarising clinical notes, ICD coding, drug interaction QA. Use BioMedLM or fine-tune Mistral-7B on PubMed + MIMIC. Domain vocabulary drifts heavily from general pre-training.

Fine-tune recommended

⚖️

Legal Document Analysis

Contract clause extraction, case summarisation, precedent search. Fine-tune on LexGLUE or proprietary contracts. Instruction-tuned general models struggle with precise legal phrasing.

Fine-tune + RAG

💻

Code Generation & Review

Auto-complete, bug detection, test generation. DeepSeek-Coder and CodeLlama are already specialised. Fine-tune on your codebase for proprietary APIs and internal conventions.

Specialised model first

🛎️

Customer Support Chatbot

Answer FAQs, route tickets, escalate appropriately. Fine-tune on historical ticket/resolution pairs. A 3B model fine-tuned on your data beats GPT-4 on your specific ticket vocabulary.

Small model + fine-tune

📄

Document Summarisation

Summarise reports, emails, research papers. Long-context models (8K+) work out-of-the-box. Only fine-tune if you need a specific summary format or domain style.

Base instruct model

🌍

Low-Resource Language

Models trained primarily on English underperform on Swahili, Welsh, Tagalog, etc. Continued pre-training on target-language text dramatically improves quality.

Continued pre-training

🏷️

Named Entity Recognition

Extracting people, organisations, locations, custom entities from text. Fine-tune a BERT/DeBERTa encoder — much smaller and faster than a generative model for pure classification.

Encoder model (BERT)

🎙️

Voice Transcription & QA

Transcribe audio with Whisper, then feed into an LLM for Q&A or summarisation. Fine-tune Whisper on domain-specific speech (accents, jargon, background noise).

Whisper + LLM pipeline

🔎

Semantic Search / RAG

Embed documents and retrieve relevant chunks at query time. Use sentence-transformers (BAAI/bge, E5) for the embedding model. Often more effective than fine-tuning for knowledge retrieval.

Embedding model + RAG

8.2 Decision Tree — To Fine-Tune or Not?

①

Try the base instruct model first

Prompt-engineer a good system prompt. Test on 50–100 real examples. Measure quality. Many tasks don't need fine-tuning — a well-crafted prompt is enough.

②

Add RAG if the problem is knowledge retrieval

If the model is good at reasoning but lacks specific facts (your company's docs, recent events, proprietary knowledge), add a retrieval layer before reaching for fine-tuning.

③

Fine-tune if output format or style is wrong

If the model consistently produces the wrong format, uses wrong tone, or fails domain-specific tasks even with good prompts, fine-tuning fixes the behaviour permanently.

④

Consider continued pre-training for deep domain shift

If your domain uses vocabulary and concepts the model has never seen (novel scientific field, proprietary jargon), continued pre-training on domain text before SFT gives the best results.

Chapter 09 After Training — Evaluation & Deployment

9.1 Evaluate Your Fine-Tuned Model

Pythonfrom transformers import pipeline
import json

# Load your merged model
pipe = pipeline(
    "text-generation",
    model="./outputs/merged_model",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# Load a held-out test set (never seen during training)
with open("./data/splits/test.jsonl") as f:
    test_examples = [json.loads(l) for l in f]

# Run inference and compare
results = []
for ex in test_examples[:100]:
    output = pipe(ex["prompt"], max_new_tokens=200, do_sample=False)[0]
    results.append({
        "prompt":    ex["prompt"],
        "expected":  ex["response"],
        "generated": output["generated_text"],
    })

# Save for human review or automated scoring
with open("eval_results.jsonl", "w") as f:
    for r in results:
        f.write(json.dumps(r) + "\n")

9.2 Evaluation Metrics to Track

Metric	Use When	Library
ROUGE-L	Summarisation tasks	`evaluate` (HF)
BLEU	Translation	`sacrebleu`
Exact Match / F1	QA with ground truth	`evaluate`
Perplexity	Language model quality	`transformers`
Human eval / LLM-as-judge	Open-ended generation	Custom or GPT-4 scoring
Task-specific accuracy	Classification	`sklearn`

9.3 Push Your Model to the Hub

Pythonfrom transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("./outputs/merged_model")
tokenizer = AutoTokenizer.from_pretrained("./outputs/merged_model")

# Push to your personal hub (creates repo automatically)
model.push_to_hub("your-username/my-finetuned-llama-3b")
tokenizer.push_to_hub("your-username/my-finetuned-llama-3b")

# Or push just the LoRA adapter (much smaller — recommended)
from peft import PeftModel
adapter_model = PeftModel.from_pretrained(base_model, "./outputs/lora_adapter")
adapter_model.push_to_hub("your-username/my-lora-adapter")

9.4 Serve the Model Locally

Bash# Option 1: vLLM (fastest, CUDA only)
pip install vllm
python -m vllm.entrypoints.openai.api_server \
  --model ./outputs/merged_model \
  --port 8000

# Option 2: Ollama (easiest, supports Apple Silicon)
# Convert to GGUF first, then:
ollama create my-model -f ./Modelfile
ollama run my-model

# Option 3: Hugging Face Text Generation Inference (TGI)
docker run --gpus all ghcr.io/huggingface/text-generation-inference \
  --model-id ./outputs/merged_model \
  --port 8080

💡

The OpenAI-compatible API: Both vLLM and Ollama expose an OpenAI-compatible endpoint. Any app built against the OpenAI SDK works with your local model by just changing the base_url to http://localhost:8000/v1. Zero application code changes.

Chapter 10 Common Mistakes & How to Avoid Them

❌ Training on raw base model

Base models are not instruction-following. Always use an instruct-tuned checkpoint as your starting point for SFT unless you are doing continued pre-training.

❌ Not separating train/val/test

Hold out at least 10% of data for validation and 10% for final test. Never evaluate on training data — it will always look better than it actually is.

❌ Learning rate too high

2e-4 is a safe default for LoRA. Going to 1e-3 or higher will often cause the loss to spike. Start conservative and tune upward if training is slow.

❌ Training too many epochs

3 epochs is often enough for SFT. After 5+, many models start memorising training examples verbatim rather than generalising. Monitor validation loss.

❌ Ignoring chat templates

Each instruct model has a specific chat template (ChatML, Llama-3, etc.). Using the wrong format at inference will cause poor outputs even from a well-trained model. Use tokenizer.apply_chat_template().

❌ Mixing data quality

One bad example in 10 is not a problem. One bad example in 100 starts to matter. One bad pattern repeated 50 times will dominate your model's outputs. Curate aggressively.

10.1 Final Checklist Before You Start Training

Licence of base model is compatible with your intended use
Dataset is cleaned, deduplicated, and held-out splits are created
Chat template is correct for your chosen base model
Environment verification script runs without errors
A small smoke test (10 steps) completes without OOM or NaN loss
Experiment tracking (wandb or TensorBoard) is configured
Checkpointing is enabled so you can recover from crashes
You have a baseline — evaluated the un-fine-tuned model on your test set

🚀

You're ready: With this guide, a Hugging Face account, Python installed, and a GPU with at least 8 GB of VRAM (or an M-series Mac), you have everything you need to go from a downloaded model to a domain-specific fine-tuned LLM. The loop is always the same: get data → clean data → train → evaluate → iterate.

Chapter 11 Apple Silicon — Getting Maximum Performance

Apple Silicon (M1 through M4) is uniquely suited for on-device LLM work. Unified memory means the GPU and CPU share the same physical pool — a 64 GB M2 Max can hold models that exceed a 24 GB NVIDIA card's limit. But MPS and MLX have different strengths, and knowing which to reach for saves hours of debugging.

11.1 MPS vs MLX — Pick the Right Tool

🔷 PyTorch MPS

The MPS backend plugs into the Hugging Face ecosystem you already know — transformers, PEFT, SFTTrainer all work out of the box. Fastest for batch training workloads. Some ops silently fall back to CPU; bitsandbytes 4-bit quantisation is not supported on MPS.

🍎 MLX (Apple-native)

Purpose-built by Apple for Apple Silicon. No device management needed — unified memory means tensors just exist. Faster than MPS for inference and single-item prediction. The go-to for running quantised models and LoRA fine-tuning when bitsandbytes is not an option.

ℹ️

Performance reality: Benchmarks show PyTorch MPS can outperform MLX on batch training tasks. MLX leads on inference startup latency and single-item throughput. For fine-tuning on consumer Macs, MLX LoRA is the most practical starting point — bitsandbytes does not support MPS, so QLoRA is unavailable in the PyTorch path.

11.2 Using PyTorch MPS Correctly

Pythonimport os
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Tell accelerate to use MPS when device_map="auto"
os.environ["ACCELERATE_USE_MPS_DEVICE"] = "True"
# Allow unsupported ops to fall back to CPU instead of crashing
os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1"

# Verify MPS is active before loading anything
assert torch.backends.mps.is_available(), "MPS not available — check macOS and PyTorch versions"
assert torch.backends.mps.is_built(),     "PyTorch was not built with MPS support"

# Quick sanity check
x = torch.ones(3, device="mps")
print(f"Tensor lives on: {x.device}")   # → mps:0

# Load a model — use device_map="auto" (not "mps", which is not a valid map value)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3.5-mini-instruct")
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3.5-mini-instruct",
    torch_dtype=torch.float16,   # float16 works on MPS; bfloat16 support varies
    device_map="auto",           # accelerate will route to MPS via env var above
)

# Or for small models, manually place on MPS device:
# model = model.to(torch.device("mps"))

# Confirm layers are actually on MPS
first_layer_device = next(model.parameters()).device
print(f"Model on: {first_layer_device}")   # should show mps:0, not cpu

11.3 MLX — Native Apple Silicon Stack

MLX is Apple's own array framework. The mlx-lm package provides a complete pipeline for running and fine-tuning LLMs. There is no device management — all tensors share unified memory automatically.

Bash# Install MLX and the LLM wrapper (Apple Silicon only — pip will reject on Intel/CUDA)
pip install mlx mlx-lm

# Run a quantised model — downloads from mlx-community on Hugging Face Hub
mlx_lm.generate \
  --model mlx-community/Llama-3.2-3B-Instruct-4bit \
  --prompt "Explain attention mechanisms in one paragraph" \
  --max-tokens 300

# Fine-tune with LoRA — replace bitsandbytes QLoRA on Apple Silicon
mlx_lm.lora \
  --model mlx-community/Mistral-7B-Instruct-v0.3-4bit \
  --train \
  --data ./data/processed \
  --iters 1000 \
  --batch-size 4 \
  --lora-layers 16 \
  --learning-rate 1e-5

# Convert any Hugging Face model to MLX format yourself
mlx_lm.convert \
  --hf-path meta-llama/Llama-3.2-3B-Instruct \
  --mlx-path ./models/llama-3.2-3b-mlx \
  --quantize \
  --q-bits 4

Pythonfrom mlx_lm import load, generate

# Load — no device placement needed, unified memory handles it
model, tokenizer = load("mlx-community/Llama-3.2-3B-Instruct-4bit")

# Generate — verbose=True prints live tokens/sec
response = generate(
    model,
    tokenizer,
    prompt="What is the difference between SFT and DPO?",
    max_tokens=300,
    verbose=True,    # prints token speed as output streams
)
print(response)

11.4 Ollama — Zero-Config Local Inference

If your goal is just running models locally without configuring Python, Ollama is the fastest path. It uses llama.cpp with Metal acceleration and exposes an OpenAI-compatible API out of the box.

Bash# Install via Homebrew (or download from https://ollama.com)
brew install ollama

# Start the background server
ollama serve

# Pull models — fully Metal-accelerated on Apple Silicon
ollama pull llama3.2:3b       # ~2 GB on disk
ollama pull mistral:7b        # ~4.1 GB
ollama pull phi4:14b          # ~8.5 GB

# Chat interactively
ollama run llama3.2:3b

# Or call the OpenAI-compatible REST API — works with any OpenAI SDK app
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3.2:3b","messages":[{"role":"user","content":"Hello!"}]}'

11.5 Memory Tiers — Which Mac Can Run What

Mac Config	Unified Memory	Max Model (4-bit)	Best Path
M1 / M2 base	8 GB	1–3B params	Ollama or MLX with Phi-3.5-mini-instruct
M1 Pro / M2 Pro	16 GB	7–8B params	MLX with Mistral-7B or Llama-3.2-3B comfortably
M1 Max / M2 Max	32–38 GB	13–14B params	Phi-4-14B or Llama-3.1-8B with headroom for fine-tuning
M2 Ultra / M3 Max	64–96 GB	34–40B params	Llama-3.3-70B in 4-bit, QwQ-32B
M2 Ultra (192 GB)	192 GB	70B+ params	Llama-3.1-70B in 8-bit, full fine-tuning of 13B models

💡

Activity Monitor check: Open Activity Monitor → GPU History while a model runs. If GPU usage sits near 0%, your model is running on CPU — something is wrong with your device setup. You should see sustained 60–100% GPU utilisation during inference and training on Apple Silicon.

11.6 MLX LoRA Fine-Tuning — Data Format

MLX LoRA expects your data in a specific JSONL format inside a folder. Both train.jsonl and valid.jsonl must exist.

Pythonimport json, os

os.makedirs("./data/processed", exist_ok=True)

# Each line must be a JSON object with a "text" field
# Apply your model's chat template before saving
from mlx_lm import load
model, tokenizer = load("mlx-community/Mistral-7B-Instruct-v0.3-4bit")

examples = [
    {"prompt": "What is LoRA?",  "response": "LoRA is Low-Rank Adaptation..."},
    {"prompt": "Define fine-tuning.", "response": "Fine-tuning is..."},
    # ... your data
]

def to_mlx_format(ex):
    msgs = [
        {"role": "user",      "content": ex["prompt"]},
        {"role": "assistant", "content": ex["response"]},
    ]
    return {"text": tokenizer.apply_chat_template(msgs, tokenize=False)}

formatted = [to_mlx_format(e) for e in examples]

# 90/10 train/val split
split = int(len(formatted) * 0.9)
for fname, rows in [("train.jsonl", formatted[:split]), ("valid.jsonl", formatted[split:])]:
    with open(f"./data/processed/{fname}", "w") as f:
        for row in rows:
            f.write(json.dumps(row) + "\n")

print(f"Train: {split} examples | Val: {len(formatted)-split} examples")

Chapter 12 Performance Gains — Speed, Memory & Throughput

Whether you're on NVIDIA or Apple Silicon, these techniques are additive — stack them in order of effort for the highest return. None of them require changing your model or your data.

12.1 The Optimisation Stack — Apply in This Order

①

Switch to bfloat16 — one word, instant win

bfloat16 has a wider dynamic range than float16, meaning fewer NaN/inf errors during training. Every A100, H100, and M-series chip supports it natively. Change torch_dtype=torch.float16 to torch.bfloat16 everywhere.

②

Quantise the base model — fit a larger model in your VRAM

4-bit NF4 (via bitsandbytes on CUDA) or 4-bit GGUF/MLX (on Apple) cuts model size by 4×. A 7B model drops from ~14 GB to ~4–5 GB with less than 2% quality loss on most tasks.

③

Enable gradient checkpointing — trade 20% compute for 30–40% less VRAM

Instead of storing all intermediate activations during the forward pass, recompute them during backprop. One flag in TrainingArguments. Essential for fine-tuning on consumer GPUs and MPS Macs.

④

Gradient accumulation — simulate large batches on small hardware

Run N forward passes before each optimiser step. Effective batch = per_device_batch × accumulation_steps. Use a small per-device batch (1–2) and accumulate 8–16 steps.

⑤

Flash Attention 2 (CUDA only) — free 2–4× speed in attention layers

Rewrites the attention computation to be IO-bound rather than memory-bound. One line to enable. MPS does not support Flash Attention — use sequence packing instead for throughput on Apple Silicon.

⑥

Sequence packing — eliminate wasted padding

If your examples are short (under 512 tokens), packing multiple examples into one sequence removes padding waste and can 2–4× your training throughput. Works on both CUDA and MPS.

12.2 Quantisation Options Compared

Method	Bits	Quality Loss	Hardware	Best For
NF4 (bitsandbytes)	4-bit	~1–2%	CUDA only	QLoRA training, CUDA inference
GPTQ	4-bit	~1–2%	CUDA	Fast CUDA inference, pre-quantised Hub models
AWQ	4-bit	~0.5%	CUDA	Best quality 4-bit on CUDA
GGUF Q4_K_M	4-bit	~1%	CPU + MPS + CUDA	Apple Silicon via Ollama/llama.cpp
GGUF Q8_0	8-bit	<0.5%	CPU + MPS + CUDA	Higher quality, 2× size of Q4
MLX 4-bit	4-bit	~1%	Apple Silicon only	MLX inference and LoRA fine-tuning on Mac

12.3 Key Training Flags — All in One Place

Pythonfrom transformers import TrainingArguments

args = TrainingArguments(
    output_dir="./checkpoints",

    # --- Memory ---
    gradient_checkpointing=True,        # recompute activations, saves 30-40% VRAM
    gradient_accumulation_steps=8,      # effective batch = per_device x 8
    per_device_train_batch_size=2,      # raise until you hit OOM, then lower by 1

    # --- Precision ---
    bf16=True,                          # bfloat16: safer than fp16, same speed
    # fp16=True,                        # use instead on older GPUs without bf16

    # --- Speed (CUDA) ---
    # attn_implementation="flash_attention_2"  # set in model.from_pretrained instead
    dataloader_num_workers=4,           # parallel data loading (set 0 on Windows)
    dataloader_pin_memory=True,         # faster CPU->GPU transfers (CUDA only)

    # --- Eval memory safety ---
    per_device_eval_batch_size=1,
    eval_accumulation_steps=4,

    # --- Logging and saving ---
    logging_steps=10,
    save_strategy="epoch",
    load_best_model_at_end=True,        # auto-keeps best checkpoint by val loss
    metric_for_best_model="eval_loss",
    report_to="wandb",                  # or "none" to disable tracking
)

12.4 Flash Attention 2 — CUDA Only, One Line

Python# pip install flash-attn --no-build-isolation (CUDA only, not MPS)
from transformers import AutoModelForCausalLM
import torch

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="flash_attention_2",   # ← the entire change
)

# Verify it took effect
print(model.config._attn_implementation)      # → flash_attention_2

12.5 Sequence Packing — Works on CUDA and MPS

Pythonfrom trl import SFTTrainer

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=dataset["train"],
    dataset_text_field="text",
    max_seq_length=2048,
    packing=True,   # packs multiple short examples into one sequence
                    # set False if your examples are already near max_seq_length
)
# Rule of thumb: if average example is under 30% of max_seq_length, packing helps

12.6 torch.compile — Free Speed on PyTorch 2.x

Pythonimport torch

# First call compiles (slow — up to several minutes). All subsequent calls are fast.
model = torch.compile(model, mode="reduce-overhead")

# mode options:
# "default"          → safe, ~10-20% speedup, wide compatibility
# "reduce-overhead"  → reduces Python overhead, good for LLMs, ~20-40% faster
# "max-autotune"     → aggressive compile, fastest runtime, CUDA only
# fullgraph=False    → safer if model has dynamic control flow

12.7 Monitor What Your Hardware Is Actually Doing

Bash# NVIDIA — live GPU stats every 0.5 seconds
watch -n 0.5 nvidia-smi

# Better NVIDIA monitor (pip install nvitop)
nvitop

# Apple Silicon — command-line GPU power and utilisation
sudo powermetrics --samplers gpu_power -i 1000 -n 10

# Cross-platform Python memory snapshot during training
import torch
print(f"Allocated : {torch.cuda.memory_allocated()/1e9:.2f} GB")
print(f"Reserved  : {torch.cuda.memory_reserved()/1e9:.2f} GB")
print(f"Peak alloc: {torch.cuda.max_memory_allocated()/1e9:.2f} GB")

# Reset peak tracking between runs
torch.cuda.reset_peak_memory_stats()

⚠️

GPU under 60% utilisation during training? Your data loader is the bottleneck. Pre-tokenise your dataset once and save it with dataset.save_to_disk(). Then load the pre-tokenised version at training time — never re-tokenise on each epoch. Also increase dataloader_num_workers to 4.

Chapter 13 Fine-Tuning Fundamentals — What Every New User Must Know

Fine-tuning looks simple from the outside — load a model, run a trainer, save weights. The reality has a dozen invisible traps that silently produce a worse model while the loss curve looks fine. This chapter is everything you need internalised before running a single training step.

13.1 The Core Mental Model

What Fine-Tuning Actually Does

A base model has already learned how language works — grammar, reasoning, world knowledge — from hundreds of billions of tokens. Fine-tuning does not inject new knowledge. It teaches the model how to behave: which format to use, which tone to adopt, which vocabulary to prioritise. If you want it to know new facts → use RAG. If you want it to act differently → fine-tune.

13.2 The Four Training Modes

Mode	What It Teaches	Data Needed	When to Use It
SFT (Supervised Fine-Tuning)	Follow a specific instruction format and style	Prompt → Response pairs (500–10K)	First step for almost every use case
Continued Pre-Training	New domain vocabulary and concepts	Raw domain text (millions of tokens)	Deeply specialised domains — medical, legal, novel languages
DPO (Direct Preference Optimisation)	Prefer one response style over another	Chosen / rejected pairs (1K+)	After SFT, to align tone, safety, or output quality
LoRA / QLoRA	Same as SFT — but via efficient adapter layers	Same as SFT	Consumer hardware — almost always your first choice

13.3 Data Preparation — Where 80% of Failures Come From

Pythonimport json
from datasets import Dataset

# STEP 1: Define one format and apply it 100% consistently
def format_example(prompt: str, response: str) -> dict:
    return {
        "text": (
            f"### Instruction:\n{prompt}\n\n"
            f"### Response:\n{response}"
        )
    }

# STEP 2: Validate every example — the model learns your bugs too
def validate(ex: dict) -> list:
    issues = []
    if not ex.get("prompt") or len(ex["prompt"].strip()) < 10:
        issues.append("prompt too short or empty")
    if not ex.get("response") or len(ex["response"].strip()) < 20:
        issues.append("response too short or empty")
    if ex.get("prompt") == ex.get("response"):
        issues.append("prompt equals response")
    if len(ex.get("response", "")) > 6000:
        issues.append("response suspiciously long — possible duplicate or corruption")
    return issues

# STEP 3: Load, audit, fix
raw = [json.loads(l) for l in open("my_data.jsonl")]
bad = [(i, validate(ex)) for i, ex in enumerate(raw) if validate(ex)]
print(f"Total: {len(raw)} | Bad: {len(bad)}")
for idx, issues in bad[:10]:
    print(f"  Row {idx}: {issues}")

# STEP 4: Filter and format clean examples only
clean = [format_example(ex["prompt"], ex["response"])
         for ex in raw if not validate(ex)]
dataset = Dataset.from_list(clean)

13.4 Chat Templates — The Invisible Bug

Every instruct model was trained with a specific prompt format. Using the wrong one is one of the most common silent failures — the model degrades with no error message.

Pythonfrom transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")

# WRONG: hand-rolled format almost certainly mismatches what the model expects
wrong = "### Instruction:\nWhat is LoRA?\n\n### Response:\n"

# RIGHT: apply_chat_template always produces the exact format the model expects
messages = [
    {"role": "system", "content": "You are a helpful ML assistant."},
    {"role": "user",   "content": "What is LoRA?"},
]
correct = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,   # adds the assistant turn opener
)
print(correct)
# <|begin_of_text|><|start_header_id|>system<|end_header_id|>\nYou are...

# Apply at data-prep time — not at inference time inside the training loop
def format_with_template(example):
    msgs = [
        {"role": "user",      "content": example["prompt"]},
        {"role": "assistant", "content": example["response"]},
    ]
    return {"text": tokenizer.apply_chat_template(msgs, tokenize=False)}

dataset = dataset.map(format_with_template)

13.5 Train / Val / Test Split — Non-Negotiable

Pythonfrom datasets import Dataset
import random

random.seed(42)
data = list(dataset)
random.shuffle(data)

n = len(data)
train = data[:int(n * 0.80)]
val   = data[int(n * 0.80):int(n * 0.90)]
test  = data[int(n * 0.90):]

# Save to disk before any training — never mix splits
Dataset.from_list(train).save_to_disk("./data/splits/train")
Dataset.from_list(val).save_to_disk("./data/splits/val")
Dataset.from_list(test).save_to_disk("./data/splits/test")
print(f"Train: {len(train)} | Val: {len(val)} | Test: {len(test)}")

⚠️

Split before any preprocessing. If you deduplicate after splitting, examples from your test set may have leaked into training. Split first, then process each split independently.

13.6 Reading Your Loss Curves

📉

Both losses drop together

Healthy training — the model is generalising. Continue or stop when val loss plateaus.

✅ Keep going

📈

Val rises, train drops

Overfitting — the model is memorising. Stop here and use the checkpoint where val was lowest.

🛑 Stop early

🔀

Loss spikes suddenly

Learning rate too high or a corrupted batch. Halve the learning rate and relaunch.

⚠️ Lower LR

📊

Loss stalls from step 1

LR too low, frozen layers, or wrong LoRA target_modules. Check your config.

⚠️ Check config

🎯

Val loss = NaN

Data issue — empty batch, bad label, or zero-length sequence. Add pre-training validation logging.

🐛 Debug data

🏁

Both losses floor out

The model has learned everything in your data. More epochs won't help — get more diverse data.

✅ Evaluate now

13.7 The Fastest Path from Idea to Working Fine-Tune

①

Always start from an instruct-tuned checkpoint, not a base model

Llama-3.2-3B-Instruct, Phi-3.5-mini-instruct, Mistral-7B-Instruct — these already follow instructions. Your SFT only needs to adjust behaviour, not teach instruction-following from scratch. Base models require far more data to be useful.

②

Write 200 examples by hand before touching any training code

If you cannot write 200 clear prompt → response pairs yourself, your task is not well-defined enough to fine-tune. This is not a suggestion. Good data beats good code every time. Spend a day on this.

③

Run a 10-step smoke test before committing to a full run

Set max_steps=10 in TrainingArguments. If you see OOM, NaN loss, or no GPU activity, fix it now — not after 6 hours. Confirm: loss is not NaN, a checkpoint saves, and your monitor shows GPU activity.

④

Train 1 epoch, evaluate fully, then decide on more epochs

Many tasks are done after one epoch. Evaluate on your held-out test set before spending more compute. You will often be surprised how little training is needed — especially with LoRA.

⑤

Keep the LoRA adapter, not the merged model, until you are satisfied

A LoRA adapter is a few hundred MB. The merged model is full size. Adapters can be re-merged anytime. Keep adapters from each epoch and only merge the winner after your evaluation is done.

13.8 Sanity Checks Before You Ship Anything

Run 20 prompts manually — read every output yourself, metrics lie, human eyes catch format drift
Test the exact use cases that motivated the fine-tune — tone, format, domain vocab — those should clearly win vs the base model
Test a prompt not in your dataset — if the model fails on novel inputs, it overfitted
Compare directly against the un-fine-tuned base model on your test set — your model must clearly win
Check for repetition loops — a sign of too many epochs or bad training data
Check output length distribution — if the model always stops short or always runs to max tokens, your response format may be wrong
Ask it something clearly outside its training domain — it should say it doesn't know, not hallucinate confidently

🧠

The mindset that separates fast learners: Treat your first fine-tune as a debugging exercise, not a production run. The goal is to get the full loop working — data in, model trained, evaluation completed — with any result at all. A model that trains without crashing and produces coherent (even if imperfect) output is a success. From that baseline, iterate on data quality. Everything else is tuning noise.

Hugging Face LLMsFrom Zero to Fine-Tuned

📋 What This Guide Covers

Chapter 01 Questions to Ask Before Choosing Any Model

1.1 What Problem Are You Solving?

1.2 What Are Your Hardware Constraints?

1.3 What Are Your Long-Term Goals?

1.4 What Are Your Data and Privacy Constraints?

Chapter 02 Best Practices for Choosing a Model

2.1 Understand Model Families

2.2 Instruction-Tuned vs Base Models

🔤 Base Models

💬 Instruction-Tuned Models

2.3 How to Read a Model Card

2.4 Model Size vs Hardware — Quick Reference

Chapter 03 Getting a Hugging Face Account

3.1 Sign Up

3.2 Install the Hugging Face CLI

3.3 Understand the Cache

Chapter 04 Downloading a Model Locally

4.1 Three Ways to Download

4.2 Download Quantised Models (for limited memory)

4.3 Verify the Download

Chapter 05 What Data Can You Use?

5.1 Data Types for LLM Training

5.2 Where to Find Datasets

5.3 Data Licensing — Critical Checklist

5.4 Data Quality Over Quantity

Chapter 06 Setting Up Your Training Environment

6.1 Install Python and Create a Virtual Environment

6.2 Core Package Installation

6.3 Verify Your Setup

6.4 Project Directory Structure

Chapter 07 Fine-Tuning with LoRA — Step by Step

7.1 LoRA Config Explained

7.2 Complete Training Script

7.3 Merge the Adapter Back into the Base Model

7.4 Key Hyperparameters for New Fine-Tuners

Chapter 08 Use Cases — Models and When to Retrain

8.1 Use Case Landscape

Medical / Clinical NLP

Legal Document Analysis

Code Generation & Review

Customer Support Chatbot

Document Summarisation

Low-Resource Language

Named Entity Recognition

Voice Transcription & QA

Semantic Search / RAG

8.2 Decision Tree — To Fine-Tune or Not?

Chapter 09 After Training — Evaluation & Deployment

9.1 Evaluate Your Fine-Tuned Model

9.2 Evaluation Metrics to Track

9.3 Push Your Model to the Hub

9.4 Serve the Model Locally

Chapter 10 Common Mistakes & How to Avoid Them

❌ Training on raw base model

❌ Not separating train/val/test

❌ Learning rate too high

❌ Training too many epochs

❌ Ignoring chat templates

❌ Mixing data quality

10.1 Final Checklist Before You Start Training

Chapter 11 Apple Silicon — Getting Maximum Performance

11.1 MPS vs MLX — Pick the Right Tool

🔷 PyTorch MPS

🍎 MLX (Apple-native)

11.2 Using PyTorch MPS Correctly

11.3 MLX — Native Apple Silicon Stack

11.4 Ollama — Zero-Config Local Inference

11.5 Memory Tiers — Which Mac Can Run What

11.6 MLX LoRA Fine-Tuning — Data Format

Chapter 12 Performance Gains — Speed, Memory & Throughput

12.1 The Optimisation Stack — Apply in This Order

12.2 Quantisation Options Compared

12.3 Key Training Flags — All in One Place

12.4 Flash Attention 2 — CUDA Only, One Line

12.5 Sequence Packing — Works on CUDA and MPS

12.6 torch.compile — Free Speed on PyTorch 2.x

12.7 Monitor What Your Hardware Is Actually Doing

Chapter 13 Fine-Tuning Fundamentals — What Every New User Must Know

Hugging Face LLMs
From Zero to Fine-Tuned