Project 04

Fine-Tuning LLMs on Apple Silicon

In 2026, building a large language model from the ground up is not the goal — it's a distraction. The real engineering challenge is knowing how to take a world-class pretrained model, shape it to your needs, and validate that it's genuinely learning. This guide documents that full pipeline, end to end, on an M1 Mac.

🧠 Why We Fine-Tune in 2026 — Not Build From Scratch

Training a capable large language model from zero requires hundreds of GPUs running continuously for months, terabytes of curated training data, and infrastructure budgets that rival a small company's annual operating costs. That was the challenge Meta, Google, and Mistral tackled so that the rest of us don't have to.

What we do instead is fine-tuning: we take an open-weight model that already understands language, reasoning, and the world — then we specialize it. Using a technique called LoRA (Low-Rank Adaptation), we inject a small set of trainable adapter layers on top of a frozen base model. The result is a model that reflects your data, your domain, and your voice, built in hours rather than months, on hardware you already own.

This document covers the complete pipeline: choosing a base model, curating training data, running LoRA fine-tuning with Apple's MLX library, and — critically — validating that the model is actually learning and not just memorizing or degrading. The methodology here is what production ML teams use at scale, condensed for a 16 GB M1 Mac.

Hardware
M1 Mac Pro · 16 GB Unified
Framework
Apple MLX + mlx-lm
Base Model
Llama 3.1 8B Instruct
Method
LoRA Fine-Tuning
Output Format
GGUF → Ollama
Python
3.11 (pyenv)

Part 00 Understanding Model Size

When people say "8 GB model," they almost always mean 8 billion parameters — not 8 gigabytes of storage. Here is how those numbers actually map to real-world hardware requirements:

TermWhat It Means
8B parameters~15 GB on disk at full float32 precision
8B at 4-bit quantization~4–5 GB on disk — fits in 16 GB unified RAM
Training 8B from scratchNeeds 100+ GB VRAM — not feasible on 16 GB
Fine-tuning 8B with LoRANeeds ~12–14 GB — just barely feasible on 16 GB M1
"Medium" LLM consensus1B–13B parameters is widely considered medium-sized
💡
The recommendation: On an M1 Mac with 16 GB, do not train from scratch. Start from a pretrained open-weight model like Llama 3.1 8B or Mistral 7B and fine-tune it. Training from scratch requires months on hundreds of GPUs — fine-tuning takes hours on hardware you already own.

Part 01 The Full Pipeline

From a blank terminal to a custom model running in Ollama, the pipeline has seven distinct phases. Each one builds on the last:

01
Choose a Base Model
Pick a pretrained open-weight model — Llama 3, Mistral 7B, Phi-3, or Gemma 2.
02
Gather Training Data
Curate conversation pairs, Q&A datasets, reasoning chains, and domain-specific content.
03
Fine-Tune with LoRA + MLX
Run LoRA training on your M1 Mac using Apple's MLX library — optimized for unified memory.
04
Evaluate Intelligence
Benchmark with lm-evaluation-harness before and after training. Save your baseline — it matters.
05
Merge & Quantize
Fuse LoRA adapter weights back into the base model, then quantize to 4-bit for efficient inference.
06
Convert to GGUF
Package the model in GGUF format — the standard Ollama and llama.cpp expect.
07
Register in Ollama
Write a Modelfile, set your system prompt and generation parameters, and run locally.

Part 02 Choosing Your Base Model

You never train a large model from scratch in 2026. You start from a pretrained base and customize it — think of it as tuning a precision engine rather than casting one from raw steel.

ModelNotes
Llama 3.1 8B (Meta)Best all-around for human conversation — widely supported, huge community
Mistral 7B v0.3Fast, efficient, excellent instruction following
Phi-3 Mini 3.8BTiny but surprisingly capable — ideal when RAM is tight
Gemma 2 9B (Google)Strong reasoning, fits in 16 GB with 4-bit quantization
Qwen2.5 7B (Alibaba)Excellent multilingual and code understanding
🎯
Best choice for M1 16 GB: Llama 3.1 8B Instruct from Meta. Trained for human conversation by default, massive fine-tuning community, works natively with MLX on Apple Silicon.

Downloading the Base Model

# Install Hugging Face CLI
pip install huggingface_hub

# Login with your HF token
huggingface-cli login

# Download Llama 3.1 8B Instruct
huggingface-cli download meta-llama/Llama-3.1-8B-Instruct \
  --local-dir ./models/llama-3.1-8b-instruct \
  --local-dir-use-symlinks Falsebash

Part 03 Training Data: What Makes a Model Smart

Data quality is the single most important variable in how capable your fine-tuned model becomes. Excellent training code on bad data still produces a bad model — there are no shortcuts here.

What Kind of Data You Need

For a model that sounds like a knowledgeable, natural human, you want: conversation pairs (user message + assistant response), Q&A datasets with high-quality answers, reasoning chains showing step-by-step thinking, domain-specific knowledge for your target use case, and diversity across topics, styles, and lengths.

DatasetWhy Use It
OpenHermes-2.51M high-quality GPT-4 conversations — top recommendation
UltraChat 200kDiverse multi-turn dialogue, very human-like
ShareGPTReal user ChatGPT conversations — authentic patterns
Alpaca (52k)Classic instruction dataset — solid starting point
OpenOrcaReasoning-heavy — makes the model measurably smarter
LIMA (1,000 examples)Small but exceptional quality — proves quality beats quantity
📖
The LIMA insight (2023): 1,000 carefully curated examples outperforms 52,000 mediocre ones. Start with OpenHermes-2.5 filtered to 100k high-quality rows and you have an excellent foundation.

Data Format — What MLX Expects

MLX fine-tuning expects JSONL format — one JSON object per line, using the standard conversation structure:

{"messages": [
  {"role": "system",    "content": "You are a helpful assistant."},
  {"role": "user",      "content": "What is photosynthesis?"},
  {"role": "assistant", "content": "Photosynthesis is the process..."}
]}jsonl

Downloading and Preparing Data

from datasets import load_dataset
import json

ds = load_dataset('teknium/OpenHermes-2.5', split='train')
ds = ds.shuffle(seed=42).select(range(100000))  # filter to 100k rows

def to_chat_format(row):
    messages = []
    if row.get('system_prompt'):
        messages.append({'role': 'system', 'content': row['system_prompt']})
    messages.append({'role': 'user',      'content': row['instruction']})
    messages.append({'role': 'assistant', 'content': row['output']})
    return {'messages': messages}

converted = [to_chat_format(r) for r in ds]
split = int(len(converted) * 0.95)  # 95/5 train/valid split

with open('data/train.jsonl', 'w') as f:
    for row in converted[:split]: f.write(json.dumps(row) + '\n')
with open('data/valid.jsonl', 'w') as f:
    for row in converted[split:]: f.write(json.dumps(row) + '\n')python

Part 04 Environment Setup

ComponentRequirement
MacM1 Mac Pro — excellent
RAM16 GB unified memory — minimum viable for 8B LoRA
Storage50–80 GB free (model + data + checkpoints)
macOSVentura 13.3+ or Sonoma (required for MLX)
Python3.11 or 3.12 (3.13 has MLX compatibility issues)
# 1. Install Homebrew
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

# 2. Install Python 3.11 via pyenv
brew install pyenv
pyenv install 3.11.9
pyenv global 3.11.9

# 3. Create virtual environment
python -m venv ~/llm-env
source ~/llm-env/bin/activate

# 4. Install MLX and fine-tuning tools
pip install mlx mlx-lm
pip install huggingface_hub datasets transformers
pip install numpy tqdm sentencepiece

# 5. Verify MLX sees Apple Silicon
python -c "import mlx.core as mx; print(mx.default_device())"
# Should print: Device(gpu, 0)bash
Why MLX? Apple's MLX framework is purpose-built for Apple Silicon's unified memory architecture. CPU and GPU share the same memory pool natively — this is why 16 GB on an M1 goes significantly further than 16 GB on a traditional discrete GPU setup.

Part 05 Measuring Intelligence: Baseline First

Before touching any training, establish your baseline benchmarks. This is non-negotiable — you cannot claim a model improved if you did not measure where it started.

Quick Sanity Test

from mlx_lm import load, generate

model, tokenizer = load('./models/llama-3.1-8b-instruct')

test_prompts = [
    'Explain quantum entanglement simply.',
    'Write a Python function to reverse a linked list.',
    'What is the difference between empathy and sympathy?',
    'If you have 3 apples and give away 1.5, how many remain?',
]

for prompt in test_prompts:
    messages = [{'role': 'user', 'content': prompt}]
    text = tokenizer.apply_chat_template(messages, tokenize=False)
    response = generate(model, tokenizer, prompt=text, max_tokens=300)
    print(f'PROMPT: {prompt}')
    print(f'RESPONSE: {response}')
    print('---')python

Formal Benchmarks

pip install lm-eval

lm_eval --model hf \
  --model_args pretrained=./models/llama-3.1-8b-instruct \
  --tasks hellaswag,arc_easy,arc_challenge,winogrande,mmlu \
  --device mps \
  --output_path ./evals/baseline/bash
BenchmarkWhat It Tests
MMLU (57 subjects)Broad knowledge — how "educated" the model is
HellaSwagCommon sense completion — human-like reasoning
ARC ChallengeHard science questions from grade-school exams
TruthfulQAHallucination rate — does it make things up?
WinoGrandePronoun disambiguation — subtle language understanding
GSM8KGrade school math word problems — arithmetic reasoning
⚠️
Record these numbers. Save your baseline scores before training begins. After fine-tuning, re-run the exact same benchmarks. Scores up = training helped. Scores down significantly = catastrophic forgetting — a common and fixable mistake.

Part 06 LoRA Fine-Tuning with MLX

What is LoRA?

LoRA (Low-Rank Adaptation) makes fine-tuning an 8B model feasible on 16 GB of RAM. Rather than updating all 8 billion weights — which would require 80–160 GB VRAM — LoRA freezes the base model entirely and trains small adapter matrices layered on top of the attention layers. At the end, those adapters merge back in seamlessly.

ConceptLoRA Explanation
Full fine-tuningUpdate all 8B parameters — needs 80–160 GB VRAM
LoRA rank 8Trains ~20–40M adapter params — needs ~12–14 GB
LoRA rank 16Slightly more capacity — use if rank 8 plateaus
What gets adaptedAttention layers: Q, K, V, O matrices
End resultAdapter file (~100–300 MB) that modifies base behavior

Running the Training

mlx_lm.lora \
  --model ./models/llama-3.1-8b-instruct \
  --train \
  --data ./data \
  --batch-size 2 \
  --lora-layers 16 \
  --iters 2000 \
  --learning-rate 1e-5 \
  --grad-checkpoint \
  --adapter-path ./adapters \
  --save-every 200 \
  --val-every 100 \
  --max-seq-length 2048bash

What to Watch During Training

SignalWhat It Means
Training loss decreasingModel is learning — this is what you want
Validation loss decreasing tooModel is generalizing, not just memorizing — excellent
Val loss rising after a dipOverfitting — stop here and use the earlier checkpoint
Loss flat / not movingLearning rate too low, or data format is wrong
Out of memory errorReduce batch_size to 1, reduce max_seq_length to 1024
Loss explodes (NaN)Learning rate too high — reduce to 5e-6
📊
Rule of thumb: For 100k examples at batch size 2, 2,000 steps covers roughly 4% of your data. That's enough for noticeable personality and style changes. For deeper knowledge transfer, aim for 5,000–10,000 steps.

Part 07 Validating That the Model Is Learning

This is where most people skip ahead — don't. Validation is the mechanism that distinguishes a model that actually improved from one that just memorized your training set or quietly degraded its general capability.

Test the Fine-Tuned Model

from mlx_lm import load, generate

# Load base model with your LoRA adapter active
model, tokenizer = load(
    './models/llama-3.1-8b-instruct',
    adapter_path='./adapters'
)

# Use the SAME prompts as your baseline — this is the comparison
test_prompts = [
    'Explain quantum entanglement simply.',
    'Write a Python function to reverse a linked list.',
    'What is the difference between empathy and sympathy?',
]

for prompt in test_prompts:
    messages = [{'role': 'user', 'content': prompt}]
    text = tokenizer.apply_chat_template(messages, tokenize=False)
    response = generate(model, tokenizer, prompt=text, max_tokens=400)
    print(f'PROMPT: {prompt[:50]}...')
    print(f'RESPONSE: {response}')
    print('===')python

Human Quality Evaluation

Read 20 responses and score each dimension on a 1–5 scale. This is still the gold standard — benchmarks measure breadth, human review measures quality.

Quality DimensionWhat to Look For
CoherenceDoes it stay on topic and follow a logical structure?
NaturalnessDoes it sound like a person, or a robot listing facts?
HelpfulnessDoes it actually answer what was asked?
HonestyDoes it acknowledge uncertainty rather than fabricating?
No hallucinationDoes it invent facts or cite things that don't exist?

Re-Run Formal Benchmarks

# Merge adapters first — required for lm-eval
python -m mlx_lm.fuse \
  --model ./models/llama-3.1-8b-instruct \
  --adapter-path ./adapters \
  --save-path ./models/finetuned-merged

# Benchmark the merged model
lm_eval --model hf \
  --model_args pretrained=./models/finetuned-merged \
  --tasks hellaswag,arc_easy,arc_challenge,winogrande,mmlu \
  --device mps \
  --output_path ./evals/finetuned/

# A drop of >3 points on MMLU = catastrophic forgetting
# Ideal result: same or better on knowledge, improved on conversation stylebash
⚠️
Catastrophic Forgetting: If general benchmark scores drop significantly, the model was trained too long or on too narrow a dataset. The fix: blend general-knowledge data (OpenOrca, UltraChat) with your custom domain content. Aim for roughly 80% general, 20% custom.

Part 08 Merging, Quantizing & GGUF

Step 1: Fuse LoRA into the Base Model

python -m mlx_lm.fuse \
  --model ./models/llama-3.1-8b-instruct \
  --adapter-path ./adapters \
  --save-path ./models/finetuned-merged \
  --de-quantize  # ensures full float16 output for best qualitybash

Step 2: Quantize to 4-bit

Quantization compresses from ~15 GB (float16) down to ~4.5 GB (4-bit int). Inference becomes significantly faster and the model fits comfortably alongside other running applications.

python -m mlx_lm.convert \
  --hf-path ./models/finetuned-merged \
  --mlx-path ./models/finetuned-4bit \
  -q --q-bits 4 --q-group-size 64bash

Step 3: Convert to GGUF

# Build llama.cpp with Metal support
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DLLAMA_METAL=ON -DLLAMA_CURL=ON
cmake --build build --config Release -j $(sysctl -n hw.ncpu)

# Convert to GGUF (float16 first)
python convert_hf_to_gguf.py \
  ./models/finetuned-merged \
  --outfile ./models/my-llm-f16.gguf \
  --outtype f16

# Quantize to Q4_K_M — best quality/size tradeoff
./build/bin/llama-quantize \
  ./models/my-llm-f16.gguf \
  ./models/my-llm-q4_k_m.gguf \
  Q4_K_Mbash
Quant TypeSize / Quality Tradeoff
Q8_0~8 GB — near-lossless quality, best for 16+ GB RAM
Q4_K_M~4.5 GB — excellent balance (recommended)
Q4_K_S~4.3 GB — slightly smaller, minimal quality difference
Q3_K_M~3.5 GB — noticeable quality drop, use only if necessary
Q2_K~2.8 GB — significant quality loss, not recommended

Part 09 Deploying to Ollama

# Create Modelfile
cat > Modelfile << 'EOF'
FROM ./models/my-llm-q4_k_m.gguf

SYSTEM """
You are a highly intelligent, helpful assistant. You are direct,
thoughtful, and give thorough answers when needed but concise answers
when brevity is appropriate. You admit when you don't know something
and never fabricate facts.
"""

PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER top_k 40
PARAMETER repeat_penalty 1.1
PARAMETER num_ctx 4096
EOF

# Register and run
ollama serve &
ollama create my-llm -f Modelfile
ollama run my-llmbash

Generation Parameter Reference

Use CaseSettings
Natural conversationtemp=0.75, top_p=0.9, repeat_penalty=1.1
Precise technical answerstemp=0.2, top_p=0.85, repeat_penalty=1.05
Creative writingtemp=0.9, top_p=0.95, repeat_penalty=1.0
Code generationtemp=0.1, top_p=0.95, repeat_penalty=1.05

Part 10 Troubleshooting & Iteration

ProblemFix
Model repeats itselfIncrease repeat_penalty to 1.15–1.2
Responses too shortAdd examples of long, detailed responses to training data
Sounds roboticAdd more natural conversation data (ShareGPT, UltraChat)
Hallucinating factsAdd TruthfulQA-style data; lower temperature at inference
Benchmark scores droppedMix in general knowledge data (OpenOrca) — 80/20 ratio
Out of memory during trainingbatch_size=1, max_seq_length=1024, grad_checkpoint=True
Loss not decreasingCheck data format; try increasing learning_rate to 2e-5
🔁
The iteration cycle: Change one variable at a time. Train for 500–1,000 steps as a quick test. Run your test prompts — does it feel better? If yes, run benchmarks to confirm scores held. If scores held or improved, continue. If not, revert the change.

Part 11 Complete Checklist

Environment

Data Pipeline

Training

Post-Training & Deployment

Appendix Quick Reference

# Fine-tune
mlx_lm.lora --model ./models/llama-3.1-8b-instruct --train --data ./data ...

# Test with adapter active
mlx_lm.generate --model ./models/llama-3.1-8b-instruct --adapter-path ./adapters

# Merge adapter into base
python -m mlx_lm.fuse --model [base] --adapter-path ./adapters --save-path [out]

# Convert to GGUF
python llama.cpp/convert_hf_to_gguf.py [merged_model] --outfile model.gguf

# Quantize
./llama.cpp/build/bin/llama-quantize model-f16.gguf model-q4.gguf Q4_K_M

# Register in Ollama
ollama create my-llm -f Modelfile

# Run
ollama run my-llmbash

Useful Links

MLX Examples ↗  ·  llama.cpp ↗  ·  Hugging Face ↗  ·  Ollama Docs ↗  ·  lm-evaluation-harness ↗