Project 04

Fine-Tuning LLMs on Apple Silicon

In 2026, building a large language model from the ground up is not the goal — it's a distraction. The real engineering challenge is knowing how to take a world-class pretrained model, shape it to your needs, and validate that it's genuinely learning. This guide documents that full pipeline, end to end, on an M1 Mac.

🧠 Why We Fine-Tune in 2026 — Not Build From Scratch

Training a capable large language model from zero requires hundreds of GPUs running continuously for months, terabytes of curated training data, and infrastructure budgets that rival a small company's annual operating costs. That was the challenge Meta, Google, and Mistral tackled so that the rest of us don't have to.

What we do instead is fine-tuning: we take an open-weight model that already understands language, reasoning, and the world — then we specialize it. Using a technique called LoRA (Low-Rank Adaptation), we inject a small set of trainable adapter layers on top of a frozen base model. The result is a model that reflects your data, your domain, and your voice, built in hours rather than months, on hardware you already own.

This document covers the complete pipeline: choosing a base model, curating training data, running LoRA fine-tuning with Apple's MLX library, and — critically — validating that the model is actually learning and not just memorizing or degrading. The methodology here is what production ML teams use at scale, condensed for a 16 GB M1 Mac.

Hardware

M1 Mac Pro · 16 GB Unified

Framework

Apple MLX + mlx-lm

Base Model

Llama 3.1 8B Instruct

Method

LoRA Fine-Tuning

Output Format

GGUF → Ollama

Python

3.11 (pyenv)

Part 00 Understanding Model Size

When people say "8 GB model," they almost always mean 8 billion parameters — not 8 gigabytes of storage. Here is how those numbers actually map to real-world hardware requirements:

Term	What It Means
8B parameters	~15 GB on disk at full float32 precision
8B at 4-bit quantization	~4–5 GB on disk — fits in 16 GB unified RAM
Training 8B from scratch	Needs 100+ GB VRAM — not feasible on 16 GB
Fine-tuning 8B with LoRA	Needs ~12–14 GB — just barely feasible on 16 GB M1
"Medium" LLM consensus	1B–13B parameters is widely considered medium-sized

💡

The recommendation: On an M1 Mac with 16 GB, do not train from scratch. Start from a pretrained open-weight model like Llama 3.1 8B or Mistral 7B and fine-tune it. Training from scratch requires months on hundreds of GPUs — fine-tuning takes hours on hardware you already own.

Part 01 The Full Pipeline

From a blank terminal to a custom model running in Ollama, the pipeline has seven distinct phases. Each one builds on the last:

Choose a Base Model

Pick a pretrained open-weight model — Llama 3, Mistral 7B, Phi-3, or Gemma 2.

Gather Training Data

Curate conversation pairs, Q&A datasets, reasoning chains, and domain-specific content.

Fine-Tune with LoRA + MLX

Run LoRA training on your M1 Mac using Apple's MLX library — optimized for unified memory.

Evaluate Intelligence

Benchmark with lm-evaluation-harness before and after training. Save your baseline — it matters.

Merge & Quantize

Fuse LoRA adapter weights back into the base model, then quantize to 4-bit for efficient inference.

Convert to GGUF

Package the model in GGUF format — the standard Ollama and llama.cpp expect.

Write a Modelfile, set your system prompt and generation parameters, and run locally.

Part 02 Choosing Your Base Model

You never train a large model from scratch in 2026. You start from a pretrained base and customize it — think of it as tuning a precision engine rather than casting one from raw steel.

Model	Notes
Llama 3.1 8B (Meta)	Best all-around for human conversation — widely supported, huge community
Mistral 7B v0.3	Fast, efficient, excellent instruction following
Phi-3 Mini 3.8B	Tiny but surprisingly capable — ideal when RAM is tight
Gemma 2 9B (Google)	Strong reasoning, fits in 16 GB with 4-bit quantization
Qwen2.5 7B (Alibaba)	Excellent multilingual and code understanding

🎯

Best choice for M1 16 GB: Llama 3.1 8B Instruct from Meta. Trained for human conversation by default, massive fine-tuning community, works natively with MLX on Apple Silicon.

Downloading the Base Model

# Install Hugging Face CLI
pip install huggingface_hub

# Login with your HF token
huggingface-cli login

# Download Llama 3.1 8B Instruct
huggingface-cli download meta-llama/Llama-3.1-8B-Instruct \
  --local-dir ./models/llama-3.1-8b-instruct \
  --local-dir-use-symlinks Falsebash

Part 03 Training Data: What Makes a Model Smart

Data quality is the single most important variable in how capable your fine-tuned model becomes. Excellent training code on bad data still produces a bad model — there are no shortcuts here.

What Kind of Data You Need

For a model that sounds like a knowledgeable, natural human, you want: conversation pairs (user message + assistant response), Q&A datasets with high-quality answers, reasoning chains showing step-by-step thinking, domain-specific knowledge for your target use case, and diversity across topics, styles, and lengths.

Dataset	Why Use It
OpenHermes-2.5	1M high-quality GPT-4 conversations — top recommendation
UltraChat 200k	Diverse multi-turn dialogue, very human-like
ShareGPT	Real user ChatGPT conversations — authentic patterns
Alpaca (52k)	Classic instruction dataset — solid starting point
OpenOrca	Reasoning-heavy — makes the model measurably smarter
LIMA (1,000 examples)	Small but exceptional quality — proves quality beats quantity

📖

The LIMA insight (2023): 1,000 carefully curated examples outperforms 52,000 mediocre ones. Start with OpenHermes-2.5 filtered to 100k high-quality rows and you have an excellent foundation.

Data Format — What MLX Expects

MLX fine-tuning expects JSONL format — one JSON object per line, using the standard conversation structure:

{"messages": [
  {"role": "system",    "content": "You are a helpful assistant."},
  {"role": "user",      "content": "What is photosynthesis?"},
  {"role": "assistant", "content": "Photosynthesis is the process..."}
]}jsonl

Downloading and Preparing Data

from datasets import load_dataset
import json

ds = load_dataset('teknium/OpenHermes-2.5', split='train')
ds = ds.shuffle(seed=42).select(range(100000))  # filter to 100k rows

def to_chat_format(row):
    messages = []
    if row.get('system_prompt'):
        messages.append({'role': 'system', 'content': row['system_prompt']})
    messages.append({'role': 'user',      'content': row['instruction']})
    messages.append({'role': 'assistant', 'content': row['output']})
    return {'messages': messages}

converted = [to_chat_format(r) for r in ds]
split = int(len(converted) * 0.95)  # 95/5 train/valid split

with open('data/train.jsonl', 'w') as f:
    for row in converted[:split]: f.write(json.dumps(row) + '\n')
with open('data/valid.jsonl', 'w') as f:
    for row in converted[split:]: f.write(json.dumps(row) + '\n')python

Part 04 Environment Setup

Component	Requirement
Mac	M1 Mac Pro — excellent
RAM	16 GB unified memory — minimum viable for 8B LoRA
Storage	50–80 GB free (model + data + checkpoints)
macOS	Ventura 13.3+ or Sonoma (required for MLX)
Python	3.11 or 3.12 (3.13 has MLX compatibility issues)

# 1. Install Homebrew
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

# 2. Install Python 3.11 via pyenv
brew install pyenv
pyenv install 3.11.9
pyenv global 3.11.9

# 3. Create virtual environment
python -m venv ~/llm-env
source ~/llm-env/bin/activate

# 4. Install MLX and fine-tuning tools
pip install mlx mlx-lm
pip install huggingface_hub datasets transformers
pip install numpy tqdm sentencepiece

# 5. Verify MLX sees Apple Silicon
python -c "import mlx.core as mx; print(mx.default_device())"
# Should print: Device(gpu, 0)bash

⚡

Why MLX? Apple's MLX framework is purpose-built for Apple Silicon's unified memory architecture. CPU and GPU share the same memory pool natively — this is why 16 GB on an M1 goes significantly further than 16 GB on a traditional discrete GPU setup.

Part 05 Measuring Intelligence: Baseline First

Before touching any training, establish your baseline benchmarks. This is non-negotiable — you cannot claim a model improved if you did not measure where it started.

Quick Sanity Test

from mlx_lm import load, generate

model, tokenizer = load('./models/llama-3.1-8b-instruct')

test_prompts = [
    'Explain quantum entanglement simply.',
    'Write a Python function to reverse a linked list.',
    'What is the difference between empathy and sympathy?',
    'If you have 3 apples and give away 1.5, how many remain?',
]

for prompt in test_prompts:
    messages = [{'role': 'user', 'content': prompt}]
    text = tokenizer.apply_chat_template(messages, tokenize=False)
    response = generate(model, tokenizer, prompt=text, max_tokens=300)
    print(f'PROMPT: {prompt}')
    print(f'RESPONSE: {response}')
    print('---')python

Formal Benchmarks

pip install lm-eval

lm_eval --model hf \
  --model_args pretrained=./models/llama-3.1-8b-instruct \
  --tasks hellaswag,arc_easy,arc_challenge,winogrande,mmlu \
  --device mps \
  --output_path ./evals/baseline/bash

Benchmark	What It Tests
MMLU (57 subjects)	Broad knowledge — how "educated" the model is
HellaSwag	Common sense completion — human-like reasoning
ARC Challenge	Hard science questions from grade-school exams
TruthfulQA	Hallucination rate — does it make things up?
WinoGrande	Pronoun disambiguation — subtle language understanding
GSM8K	Grade school math word problems — arithmetic reasoning

⚠️

Record these numbers. Save your baseline scores before training begins. After fine-tuning, re-run the exact same benchmarks. Scores up = training helped. Scores down significantly = catastrophic forgetting — a common and fixable mistake.

Part 06 LoRA Fine-Tuning with MLX

What is LoRA?

LoRA (Low-Rank Adaptation) makes fine-tuning an 8B model feasible on 16 GB of RAM. Rather than updating all 8 billion weights — which would require 80–160 GB VRAM — LoRA freezes the base model entirely and trains small adapter matrices layered on top of the attention layers. At the end, those adapters merge back in seamlessly.

Concept	LoRA Explanation
Full fine-tuning	Update all 8B parameters — needs 80–160 GB VRAM
LoRA rank 8	Trains ~20–40M adapter params — needs ~12–14 GB
LoRA rank 16	Slightly more capacity — use if rank 8 plateaus
What gets adapted	Attention layers: Q, K, V, O matrices
End result	Adapter file (~100–300 MB) that modifies base behavior

Running the Training

mlx_lm.lora \
  --model ./models/llama-3.1-8b-instruct \
  --train \
  --data ./data \
  --batch-size 2 \
  --lora-layers 16 \
  --iters 2000 \
  --learning-rate 1e-5 \
  --grad-checkpoint \
  --adapter-path ./adapters \
  --save-every 200 \
  --val-every 100 \
  --max-seq-length 2048bash

What to Watch During Training

Signal	What It Means
Training loss decreasing	Model is learning — this is what you want
Validation loss decreasing too	Model is generalizing, not just memorizing — excellent
Val loss rising after a dip	Overfitting — stop here and use the earlier checkpoint
Loss flat / not moving	Learning rate too low, or data format is wrong
Out of memory error	Reduce `batch_size` to 1, reduce `max_seq_length` to 1024
Loss explodes (NaN)	Learning rate too high — reduce to `5e-6`

📊

Rule of thumb: For 100k examples at batch size 2, 2,000 steps covers roughly 4% of your data. That's enough for noticeable personality and style changes. For deeper knowledge transfer, aim for 5,000–10,000 steps.

Part 07 Validating That the Model Is Learning

This is where most people skip ahead — don't. Validation is the mechanism that distinguishes a model that actually improved from one that just memorized your training set or quietly degraded its general capability.

Test the Fine-Tuned Model

from mlx_lm import load, generate

# Load base model with your LoRA adapter active
model, tokenizer = load(
    './models/llama-3.1-8b-instruct',
    adapter_path='./adapters'
)

# Use the SAME prompts as your baseline — this is the comparison
test_prompts = [
    'Explain quantum entanglement simply.',
    'Write a Python function to reverse a linked list.',
    'What is the difference between empathy and sympathy?',
]

for prompt in test_prompts:
    messages = [{'role': 'user', 'content': prompt}]
    text = tokenizer.apply_chat_template(messages, tokenize=False)
    response = generate(model, tokenizer, prompt=text, max_tokens=400)
    print(f'PROMPT: {prompt[:50]}...')
    print(f'RESPONSE: {response}')
    print('===')python

Human Quality Evaluation

Read 20 responses and score each dimension on a 1–5 scale. This is still the gold standard — benchmarks measure breadth, human review measures quality.

Quality Dimension	What to Look For
Coherence	Does it stay on topic and follow a logical structure?
Naturalness	Does it sound like a person, or a robot listing facts?
Helpfulness	Does it actually answer what was asked?
Honesty	Does it acknowledge uncertainty rather than fabricating?
No hallucination	Does it invent facts or cite things that don't exist?

Re-Run Formal Benchmarks

# Merge adapters first — required for lm-eval
python -m mlx_lm.fuse \
  --model ./models/llama-3.1-8b-instruct \
  --adapter-path ./adapters \
  --save-path ./models/finetuned-merged

# Benchmark the merged model
lm_eval --model hf \
  --model_args pretrained=./models/finetuned-merged \
  --tasks hellaswag,arc_easy,arc_challenge,winogrande,mmlu \
  --device mps \
  --output_path ./evals/finetuned/

# A drop of >3 points on MMLU = catastrophic forgetting
# Ideal result: same or better on knowledge, improved on conversation stylebash

⚠️

Catastrophic Forgetting: If general benchmark scores drop significantly, the model was trained too long or on too narrow a dataset. The fix: blend general-knowledge data (OpenOrca, UltraChat) with your custom domain content. Aim for roughly 80% general, 20% custom.

Part 08 Merging, Quantizing & GGUF

Step 1: Fuse LoRA into the Base Model

python -m mlx_lm.fuse \
  --model ./models/llama-3.1-8b-instruct \
  --adapter-path ./adapters \
  --save-path ./models/finetuned-merged \
  --de-quantize  # ensures full float16 output for best qualitybash

Step 2: Quantize to 4-bit

Quantization compresses from ~15 GB (float16) down to ~4.5 GB (4-bit int). Inference becomes significantly faster and the model fits comfortably alongside other running applications.

python -m mlx_lm.convert \
  --hf-path ./models/finetuned-merged \
  --mlx-path ./models/finetuned-4bit \
  -q --q-bits 4 --q-group-size 64bash

Step 3: Convert to GGUF

# Build llama.cpp with Metal support
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DLLAMA_METAL=ON -DLLAMA_CURL=ON
cmake --build build --config Release -j $(sysctl -n hw.ncpu)

# Convert to GGUF (float16 first)
python convert_hf_to_gguf.py \
  ./models/finetuned-merged \
  --outfile ./models/my-llm-f16.gguf \
  --outtype f16

# Quantize to Q4_K_M — best quality/size tradeoff
./build/bin/llama-quantize \
  ./models/my-llm-f16.gguf \
  ./models/my-llm-q4_k_m.gguf \
  Q4_K_Mbash

Quant Type	Size / Quality Tradeoff
Q8_0	~8 GB — near-lossless quality, best for 16+ GB RAM
Q4_K_M	~4.5 GB — excellent balance (recommended)
Q4_K_S	~4.3 GB — slightly smaller, minimal quality difference
Q3_K_M	~3.5 GB — noticeable quality drop, use only if necessary
Q2_K	~2.8 GB — significant quality loss, not recommended

Part 09 Deploying to Ollama

# Create Modelfile
cat > Modelfile << 'EOF'
FROM ./models/my-llm-q4_k_m.gguf

SYSTEM """
You are a highly intelligent, helpful assistant. You are direct,
thoughtful, and give thorough answers when needed but concise answers
when brevity is appropriate. You admit when you don't know something
and never fabricate facts.
"""

PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER top_k 40
PARAMETER repeat_penalty 1.1
PARAMETER num_ctx 4096
EOF

# Register and run
ollama serve &
ollama create my-llm -f Modelfile
ollama run my-llmbash

Generation Parameter Reference

Use Case	Settings
Natural conversation	`temp=0.75, top_p=0.9, repeat_penalty=1.1`
Precise technical answers	`temp=0.2, top_p=0.85, repeat_penalty=1.05`
Creative writing	`temp=0.9, top_p=0.95, repeat_penalty=1.0`
Code generation	`temp=0.1, top_p=0.95, repeat_penalty=1.05`

Part 10 Troubleshooting & Iteration

Problem	Fix
Model repeats itself	Increase `repeat_penalty` to 1.15–1.2
Responses too short	Add examples of long, detailed responses to training data
Sounds robotic	Add more natural conversation data (ShareGPT, UltraChat)
Hallucinating facts	Add TruthfulQA-style data; lower temperature at inference
Benchmark scores dropped	Mix in general knowledge data (OpenOrca) — 80/20 ratio
Out of memory during training	`batch_size=1, max_seq_length=1024, grad_checkpoint=True`
Loss not decreasing	Check data format; try increasing `learning_rate` to `2e-5`

🔁

The iteration cycle: Change one variable at a time. Train for 500–1,000 steps as a quick test. Run your test prompts — does it feel better? If yes, run benchmarks to confirm scores held. If scores held or improved, continue. If not, revert the change.

Part 11 Complete Checklist

Environment

Homebrew installed
pyenv + Python 3.11 installed and set as global
MLX virtual environment created and activated
mlx, mlx-lm, datasets, huggingface_hub installed
Hugging Face account created + CLI logged in
Ollama installed and service running
llama.cpp cloned and compiled with Metal support

Data Pipeline

OpenHermes-2.5 downloaded and filtered to 100k rows
Converted to JSONL chat format
Split into train.jsonl (95%) and valid.jsonl (5%)
Format verified with json.loads test

Training

Llama 3.1 8B Instruct downloaded to ./models/
Baseline benchmarks run and scores saved
LoRA training run — 2,000+ steps minimum
Loss curve decreasing — no divergence or plateau
Checkpoints saved every 200 steps

Post-Training & Deployment

Test prompts run on fine-tuned model
Formal benchmarks re-run on merged model
Benchmark scores compared to baseline — no catastrophic forgetting
LoRA adapters fused into base model
Model quantized to Q4_K_M GGUF
Modelfile created with system prompt and generation parameters
Model imported into Ollama and tested interactively

Appendix Quick Reference

# Fine-tune
mlx_lm.lora --model ./models/llama-3.1-8b-instruct --train --data ./data ...

# Test with adapter active
mlx_lm.generate --model ./models/llama-3.1-8b-instruct --adapter-path ./adapters

# Merge adapter into base
python -m mlx_lm.fuse --model [base] --adapter-path ./adapters --save-path [out]

# Convert to GGUF
python llama.cpp/convert_hf_to_gguf.py [merged_model] --outfile model.gguf

# Quantize
./llama.cpp/build/bin/llama-quantize model-f16.gguf model-q4.gguf Q4_K_M

# Register in Ollama
ollama create my-llm -f Modelfile

# Run
ollama run my-llmbash

Useful Links

MLX Examples ↗ · llama.cpp ↗ · Hugging Face ↗ · Ollama Docs ↗ · lm-evaluation-harness ↗