In 2026, building a large language model from the ground up is not the goal — it's a distraction. The real engineering challenge is knowing how to take a world-class pretrained model, shape it to your needs, and validate that it's genuinely learning. This guide documents that full pipeline, end to end, on an M1 Mac.
Training a capable large language model from zero requires hundreds of GPUs running continuously for months, terabytes of curated training data, and infrastructure budgets that rival a small company's annual operating costs. That was the challenge Meta, Google, and Mistral tackled so that the rest of us don't have to.
What we do instead is fine-tuning: we take an open-weight model that already understands language, reasoning, and the world — then we specialize it. Using a technique called LoRA (Low-Rank Adaptation), we inject a small set of trainable adapter layers on top of a frozen base model. The result is a model that reflects your data, your domain, and your voice, built in hours rather than months, on hardware you already own.
This document covers the complete pipeline: choosing a base model, curating training data, running LoRA fine-tuning with Apple's MLX library, and — critically — validating that the model is actually learning and not just memorizing or degrading. The methodology here is what production ML teams use at scale, condensed for a 16 GB M1 Mac.
When people say "8 GB model," they almost always mean 8 billion parameters — not 8 gigabytes of storage. Here is how those numbers actually map to real-world hardware requirements:
| Term | What It Means |
|---|---|
| 8B parameters | ~15 GB on disk at full float32 precision |
| 8B at 4-bit quantization | ~4–5 GB on disk — fits in 16 GB unified RAM |
| Training 8B from scratch | Needs 100+ GB VRAM — not feasible on 16 GB |
| Fine-tuning 8B with LoRA | Needs ~12–14 GB — just barely feasible on 16 GB M1 |
| "Medium" LLM consensus | 1B–13B parameters is widely considered medium-sized |
From a blank terminal to a custom model running in Ollama, the pipeline has seven distinct phases. Each one builds on the last:
You never train a large model from scratch in 2026. You start from a pretrained base and customize it — think of it as tuning a precision engine rather than casting one from raw steel.
| Model | Notes |
|---|---|
| Llama 3.1 8B (Meta) | Best all-around for human conversation — widely supported, huge community |
| Mistral 7B v0.3 | Fast, efficient, excellent instruction following |
| Phi-3 Mini 3.8B | Tiny but surprisingly capable — ideal when RAM is tight |
| Gemma 2 9B (Google) | Strong reasoning, fits in 16 GB with 4-bit quantization |
| Qwen2.5 7B (Alibaba) | Excellent multilingual and code understanding |
# Install Hugging Face CLI
pip install huggingface_hub
# Login with your HF token
huggingface-cli login
# Download Llama 3.1 8B Instruct
huggingface-cli download meta-llama/Llama-3.1-8B-Instruct \
--local-dir ./models/llama-3.1-8b-instruct \
--local-dir-use-symlinks Falsebash
Data quality is the single most important variable in how capable your fine-tuned model becomes. Excellent training code on bad data still produces a bad model — there are no shortcuts here.
For a model that sounds like a knowledgeable, natural human, you want: conversation pairs (user message + assistant response), Q&A datasets with high-quality answers, reasoning chains showing step-by-step thinking, domain-specific knowledge for your target use case, and diversity across topics, styles, and lengths.
| Dataset | Why Use It |
|---|---|
| OpenHermes-2.5 | 1M high-quality GPT-4 conversations — top recommendation |
| UltraChat 200k | Diverse multi-turn dialogue, very human-like |
| ShareGPT | Real user ChatGPT conversations — authentic patterns |
| Alpaca (52k) | Classic instruction dataset — solid starting point |
| OpenOrca | Reasoning-heavy — makes the model measurably smarter |
| LIMA (1,000 examples) | Small but exceptional quality — proves quality beats quantity |
MLX fine-tuning expects JSONL format — one JSON object per line, using the standard conversation structure:
{"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is photosynthesis?"},
{"role": "assistant", "content": "Photosynthesis is the process..."}
]}jsonl
from datasets import load_dataset
import json
ds = load_dataset('teknium/OpenHermes-2.5', split='train')
ds = ds.shuffle(seed=42).select(range(100000)) # filter to 100k rows
def to_chat_format(row):
messages = []
if row.get('system_prompt'):
messages.append({'role': 'system', 'content': row['system_prompt']})
messages.append({'role': 'user', 'content': row['instruction']})
messages.append({'role': 'assistant', 'content': row['output']})
return {'messages': messages}
converted = [to_chat_format(r) for r in ds]
split = int(len(converted) * 0.95) # 95/5 train/valid split
with open('data/train.jsonl', 'w') as f:
for row in converted[:split]: f.write(json.dumps(row) + '\n')
with open('data/valid.jsonl', 'w') as f:
for row in converted[split:]: f.write(json.dumps(row) + '\n')python
| Component | Requirement |
|---|---|
| Mac | M1 Mac Pro — excellent |
| RAM | 16 GB unified memory — minimum viable for 8B LoRA |
| Storage | 50–80 GB free (model + data + checkpoints) |
| macOS | Ventura 13.3+ or Sonoma (required for MLX) |
| Python | 3.11 or 3.12 (3.13 has MLX compatibility issues) |
# 1. Install Homebrew
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
# 2. Install Python 3.11 via pyenv
brew install pyenv
pyenv install 3.11.9
pyenv global 3.11.9
# 3. Create virtual environment
python -m venv ~/llm-env
source ~/llm-env/bin/activate
# 4. Install MLX and fine-tuning tools
pip install mlx mlx-lm
pip install huggingface_hub datasets transformers
pip install numpy tqdm sentencepiece
# 5. Verify MLX sees Apple Silicon
python -c "import mlx.core as mx; print(mx.default_device())"
# Should print: Device(gpu, 0)bash
Before touching any training, establish your baseline benchmarks. This is non-negotiable — you cannot claim a model improved if you did not measure where it started.
from mlx_lm import load, generate
model, tokenizer = load('./models/llama-3.1-8b-instruct')
test_prompts = [
'Explain quantum entanglement simply.',
'Write a Python function to reverse a linked list.',
'What is the difference between empathy and sympathy?',
'If you have 3 apples and give away 1.5, how many remain?',
]
for prompt in test_prompts:
messages = [{'role': 'user', 'content': prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False)
response = generate(model, tokenizer, prompt=text, max_tokens=300)
print(f'PROMPT: {prompt}')
print(f'RESPONSE: {response}')
print('---')python
pip install lm-eval
lm_eval --model hf \
--model_args pretrained=./models/llama-3.1-8b-instruct \
--tasks hellaswag,arc_easy,arc_challenge,winogrande,mmlu \
--device mps \
--output_path ./evals/baseline/bash
| Benchmark | What It Tests |
|---|---|
| MMLU (57 subjects) | Broad knowledge — how "educated" the model is |
| HellaSwag | Common sense completion — human-like reasoning |
| ARC Challenge | Hard science questions from grade-school exams |
| TruthfulQA | Hallucination rate — does it make things up? |
| WinoGrande | Pronoun disambiguation — subtle language understanding |
| GSM8K | Grade school math word problems — arithmetic reasoning |
LoRA (Low-Rank Adaptation) makes fine-tuning an 8B model feasible on 16 GB of RAM. Rather than updating all 8 billion weights — which would require 80–160 GB VRAM — LoRA freezes the base model entirely and trains small adapter matrices layered on top of the attention layers. At the end, those adapters merge back in seamlessly.
| Concept | LoRA Explanation |
|---|---|
| Full fine-tuning | Update all 8B parameters — needs 80–160 GB VRAM |
| LoRA rank 8 | Trains ~20–40M adapter params — needs ~12–14 GB |
| LoRA rank 16 | Slightly more capacity — use if rank 8 plateaus |
| What gets adapted | Attention layers: Q, K, V, O matrices |
| End result | Adapter file (~100–300 MB) that modifies base behavior |
mlx_lm.lora \
--model ./models/llama-3.1-8b-instruct \
--train \
--data ./data \
--batch-size 2 \
--lora-layers 16 \
--iters 2000 \
--learning-rate 1e-5 \
--grad-checkpoint \
--adapter-path ./adapters \
--save-every 200 \
--val-every 100 \
--max-seq-length 2048bash
| Signal | What It Means |
|---|---|
| Training loss decreasing | Model is learning — this is what you want |
| Validation loss decreasing too | Model is generalizing, not just memorizing — excellent |
| Val loss rising after a dip | Overfitting — stop here and use the earlier checkpoint |
| Loss flat / not moving | Learning rate too low, or data format is wrong |
| Out of memory error | Reduce batch_size to 1, reduce max_seq_length to 1024 |
| Loss explodes (NaN) | Learning rate too high — reduce to 5e-6 |
This is where most people skip ahead — don't. Validation is the mechanism that distinguishes a model that actually improved from one that just memorized your training set or quietly degraded its general capability.
from mlx_lm import load, generate
# Load base model with your LoRA adapter active
model, tokenizer = load(
'./models/llama-3.1-8b-instruct',
adapter_path='./adapters'
)
# Use the SAME prompts as your baseline — this is the comparison
test_prompts = [
'Explain quantum entanglement simply.',
'Write a Python function to reverse a linked list.',
'What is the difference between empathy and sympathy?',
]
for prompt in test_prompts:
messages = [{'role': 'user', 'content': prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False)
response = generate(model, tokenizer, prompt=text, max_tokens=400)
print(f'PROMPT: {prompt[:50]}...')
print(f'RESPONSE: {response}')
print('===')python
Read 20 responses and score each dimension on a 1–5 scale. This is still the gold standard — benchmarks measure breadth, human review measures quality.
| Quality Dimension | What to Look For |
|---|---|
| Coherence | Does it stay on topic and follow a logical structure? |
| Naturalness | Does it sound like a person, or a robot listing facts? |
| Helpfulness | Does it actually answer what was asked? |
| Honesty | Does it acknowledge uncertainty rather than fabricating? |
| No hallucination | Does it invent facts or cite things that don't exist? |
# Merge adapters first — required for lm-eval
python -m mlx_lm.fuse \
--model ./models/llama-3.1-8b-instruct \
--adapter-path ./adapters \
--save-path ./models/finetuned-merged
# Benchmark the merged model
lm_eval --model hf \
--model_args pretrained=./models/finetuned-merged \
--tasks hellaswag,arc_easy,arc_challenge,winogrande,mmlu \
--device mps \
--output_path ./evals/finetuned/
# A drop of >3 points on MMLU = catastrophic forgetting
# Ideal result: same or better on knowledge, improved on conversation stylebash
python -m mlx_lm.fuse \
--model ./models/llama-3.1-8b-instruct \
--adapter-path ./adapters \
--save-path ./models/finetuned-merged \
--de-quantize # ensures full float16 output for best qualitybash
Quantization compresses from ~15 GB (float16) down to ~4.5 GB (4-bit int). Inference becomes significantly faster and the model fits comfortably alongside other running applications.
python -m mlx_lm.convert \
--hf-path ./models/finetuned-merged \
--mlx-path ./models/finetuned-4bit \
-q --q-bits 4 --q-group-size 64bash
# Build llama.cpp with Metal support
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DLLAMA_METAL=ON -DLLAMA_CURL=ON
cmake --build build --config Release -j $(sysctl -n hw.ncpu)
# Convert to GGUF (float16 first)
python convert_hf_to_gguf.py \
./models/finetuned-merged \
--outfile ./models/my-llm-f16.gguf \
--outtype f16
# Quantize to Q4_K_M — best quality/size tradeoff
./build/bin/llama-quantize \
./models/my-llm-f16.gguf \
./models/my-llm-q4_k_m.gguf \
Q4_K_Mbash
| Quant Type | Size / Quality Tradeoff |
|---|---|
| Q8_0 | ~8 GB — near-lossless quality, best for 16+ GB RAM |
| Q4_K_M | ~4.5 GB — excellent balance (recommended) |
| Q4_K_S | ~4.3 GB — slightly smaller, minimal quality difference |
| Q3_K_M | ~3.5 GB — noticeable quality drop, use only if necessary |
| Q2_K | ~2.8 GB — significant quality loss, not recommended |
# Create Modelfile
cat > Modelfile << 'EOF'
FROM ./models/my-llm-q4_k_m.gguf
SYSTEM """
You are a highly intelligent, helpful assistant. You are direct,
thoughtful, and give thorough answers when needed but concise answers
when brevity is appropriate. You admit when you don't know something
and never fabricate facts.
"""
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER top_k 40
PARAMETER repeat_penalty 1.1
PARAMETER num_ctx 4096
EOF
# Register and run
ollama serve &
ollama create my-llm -f Modelfile
ollama run my-llmbash
| Use Case | Settings |
|---|---|
| Natural conversation | temp=0.75, top_p=0.9, repeat_penalty=1.1 |
| Precise technical answers | temp=0.2, top_p=0.85, repeat_penalty=1.05 |
| Creative writing | temp=0.9, top_p=0.95, repeat_penalty=1.0 |
| Code generation | temp=0.1, top_p=0.95, repeat_penalty=1.05 |
| Problem | Fix |
|---|---|
| Model repeats itself | Increase repeat_penalty to 1.15–1.2 |
| Responses too short | Add examples of long, detailed responses to training data |
| Sounds robotic | Add more natural conversation data (ShareGPT, UltraChat) |
| Hallucinating facts | Add TruthfulQA-style data; lower temperature at inference |
| Benchmark scores dropped | Mix in general knowledge data (OpenOrca) — 80/20 ratio |
| Out of memory during training | batch_size=1, max_seq_length=1024, grad_checkpoint=True |
| Loss not decreasing | Check data format; try increasing learning_rate to 2e-5 |
# Fine-tune
mlx_lm.lora --model ./models/llama-3.1-8b-instruct --train --data ./data ...
# Test with adapter active
mlx_lm.generate --model ./models/llama-3.1-8b-instruct --adapter-path ./adapters
# Merge adapter into base
python -m mlx_lm.fuse --model [base] --adapter-path ./adapters --save-path [out]
# Convert to GGUF
python llama.cpp/convert_hf_to_gguf.py [merged_model] --outfile model.gguf
# Quantize
./llama.cpp/build/bin/llama-quantize model-f16.gguf model-q4.gguf Q4_K_M
# Register in Ollama
ollama create my-llm -f Modelfile
# Run
ollama run my-llmbash
MLX Examples ↗ · llama.cpp ↗ · Hugging Face ↗ · Ollama Docs ↗ · lm-evaluation-harness ↗