Project 02

GPT from Scratch

A complete hands-on implementation of a GPT-style transformer LLM — trained entirely from scratch on Apple Silicon M1. Built as a learning project to understand how large language models actually work under the hood.

Machine
MacBook Pro M1
GPU
Apple Silicon MPS
Python
3.12 (system framework)
Dataset
TinyStories (HuggingFace)
Model Size
30,142,848 parameters
Key Libs
PyTorch · Transformers · Flask

Chapter 01 Environment Setup

Installing dependencies failed immediately — the yaml module was missing. The PyPI package is pyyaml but imports as yaml — a common gotcha.

ModuleNotFoundError: No module named 'yaml'
# Fix
pip3 install pyyaml
# Note: the package is 'pyyaml' but you import it as 'import yaml'

# Full dependency install
python3 -m pip install -q --break-system-packages \
  torch transformers datasets tokenizers sentencepiece \
  tqdm pyyaml numpy safetensorsbash
💡
Your terminal's working directory points to an inode. Moving or trashing a folder in Finder does not change where your terminal thinks it is — stay there until you cd elsewhere.

Chapter 02 Bugs Found & Fixed

A full audit of all pipeline scripts revealed 4 bugs. All were fixed before training began.

Bug 1 — actual_vocab_size Undefined
train.py · line 392 · NameError

The variable was referenced in GPTConfig before it was assigned. The tokenizer needed to be loaded first to get the real vocab size.

from tokenizers import Tokenizer as _Tokenizer
_tok = _Tokenizer.from_file(str(_tok_dir / 'tokenizer.json'))
actual_vocab_size = _tok.get_vocab_size()
config = GPTConfig(vocab_size=actual_vocab_size, ...)  # ✅ now definedpython
Bug 2 — OpenWebText Not Implemented
prepare_data.py · NotImplementedError

openwebtext was listed as a valid CLI option with no implementation. Added download_openwebtext() using streaming to avoid loading the full dataset into memory.

Bug 3 — Hardcoded Token IDs
export_hf.py · wrong vocab size

bos_token_id and eos_token_id were hardcoded to GPT-2's value of 50256. Our custom tokenizer has a different vocab size.

actual_vocab_size = cfg.get('vocab_size', 50257)
eos_token_id = actual_vocab_size - 1  # last token = endoftextpython
Bug 4 — Broken Fallback Imports
train_tokenizer.py · ImportError

The try block imported RobertaProcessing which was never used. On the fallback path this import was missing, causing an ImportError on first install.

# AFTER ✅
from tokenizers import Tokenizer, models, trainers, pre_tokenizers, decoderspython

Chapter 03 Training

python3 run_pipeline.py --config configs/tiny.yamlbash
# Model architecture (tiny config)
n_layer: 4       # transformer blocks
n_head:  4       # attention heads
n_embd:  256     # embedding dimension
epochs:  2
batch_size: 8
context_len: 512yaml

Dataset — TinyStories

SplitSize
Training set47,500,000 characters
Validation set2,500,000 characters
Tokenizer vocab20,712 tokens (custom BPE)
Total parameters30,142,848

Training Metrics

MetricMeaning
lossCross-entropy loss — lower is better; below 2.0 means coherent text generation
pplPerplexity — how "surprised" the model is by the next token
lrCurrent learning rate (decays over time via scheduler)
step/sTraining throughput in steps per second
etaEstimated time remaining
💡
A loss of 2.45 after only 10% of training is healthy — the model is learning fast. Loss below 2.0 generally means the model can generate coherent text.

Running Individual Steps

# Run only step 3 (tokenizer training)
python3 run_pipeline.py --config configs/tiny.yaml --only-step 3

# Run steps 4–11 sequentially
for step in 4 5 6 7 8 9 10 11; do
  python3 run_pipeline.py --config configs/tiny.yaml --only-step $step
donebash

Chapter 04 Chat Interface

FilePurpose
server.pyFlask server — loads the model and serves the UI
templates/index.htmlDark terminal-styled chat interface
finetune_from_logs.pyRetrains the model on logged conversations
pip3 install flask --break-system-packages
cd llm_chat
python3 server.py --checkpoint ../BuildYourLLM_fixed/output/tiny/checkpoints/best.pt
# Open → http://localhost:5000bash

Conversation Log Format

{
  "timestamp":  "2026-03-06T14:30:00",
  "session_id": "sess_abc123",
  "prompt":     "Once upon a time there was",
  "response":   "a little fox who lived in the forest..."
}json

Chapter 05 Continuous Learning

⚠️
The model does not learn while you talk to it — once training is complete, weights are frozen. This is true of all LLMs including ChatGPT. Production systems fine-tune periodically on collected data.
cd llm_chat
python3 finetune_from_logs.py \
    --checkpoint ../BuildYourLLM_fixed/output/tiny/checkpoints/best.pt \
    --logs logs/conversations.jsonl --epochs 1
python3 server.py --checkpoint .../finetuned_chat/finetuned_best.ptbash

Fine-tuning uses LoRA (Low-Rank Adaptation) — only ~1% of weights are updated, base weights are frozen (no catastrophic forgetting).

ApproachHow It WorksBest For
Periodic LoRAFine-tune on conversation logs every N daysMost practical; low cost
Instruction tuningTrain on curated prompt/response JSONL pairsTeaching Q&A behaviour
Full fine-tuningUpdate all weights on new dataMaximum quality; high cost

Chapter 06 Capabilities & Limitations

📖
This is a base language model trained on children's stories. It generates story-style text continuations — not answers to questions. It was built to learn.
✅ Can Do❌ Cannot Do
Continue a story from a promptAnswer factual questions accurately
Generate fluent, grammatical textFollow complex instructions
Produce children's story-style proseHold a coherent multi-turn conversation
Run entirely on-device (M1 Mac)Reason or plan like a large model

Quick Reference

# Full pipeline
python3 run_pipeline.py --config configs/tiny.yaml

# Skip training, use existing checkpoint
python3 run_pipeline.py --config configs/tiny.yaml --skip-train

# Start chat server
cd llm_chat && python3 server.py \
    --checkpoint ../BuildYourLLM_fixed/output/tiny/checkpoints/best.pt

# Fine-tune on logged conversations
python3 finetune_from_logs.py \
    --checkpoint .../best.pt \
    --logs logs/conversations.jsonl --epochs 1bash