Project 02

GPT from Scratch

A complete hands-on implementation of a GPT-style transformer LLM — trained entirely from scratch on Apple Silicon M1. Built as a learning project to understand how large language models actually work under the hood.

Machine

MacBook Pro M1

GPU

Apple Silicon MPS

Python

3.12 (system framework)

Dataset

TinyStories (HuggingFace)

Model Size

30,142,848 parameters

Key Libs

PyTorch · Transformers · Flask

Chapter 01 Environment Setup

Installing dependencies failed immediately — the yaml module was missing. The PyPI package is pyyaml but imports as yaml — a common gotcha.

ModuleNotFoundError: No module named 'yaml'

# Fix
pip3 install pyyaml
# Note: the package is 'pyyaml' but you import it as 'import yaml'

# Full dependency install
python3 -m pip install -q --break-system-packages \
  torch transformers datasets tokenizers sentencepiece \
  tqdm pyyaml numpy safetensorsbash

💡

Your terminal's working directory points to an inode. Moving or trashing a folder in Finder does not change where your terminal thinks it is — stay there until you cd elsewhere.

Chapter 02 Bugs Found & Fixed

A full audit of all pipeline scripts revealed 4 bugs. All were fixed before training began.

Bug 1 — actual_vocab_size Undefined

train.py · line 392 · NameError

The variable was referenced in GPTConfig before it was assigned. The tokenizer needed to be loaded first to get the real vocab size.

from tokenizers import Tokenizer as _Tokenizer
_tok = _Tokenizer.from_file(str(_tok_dir / 'tokenizer.json'))
actual_vocab_size = _tok.get_vocab_size()
config = GPTConfig(vocab_size=actual_vocab_size, ...)  # ✅ now definedpython

Bug 2 — OpenWebText Not Implemented

prepare_data.py · NotImplementedError

openwebtext was listed as a valid CLI option with no implementation. Added download_openwebtext() using streaming to avoid loading the full dataset into memory.

Bug 3 — Hardcoded Token IDs

export_hf.py · wrong vocab size

bos_token_id and eos_token_id were hardcoded to GPT-2's value of 50256. Our custom tokenizer has a different vocab size.

actual_vocab_size = cfg.get('vocab_size', 50257)
eos_token_id = actual_vocab_size - 1  # last token = endoftextpython

Bug 4 — Broken Fallback Imports

train_tokenizer.py · ImportError

The try block imported RobertaProcessing which was never used. On the fallback path this import was missing, causing an ImportError on first install.

# AFTER ✅
from tokenizers import Tokenizer, models, trainers, pre_tokenizers, decoderspython

Chapter 03 Training

python3 run_pipeline.py --config configs/tiny.yamlbash

# Model architecture (tiny config)
n_layer: 4       # transformer blocks
n_head:  4       # attention heads
n_embd:  256     # embedding dimension
epochs:  2
batch_size: 8
context_len: 512yaml

Dataset — TinyStories

Split	Size
Training set	47,500,000 characters
Validation set	2,500,000 characters
Tokenizer vocab	20,712 tokens (custom BPE)
Total parameters	30,142,848

Training Metrics

Metric	Meaning
`loss`	Cross-entropy loss — lower is better; below 2.0 means coherent text generation
`ppl`	Perplexity — how "surprised" the model is by the next token
`lr`	Current learning rate (decays over time via scheduler)
`step/s`	Training throughput in steps per second
`eta`	Estimated time remaining

💡

A loss of 2.45 after only 10% of training is healthy — the model is learning fast. Loss below 2.0 generally means the model can generate coherent text.

Running Individual Steps

# Run only step 3 (tokenizer training)
python3 run_pipeline.py --config configs/tiny.yaml --only-step 3

# Run steps 4–11 sequentially
for step in 4 5 6 7 8 9 10 11; do
  python3 run_pipeline.py --config configs/tiny.yaml --only-step $step
donebash

Chapter 04 Chat Interface

File	Purpose
`server.py`	Flask server — loads the model and serves the UI
`templates/index.html`	Dark terminal-styled chat interface
`finetune_from_logs.py`	Retrains the model on logged conversations

pip3 install flask --break-system-packages
cd llm_chat
python3 server.py --checkpoint ../BuildYourLLM_fixed/output/tiny/checkpoints/best.pt
# Open → http://localhost:5000bash

Conversation Log Format

{
  "timestamp":  "2026-03-06T14:30:00",
  "session_id": "sess_abc123",
  "prompt":     "Once upon a time there was",
  "response":   "a little fox who lived in the forest..."
}json

Chapter 05 Continuous Learning

⚠️

The model does not learn while you talk to it — once training is complete, weights are frozen. This is true of all LLMs including ChatGPT. Production systems fine-tune periodically on collected data.

cd llm_chat
python3 finetune_from_logs.py \
    --checkpoint ../BuildYourLLM_fixed/output/tiny/checkpoints/best.pt \
    --logs logs/conversations.jsonl --epochs 1
python3 server.py --checkpoint .../finetuned_chat/finetuned_best.ptbash

Fine-tuning uses LoRA (Low-Rank Adaptation) — only ~1% of weights are updated, base weights are frozen (no catastrophic forgetting).

Approach	How It Works	Best For
Periodic LoRA	Fine-tune on conversation logs every N days	Most practical; low cost
Instruction tuning	Train on curated prompt/response JSONL pairs	Teaching Q&A behaviour
Full fine-tuning	Update all weights on new data	Maximum quality; high cost

Chapter 06 Capabilities & Limitations

📖

This is a base language model trained on children's stories. It generates story-style text continuations — not answers to questions. It was built to learn.

✅ Can Do	❌ Cannot Do
Continue a story from a prompt	Answer factual questions accurately
Generate fluent, grammatical text	Follow complex instructions
Produce children's story-style prose	Hold a coherent multi-turn conversation
Run entirely on-device (M1 Mac)	Reason or plan like a large model

Quick Reference

# Full pipeline
python3 run_pipeline.py --config configs/tiny.yaml

# Skip training, use existing checkpoint
python3 run_pipeline.py --config configs/tiny.yaml --skip-train

# Start chat server
cd llm_chat && python3 server.py \
    --checkpoint ../BuildYourLLM_fixed/output/tiny/checkpoints/best.pt

# Fine-tune on logged conversations
python3 finetune_from_logs.py \
    --checkpoint .../best.pt \
    --logs logs/conversations.jsonl --epochs 1bash