Project 05 — LoRA SMS Classifier

Fine-Tuning a Spam Detector with LoRA

We take a raw Mistral-7B base model that has never seen an SMS message and fine-tune it — using LoRA on Apple MLX — into a production-ready spam classifier. This document covers the full workflow: data prep, training, evaluation, stress testing, and deployment via model fusion. Every decision is documented and every result is reproducible.

🎯 What We Are Building and Why

Our goal is simple: classify an SMS message as spam or not spam. But the engineering challenge underneath that goal is everything. We start with a 134-million-parameter language model that knows how to speak English but has no concept of what a scam text looks like. Through LoRA (Low-Rank Adaptation), we inject a small set of trainable adapter layers — less than 0.5% of the total parameters — and teach the model a new skill without overwriting what it already knows.

Along the way we will hit every real-world ML failure mode: loss explosions, overfitting, majority-class bias, and format mismatches. We will diagnose each one, fix it scientifically, and validate the fix. The model sitting in models/retrained/fused at the end of this guide correctly identifies both classic spam and sophisticated phishing attacks — while leaving friendly messages alone.

📦

Try it yourself — Apple Silicon required

Download the Starter Project

Everything you need to run the full pipeline on your own Mac. One command — ./runme.sh — downloads the base model, preps the balanced dataset, trains the LoRA adapters, fuses the model, and prints a before/after comparison report. The whole thing completes in under 90 seconds.

✓ Scripts included ✓ Balanced dataset builder ✓ One-command pipeline Python 3.12 Apple MLX ~300 KB zip

⬇ Download lora_sms_classifier.zip View on GitHub ↗

Requirements: Mac with Apple Silicon (M1 or later) · Python 3.12 · ~300 MB free disk for the base model download · internet connection for the first run only. Model weights are not included in the zip — runme.sh fetches them automatically from Hugging Face.

Hardware

M1 MacBook Pro · 16 GB

Framework

Apple MLX + mlx-lm

Base Model

Mistral-7B-Instruct-v0.3 4-bit

Method

LoRA Fine-Tuning

Dataset

HuggingFace sms_spam

Python

3.12

Overview The Full Training Pipeline

The entire workflow runs as a numbered script sequence. Each script hands off to the next, and every artifact — the data, the adapters, the fused model, the eval metrics — is written to disk so we can inspect and reproduce any step in isolation.

Data Preparation — 03_download_data.py

Pull the sms_spam dataset from Hugging Face and write a balanced 50/50 JSONL training file. A locked random seed (42) ensures every run shuffles the same way.

LoRA Fine-Tuning — 04_retrain.py

Configure and launch the MLX LoRA trainer. Monitor validation loss in real time. The script auto-fuses adapters into the base model on completion.

Evaluation — 05_retrained_eval.py

Run the retrained model against 10 fixed test messages and record loss, perplexity, and accuracy. Compare directly to the baseline run.

05+

Stress Test — extended eval set

Add two "adversarial" messages — a phishing URL and a friendly poker question — to verify the model generalizes rather than memorizes.

Fusion — mlx_lm fuse

Bake the trained adapters permanently into the base model weights to create a self-contained, portable production folder.

Interactive Test — 09_chat_test.py

Load the fused model and send it real SMS messages interactively to verify live classification performance.

Step 03 Data Preparation

Downloading the Dataset

We use the sms_spam dataset from Hugging Face. The load_dataset function pulls the full corpus — 5,574 messages — and caches it locally. We request the full "train" split so we can manually slice our own training and validation sets rather than relying on Hugging Face's default splits.

from datasets import load_dataset

ds = load_dataset("sms_spam", split="train")
# → 747 spam messages, 4827 ham messagespython

The Majority-Class Problem — and How We Fix It

The raw dataset is roughly 85% ham / 15% spam. If we train on that imbalance, the model discovers a shortcut: say "not spam" for every message and score 85% accuracy without learning anything useful. We call this Majority-Class Bias, and the fix is a balanced 50/50 split.

⚠️

Critical insight: A model trained on imbalanced data will always default to the majority class under ambiguity. Balancing the dataset forces the model to actually learn the distinguishing features of spam — it can no longer "cheat" by guessing the safe bet.

spam = [row for row in ds if row["label"] == 1]   # 747 total
ham  = [row for row in ds if row["label"] == 0]   # 4827 total

num_train_per_class = 400
num_val_per_class   = 80

train_rows = spam[:400] + ham[:400]   # 400 spam + 400 ham
val_rows   = spam[400:480] + ham[400:480]

random.seed(42)
random.shuffle(train_rows)
random.shuffle(val_rows)

# Each row → JSONL formatted prompt
# "Classify the following SMS message as spam or not spam.\nMessage: TEXT\nClassification: LABEL"python

The Locked Seed

Setting random.seed(42) before every shuffle means the data split is identical on every run. When we change a hyperparameter and retrain, the only variable that changes is our configuration — the training data and validation data stay exactly the same. This is what makes results comparable across experiments.

💡

Pro tip: The prompt format we use during training must be reproduced exactly at inference time. Every evaluation script and every chat interface must use "Classify the following SMS message as spam or not spam.\nMessage: {text}\nClassification:" — a single missing space or period can prevent the model from outputting a label it has been trained to produce.

Step 04 The Training Configuration

The LoRA Config Dictionary

We define all training parameters in a single Python dictionary. This translates directly into the JSON file that MLX requires, and it makes every experiment parameter visible and diffable — we never tweak a command-line flag without recording it here first.

LORA_CONFIG = {
    "model"          : str(BASELINE_DIR),      # models/baseline
    "train"          : True,                   # "Learning Mode" — updates weights
    "data"           : str(DATA_DIR),           # data/train.jsonl + data/valid.jsonl
    "seed"           : 42,                     # locked shuffle
    "lora_layers"    : 16,                     # how many transformer layers we adapt
    "batch_size"     : 4,                      # messages per gradient update
    "iters"          : 400,                    # total training steps
    "learning_rate"  : 5e-5,                   # "gentleness" of each weight update
    "adapter_path"   : str(RETRAINED_DIR / "adapters"),
}python

Understanding Each Parameter

Parameter	Value	Why
`lora_layers`	16	We perform "brain surgery" on the last 16 transformer layers. More layers = more capacity to distinguish spam from ham. Fewer layers is faster but may plateau early.
`learning_rate`	`5e-5`	Think of this as how hard we turn the steering wheel per update. At `1e-3` the loss exploded. At `1e-5` the model was too quiet to learn spam. `5e-5` is the sweet spot for this task.
`iters`	400	With 800 training examples and `batch_size=4`, one full pass through the data takes 200 steps. We run 2 full passes (400 steps) to allow the model to consolidate what it learned.
`batch_size`	4	4 messages per gradient update. Larger batches are more stable but require more memory. At 16 GB unified RAM, 4 is comfortable.
`seed`	42	Same as the data prep seed — ensures internal MLX shuffles are also reproducible.

Understanding Loss and the Three Failure Modes

Loss is the number we watch during training. It represents the model's total prediction error — lower is better. There are three distinct patterns to recognize:

Pattern	What You See	The Fix
✅ Healthy Learning	Loss starts at ~4.0 and steadily drops toward 2.0–2.5 over training	Nothing — let it run
💥 Loss Explosion	Loss spikes to 5, 10, NaN — the math has crashed	Lower the learning rate (try halving it)
🧱 Stall / Plateau	Loss barely moves or stops improving after 100–150 steps	Increase layers, increase learning rate slightly, or stop early and use the best checkpoint

📊

Validation loss is the one that matters. Training loss measures how well the model is memorizing your examples. Validation loss measures how well it performs on examples it has never seen — that is the real test. If training loss keeps dropping but validation loss rises, the model is overfitting and we need to stop early.

Launching the Fine-Tune

python3 scripts/04_retrain.pybash

Internally, the script constructs and executes the following MLX command:

python3 -m mlx_lm lora \
  --model models/baseline \
  --train \
  --data data \
  --batch-size 4 \
  --iters 400 \
  --val-batches 10 \
  --learning-rate 5e-05 \
  --steps-per-eval 50 \
  --save-every 150 \
  --adapter-path models/retrained/adapters \
  --num-layers 16 \
  --seed 42bash

Step 04 — Results Reading the Training Logs

What the Terminal Shows Us

Every 50 iterations MLX pauses to run a validation check. We watch the Val loss column to understand whether the model is learning, plateauing, or overfitting. Here is the log from our final successful run:

Trainable parameters: 0.968% (1.303M/134.515M)
Starting training..., iters: 400

Iter 1:   Val loss 4.290          ← Model is guessing randomly
Iter 50:  Val loss 2.674          ← Rapid early learning
Iter 100: Val loss 2.467          ← Steadily improving
Iter 150: Val loss 2.506          ← Checkpoint saved
Iter 200: Val loss 2.523          ← Stable, not overfitting
Iter 250: Val loss 2.509          ← Holding steady
Iter 300: Val loss 2.507          ← Checkpoint saved
Iter 350: Val loss 2.424          ← Late improvement (balanced data working)
Iter 400: Val loss 2.341          ← Best score — training complete

Fine-tuning complete in 47.0s
Fused model saved to models/retrained/fused

Why This Run Succeeded

The key diagnostic is that the validation loss continued improving all the way to iteration 400. In our first run (8 layers, 1e-5 learning rate), the loss stalled at 2.52 after 150 steps and never recovered. Here, with 16 layers and 5e-5, the model had both the capacity (more layers) and the signal strength (higher learning rate) to keep learning past that plateau.

⚡

Hardware note: Peak memory usage was 1.00 GB — remarkably low for a 134M parameter model. MLX's unified memory architecture on Apple Silicon handles this far more efficiently than a traditional GPU setup would. The entire 400-iteration training run completed in 47 seconds.

Step 05 Evaluation — Before vs. After

The Evaluation Script

The eval script (05_retrained_eval.py) does more than check whether answers are right. It measures how confident the model is, which gives us a much richer picture of what the fine-tuning actually changed.

Metric	What It Measures	Ideal Direction
Eval Loss	Total prediction error across the test set	↓ Lower is better
Perplexity	The model's "confusion" — how surprised it is by the correct answer	↓ Lower is better
Accuracy	Percentage of tokens predicted correctly	↑ Higher is better

📌

Why we use the adapter directly — not just the fused model: The eval script intentionally loads the base model and "snaps" the adapters on top at runtime rather than loading the fused folder. On small models, the mathematical merge of fusing can introduce a tiny rounding error. Loading the adapter dynamically gives us the most mathematically pure measurement of what the fine-tuning achieved.

The Results

Baseline — Before Fine-Tuning

Eval Loss3.3868

Perplexity29.57

Accuracy48.26%

Texts Evaluated10

Retrained — After Fine-Tuning

Eval Loss1.6271

Perplexity5.09

Accuracy68.9%

Texts Evaluated10

What Each Improvement Means

Perplexity dropped 83% — from 29.57 down to 5.09. Before training, the model was genuinely confused by SMS-style text and the spam/not-spam framing. After training, it "expects" the word spam after seeing a prize announcement — that kind of anticipatory confidence is exactly what low perplexity measures.

Accuracy went from 48% to 69% — essentially moving from "coin flip" territory to a functional classifier in 47 seconds of training. The 20-point jump is the concrete payoff of fixing the majority-class bias in the dataset.

Step 05+ The Stress Test — Adversarial Inputs

A model that only works on clean, "textbook" examples is not a useful model. We add two adversarial messages to the evaluation set to test whether the model has learned a genuine concept of spam or is just pattern-matching on trigger words like "FREE" and "WINNER."

The Two Stress Test Cases

Type	Message	Label	Why It's Hard
Phishing URL	Urgent: Action required on your account. Log in at http://bit.ly to avoid suspension.	`spam`	No "FREE" or "WINNER" — uses fear and a shortened link instead. Tests whether the model recognizes the authority + urgency + link pattern used in modern phishing.
Friendly "win"	Did you win that money at the poker game last night? Let me know!	`not spam`	Uses the trigger words "win" and "money" in a completely normal conversational context. Tests whether the model has high precision — it must not flag friends talking about poker.

Stress Test Results

Retrained Evaluation -- STRESS TEST
Load mode   : adapter
Eval Loss   : 1.6607   (+0.03 vs. standard eval)
Perplexity  : 5.2628   (+0.17 vs. standard eval)
Accuracy    : 68.42%   (–0.5% vs. standard eval)
Texts used  : 12       (10 originals + 2 adversarial)

Accuracy barely moved — from 68.9% down to 68.42% — across the two adversarial cases. This is the result we want. A "brittle" model that had only memorized its training examples would have shown a sharp accuracy drop when confronted with novel phrasing. The near-identical scores prove the model is generalizing the underlying concept rather than relying on a lookup table of spam words.

🎯

What "generalization" looks like in practice: A model that generalized correctly won't flag "Did you win that poker game?" as spam just because it learned the word "win." It understands that who is asking and how they're asking matters as much as which words appear.

Step 06 Fusing the Model for Production

After training, we have two separate artifacts: the original base model and the adapter file (adapters.safetensors). For production use, we want a single self-contained folder that includes everything. The fuse step mathematically bakes the adapter weights into the base model's layers — the equivalent of permanently installing a lens attachment rather than clipping it on each time.

Running the Fusion

python3 -m mlx_lm fuse \
    --model models/baseline \
    --adapter-path models/retrained/adapters \
    --save-path models/retrained/fusedbash

Verifying the Output

ls -l models/retrained/fused

# model.safetensors     ← the "brain" with adapters baked in
# config.json           ← model architecture and settings
# tokenizer.json        ← how to convert text ↔ tokens
# tokenizer_config.json ← tokenizer settingsbash

Validating Fusion Accuracy

To confirm that fusing did not degrade performance, we rename the adapter folder (forcing the eval script to fall back to the fused model) and re-run the evaluation:

mv models/retrained/adapters models/retrained/adapters_hidden
python3 scripts/05_retrained_eval.pybash

Adapter Mode (pre-fusion)

Eval Loss1.6607

Perplexity5.2628

Accuracy68.42%

Fused Mode (post-fusion)

Eval Loss1.6608

Perplexity5.2637

Accuracy68.42%

The difference between adapter mode and fused mode is 0.0009 perplexity — effectively zero. The model is production-ready. The models/retrained/fused folder is now a fully self-contained spam classifier that requires no external adapter files to operate.

Step 09 Live Interactive Testing

Starting the Chat Interface

python3 scripts/09_chat_test.pybash

The Final Boss Test Suite

We run the fused model against four progressively harder classification tasks. Each tests a different aspect of what the model learned:

Message	Challenge	Result
Hey, did you get the milk?	Baseline — a completely normal friend text	✅ `not spam`
WINNER! You won a $500 gift card! Click here now.	Classic spam — high-value prize with urgency	✅ `spam`
Did you win that poker game?	Precision test — "win" in friendly context	✅ `not spam`
Hey friend, I tried calling but you didn't answer. I found a way for us both to get a $1000 credit on our accounts today! Just go to http://bit.ly and use my invite code.	"Trojan Horse" — opens like a friend, hides a scam	✅ `spam`
Service Alert: Your mobile data plan has exceeded the monthly limit. View your updated billing statement here: http://my-account-portal.com	Authority impersonation — no prize language, just a "problem" and a link	✅ `spam`

Why the "Trojan Horse" and "Service Alert" Results Matter Most

The "Trojan Horse" message opens with "Hey friend, I tried calling" — the exact kind of friendly opener designed to disarm spam filters. The model looked past the social framing and caught the high-value monetary incentive ($1,000) plus shortened URL pattern that signals phishing.

The "Service Alert" message is arguably harder: it contains no prize words, no "FREE," no "WINNER." It impersonates a utility company using a fear-of-loss angle. The model recognized the authority tone + urgency + external link pattern as phishing — a pattern that trips up simpler keyword-based filters entirely.

🏆

What 5/5 on this test bench means: The model did not just learn to pattern-match on "WINNER" and "FREE." It learned the structural and tonal signatures of spam — urgency, external links, high-value offers, and authority impersonation — regardless of which specific words carry those signals.

Lessons What We Learned — Diagnosing Failures

Three distinct failures occurred before we reached the final working model. Each one taught something important about how LLM fine-tuning actually behaves in practice.

Failure 1 — Loss Explosion (`learning_rate: 1e-3`)

The first training attempt used a learning rate of 1e-3. Within a few iterations, the loss shot to NaN. Think of steering a car: 1e-3 is like jerking the wheel 90 degrees. The model's internal math destabilizes completely. The fix was dropping the learning rate to 2e-5 — a gentle 2-degree correction instead.

Failure 2 — Overfitting (600 iterations, 8 layers)

The second run was stable but too long. Validation loss improved through iteration 150, then started creeping back up. The model had finished learning the general patterns and was starting to memorize the specific 800 training examples — which made it perform worse on new messages. We cut iterations to 200 and lowered the learning rate further to 1e-5. The model stopped overfitting, but a new problem emerged.

Failure 3 — "Too Nice" (Majority-Class Bias)

With only 8 layers and 1e-5 learning rate, the model passed the automated eval (68.9% accuracy) but failed every live test: it classified every message — even "WINNER! Claim your prize!" — as not spam. The automated eval used a prompt format that included the label at the end; the model had learned to output "not spam" reflexively regardless of input. The root cause was the imbalanced dataset. With 85% of examples labeled "not spam," the model discovered that guessing "not spam" was statistically the safest path to a low loss.

⚠️

The fix — three changes simultaneously: (1) Balance the data to 50/50 spam/ham. (2) Increase lora_layers from 8 to 16 to give the model more capacity to represent the difference between the two classes. (3) Raise learning_rate from 1e-5 to 5e-5 so the new spam signal is strong enough to register against the model's existing priors.

The Iteration History

Run	LR	Layers	Iters	Data Split	Outcome
Run 1	`1e-3`	8	600	85/15	💥 Loss explosion (NaN)
Run 2	`2e-5`	8	600	85/15	🧱 Overfit after iter 150
Run 3	`1e-5`	8	200	85/15	😶 68.9% eval, 0% live accuracy
Run 4	`5e-5`	16	400	50/50	✅ 68.42%, 5/5 live tests passed

Summary The Complete ML Lifecycle

What we built here is not just a spam classifier — it is a complete, documented demonstration of the machine learning engineering process. From raw data to a production-ready fused model, every decision is traceable, every result is reproducible, and every failure was diagnosed and fixed scientifically.

Established a Baseline

Proved the original model was weak — 48.26% accuracy, perplexity of 29.57. Essentially coin-flip territory.

Fixed a Loss Explosion

Identified that learning_rate: 1e-3 was destabilizing training. Corrected to 5e-5 through controlled experimentation.

Solved Majority-Class Bias

Discovered the 85/15 data imbalance was causing the model to default to "not spam" for everything. Rebuilt the dataset at 50/50.

Validated Generalization

Added adversarial stress tests — a phishing URL and a poker question — and confirmed accuracy held at 68.42%.

Confirmed Fusion Fidelity

Verified the fused model matched adapter performance to 4 decimal places. The model is self-contained and deployment-ready.

Passed the Final Boss

5/5 on live interactive tests including a "Trojan Horse" phishing message and a no-keyword authority impersonation scam.

🧹

Cleanup — removing the base model cache: The Mistral-7B base model is cached in Hugging Face's hidden directory. When we are done iterating, we can free the disk space with:

rm -rf ~/.cache/huggingface/hub/models--mlx-community--Mistral-7B-Instruct-v0.3-4bit

The fused model in models/retrained/fused is self-contained and does not depend on this cache to run.