Visual · Intuition-first · End-to-End

ML in a Month

From linear regression to LLMs — curated for visual learners. StatQuest + 3Blue1Brown + Karpathy.

28
Days
4
Weeks
56+
Videos
84
Quiz Qs
Progress
0 / 28
Week 1
ML Foundations & Core Algorithms
Master the essential ML algorithms visually. StatQuest-powered — no heavy math, all intuition.
Day 01 What is ML? Bias-Variance & the ML Workflow
visualconcept
Watch
A Gentle Introduction to Machine Learning
StatQuest with Josh Starmer
18 min
Machine Learning Fundamentals: Bias and Variance
StatQuest
20 min
Notes

ML is the science of learning patterns from data instead of hard-coding rules. The bias-variance tradeoff is the most important concept in all of ML — high bias underfits (too simple), high variance overfits (memorises noise). Every model selection decision you make is managing this tradeoff.

  • Supervised = labelled data; Unsupervised = find structure without labels; RL = learn from rewards
  • ML workflow: collect data, clean, split train/val/test, train, evaluate, iterate
  • Never touch the test set until final evaluation — it is your unbiased reality check
  • Model complexity vs error: training error always falls, but validation error has a sweet-spot minimum
Quiz

3 questions — test your understanding

Q1. A model too simple to capture the underlying pattern is said to have:

Q2. Which dataset split should you NEVER look at until you have a final model?

Q3. In supervised learning, training examples must include:

Day 02 Linear Regression & Gradient Descent
visualconcept
Watch
Linear Regression, Clearly Explained
StatQuest
27 min
Gradient Descent, Step by Step
StatQuest
20 min
Notes

Linear regression fits a line through data by minimising the sum of squared residuals. Gradient descent is the numerical engine — compute the slope of the loss, then take a small step downhill. Almost every ML algorithm is optimised via some variant of gradient descent.

  • Loss function MSE: average squared difference between predictions and actual values
  • Update rule: weight = weight minus learning_rate times gradient of loss
  • Learning rate too large: overshoots minimum and diverges. Too small: converges very slowly
  • Mini-batch GD: compromise between full-batch (stable, slow) and stochastic (noisy, fast)
Quiz

3 questions — test your understanding

Q1. MSE stands for:

Q2. A learning rate that is too large will cause gradient descent to:

Q3. Weights are updated in the direction of:

Day 03 Logistic Regression & Classification Metrics
visual
Watch
Logistic Regression, Clearly Explained
StatQuest
19 min
ROC and AUC, Clearly Explained
StatQuest
16 min
Notes

Logistic regression predicts probabilities using the sigmoid function that squashes any value into 0-1. Accuracy alone misleads on imbalanced datasets. Use precision, recall, F1 and AUC-ROC for a complete picture.

  • Sigmoid: 1 / (1 + e^-z) — output is always a probability between 0 and 1
  • Decision threshold default is 0.5 but tuning it changes the precision-recall tradeoff
  • Precision = TP / (TP+FP) — of all predicted positives, how many were actually positive
  • AUC = 1.0 is perfect; AUC = 0.5 is no better than random chance
Quiz

3 questions — test your understanding

Q1. The sigmoid function outputs values in the range:

Q2. An AUC score of 0.5 means the model is:

Q3. Precision is defined as TP / (TP + ?):

Day 04 Decision Trees & Cross-Validation
visual
Watch
Decision and Classification Trees, Clearly Explained
StatQuest
22 min
Machine Learning Fundamentals: Cross Validation
StatQuest
8 min
Notes

Decision trees split data to maximise information gain, reducing Gini impurity at each node. They are highly interpretable but overfit easily. Cross-validation gives a much more reliable generalisation estimate than a single train/val split.

  • Gini impurity = 1 minus sum of p_i squared — 0 means perfectly pure node
  • Information gain = parent impurity minus weighted average of children impurity
  • Limit depth: max_depth and min_samples_leaf are your main anti-overfitting controls
  • K-Fold CV: train on k-1 folds, test on 1, rotate k times and average the score
Quiz

3 questions — test your understanding

Q1. A Gini impurity of 0 means the node is:

Q2. A deep unpruned decision tree will tend to:

Q3. The main purpose of k-fold cross-validation is to:

Day 05 Random Forests & Gradient Boosting (XGBoost)
visual
Watch
Random Forests, Clearly Explained
StatQuest
30 min
XGBoost Part 1 — Regression Main Ideas
StatQuest
25 min
Notes

Random Forests train many trees on bootstrapped samples with random feature subsets — averaging uncorrelated trees drops variance dramatically. XGBoost builds trees sequentially where each one corrects the previous tree's mistakes. These dominate tabular data competitions.

  • Bootstrap = sampling with replacement — each tree sees a different subset of training data
  • Out-of-bag error: unused samples act as a free built-in validation set
  • Bagging reduces variance; Boosting reduces bias by focusing on hard examples
  • XGBoost dominates Kaggle for tabular data — learn it well
Quiz

3 questions — test your understanding

Q1. Random Forests reduce variance primarily through:

Q2. Out-of-Bag error is computed using:

Q3. XGBoost is an example of which ensemble technique?

Day 06 SVMs, Regularisation (L1 & L2) & Naive Bayes
visualconcept
Watch
Support Vector Machines — Main Ideas
StatQuest
20 min
Ridge (L2) Regularisation, Clearly Explained
StatQuest
16 min
Notes

SVMs find the maximum margin hyperplane — the widest gap between classes. The kernel trick maps data to higher dimensions for non-linear classification. L1 regularisation creates sparse solutions (feature selection); L2 shrinks all weights proportionally.

  • Support vectors: the few data points closest to the boundary — only they determine the hyperplane
  • Kernel trick: achieve non-linear classification without explicit high-dimensional mapping
  • L2 (Ridge): shrinks all weights proportionally. L1 (Lasso): shrinks some weights to exactly zero
  • Naive Bayes: fast probabilistic classifier that assumes feature independence — great for text
Quiz

3 questions — test your understanding

Q1. The support vectors in SVM are:

Q2. L1 (Lasso) regularisation is special because it can:

Q3. The kernel trick in SVMs allows:

Day 07 K-Means Clustering, PCA & Dimensionality Reduction
visual
Watch
K-means Clustering, Clearly Explained
StatQuest
12 min
PCA Step by Step — Main Ideas
StatQuest
21 min
Notes

K-Means assigns each point to the nearest centroid then recomputes centroids — repeat until stable. PCA finds directions of maximum variance and projects data into fewer dimensions while retaining structure. These are the two most important unsupervised techniques.

  • K-Means is sensitive to initialisation — always use k-means++ for better starting centroids
  • Elbow method: plot inertia vs k and pick the bend point to choose the number of clusters
  • PCA principal components are orthogonal directions ordered by variance explained
  • First 2-3 components often capture 80-90% of variance — enough for visualisation
Quiz

3 questions — test your understanding

Q1. K-Means initialisation is improved by using:

Q2. PCA finds directions of:

Q3. The elbow method is used to find the optimal:

Week 2
Deep Learning & Neural Networks
Build real intuition for how neural nets learn — from backprop to CNNs, using 3Blue1Brown and Karpathy from-scratch builds.
Day 08 Neural Networks: The Big Picture
visualconcept
Watch
But what is a Neural Network? — Chapter 1
3Blue1Brown
19 min
Gradient Descent — How Neural Networks Learn, Ch 2
3Blue1Brown
21 min
Notes

A neural network is a massive composition of linear transformations and non-linearities. Each neuron computes a weighted sum then passes it through an activation function. The whole network is differentiable end-to-end, so gradient descent can optimise every weight simultaneously.

  • Layers: input encodes data, hidden layers extract features, output makes predictions
  • ReLU (max(0,x)) is the default activation — fast, avoids vanishing gradients, works well
  • Universal approximation: a sufficiently wide single hidden layer can approximate any function
  • Depth gives compositional power — deep nets learn hierarchical representations
Quiz

3 questions — test your understanding

Q1. The ReLU activation function is defined as:

Q2. The universal approximation theorem states:

Q3. Activation functions are necessary to:

Day 09 Backpropagation — How Neural Nets Actually Learn
visualconcept
Watch
Backpropagation Intuitively — Chapter 3
3Blue1Brown
14 min
Backpropagation Calculus — Chapter 4
3Blue1Brown
10 min
Notes

Backpropagation is just the chain rule of calculus applied to a computation graph. Forward pass computes predictions and loss; backward pass computes how much each weight contributed to the error. These gradients tell us exactly how to update every weight.

  • Forward pass: compute output and loss. Backward pass: compute all gradients via chain rule
  • Chain rule: dL/dw = (dL/dy) times (dy/dw) — gradients multiply along the path
  • Vanishing gradient: in deep sigmoid networks, gradients shrink exponentially through layers
  • PyTorch autograd handles all of this — loss.backward() computes all gradients automatically
Quiz

3 questions — test your understanding

Q1. Backpropagation is fundamentally based on:

Q2. Vanishing gradients most severely affect:

Q3. In PyTorch, calling loss.backward() will:

Day 10 Build a Neural Net from Scratch — Karpathy micrograd
code-along
Watch
The Spelled-Out Intro to Backprop: Building micrograd
Andrej Karpathy — CODE ALONG (2.5 hrs)
2.5 hrs
Notes

Karpathy's masterpiece. You build a tiny autograd engine from zero Python — no libraries. By building it yourself, backpropagation will permanently click. Every line of code reveals how PyTorch works under the hood. Do not just watch — pause and type every line.

  • Every operation (+, *, tanh) stores its own backward function for the chain rule
  • Topological sort of the computation graph ensures gradients flow in the correct order
  • Run this in Google Colab — no local setup needed
  • After this, gradient descent and backprop will never be a black box again
Quiz

3 questions — test your understanding

Q1. Automatic differentiation (autograd) works by:

Q2. Topological sort of the computation graph ensures:

Q3. micrograd operates only on scalar values, meaning each operation is:

Day 11 CNNs — Convolutional Neural Networks
visualconcept
Watch
But what is a convolution? (Visual deep-dive)
3Blue1Brown
23 min
Convolutional Neural Networks — MIT 6.S191
MIT Deep Learning
45 min
Notes

A small filter/kernel slides over the image computing dot products, detecting features wherever they appear. Deep CNNs build a feature hierarchy: early layers detect edges, middle layers shapes, deep layers objects. Parameter sharing makes CNNs vastly more efficient than fully-connected nets on images.

  • Parameter sharing: one 3x3 filter scans the entire image, reducing parameters enormously
  • Feature maps: the output of one filter applied across the image — each filter detects one feature type
  • MaxPooling: take the max in each region, reducing spatial size while keeping dominant activations
  • BatchNorm: normalise activations after each layer, stabilises training and allows higher learning rates
Quiz

3 questions — test your understanding

Q1. Parameter sharing in CNNs means:

Q2. MaxPooling is primarily used to:

Q3. Early layers in a deep CNN tend to detect:

Day 12 Training Deep Nets: Adam, Dropout & BatchNorm
concept
Watch
Building makemore Part 3: Activations, Gradients, BatchNorm
Andrej Karpathy
1.5 hrs
Stochastic Gradient Descent, Clearly Explained
StatQuest
10 min
Notes

Adam combines momentum with adaptive per-parameter learning rates — use it as your default. Dropout randomly zeros neurons during training, forcing redundant representations. BatchNorm normalises activations within a batch, dramatically stabilising deep network training.

  • Adam = adaptive moment estimation. Works well out-of-the-box — start here for every task
  • Dropout rate 0.2-0.5 during training; at test time all neurons are active
  • Weight initialisation matters: bad init causes vanishing/exploding activations from step one
  • Learning rate is the most important hyperparameter — search it on a log scale from 1e-4 to 0.1
Quiz

3 questions — test your understanding

Q1. The Adam optimiser combines:

Q2. During inference (test time), dropout should be:

Q3. Poor weight initialisation can cause:

Day 13 Transfer Learning & Fine-tuning
codeconcept
Watch
Transfer Learning — fast.ai Practical Deep Learning Lesson 1
fast.ai / Jeremy Howard
1.5 hrs
Notes

Take a model pre-trained on millions of images (ResNet, ViT) and fine-tune its final layers on your small dataset. Early layers learn universal features that transfer to almost any vision task. This approach gives state-of-the-art results with very little data.

  • Feature extraction: freeze all base weights, only train the new classification head
  • Fine-tuning: unfreeze all layers and train with a very small learning rate (1e-5 to 1e-4)
  • Discriminative learning rates: lower LR for early layers, higher for the new head
  • HuggingFace makes transfer learning one line for NLP: AutoModel.from_pretrained()
Quiz

3 questions — test your understanding

Q1. In feature extraction (transfer learning), the base model weights are:

Q2. Early layers in a pre-trained CNN detect:

Q3. When fine-tuning a pre-trained model, the learning rate should typically be:

Day 14 RNNs, LSTMs & Why Transformers Replaced Them
visual
Watch
Illustrated Guide to Recurrent Neural Networks
The A.I. Hacker — Michael Phi
9 min
Illustrated Guide to LSTMs and GRUs
The A.I. Hacker — Michael Phi
12 min
Notes

RNNs process sequences by passing a hidden state from step to step — but gradients vanish over long sequences. LSTMs fix this with a cell state and gating mechanism. But RNNs must process tokens sequentially. Transformers parallelise over the full sequence at once — this was the key breakthrough that enabled scale.

  • LSTM has three gates: forget (what to erase), input (what to add), output (what to expose)
  • GRU = simpler LSTM with two gates and fewer parameters — often similar performance
  • RNNs are sequential; Transformers are parallel — why transformers scale so much better
  • RNNs still used in streaming/online inference scenarios where full-sequence access is unavailable
Quiz

3 questions — test your understanding

Q1. LSTMs were designed to solve:

Q2. GRUs compared to LSTMs have:

Q3. The key reason Transformers replaced RNNs is:

Week 3
NLP, Attention & Transformers
From word embeddings to building GPT from scratch. This is where it all comes together.
Day 15 Word Embeddings — Word2Vec & Semantic Space
visual
Watch
Word Embedding and Word2Vec — StatQuest NLP Series
StatQuest
22 min
Word2Vec — Illustrated and Explained
Rasa
16 min
Notes

Words can be represented as dense vectors where semantically similar words cluster together. The famous result: King - Man + Woman = Queen. Modern LLMs use contextual embeddings — the same word gets a different vector depending on its context.

  • One-hot = huge sparse vectors; embeddings = compact dense vectors (e.g. 300 dimensions)
  • Cosine similarity measures the angle between vectors — the metric for semantic closeness
  • CBOW: predict centre word from context. Skip-gram: predict context from centre word
  • BERT embeddings are contextual: "bank" gets different vectors in "river bank" vs "bank account"
Quiz

3 questions — test your understanding

Q1. Word embeddings represent words as:

Q2. Cosine similarity between word vectors measures:

Q3. Contextual embeddings (like BERT) differ from Word2Vec because:

Day 16 The Attention Mechanism — The Key Innovation
visualLLM
Watch
Attention in Transformers, Step by Step — Chapter 6
3Blue1Brown
27 min
Illustrated Guide to Transformers — Step by Step
The A.I. Hacker — Michael Phi
15 min
Notes

Attention lets each token look at every other token and decide relevance. It computes Queries (Q), Keys (K), and Values (V): Q dot K produces relevance scores, softmax normalises them, then we take a weighted sum of V. This solves word disambiguation across long-range context.

  • Attention(Q,K,V) = softmax(QK^T / sqrt(d_k)) times V — learn this formula
  • Scaling by sqrt(d_k) prevents dot products from becoming too large, causing softmax saturation
  • Multi-head: run attention h times in parallel with different projections, then concatenate
  • Causal masking in GPT: each position can only attend to itself and previous positions
Quiz

3 questions — test your understanding

Q1. We scale the dot product by sqrt(d_k) to:

Q2. Multi-head attention runs attention:

Q3. Causal masking in GPT ensures:

Day 17 Transformer Architecture — The Full Picture
visualLLM
Watch
Transformers, the Tech Behind LLMs — Chapter 5
3Blue1Brown
27 min
Notes

A Transformer block = Multi-Head Self-Attention + Feed-Forward Network + Residual Connections + LayerNorm. Stack 12 to 96 of these and you have GPT. The residual stream is the backbone — each block reads from it and adds its contribution back.

  • Residual connections: output = x + F(x) — allows gradients to flow through the skip path
  • LayerNorm normalises each token's embedding independently (vs BatchNorm which normalises per-feature)
  • FFN in each block: two linear layers with GELU — acts as a key-value memory store for facts
  • Positional encoding: sinusoidal or learned vectors added to embeddings to encode order
Quiz

3 questions — test your understanding

Q1. Residual connections are primarily used to:

Q2. Positional encoding is necessary because:

Q3. The FFN in each transformer block:

Day 18 Build GPT from Scratch — Karpathy nanoGPT
code-alongLLM
Watch
Let's Build GPT: From Scratch, In Code, Spelled Out
Andrej Karpathy — CODE ALONG (2 hrs)
2 hrs
Notes

The crown jewel of this curriculum. Karpathy builds a character-level GPT in roughly 200 lines of clean PyTorch. The Attention is All You Need paper becomes actual working code in front of your eyes. After this, transformer papers are just implementation details.

  • Tokenisation: character-level here (65 chars), but GPT-4 uses BPE with about 100k tokens
  • Training objective: predict the next token given all previous tokens — cross-entropy loss
  • Temperature controls randomness at generation: low = conservative, high = creative/wild
  • Scaling = more data + bigger model + longer training, and performance improves predictably
Quiz

3 questions — test your understanding

Q1. GPT is trained with the objective of:

Q2. Setting temperature to 0 during generation produces:

Q3. In nanoGPT at character level, the vocabulary size is:

Day 19 Tokenisation & Byte Pair Encoding (BPE)
LLMcode
Watch
Let's Build the GPT Tokenizer from Scratch
Andrej Karpathy
2.2 hrs
Notes

BPE starts with individual bytes and iteratively merges the most frequent pair until a target vocabulary size is reached. A surprising number of LLM quirks — difficulty counting letters, bad arithmetic, odd spelling — trace directly back to tokenisation artefacts.

  • Tokens are not words: "unbelievable" might be 3-4 tokens; " the" has a leading space
  • Same text in different languages uses very different numbers of tokens
  • OpenAI tiktoken and Google SentencePiece are the two main tokeniser libraries
  • Context window length is always measured in tokens, not words or characters
Quiz

3 questions — test your understanding

Q1. BPE works by:

Q2. GPT-4 uses approximately how many tokens in its vocabulary?

Q3. LLM context window length is measured in:

Day 20 BERT vs GPT — Encoders, Decoders & Seq2Seq
LLMconcept
Watch
BERT Neural Network, Clearly Explained
StatQuest
30 min
Notes

BERT uses bidirectional attention — it sees the full context in both directions — excellent for understanding tasks. GPT uses causal (left-to-right) attention, excellent for generation. T5 and BART combine both in an encoder-decoder for tasks like translation and summarisation.

  • BERT pre-trains with Masked LM (predict hidden 15% tokens) + Next Sentence Prediction
  • GPT pre-trains with causal LM (predict next token) — simpler objective, scales much better
  • Encoder-decoder (T5, BART): encoder reads source, cross-attention lets decoder attend to it
  • BERT for classification/NER; GPT for generation; T5 for translation/summarisation
Quiz

3 questions — test your understanding

Q1. BERT uses bidirectional attention, meaning:

Q2. BERT is pre-trained using:

Q3. Encoder-decoder models like T5 are best suited for:

Day 21 HuggingFace in Practice — Fine-tune in 20 Lines
code
Watch
HuggingFace Transformers Crash Course
Nicholas Renotte
40 min
Notes

Build a sentiment classifier using a pre-trained BERT model in 20 lines. The Trainer API abstracts away the training loop — it handles batching, logging, checkpointing, and evaluation automatically. HuggingFace Hub has 500k+ public models.

  • AutoModel.from_pretrained("bert-base-uncased") downloads weights, config, and tokeniser in one line
  • Pipeline: raw text to tokenised tensors to model predictions to decoded labels
  • Trainer API: pass model + dataset + training args, call .train() — handles everything else
  • Never train a language model from scratch for NLP — always fine-tune a pre-trained one
Quiz

3 questions — test your understanding

Q1. HuggingFace AutoModel.from_pretrained() downloads:

Q2. The HuggingFace Trainer API handles:

Q3. For most NLP tasks, training from scratch is:

Week 4
LLMs, RLHF, RAG & Modern AI Systems
How GPT-4, Claude, and Llama actually work. Pre-training, alignment, RAG, fine-tuning, and what is coming next.
Day 22 Intro to Large Language Models — What Are They Really?
LLMvisual
Watch
Intro to Large Language Models — 1-Hour Overview
Andrej Karpathy
1 hr
Notes

Karpathy's one-hour masterclass on LLMs from first principles. An LLM is compressed internet knowledge stored in billions of floating point weights. He covers the full training pipeline: pre-training, supervised fine-tuning, and RLHF for alignment.

  • Stage 1 pre-training: predict next token on massive internet text — creates a base model
  • Stage 2 SFT: fine-tune on high-quality human-written demonstrations — creates assistant model
  • Emergent abilities: capabilities that appear suddenly and unpredictably at large scale
  • LLM as document completer vs LLM as assistant — two very different mental models
Quiz

3 questions — test your understanding

Q1. During LLM pre-training, the model is trained on:

Q2. Emergent abilities in LLMs refer to:

Q3. The transformation from base LLM to helpful assistant is achieved through:

Day 23 RLHF — How ChatGPT Learned to Be Helpful
LLMconcept
Watch
Reinforcement Learning from Human Feedback — Explained
Hugging Face
20 min
DPO — Direct Preference Optimisation Explained
Trelis Research
20 min
Notes

RLHF transforms base LLMs into helpful assistants. Step 1: SFT on demonstrations. Step 2: train a Reward Model on human A/B preference rankings. Step 3: use PPO to optimise the LLM against the reward model. DPO is a newer, simpler alternative that skips the reward model entirely.

  • Reward Model: trained on pairs of responses with human preference labels (A is better than B)
  • PPO: Proximal Policy Optimisation — maximise reward without drifting too far from base model
  • KL divergence penalty: prevents the fine-tuned model from becoming too different from the base
  • DPO: directly optimises on preference data — simpler, stabler, no RL loop needed
Quiz

3 questions — test your understanding

Q1. The Reward Model in RLHF is trained on:

Q2. The KL divergence penalty in RLHF prevents:

Q3. DPO improves on RLHF by:

Day 24 RAG — Retrieval-Augmented Generation
LLMcode
Watch
Retrieval Augmented Generation (RAG) Explained
IBM Technology
10 min
Build a RAG Pipeline from Scratch with LangChain
Sam Witteveen
35 min
Notes

LLMs hallucinate because they rely on weights from training. RAG solves this: embed your documents, store them in a vector database, retrieve the most semantically relevant chunks at query time, and include them in the prompt. This is the dominant pattern in production AI apps today.

  • Pipeline: chunk docs into ~500 token pieces, embed each chunk, store in vector DB (Chroma, Pinecone)
  • At query time: embed the question, find top-k similar chunks by cosine similarity, stuff into prompt
  • ANN (Approximate Nearest Neighbour) makes retrieval fast even with millions of vectors
  • Advanced RAG: reranking with cross-encoder, HyDE (generate hypothetical doc first), parent-child retrieval
Quiz

3 questions — test your understanding

Q1. RAG primarily addresses the problem of:

Q2. A vector database stores:

Q3. Semantic similarity at query time is measured using:

Day 25 Prompt Engineering & In-Context Learning
LLMcode
Watch
Prompt Engineering Guide — Full Lecture
DAIR.AI
1 hr
Notes

Chain-of-Thought prompting ("think step by step") dramatically improves multi-step reasoning by forcing the model to externalise intermediate steps. Few-shot examples demonstrate exact output format. In-context learning: the model adapts to your examples with zero weight updates.

  • Zero-shot: task description only. Few-shot: 2-5 input-output example pairs before the actual query
  • CoT gives large gains on math, logic, and code tasks — add "let's think step by step"
  • System prompt sets role; user prompt is the task; assistant prompt seeds output format
  • Temperature 0 for code/facts; 0.7-1.0 for creative writing; above 1 gets incoherent quickly
Quiz

3 questions — test your understanding

Q1. Chain-of-Thought prompting most improves LLM performance on:

Q2. Few-shot prompting means:

Q3. For deterministic/factual outputs like code, set temperature to:

Day 26 Fine-tuning LLMs: LoRA, QLoRA & PEFT
LLMcode
Watch
LoRA Fine-tuning Explained
Weights and Biases
30 min
Fine-tune Llama with QLoRA on One GPU
Maxime Labonne
25 min
Notes

LoRA inserts tiny trainable adapter matrices alongside frozen base weights — training only 0.1% of parameters. QLoRA adds 4-bit quantisation so you can fine-tune a 7B model on a single consumer GPU. This is how the entire open-source LLM community does domain-specific fine-tuning.

  • LoRA: add delta_W = B times A (low-rank) to each weight matrix; only train B and A
  • Rank r=8 or 16 is usually sufficient — higher rank means more expressiveness but more parameters
  • QLoRA: quantise base weights to 4-bit (NF4), apply LoRA adapters in 16-bit = 4x memory reduction
  • Use HuggingFace PEFT + trl SFTTrainer — 50-100 examples often enough for format fine-tuning
Quiz

3 questions — test your understanding

Q1. LoRA reduces trainable parameters by:

Q2. QLoRA enables fine-tuning large LLMs on one GPU by:

Q3. LoRA stands for:

Day 27 LLM Internals — How LLMs Store and Recall Facts
visualLLM
Watch
How Might LLMs Store Facts — Deep Learning Chapter 7
3Blue1Brown
23 min
Notes

The MLP layers inside each transformer block function as a key-value associative memory store. The first linear layer detects "key" patterns (is this about Paris?), the second outputs the associated "value" (the Eiffel Tower is there). Mechanistic interpretability is the emerging science of reverse-engineering these circuits.

  • MLP first layer = pattern detectors (keys); second layer = associated stored values
  • Superposition: networks encode many more features than dimensions using near-orthogonal directions
  • Hallucinations occur when the model's internal confidence is miscalibrated for a fact
  • Anthropic and DeepMind are leading mechanistic interpretability research — fascinating frontier
Quiz

3 questions — test your understanding

Q1. The MLP layers in transformers are thought to function as:

Q2. Superposition in neural networks refers to:

Q3. Mechanistic interpretability research aims to:

Day 28 AI Agents, Scaling Laws & What Comes Next
LLMconcept
Watch
AI Agents Explained — Tool Use, Planning and Memory
IBM Technology
12 min
Scaling Laws for Neural Language Models — Explained
Yannic Kilcher
38 min
Notes

LLM Agents use tools (web search, code execution, APIs) in a ReAct loop: Reason about what to do, Act by calling a tool, Observe the result, repeat. The Chinchilla scaling laws showed model size and training data should scale together — this changed how every major lab trains models.

  • ReAct = Reason + Act + Observe loop — models can self-correct over multiple steps
  • Tool use: function calling lets LLMs trigger APIs, web search, code interpreters, databases
  • Chinchilla law: for 10x more compute, increase model size by ~3x AND training tokens by ~3x
  • Multimodal LLMs embed image patches as tokens alongside text — same transformer architecture
Quiz

3 questions — test your understanding

Q1. The ReAct pattern stands for:

Q2. The Chinchilla scaling law states that for optimal training:

Q3. Multimodal LLMs handle images by: