ML in a Month

Linear Regression, Clearly Explained

Notes

ML is the science of learning patterns from data instead of hard-coding rules. The bias-variance tradeoff is the most important concept in all of ML — high bias underfits (too simple), high variance overfits (memorises noise). Every model selection decision you make is managing this tradeoff.

Supervised = labelled data; Unsupervised = find structure without labels; RL = learn from rewards
ML workflow: collect data, clean, split train/val/test, train, evaluate, iterate
Never touch the test set until final evaluation — it is your unbiased reality check
Model complexity vs error: training error always falls, but validation error has a sweet-spot minimum

Quiz

3 questions — test your understanding

Q1. A model too simple to capture the underlying pattern is said to have:

High variance High bias High accuracy Low loss

Q2. Which dataset split should you NEVER look at until you have a final model?

Training set Validation set Test set Dev set

Q3. In supervised learning, training examples must include:

Both inputs and corresponding labels Only input features Only output labels Reward signals

Day 02 Linear Regression & Gradient Descent

visualconcept

Watch

Gradient Descent, Step by Step

27 min

Logistic Regression, Clearly Explained

Notes

Linear regression fits a line through data by minimising the sum of squared residuals. Gradient descent is the numerical engine — compute the slope of the loss, then take a small step downhill. Almost every ML algorithm is optimised via some variant of gradient descent.

Loss function MSE: average squared difference between predictions and actual values
Update rule: weight = weight minus learning_rate times gradient of loss
Learning rate too large: overshoots minimum and diverges. Too small: converges very slowly
Mini-batch GD: compromise between full-batch (stable, slow) and stochastic (noisy, fast)

Quiz

3 questions — test your understanding

Q1. MSE stands for:

Maximum Squared Error Minimum Squared Estimation Model Score Evaluation Mean Squared Error

Q2. A learning rate that is too large will cause gradient descent to:

Converge slowly but safely Overshoot the minimum and possibly diverge Find the global minimum faster Have no effect

Q3. Weights are updated in the direction of:

The positive gradient A random direction The negative gradient The second derivative

Day 03 Logistic Regression & Classification Metrics

visual

Watch

ROC and AUC, Clearly Explained

19 min

Decision and Classification Trees, Clearly Explained

16 min

Notes

Logistic regression predicts probabilities using the sigmoid function that squashes any value into 0-1. Accuracy alone misleads on imbalanced datasets. Use precision, recall, F1 and AUC-ROC for a complete picture.

Sigmoid: 1 / (1 + e^-z) — output is always a probability between 0 and 1
Decision threshold default is 0.5 but tuning it changes the precision-recall tradeoff
Precision = TP / (TP+FP) — of all predicted positives, how many were actually positive
AUC = 1.0 is perfect; AUC = 0.5 is no better than random chance

Quiz

3 questions — test your understanding

Q1. The sigmoid function outputs values in the range:

0 to 1 -1 to 1 0 to infinity -infinity to infinity

Q2. An AUC score of 0.5 means the model is:

Perfect 50% accurate No better than random guessing Always predicting positive

Q3. Precision is defined as TP / (TP + ?):

TN FP FN TN + FN

Day 04 Decision Trees & Cross-Validation

visual

Watch

Machine Learning Fundamentals: Cross Validation

22 min

Random Forests, Clearly Explained

8 min

Notes

Decision trees split data to maximise information gain, reducing Gini impurity at each node. They are highly interpretable but overfit easily. Cross-validation gives a much more reliable generalisation estimate than a single train/val split.

Gini impurity = 1 minus sum of p_i squared — 0 means perfectly pure node
Information gain = parent impurity minus weighted average of children impurity
Limit depth: max_depth and min_samples_leaf are your main anti-overfitting controls
K-Fold CV: train on k-1 folds, test on 1, rotate k times and average the score

Quiz

3 questions — test your understanding

Q1. A Gini impurity of 0 means the node is:

Fully mixed Randomly split At maximum depth Perfectly pure (all one class)

Q2. A deep unpruned decision tree will tend to:

Overfit training data Underfit training data Generalise perfectly Have high bias

Q3. The main purpose of k-fold cross-validation is to:

Make training faster Increase training set size Get a more reliable estimate of model generalisation Reduce the number of features

Day 05 Random Forests & Gradient Boosting (XGBoost)

visual

Watch

XGBoost Part 1 — Regression Main Ideas

30 min

Support Vector Machines — Main Ideas

25 min

Notes

Random Forests train many trees on bootstrapped samples with random feature subsets — averaging uncorrelated trees drops variance dramatically. XGBoost builds trees sequentially where each one corrects the previous tree's mistakes. These dominate tabular data competitions.

Bootstrap = sampling with replacement — each tree sees a different subset of training data
Out-of-bag error: unused samples act as a free built-in validation set
Bagging reduces variance; Boosting reduces bias by focusing on hard examples
XGBoost dominates Kaggle for tabular data — learn it well

Quiz

3 questions — test your understanding

Q1. Random Forests reduce variance primarily through:

Boosting weak learners sequentially Averaging many uncorrelated trees (bagging) Using a very deep single tree Removing outliers from training data

Q2. Out-of-Bag error is computed using:

Samples not included in each tree's bootstrap sample A separate hold-out validation set Training error averaged across trees Cross-validation folds

Q3. XGBoost is an example of which ensemble technique?

Bagging Stacking Voting Boosting

Day 06 SVMs, Regularisation (L1 & L2) & Naive Bayes

visualconcept

Watch

Ridge (L2) Regularisation, Clearly Explained

K-means Clustering, Clearly Explained

16 min

Notes

SVMs find the maximum margin hyperplane — the widest gap between classes. The kernel trick maps data to higher dimensions for non-linear classification. L1 regularisation creates sparse solutions (feature selection); L2 shrinks all weights proportionally.

Support vectors: the few data points closest to the boundary — only they determine the hyperplane
Kernel trick: achieve non-linear classification without explicit high-dimensional mapping
L2 (Ridge): shrinks all weights proportionally. L1 (Lasso): shrinks some weights to exactly zero
Naive Bayes: fast probabilistic classifier that assumes feature independence — great for text

Quiz

3 questions — test your understanding

Q1. The support vectors in SVM are:

All training data points Points far from the decision boundary Data points closest to the decision boundary Incorrectly classified training points

Q2. L1 (Lasso) regularisation is special because it can:

Only be applied to linear models Increase model complexity Shrink all weights equally Shrink some weights to exactly zero, doing feature selection

Q3. The kernel trick in SVMs allows:

Non-linear classification without explicit high-dimensional mapping Faster training on large datasets Training with no hyperparameters Handling missing values automatically

Day 07 K-Means Clustering, PCA & Dimensionality Reduction

visual

Watch

PCA Step by Step — Main Ideas

12 min

But what is a Neural Network? — Chapter 1

21 min

Notes

K-Means assigns each point to the nearest centroid then recomputes centroids — repeat until stable. PCA finds directions of maximum variance and projects data into fewer dimensions while retaining structure. These are the two most important unsupervised techniques.

K-Means is sensitive to initialisation — always use k-means++ for better starting centroids
Elbow method: plot inertia vs k and pick the bend point to choose the number of clusters
PCA principal components are orthogonal directions ordered by variance explained
First 2-3 components often capture 80-90% of variance — enough for visualisation

Quiz

3 questions — test your understanding

Q1. K-Means initialisation is improved by using:

Random restarts only K-Means++ initialisation PCA pre-processing DBSCAN first

Q2. PCA finds directions of:

Maximum loss Minimum correlation Minimum variance Maximum variance

Q3. The elbow method is used to find the optimal:

Number of clusters k in K-Means Number of PCA components Learning rate in gradient descent Max depth of a decision tree

Week 2

Deep Learning & Neural Networks

Build real intuition for how neural nets learn — from backprop to CNNs, using 3Blue1Brown and Karpathy from-scratch builds.

Day 08 Neural Networks: The Big Picture

visualconcept

Watch

Gradient Descent — How Neural Networks Learn, Ch 2

19 min

Backpropagation Intuitively — Chapter 3

21 min

Notes

A neural network is a massive composition of linear transformations and non-linearities. Each neuron computes a weighted sum then passes it through an activation function. The whole network is differentiable end-to-end, so gradient descent can optimise every weight simultaneously.

Layers: input encodes data, hidden layers extract features, output makes predictions
ReLU (max(0,x)) is the default activation — fast, avoids vanishing gradients, works well
Universal approximation: a sufficiently wide single hidden layer can approximate any function
Depth gives compositional power — deep nets learn hierarchical representations

Quiz

3 questions — test your understanding

Q1. The ReLU activation function is defined as:

1 / (1 + e^-x) tanh(x) max(0, x) x squared

Q2. The universal approximation theorem states:

A wide enough neural network can approximate any continuous function All neural networks converge to the global optimum Deeper is always better than wider Networks require at least 10 layers

Q3. Activation functions are necessary to:

Speed up training Reduce parameters Normalise inputs Introduce non-linearity so the network can learn complex patterns

Day 09 Backpropagation — How Neural Nets Actually Learn

visualconcept

Watch

Backpropagation Calculus — Chapter 4

14 min

The Spelled-Out Intro to Backprop: Building micrograd

10 min

Notes

Backpropagation is just the chain rule of calculus applied to a computation graph. Forward pass computes predictions and loss; backward pass computes how much each weight contributed to the error. These gradients tell us exactly how to update every weight.

Forward pass: compute output and loss. Backward pass: compute all gradients via chain rule
Chain rule: dL/dw = (dL/dy) times (dy/dw) — gradients multiply along the path
Vanishing gradient: in deep sigmoid networks, gradients shrink exponentially through layers
PyTorch autograd handles all of this — loss.backward() computes all gradients automatically

Quiz

3 questions — test your understanding

Q1. Backpropagation is fundamentally based on:

Matrix factorisation The chain rule of calculus Convex optimisation Bayesian inference

Q2. Vanishing gradients most severely affect:

Shallow networks with ReLU Single-layer perceptrons Deep networks with sigmoid/tanh activations Convolutional networks only

Q3. In PyTorch, calling loss.backward() will:

Compute gradients for all parameters in the computation graph Update all model weights Run a forward pass Reset all gradients to zero

Day 10 Build a Neural Net from Scratch — Karpathy micrograd

code-along

Watch

Andrej Karpathy — CODE ALONG (2.5 hrs)

2.5 hrs

Notes

Karpathy's masterpiece. You build a tiny autograd engine from zero Python — no libraries. By building it yourself, backpropagation will permanently click. Every line of code reveals how PyTorch works under the hood. Do not just watch — pause and type every line.

Every operation (+, *, tanh) stores its own backward function for the chain rule
Topological sort of the computation graph ensures gradients flow in the correct order
Run this in Google Colab — no local setup needed
After this, gradient descent and backprop will never be a black box again

Quiz

3 questions — test your understanding

Q1. Automatic differentiation (autograd) works by:

Tracking operations in a computation graph and applying chain rule backwards Using symbolic math to differentiate analytically Numerically estimating gradients with finite differences Randomly sampling gradient directions

Q2. Topological sort of the computation graph ensures:

The model trains faster Gradients are computed in the correct reverse order The graph has no cycles Memory is managed efficiently

Q3. micrograd operates only on scalar values, meaning each operation is:

Unable to learn Equivalent to a convolutional layer Restricted to linear operations Broken down into individual scalar additions and multiplications

Day 11 CNNs — Convolutional Neural Networks

visualconcept

Watch

But what is a convolution? (Visual deep-dive)

Convolutional Neural Networks — MIT 6.S191

23 min

MIT Deep Learning

45 min

Notes

A small filter/kernel slides over the image computing dot products, detecting features wherever they appear. Deep CNNs build a feature hierarchy: early layers detect edges, middle layers shapes, deep layers objects. Parameter sharing makes CNNs vastly more efficient than fully-connected nets on images.

Parameter sharing: one 3x3 filter scans the entire image, reducing parameters enormously
Feature maps: the output of one filter applied across the image — each filter detects one feature type
MaxPooling: take the max in each region, reducing spatial size while keeping dominant activations
BatchNorm: normalise activations after each layer, stabilises training and allows higher learning rates

Quiz

3 questions — test your understanding

Q1. Parameter sharing in CNNs means:

All layers use the same weights Parameters are shared between train and test sets The same filter kernel is applied across the entire spatial input Multiple models share one parameter set

Q2. MaxPooling is primarily used to:

Reduce spatial dimensions and achieve translation invariance Increase feature map count Normalise activations Apply the ReLU activation

Q3. Early layers in a deep CNN tend to detect:

High-level objects like faces and cars Low-level features like edges and textures Semantic meaning of the image Class probability distribution

Day 12 Training Deep Nets: Adam, Dropout & BatchNorm

concept

Watch

Building makemore Part 3: Activations, Gradients, BatchNorm

Andrej Karpathy

1.5 hrs

Stochastic Gradient Descent, Clearly Explained

Transfer Learning — fast.ai Practical Deep Learning Lesson 1

10 min

Notes

Adam combines momentum with adaptive per-parameter learning rates — use it as your default. Dropout randomly zeros neurons during training, forcing redundant representations. BatchNorm normalises activations within a batch, dramatically stabilising deep network training.

Adam = adaptive moment estimation. Works well out-of-the-box — start here for every task
Dropout rate 0.2-0.5 during training; at test time all neurons are active
Weight initialisation matters: bad init causes vanishing/exploding activations from step one
Learning rate is the most important hyperparameter — search it on a log scale from 1e-4 to 0.1

Quiz

3 questions — test your understanding

Q1. The Adam optimiser combines:

L1 and L2 regularisation Momentum and adaptive per-parameter learning rates Batch normalisation and dropout SGD and second-order gradients

Q2. During inference (test time), dropout should be:

Applied at the same rate as training Doubled Applied only to the output layer Disabled — all neurons active

Q3. Poor weight initialisation can cause:

Vanishing or exploding gradients from the very first iteration The model to immediately overfit Faster convergence Increased memory usage only

Day 13 Transfer Learning & Fine-tuning

codeconcept

Watch

fast.ai / Jeremy Howard

1.5 hrs

Notes

Take a model pre-trained on millions of images (ResNet, ViT) and fine-tune its final layers on your small dataset. Early layers learn universal features that transfer to almost any vision task. This approach gives state-of-the-art results with very little data.

Feature extraction: freeze all base weights, only train the new classification head
Fine-tuning: unfreeze all layers and train with a very small learning rate (1e-5 to 1e-4)
Discriminative learning rates: lower LR for early layers, higher for the new head
HuggingFace makes transfer learning one line for NLP: AutoModel.from_pretrained()

Quiz

3 questions — test your understanding

Q1. In feature extraction (transfer learning), the base model weights are:

Frozen — only the new head is trained Trained from scratch Updated with a large learning rate Discarded entirely

Q2. Early layers in a pre-trained CNN detect:

Task-specific features like dog breeds Abstract semantic categories Universal low-level features like edges and textures Output probabilities

Q3. When fine-tuning a pre-trained model, the learning rate should typically be:

The same as training from scratch (0.1) Zero — no updates 0.5 or higher Much smaller (1e-5 to 1e-4) to preserve learned features

Day 14 RNNs, LSTMs & Why Transformers Replaced Them

visual

Watch

Illustrated Guide to Recurrent Neural Networks

The A.I. Hacker — Michael Phi

9 min

Illustrated Guide to LSTMs and GRUs

The A.I. Hacker — Michael Phi

12 min

Notes

RNNs process sequences by passing a hidden state from step to step — but gradients vanish over long sequences. LSTMs fix this with a cell state and gating mechanism. But RNNs must process tokens sequentially. Transformers parallelise over the full sequence at once — this was the key breakthrough that enabled scale.

LSTM has three gates: forget (what to erase), input (what to add), output (what to expose)
GRU = simpler LSTM with two gates and fewer parameters — often similar performance
RNNs are sequential; Transformers are parallel — why transformers scale so much better
RNNs still used in streaming/online inference scenarios where full-sequence access is unavailable

Quiz

3 questions — test your understanding

Q1. LSTMs were designed to solve:

Overfitting in deep networks Slow training compared to feedforward networks The vanishing gradient problem in long sequences Handling 2D image data

Q2. GRUs compared to LSTMs have:

Fewer parameters due to a simpler gating mechanism More parameters and better performance No recurrent connection A separate cell state

Q3. The key reason Transformers replaced RNNs is:

Transformers require less data Transformers have no attention mechanism Transformers cannot handle variable-length sequences Transformers process the full sequence in parallel, enabling massive scale

Week 3

NLP, Attention & Transformers

From word embeddings to building GPT from scratch. This is where it all comes together.

Day 15 Word Embeddings — Word2Vec & Semantic Space

visual

Watch

Word Embedding and Word2Vec — StatQuest NLP Series

Word2Vec — Illustrated and Explained

22 min

Rasa

16 min

Notes

Words can be represented as dense vectors where semantically similar words cluster together. The famous result: King - Man + Woman = Queen. Modern LLMs use contextual embeddings — the same word gets a different vector depending on its context.

One-hot = huge sparse vectors; embeddings = compact dense vectors (e.g. 300 dimensions)
Cosine similarity measures the angle between vectors — the metric for semantic closeness
CBOW: predict centre word from context. Skip-gram: predict context from centre word
BERT embeddings are contextual: "bank" gets different vectors in "river bank" vs "bank account"

Quiz

3 questions — test your understanding

Q1. Word embeddings represent words as:

Sparse one-hot vectors Dense low-dimensional continuous vectors Binary feature vectors Integer frequency counts

Q2. Cosine similarity between word vectors measures:

Euclidean distance Co-occurrence frequency Word frequency in the corpus The angle between vectors, i.e. semantic similarity

Q3. Contextual embeddings (like BERT) differ from Word2Vec because:

The same word gets a different vector depending on surrounding context They use a larger vocabulary They produce one-hot vectors They only work for English

Day 16 The Attention Mechanism — The Key Innovation

visualLLM

Watch

Attention in Transformers, Step by Step — Chapter 6

Illustrated Guide to Transformers — Step by Step

27 min

The A.I. Hacker — Michael Phi

15 min

Notes

Attention lets each token look at every other token and decide relevance. It computes Queries (Q), Keys (K), and Values (V): Q dot K produces relevance scores, softmax normalises them, then we take a weighted sum of V. This solves word disambiguation across long-range context.

Attention(Q,K,V) = softmax(QK^T / sqrt(d_k)) times V — learn this formula
Scaling by sqrt(d_k) prevents dot products from becoming too large, causing softmax saturation
Multi-head: run attention h times in parallel with different projections, then concatenate
Causal masking in GPT: each position can only attend to itself and previous positions

Quiz

3 questions — test your understanding

Q1. We scale the dot product by sqrt(d_k) to:

Make the output between 0 and 1 Increase the magnitude of attention scores Prevent dot products from growing too large and saturating softmax Normalise the key vectors

Q2. Multi-head attention runs attention:

Multiple times in parallel with different learned projections, then concatenates Sequentially one head at a time Only on specific transformer layers Once across multiple sequences simultaneously

Q3. Causal masking in GPT ensures:

All tokens attend to all others equally The model processes inputs in reverse order Padding tokens are ignored Each token can only attend to itself and previous tokens, not future tokens

Day 17 Transformer Architecture — The Full Picture

visualLLM

Watch

Transformers, the Tech Behind LLMs — Chapter 5

Let's Build GPT: From Scratch, In Code, Spelled Out

27 min

Notes

A Transformer block = Multi-Head Self-Attention + Feed-Forward Network + Residual Connections + LayerNorm. Stack 12 to 96 of these and you have GPT. The residual stream is the backbone — each block reads from it and adds its contribution back.

Residual connections: output = x + F(x) — allows gradients to flow through the skip path
LayerNorm normalises each token's embedding independently (vs BatchNorm which normalises per-feature)
FFN in each block: two linear layers with GELU — acts as a key-value memory store for facts
Positional encoding: sinusoidal or learned vectors added to embeddings to encode order

Quiz

3 questions — test your understanding

Q1. Residual connections are primarily used to:

Reduce the number of parameters Allow gradients to flow directly, enabling training of very deep networks Implement the attention mechanism Normalise attention scores

Q2. Positional encoding is necessary because:

Attention is too slow without it It replaces the embedding layer It helps with longer sequences Attention is permutation-invariant and has no inherent notion of token order

Q3. The FFN in each transformer block:

Acts as a per-token memory store and adds non-linearity Handles the attention computation Reduces the sequence length Produces final output probabilities

Day 18 Build GPT from Scratch — Karpathy nanoGPT

code-alongLLM

Watch

Andrej Karpathy — CODE ALONG (2 hrs)

2 hrs

Notes

The crown jewel of this curriculum. Karpathy builds a character-level GPT in roughly 200 lines of clean PyTorch. The Attention is All You Need paper becomes actual working code in front of your eyes. After this, transformer papers are just implementation details.

Tokenisation: character-level here (65 chars), but GPT-4 uses BPE with about 100k tokens
Training objective: predict the next token given all previous tokens — cross-entropy loss
Temperature controls randomness at generation: low = conservative, high = creative/wild
Scaling = more data + bigger model + longer training, and performance improves predictably

Quiz

3 questions — test your understanding

Q1. GPT is trained with the objective of:

Predicting randomly masked tokens like BERT Classifying the sentiment of input text Predicting the next token given all previous tokens Generating tokens in both forward and backward directions

Q2. Setting temperature to 0 during generation produces:

Deterministic output — always picks the highest probability token Completely random output No output Equal probability over all tokens

Q3. In nanoGPT at character level, the vocabulary size is:

50,000 tokens from BPE About 65 unique characters 26 letters only One word per token

Day 19 Tokenisation & Byte Pair Encoding (BPE)

LLMcode

Watch

Let's Build the GPT Tokenizer from Scratch

Andrej Karpathy

2.2 hrs

Notes

BPE starts with individual bytes and iteratively merges the most frequent pair until a target vocabulary size is reached. A surprising number of LLM quirks — difficulty counting letters, bad arithmetic, odd spelling — trace directly back to tokenisation artefacts.

Tokens are not words: "unbelievable" might be 3-4 tokens; " the" has a leading space
Same text in different languages uses very different numbers of tokens
OpenAI tiktoken and Google SentencePiece are the two main tokeniser libraries
Context window length is always measured in tokens, not words or characters

Quiz

3 questions — test your understanding

Q1. BPE works by:

Splitting text into individual words Randomly grouping characters into tokens Using a fixed vocabulary of common English words Iteratively merging the most frequent byte pairs until a target vocabulary size

Q2. GPT-4 uses approximately how many tokens in its vocabulary?

About 100,000 About 5,000 26 (one per letter) About 1 million

Q3. LLM context window length is measured in:

Words Characters Tokens Sentences

Day 20 BERT vs GPT — Encoders, Decoders & Seq2Seq

LLMconcept

Watch

BERT Neural Network, Clearly Explained

HuggingFace Transformers Crash Course

30 min

Notes

BERT uses bidirectional attention — it sees the full context in both directions — excellent for understanding tasks. GPT uses causal (left-to-right) attention, excellent for generation. T5 and BART combine both in an encoder-decoder for tasks like translation and summarisation.

BERT pre-trains with Masked LM (predict hidden 15% tokens) + Next Sentence Prediction
GPT pre-trains with causal LM (predict next token) — simpler objective, scales much better
Encoder-decoder (T5, BART): encoder reads source, cross-attention lets decoder attend to it
BERT for classification/NER; GPT for generation; T5 for translation/summarisation

Quiz

3 questions — test your understanding

Q1. BERT uses bidirectional attention, meaning:

Each token attends to all other tokens in both directions The model processes text right-to-left It uses two separate attention heads Attention is computed in two passes

Q2. BERT is pre-trained using:

Next token prediction only Image-text contrastive learning Masked Language Modelling and Next Sentence Prediction Reinforcement learning from human feedback

Q3. Encoder-decoder models like T5 are best suited for:

Simple text classification Named entity recognition Image generation Sequence-to-sequence tasks like translation and summarisation

Day 21 HuggingFace in Practice — Fine-tune in 20 Lines

code

Watch

Nicholas Renotte

40 min

Notes

Build a sentiment classifier using a pre-trained BERT model in 20 lines. The Trainer API abstracts away the training loop — it handles batching, logging, checkpointing, and evaluation automatically. HuggingFace Hub has 500k+ public models.

AutoModel.from_pretrained("bert-base-uncased") downloads weights, config, and tokeniser in one line
Pipeline: raw text to tokenised tensors to model predictions to decoded labels
Trainer API: pass model + dataset + training args, call .train() — handles everything else
Never train a language model from scratch for NLP — always fine-tune a pre-trained one

Quiz

3 questions — test your understanding

Q1. HuggingFace AutoModel.from_pretrained() downloads:

Only the model architecture Raw training data Pre-trained weights, configuration, and tokeniser A task-specific fine-tuned model

Q2. The HuggingFace Trainer API handles:

The training loop, batching, logging, and checkpointing Data collection and labelling Model architecture design Deployment to production servers

Q3. For most NLP tasks, training from scratch is:

The recommended approach Required because pre-trained models are unavailable Faster than fine-tuning Unnecessary — fine-tuning a pre-trained model almost always gives better results faster

Week 4

LLMs, RLHF, RAG & Modern AI Systems

How GPT-4, Claude, and Llama actually work. Pre-training, alignment, RAG, fine-tuning, and what is coming next.

Day 22 Intro to Large Language Models — What Are They Really?

LLMvisual

Watch

Intro to Large Language Models — 1-Hour Overview

Andrej Karpathy

1 hr

Notes

Karpathy's one-hour masterclass on LLMs from first principles. An LLM is compressed internet knowledge stored in billions of floating point weights. He covers the full training pipeline: pre-training, supervised fine-tuning, and RLHF for alignment.

Stage 1 pre-training: predict next token on massive internet text — creates a base model
Stage 2 SFT: fine-tune on high-quality human-written demonstrations — creates assistant model
Emergent abilities: capabilities that appear suddenly and unpredictably at large scale
LLM as document completer vs LLM as assistant — two very different mental models

Quiz

3 questions — test your understanding

Q1. During LLM pre-training, the model is trained on:

Labelled classification examples Trillions of tokens of internet text with a next-token prediction objective Curated expert question-answer pairs Human preference rankings

Q2. Emergent abilities in LLMs refer to:

Skills explicitly programmed by engineers Gradual linear improvement as model size grows Abilities added through fine-tuning Capabilities that appear suddenly and unpredictably at large scale

Q3. The transformation from base LLM to helpful assistant is achieved through:

Supervised fine-tuning on high-quality instruction-following demonstrations Simply deploying the base model Adding more pre-training data Increasing the number of transformer layers

Day 23 RLHF — How ChatGPT Learned to Be Helpful

LLMconcept

Watch

Reinforcement Learning from Human Feedback — Explained

Hugging Face

DPO — Direct Preference Optimisation Explained

Trelis Research

Retrieval Augmented Generation (RAG) Explained

Notes

RLHF transforms base LLMs into helpful assistants. Step 1: SFT on demonstrations. Step 2: train a Reward Model on human A/B preference rankings. Step 3: use PPO to optimise the LLM against the reward model. DPO is a newer, simpler alternative that skips the reward model entirely.

Reward Model: trained on pairs of responses with human preference labels (A is better than B)
PPO: Proximal Policy Optimisation — maximise reward without drifting too far from base model
KL divergence penalty: prevents the fine-tuned model from becoming too different from the base
DPO: directly optimises on preference data — simpler, stabler, no RL loop needed

Quiz

3 questions — test your understanding

Q1. The Reward Model in RLHF is trained on:

Raw internet text Pre-trained base model outputs only Human preference rankings of pairwise model outputs Question-answer pairs from Wikipedia

Q2. The KL divergence penalty in RLHF prevents:

The fine-tuned model from drifting too far from the original base model The reward model from overfitting The model from generating long responses The training loss from going negative

Q3. DPO improves on RLHF by:

Using a much larger reward model Requiring more human feedback data Adding a second RL loop Eliminating the separate reward model and RL loop entirely

Day 24 RAG — Retrieval-Augmented Generation

LLMcode

Watch

IBM Technology

10 min

Build a RAG Pipeline from Scratch with LangChain

Sam Witteveen

35 min

Notes

LLMs hallucinate because they rely on weights from training. RAG solves this: embed your documents, store them in a vector database, retrieve the most semantically relevant chunks at query time, and include them in the prompt. This is the dominant pattern in production AI apps today.

Pipeline: chunk docs into ~500 token pieces, embed each chunk, store in vector DB (Chroma, Pinecone)
At query time: embed the question, find top-k similar chunks by cosine similarity, stuff into prompt
ANN (Approximate Nearest Neighbour) makes retrieval fast even with millions of vectors
Advanced RAG: reranking with cross-encoder, HyDE (generate hypothetical doc first), parent-child retrieval

Quiz

3 questions — test your understanding

Q1. RAG primarily addresses the problem of:

Slow generation speed Hallucinations and outdated knowledge High training costs Poor multi-language support

Q2. A vector database stores:

Raw text documents in full Model weights and configurations SQL-style tabular data Dense vector embeddings of document chunks

Q3. Semantic similarity at query time is measured using:

Cosine similarity between embedding vectors Exact string matching BM25 keyword scoring only Cross-entropy loss

Day 25 Prompt Engineering & In-Context Learning

LLMcode

Watch

Prompt Engineering Guide — Full Lecture

DAIR.AI

1 hr

Notes

Chain-of-Thought prompting ("think step by step") dramatically improves multi-step reasoning by forcing the model to externalise intermediate steps. Few-shot examples demonstrate exact output format. In-context learning: the model adapts to your examples with zero weight updates.

Zero-shot: task description only. Few-shot: 2-5 input-output example pairs before the actual query
CoT gives large gains on math, logic, and code tasks — add "let's think step by step"
System prompt sets role; user prompt is the task; assistant prompt seeds output format
Temperature 0 for code/facts; 0.7-1.0 for creative writing; above 1 gets incoherent quickly

Quiz

3 questions — test your understanding

Q1. Chain-of-Thought prompting most improves LLM performance on:

Simple factual retrieval Creative story generation Multi-step reasoning tasks like math and logic Translation between languages

Q2. Few-shot prompting means:

Including 2-5 input-output examples in the prompt before the actual query Fine-tuning the model on a small dataset Using a smaller model for faster inference Providing only partial information to the model

Q3. For deterministic/factual outputs like code, set temperature to:

2.0 for maximum diversity 1.0 as default 0.7 for balanced output 0 or very close to 0

Day 26 Fine-tuning LLMs: LoRA, QLoRA & PEFT

LLMcode

Watch

LoRA Fine-tuning Explained

Weights and Biases

30 min

Fine-tune Llama with QLoRA on One GPU

Maxime Labonne

25 min

Notes

LoRA inserts tiny trainable adapter matrices alongside frozen base weights — training only 0.1% of parameters. QLoRA adds 4-bit quantisation so you can fine-tune a 7B model on a single consumer GPU. This is how the entire open-source LLM community does domain-specific fine-tuning.

LoRA: add delta_W = B times A (low-rank) to each weight matrix; only train B and A
Rank r=8 or 16 is usually sufficient — higher rank means more expressiveness but more parameters
QLoRA: quantise base weights to 4-bit (NF4), apply LoRA adapters in 16-bit = 4x memory reduction
Use HuggingFace PEFT + trl SFTTrainer — 50-100 examples often enough for format fine-tuning

Quiz

3 questions — test your understanding

Q1. LoRA reduces trainable parameters by:

Removing entire transformer layers Using 8-bit quantisation on all weights Decomposing weight updates into low-rank matrices and only training those Training only the embedding layer

Q2. QLoRA enables fine-tuning large LLMs on one GPU by:

Quantising the base model to 4-bit while keeping LoRA adapters in 16-bit Gradient checkpointing only Training on CPU Reducing sequence length to 128 tokens

Q3. LoRA stands for:

Large Output Regularisation Architecture Linear Output Reduction Algorithm Learned Optimisation of Recurrent Activations Low-Rank Adaptation

Day 27 LLM Internals — How LLMs Store and Recall Facts

visualLLM

Watch

How Might LLMs Store Facts — Deep Learning Chapter 7