From linear regression to LLMs — curated for visual learners. StatQuest + 3Blue1Brown + Karpathy.
ML is the science of learning patterns from data instead of hard-coding rules. The bias-variance tradeoff is the most important concept in all of ML — high bias underfits (too simple), high variance overfits (memorises noise). Every model selection decision you make is managing this tradeoff.
3 questions — test your understanding
Q1. A model too simple to capture the underlying pattern is said to have:
Q2. Which dataset split should you NEVER look at until you have a final model?
Q3. In supervised learning, training examples must include:
Linear regression fits a line through data by minimising the sum of squared residuals. Gradient descent is the numerical engine — compute the slope of the loss, then take a small step downhill. Almost every ML algorithm is optimised via some variant of gradient descent.
3 questions — test your understanding
Q1. MSE stands for:
Q2. A learning rate that is too large will cause gradient descent to:
Q3. Weights are updated in the direction of:
Logistic regression predicts probabilities using the sigmoid function that squashes any value into 0-1. Accuracy alone misleads on imbalanced datasets. Use precision, recall, F1 and AUC-ROC for a complete picture.
3 questions — test your understanding
Q1. The sigmoid function outputs values in the range:
Q2. An AUC score of 0.5 means the model is:
Q3. Precision is defined as TP / (TP + ?):
Decision trees split data to maximise information gain, reducing Gini impurity at each node. They are highly interpretable but overfit easily. Cross-validation gives a much more reliable generalisation estimate than a single train/val split.
3 questions — test your understanding
Q1. A Gini impurity of 0 means the node is:
Q2. A deep unpruned decision tree will tend to:
Q3. The main purpose of k-fold cross-validation is to:
Random Forests train many trees on bootstrapped samples with random feature subsets — averaging uncorrelated trees drops variance dramatically. XGBoost builds trees sequentially where each one corrects the previous tree's mistakes. These dominate tabular data competitions.
3 questions — test your understanding
Q1. Random Forests reduce variance primarily through:
Q2. Out-of-Bag error is computed using:
Q3. XGBoost is an example of which ensemble technique?
SVMs find the maximum margin hyperplane — the widest gap between classes. The kernel trick maps data to higher dimensions for non-linear classification. L1 regularisation creates sparse solutions (feature selection); L2 shrinks all weights proportionally.
3 questions — test your understanding
Q1. The support vectors in SVM are:
Q2. L1 (Lasso) regularisation is special because it can:
Q3. The kernel trick in SVMs allows:
K-Means assigns each point to the nearest centroid then recomputes centroids — repeat until stable. PCA finds directions of maximum variance and projects data into fewer dimensions while retaining structure. These are the two most important unsupervised techniques.
3 questions — test your understanding
Q1. K-Means initialisation is improved by using:
Q2. PCA finds directions of:
Q3. The elbow method is used to find the optimal:
A neural network is a massive composition of linear transformations and non-linearities. Each neuron computes a weighted sum then passes it through an activation function. The whole network is differentiable end-to-end, so gradient descent can optimise every weight simultaneously.
3 questions — test your understanding
Q1. The ReLU activation function is defined as:
Q2. The universal approximation theorem states:
Q3. Activation functions are necessary to:
Backpropagation is just the chain rule of calculus applied to a computation graph. Forward pass computes predictions and loss; backward pass computes how much each weight contributed to the error. These gradients tell us exactly how to update every weight.
3 questions — test your understanding
Q1. Backpropagation is fundamentally based on:
Q2. Vanishing gradients most severely affect:
Q3. In PyTorch, calling loss.backward() will:
Karpathy's masterpiece. You build a tiny autograd engine from zero Python — no libraries. By building it yourself, backpropagation will permanently click. Every line of code reveals how PyTorch works under the hood. Do not just watch — pause and type every line.
3 questions — test your understanding
Q1. Automatic differentiation (autograd) works by:
Q2. Topological sort of the computation graph ensures:
Q3. micrograd operates only on scalar values, meaning each operation is:
A small filter/kernel slides over the image computing dot products, detecting features wherever they appear. Deep CNNs build a feature hierarchy: early layers detect edges, middle layers shapes, deep layers objects. Parameter sharing makes CNNs vastly more efficient than fully-connected nets on images.
3 questions — test your understanding
Q1. Parameter sharing in CNNs means:
Q2. MaxPooling is primarily used to:
Q3. Early layers in a deep CNN tend to detect:
Adam combines momentum with adaptive per-parameter learning rates — use it as your default. Dropout randomly zeros neurons during training, forcing redundant representations. BatchNorm normalises activations within a batch, dramatically stabilising deep network training.
3 questions — test your understanding
Q1. The Adam optimiser combines:
Q2. During inference (test time), dropout should be:
Q3. Poor weight initialisation can cause:
Take a model pre-trained on millions of images (ResNet, ViT) and fine-tune its final layers on your small dataset. Early layers learn universal features that transfer to almost any vision task. This approach gives state-of-the-art results with very little data.
3 questions — test your understanding
Q1. In feature extraction (transfer learning), the base model weights are:
Q2. Early layers in a pre-trained CNN detect:
Q3. When fine-tuning a pre-trained model, the learning rate should typically be:
RNNs process sequences by passing a hidden state from step to step — but gradients vanish over long sequences. LSTMs fix this with a cell state and gating mechanism. But RNNs must process tokens sequentially. Transformers parallelise over the full sequence at once — this was the key breakthrough that enabled scale.
3 questions — test your understanding
Q1. LSTMs were designed to solve:
Q2. GRUs compared to LSTMs have:
Q3. The key reason Transformers replaced RNNs is:
Words can be represented as dense vectors where semantically similar words cluster together. The famous result: King - Man + Woman = Queen. Modern LLMs use contextual embeddings — the same word gets a different vector depending on its context.
3 questions — test your understanding
Q1. Word embeddings represent words as:
Q2. Cosine similarity between word vectors measures:
Q3. Contextual embeddings (like BERT) differ from Word2Vec because:
Attention lets each token look at every other token and decide relevance. It computes Queries (Q), Keys (K), and Values (V): Q dot K produces relevance scores, softmax normalises them, then we take a weighted sum of V. This solves word disambiguation across long-range context.
3 questions — test your understanding
Q1. We scale the dot product by sqrt(d_k) to:
Q2. Multi-head attention runs attention:
Q3. Causal masking in GPT ensures:
A Transformer block = Multi-Head Self-Attention + Feed-Forward Network + Residual Connections + LayerNorm. Stack 12 to 96 of these and you have GPT. The residual stream is the backbone — each block reads from it and adds its contribution back.
3 questions — test your understanding
Q1. Residual connections are primarily used to:
Q2. Positional encoding is necessary because:
Q3. The FFN in each transformer block:
The crown jewel of this curriculum. Karpathy builds a character-level GPT in roughly 200 lines of clean PyTorch. The Attention is All You Need paper becomes actual working code in front of your eyes. After this, transformer papers are just implementation details.
3 questions — test your understanding
Q1. GPT is trained with the objective of:
Q2. Setting temperature to 0 during generation produces:
Q3. In nanoGPT at character level, the vocabulary size is:
BPE starts with individual bytes and iteratively merges the most frequent pair until a target vocabulary size is reached. A surprising number of LLM quirks — difficulty counting letters, bad arithmetic, odd spelling — trace directly back to tokenisation artefacts.
3 questions — test your understanding
Q1. BPE works by:
Q2. GPT-4 uses approximately how many tokens in its vocabulary?
Q3. LLM context window length is measured in:
BERT uses bidirectional attention — it sees the full context in both directions — excellent for understanding tasks. GPT uses causal (left-to-right) attention, excellent for generation. T5 and BART combine both in an encoder-decoder for tasks like translation and summarisation.
3 questions — test your understanding
Q1. BERT uses bidirectional attention, meaning:
Q2. BERT is pre-trained using:
Q3. Encoder-decoder models like T5 are best suited for:
Build a sentiment classifier using a pre-trained BERT model in 20 lines. The Trainer API abstracts away the training loop — it handles batching, logging, checkpointing, and evaluation automatically. HuggingFace Hub has 500k+ public models.
3 questions — test your understanding
Q1. HuggingFace AutoModel.from_pretrained() downloads:
Q2. The HuggingFace Trainer API handles:
Q3. For most NLP tasks, training from scratch is:
Karpathy's one-hour masterclass on LLMs from first principles. An LLM is compressed internet knowledge stored in billions of floating point weights. He covers the full training pipeline: pre-training, supervised fine-tuning, and RLHF for alignment.
3 questions — test your understanding
Q1. During LLM pre-training, the model is trained on:
Q2. Emergent abilities in LLMs refer to:
Q3. The transformation from base LLM to helpful assistant is achieved through:
RLHF transforms base LLMs into helpful assistants. Step 1: SFT on demonstrations. Step 2: train a Reward Model on human A/B preference rankings. Step 3: use PPO to optimise the LLM against the reward model. DPO is a newer, simpler alternative that skips the reward model entirely.
3 questions — test your understanding
Q1. The Reward Model in RLHF is trained on:
Q2. The KL divergence penalty in RLHF prevents:
Q3. DPO improves on RLHF by:
LLMs hallucinate because they rely on weights from training. RAG solves this: embed your documents, store them in a vector database, retrieve the most semantically relevant chunks at query time, and include them in the prompt. This is the dominant pattern in production AI apps today.
3 questions — test your understanding
Q1. RAG primarily addresses the problem of:
Q2. A vector database stores:
Q3. Semantic similarity at query time is measured using:
Chain-of-Thought prompting ("think step by step") dramatically improves multi-step reasoning by forcing the model to externalise intermediate steps. Few-shot examples demonstrate exact output format. In-context learning: the model adapts to your examples with zero weight updates.
3 questions — test your understanding
Q1. Chain-of-Thought prompting most improves LLM performance on:
Q2. Few-shot prompting means:
Q3. For deterministic/factual outputs like code, set temperature to:
LoRA inserts tiny trainable adapter matrices alongside frozen base weights — training only 0.1% of parameters. QLoRA adds 4-bit quantisation so you can fine-tune a 7B model on a single consumer GPU. This is how the entire open-source LLM community does domain-specific fine-tuning.
3 questions — test your understanding
Q1. LoRA reduces trainable parameters by:
Q2. QLoRA enables fine-tuning large LLMs on one GPU by:
Q3. LoRA stands for:
The MLP layers inside each transformer block function as a key-value associative memory store. The first linear layer detects "key" patterns (is this about Paris?), the second outputs the associated "value" (the Eiffel Tower is there). Mechanistic interpretability is the emerging science of reverse-engineering these circuits.
3 questions — test your understanding
Q1. The MLP layers in transformers are thought to function as:
Q2. Superposition in neural networks refers to:
Q3. Mechanistic interpretability research aims to:
LLM Agents use tools (web search, code execution, APIs) in a ReAct loop: Reason about what to do, Act by calling a tool, Observe the result, repeat. The Chinchilla scaling laws showed model size and training data should scale together — this changed how every major lab trains models.
3 questions — test your understanding
Q1. The ReAct pattern stands for:
Q2. The Chinchilla scaling law states that for optimal training:
Q3. Multimodal LLMs handle images by: