PaperTrace — Interactive ML Paper Deep-Dives

TL;DR

CS231n is Stanford's foundational course on deep learning for computer vision. Taught by Fei-Fei Li, Andrej Karpathy, and Justin Johnson, it covers everything from k-NN classifiers and loss functions through backprop, CNNs, batch normalization, and modern architectures (AlexNet → ResNet), all the way to detection, segmentation, and visualization. It is the single best place to build a rigorous, mathematical understanding of how vision models actually work.

1. Why CS231n?

Most machine learning courses teach models as black boxes. CS231n does the opposite: it forces you to derive every gradient by hand, implement every layer from scratch in NumPy, and understand why design choices like ReLU, batch norm, or skip connections exist. The assignments — especially the CNN implementation and the style transfer project — are among the most educational exercises in any public ML curriculum.

The course is particularly valuable for interview preparation. Questions about convolution output sizes, backprop through a ReLU, what batch norm does at test time, or why ResNet works — all of these appear frequently in ML engineering and research interviews at top AI labs.

Recommended path: Watch lecture videos (2017 or 2022 versions), read the notes on cs231n.github.io, and complete at least Assignment 1 (k-NN, SVM, softmax, two-layer net) and Assignment 2 (BatchNorm, Dropout, ConvNet). The notes are exceptionally well-written and can be read independently.

2. Image Classification Fundamentals

CS231n opens with the image classification problem: given a fixed set of categories (say, 1000 ImageNet classes), assign the correct label to any input image. This is harder than it sounds — a cat can appear in any pose, lighting, or partial occlusion, and each image is just a 3D array of numbers to the model.

k-Nearest Neighbors

The course starts with k-NN as a conceptual baseline. At test time, find the k training images closest in pixel space (L1 or L2 distance) to the query, and take a majority vote. k-NN has no training time — all computation is deferred to inference — but it scales terribly (O(N) per query) and raw pixel distances are a poor semantic similarity measure.

Linear Classifiers

A linear classifier computes a score for each class as a dot product of the weight matrix W and the input vector x (plus a bias b). For an image with pixel values flattened into a D-dimensional vector and C classes:

f(x_i, W, b) = W x_i + b \quad \in \mathbb{R}^C

W has shape C × D. Each row of W can be visualized as a 'template' for a class — the classifier learns one template per class and scores an image by how much it resembles each template. This is powerful but limited: one template per class cannot handle multimodal distributions (a car from the front looks nothing like a car from the side).

Loss Functions

Two loss functions dominate the early course: SVM (hinge) loss and Softmax (cross-entropy) loss.

The multiclass SVM loss for example i sums over all wrong classes j ≠ y_i, penalizing any wrong class that scores higher than the correct class score by less than a margin Δ:

L_i = \sum_{j \neq y_i} \max(0,\, f_j - f_{y_i} + \Delta)

With Δ = 1, the loss is zero only when the correct class score exceeds every wrong class score by at least 1. The Softmax loss converts raw scores into probabilities via the softmax function, then takes the negative log probability of the correct class:

L_i = -\log\!\left(\frac{e^{f_{y_i}}}{\sum_j e^{f_j}}\right)

3. Optimization: From SGD to Adam

Once we have a loss function, the goal is to find parameters W that minimize it. CS231n covers the full optimization toolkit used in modern deep learning.

Stochastic Gradient Descent (SGD)

Vanilla gradient descent computes the gradient of the loss over the entire dataset before each update — prohibitively expensive. SGD approximates this by using a small mini-batch of B examples (typically 32–256) to estimate the gradient:

W \leftarrow W - \alpha \nabla_W L

where α is the learning rate. The mini-batch gradient is a noisy but unbiased estimate of the full-batch gradient. The noise actually helps — it allows the optimizer to escape shallow local minima and saddle points.

Momentum and Adaptive Methods

Plain SGD can oscillate in narrow ravines of the loss landscape. Momentum accumulates a velocity vector in the direction of persistent gradients, dampening oscillation:

v \leftarrow \mu v - \alpha \nabla_W L, \quad W \leftarrow W + v

Typical momentum coefficient μ = 0.9. RMSprop adapts the learning rate per-parameter by dividing by a running average of squared gradients, preventing large updates in directions that already have large gradient magnitudes. Adam combines momentum and RMSprop, and is the default optimizer for most deep learning work today:

m \leftarrow \beta_1 m + (1-\beta_1)g, \quad v \leftarrow \beta_2 v + (1-\beta_2)g^2, \quad W \leftarrow W - \frac{\alpha\, \hat{m}}{\sqrt{\hat{v}}+\epsilon}

4. Backpropagation from Scratch

Backpropagation is the algorithm that computes gradients efficiently by applying the chain rule through a computational graph. CS231n's treatment is exceptionally clear: every node in the graph has a forward pass (compute output) and a backward pass (multiply incoming gradient by local gradient).

\frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial x}

The upstream gradient ∂L/∂y flows back from downstream nodes. The local gradient ∂y/∂x depends only on the current node's inputs and outputs. Their product gives the gradient flowing into the current node's inputs.

Gate Intuitions

CS231n gives memorable names to common gate patterns:

Add gate: distributes gradient equally to both inputs (∂(x+y)/∂x = 1, ∂(x+y)/∂y = 1). Gradient flows through unchanged.
Multiply gate: swaps the inputs as gradients (∂(xy)/∂x = y, ∂(xy)/∂y = x). Each input's gradient is the other input times the upstream gradient.
Max gate: routes gradient to the input that was largest (∂max(x,y)/∂x = 1 if x > y, else 0). Acts as a switch.

5. Neural Networks: Layers and Activations

A fully connected (FC) layer applies a linear transformation followed by an elementwise nonlinearity. Stack multiple FC layers and you have a neural network. Without the nonlinearity, stacking layers is equivalent to a single linear transformation — no expressive power is gained.

Activation Functions

Activation	Formula	Key property / issue
Sigmoid	σ(x) = 1/(1+e⁻ˣ)	Output ∈ (0,1); saturates → vanishing gradient; not zero-centered
Tanh	tanh(x)	Output ∈ (−1,1); zero-centered; still saturates at extremes
ReLU	max(0, x)	Fast to compute; no vanishing gradient for x>0; dying ReLU problem
Leaky ReLU	max(αx, x), α≈0.01	Fixes dying ReLU by allowing small gradient for x<0

ReLU is the default choice for hidden layers in CNNs. Its gradient is either 0 (for negative inputs) or 1 (for positive inputs), which avoids the vanishing gradient problem that plagued sigmoid and tanh at depth. The 'dying ReLU' issue — neurons that always output zero because their input is always negative — can be mitigated with careful initialization, smaller learning rates, or Leaky ReLU.

6. Convolutional Layers

The key insight of CNNs is that natural images have spatial structure: pixels near each other are related, and the same patterns (edges, textures) appear at multiple locations. A convolutional layer exploits this by using shared, spatially local filters instead of fully connected weights.

The Convolution Operation

A filter (kernel) of size F×F slides across the input feature map with a given stride S. At each position, compute the dot product between the filter weights and the corresponding receptive field patch. The result is a single activation value. Using P pixels of zero-padding around the input keeps the spatial dimensions controlled.

W_{\text{out}} = \frac{W_{\text{in}} - F + 2P}{S} + 1

This formula applies independently to height and width (for square inputs/filters with the same parameters). The number of output channels equals the number of filters used. Each filter produces one output feature map.

Weight sharing is what makes CNNs powerful: the same filter weights are applied at every spatial position, enforcing translational equivariance. The network learns that a horizontal edge detector, for example, is useful everywhere in the image — not just at one fixed position.

7. Pooling and Architecture Design

Pooling layers reduce spatial dimensions, providing a form of translation invariance and controlling computational cost. The two standard operations are max pooling and average pooling.

Max pooling with a 2×2 window and stride 2 divides the feature map into 2×2 non-overlapping regions and keeps the maximum value from each. This halves width and height, reducing the feature map to 1/4 of its area. Max pooling is by far the most common; it retains the most prominent activation in each region, which correlates with detecting whether a feature is present.

Average pooling computes the mean of each region. It is used less frequently in feature extraction but appears at the end of many modern architectures as Global Average Pooling (GAP): reduce each feature map to a single number by averaging all spatial positions. GAP replaces large FC layers, dramatically reducing parameters and improving regularization.

Modern trend: Many modern architectures replace pooling with strided convolutions for downsampling, giving the network learned downsampling rather than a fixed heuristic. The spatial size reduction is the same, but the network can optimize how it reduces resolution.

8. Batch Normalization

Batch normalization (Ioffe & Szegedy, 2015) is one of the most impactful techniques in deep learning. It normalizes the activations of each layer across the mini-batch, then applies a learnable affine transformation. This dramatically stabilizes training and allows much higher learning rates.

For a mini-batch B = {x_1, ..., x_m}, batch norm computes the batch mean and variance, normalizes each activation, then scales and shifts with learnable parameters γ and β:

\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \varepsilon}}

y_i = \gamma\,\hat{x}_i + \beta

ε is a small constant (e.g., 1e-5) for numerical stability. γ and β are parameters learned during training — they allow the network to undo the normalization if that is optimal. If γ = √(σ_B² + ε) and β = μ_B, the output exactly recovers the original activations.

Train vs. Test Behavior

During training, batch norm uses the statistics (μ_B, σ_B) of the current mini-batch. During inference, this is problematic: the batch might contain only one example, or statistics may vary. The solution: track running averages of mean and variance during training (with momentum), and use these fixed statistics at test time.

9. Regularization Techniques

Regularization prevents overfitting by adding constraints or noise to the training process. CS231n covers several techniques that are ubiquitous in practice.

L1 and L2 Regularization

L2 regularization (weight decay) adds a penalty proportional to the squared magnitude of all weights to the loss: λ Σ w². This penalizes large weights, pushing them toward zero. L1 regularization adds λ Σ |w|, which tends to produce sparse weights (many exactly zero). In practice, L2 is used almost universally for neural networks; L1 is more common in sparse feature selection settings.

Dropout

Dropout (Srivastava et al., 2014) randomly sets each neuron's activation to zero with probability p (typically 0.5 for FC layers, 0.1–0.2 for conv layers) during training. This prevents neurons from co-adapting — no single neuron can rely on the presence of any other specific neuron, forcing the network to learn redundant representations.

At test time, all neurons are active. To compensate for the scale change (activations are now larger on average), multiply all activations by (1-p) — or equivalently, scale the activations by 1/(1-p) during training (inverted dropout). The latter is more common in practice.

Data Augmentation

Data augmentation artificially expands the training set by applying label-preserving transformations: random horizontal flips, random crops, color jitter (brightness, contrast, saturation), cutout, mixup. These transformations make the model invariant to irrelevant variations in the input, effectively providing more diverse training examples without collecting new data.

10. CNN Architecture Evolution: AlexNet → ResNet

CS231n traces the rapid evolution of CNN architectures through the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). Each major architecture introduced a key idea that influenced subsequent research.

Architecture	Year	Top-5 error	Key innovation
AlexNet	2012	15.3%	ReLU, dropout, GPU training, data augmentation
VGGNet	2014	7.3%	Very deep (16–19 layers) with only 3×3 convolutions
GoogLeNet	2014	6.7%	Inception module: parallel paths with 1×1/3×3/5×5 convs
ResNet	2015	3.57%	Residual (skip) connections enabling 152-layer networks

ResNet (He et al., 2015) solved the degradation problem: deeper plain networks were harder to train, not better. Residual (skip) connections let the network learn F(x) = H(x) − x instead of H(x) directly, making the identity function easy to approximate and allowing gradients to flow directly to early layers. See the dedicated ResNet page for a full derivation.

11. Transfer Learning

Training a CNN from scratch requires millions of labeled examples and days of GPU time. Transfer learning sidesteps this by starting from a model pretrained on a large dataset (usually ImageNet) and adapting it to a new task with far less data.

The intuition: early CNN layers learn general features (edges, textures, colors) that are useful across many visual tasks. Later layers become increasingly task-specific. We can reuse the general features and only retrain the task-specific parts.

Fine-tuning Strategies

CS231n recommends the following decision process based on dataset size and similarity:

Scenario	Strategy
Small dataset, similar to pretrain domain	Linear probe only: freeze all conv layers, train only a new classifier head
Small dataset, different from pretrain domain	Train only top few layers + classifier head; risk of overfitting if more
Large dataset, similar to pretrain domain	Fine-tune all or most layers with a small LR (10–100× smaller than training from scratch)
Large dataset, very different from pretrain domain	Fine-tune everything; pretrained weights still provide a better starting point than random init

A critical implementation detail: when fine-tuning, use a much lower learning rate for the pretrained layers than for the newly added layers. Common practice is to use 10× or 100× smaller LR for the backbone, preventing the pretrained features from being destroyed in the first few steps.

12. Beyond Classification: Detection & Segmentation

Classification assigns a single label to an entire image. Real-world applications require localizing objects (detection) or labeling every pixel (segmentation). CS231n covers the main approaches for both.

Object Detection: R-CNN Family

R-CNN (Girshick et al., 2014) was the first major deep learning approach to detection. It uses selective search to propose ~2000 candidate regions, warps each to a fixed size, runs a CNN to extract features, and classifies with an SVM. Fast R-CNN eliminated the per-proposal CNN forward pass by running the CNN once on the full image and then extracting RoI-pooled features for each proposal. Faster R-CNN replaced selective search with a Region Proposal Network (RPN) that shares CNN features with the detection head — making proposals nearly free.

YOLO (You Only Look Once) took a completely different approach: divide the image into a grid, predict bounding boxes and class probabilities directly from the full image in a single forward pass. Much faster than R-CNN family (real-time capable), at some cost to accuracy on small objects.

Semantic Segmentation: FCN

Fully Convolutional Networks (Long et al., 2015) adapted classification CNNs to output pixel-wise predictions. The key insight: replace FC layers with 1×1 convolutional layers, making the network fully convolutional and applicable to any input size. The spatial downsampling from pooling is reversed via learned upsampling (transposed convolutions or bilinear interpolation + convolution), producing a dense output map with per-pixel class predictions.

13. Visualization: Understanding What CNNs Learn

CS231n dedicates significant attention to visualizing and understanding CNN representations — a topic that is both scientifically interesting and practically useful for debugging.

Saliency maps: Compute the gradient of the class score with respect to input pixels. Large gradients indicate pixels the model is most sensitive to — a rough map of 'where the model is looking.'
Gradient ascent / activation maximization: Start from noise and optimize the input to maximize a specific neuron's activation. Reveals the 'ideal input' for that neuron, showing what patterns it detects.
DeepDream: Run gradient ascent on an image (rather than noise), amplifying patterns the network detects. Produces the psychedelic, feature-amplified images that went viral in 2015.
Neural style transfer: Separate and recombine content (high-level structure) and style (texture statistics) by optimizing an image to match a content image's feature activations and a style image's Gram matrices (correlations between feature channels). Gatys et al., 2015.
t-SNE of feature embeddings: Extract penultimate-layer features for many images and embed them in 2D with t-SNE. Reveals the geometric structure of the learned representation — images of the same class should cluster together.

14. Recommended Study Path

CS231n is a large course. Here is a prioritized path for different goals:

For ML interview preparation (4–6 weeks)

Read the kNN, linear classification, and loss function notes. Do Assignment 1 (k-NN, SVM, softmax sections).
Read the backpropagation notes thoroughly. Implement a two-layer net from scratch in NumPy.
Read the CNN notes and the architecture overview. Do Assignment 2 (BatchNorm, Dropout, ConvNets).
Read transfer learning and regularization notes. Review the detection overview lecture.

For CV research (full course)

Watch all lectures in order (2017 or 2022 versions, both available on YouTube). Complete all three assignments. Pay particular attention to the detection, segmentation, and visualization lectures, which are not fully covered in the notes and contain important practical wisdom.

Key resources

cs231n.github.io — course notes (best starting point; readable independently)
YouTube: 2017 lecture series — Karpathy & Johnson lectures, most cited version
2024 course schedule — latest syllabus with new topics (vision-language, diffusion)

← Back to Papers & Courses