Ilya's Top 30 Papers

Ilya Sutskever's personal reading list β€” the papers that shaped modern deep learning. From Kolmogorov complexity to Transformers, residual networks to scaling laws.

Original list by Aman Chadha β†—
1

The First Law of Complexodynamics

Scott Aaronson

Full breakdown

Why does complexity in physical systems rise, peak, and fall β€” unlike entropy which only grows? Introduces 'complextropy' as a bounded complexity measure.

Kolmogorov complexitySophisticationEntropy
β†’
2

The Unreasonable Effectiveness of Recurrent Neural Networks

Andrej Karpathy Β· 2015

Full breakdown

RNNs trained character-by-character on raw text can produce surprisingly coherent outputs β€” code, Shakespeare, math papers. A must-read for building intuition about sequence models.

RNNSequence ModelingCharacter-level LM
β†’
3

Understanding LSTM Networks

Christopher Olah Β· 2015

Full breakdown

The clearest explanation of how LSTM gates (forget, input, output) enable long-term memory. Required reading before diving into Transformers.

LSTMGating MechanismsLong-term Dependencies
β†’
4

Recurrent Neural Network Regularization

Zaremba, Sutskever, Vinyals Β· 2014

Full breakdown

Dropout for LSTMs: apply it only on non-recurrent connections. Simple fix that significantly improves generalization on language modeling tasks.

DropoutLSTMRegularization
β†’
5

Keeping Neural Networks Simple by Minimizing the Description Length of the Weights

Hinton, van Camp Β· 1993

Full breakdown

Apply Minimum Description Length (MDL) to networks: add Gaussian noise to weights to compress them. A 1993 precursor to modern weight regularization and Bayesian deep learning.

MDL PrincipleWeight CompressionBayesian NNs
β†’
6

Pointer Networks

Vinyals, Fortunato, Jaitly Β· 2015

Full breakdown

Attention mechanism that points to positions in the input instead of a fixed output vocabulary. Solves variable-output problems like convex hull and TSP.

Pointer MechanismAttentionCombinatorial Optimization
β†’
7

ImageNet Classification with Deep CNNs (AlexNet)

Krizhevsky, Sutskever, Hinton Β· 2012

Full breakdown

The paper that started the deep learning revolution. AlexNet used ReLU activations, GPU training, and dropout to crush ImageNet by a 10%+ margin.

CNNReLUGPU TrainingDropout
β†’
8

Order Matters: Sequence to Sequence for Sets

Vinyals, Bengio, Kudlur Β· 2015

Full breakdown

The order you feed inputs into seq2seq models matters significantly β€” even for set-structured problems. Proposes methods for learning optimal input/output orderings.

Seq2SeqOrder InvarianceAttention
β†’
9

GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism

Huang et al. (Google) Β· 2018

Full breakdown

Split model layers across accelerators and pipeline micro-batches through them. Enables training billion-parameter models on commodity hardware setups.

Model ParallelismPipeline ParallelismScaling
β†’
10

Deep Residual Learning for Image Recognition (ResNet)

He, Zhang, Ren, Sun Β· 2015

Full breakdown

Skip connections solve the degradation problem in very deep networks. ResNet-152 wins ImageNet 2015; residual blocks are now everywhere in deep learning.

Skip ConnectionsResidual BlocksVanishing Gradients
β†’
11

Multi-Scale Context Aggregation by Dilated Convolutions

Yu, Koltun Β· 2015

Full breakdown

Dilated convolutions expand the receptive field exponentially without losing resolution. Key for semantic segmentation tasks.

Dilated ConvolutionsSemantic SegmentationReceptive Field
β†’
12

Neural Message Passing for Quantum Chemistry

Gilmer et al. (Google) Β· 2017

Full breakdown

Unifies GNN variants under a single Message Passing Neural Network framework for predicting molecular properties from graphs.

Graph Neural NetworksMessage PassingMolecular Properties
β†’
13

Attention Is All You Need

Vaswani et al. (Google Brain) Β· 2017

Full breakdown

Introduced the Transformer β€” replacing RNNs entirely with self-attention. The foundation of every modern LLM.

Self-AttentionMulti-Head AttentionPositional EncodingEncoder-Decoder
β†’
14

Neural Machine Translation by Jointly Learning to Align and Translate

Bahdanau, Cho, Bengio Β· 2014

Full breakdown

The original attention paper. Decoder learns to focus on relevant source words dynamically β€” precursor to the Transformer's self-attention.

Attention MechanismAlignmentNMT
β†’
15

Identity Mappings in Deep Residual Networks

He, Zhang, Ren, Sun Β· 2016

Full breakdown

ResNet v2: move BN and ReLU before the convolution for clean identity mappings. Enables training ResNet-1001.

Identity MappingsResidual NetworksSignal Propagation
β†’
16

A Simple Neural Network Module for Relational Reasoning

Santoro et al. (DeepMind) Β· 2017

Full breakdown

Relation Networks: a small module that computes all pairwise object relations. State-of-the-art on visual QA with a strikingly simple design.

Relational ReasoningPairwise RelationsVisual QA
β†’
17

Variational Lossy Autoencoder

Chen, Kingma, Salimans et al. Β· 2017

Full breakdown

Combines VAEs with autoregressive models: use autoregressive decoder to capture local details, let the VAE latent capture global structure.

VAEVariational InferenceAutoregressive Models
β†’
18

Relational Recurrent Neural Networks

Santoro et al. (DeepMind) Β· 2018

Full breakdown

Relational Memory Core: uses multi-head attention for memory-to-memory interactions. Improves on tasks requiring tracking relations over time.

Memory Augmented NNsMulti-Head AttentionRelational Reasoning
β†’
19

The Coffee Automaton: Coarse-graining, Symmetry Breaking, and Possible Futures

Aaronson, Carroll, Ouellette Β· 2014

Full breakdown

Cellular automata model of complexity: shows complexity peaks at intermediate times using coarse-grained Kolmogorov complexity. Companion to paper #1.

Coarse-grainingComplexity DynamicsKolmogorov Complexity
β†’
20

Neural Turing Machines

Graves, Wayne, Danihelka (DeepMind) Β· 2014

Full breakdown

Neural network + differentiable external memory = can learn algorithms (copy, sort, associative recall). Content + location-based addressing.

External MemoryDifferentiable ProgrammingTuring Completeness
β†’
21

Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

Amodei et al. (Baidu Research) Β· 2015

Full breakdown

End-to-end deep learning for ASR matching human performance on benchmarks. Works across English and Mandarin, different accents, noisy environments.

ASRCTCBatch NormalizationEnd-to-End
β†’
22

Scaling Laws for Neural Language Models

Kaplan et al. (OpenAI) Β· 2020

Full breakdown

Loss follows power laws in N (params), D (data), C (compute). Optimal allocation: scale N and D together. Directly led to GPT-3 and the LLM scaling era.

Scaling LawsPower LawsCompute OptimalLLM
β†’
23

A Tutorial Introduction to the Minimum Description Length Principle

Peter GrΓΌnwald Β· 2004

Full breakdown

MDL principle: the best model is the one that compresses the data most. A bridge between Kolmogorov complexity and practical statistics / model selection.

MDLModel SelectionData CompressionKolmogorov Complexity
β†’
24

Machine Super Intelligence (Dissertation)

Shane Legg (DeepMind) Β· 2008

Full breakdown

Theoretical foundations of machine superintelligence: formal definition of intelligence, pathways to superintelligence, early AI safety framing.

AGIIntelligence MeasuresAI SafetyRecursive Self-improvement
β†’
25

Kolmogorov Complexity and Algorithmic Randomness

Shen, Uspensky, Vereshchagin

Full breakdown

Comprehensive technical textbook on Kolmogorov complexity: incompressibility, algorithmic randomness, mutual information. The math behind intelligence measures.

Kolmogorov ComplexityAlgorithmic RandomnessInformation Theory
β†’
26

CS231n: CNNs for Visual Recognition (Stanford)

Fei-Fei Li et al.

Full breakdown

The gold standard CNN course β€” backpropagation, convolutions, batch norm, transfer learning. Still the best technical intro to deep learning for vision.

CNNBackpropagationBatch NormalizationTransfer Learning
β†’
27

Better & Faster LLMs via Multi-Token Prediction

Gloeckle, Idrissi, Rozière et al. (Meta) · 2024

Full breakdown

Instead of predicting one next token, predict the next k tokens in parallel with k independent heads. Faster inference + better code/reasoning performance.

Multi-Token PredictionParallel DecodingInference Speed
β†’
28

Dense Passage Retrieval for Open-Domain QA

Karpukhin et al. (Meta) Β· 2020

Full breakdown

Dual-encoder BERT for dense retrieval dramatically outperforms sparse BM25. The foundation of modern RAG systems.

Dense RetrievalDual EncoderOpen-Domain QAEmbeddings
β†’
29

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (RAG)

Lewis et al. (Meta AI) Β· 2020

Full breakdown

Combine a pre-trained seq2seq model with a dense retriever over Wikipedia. Factual, updateable knowledge without retraining the model.

RAGKnowledge RetrievalParametric MemorySeq2Seq
β†’
30

Zephyr: Direct Distillation of LM Alignment

Tunstall et al. (HuggingFace) Β· 2023

Full breakdown

Distill alignment from a larger teacher LLM to a smaller student using dSFT + dDPO β€” no PPO, no reward model. Zephyr-7B beats larger RLHF models.

Knowledge DistillationAlignmentDPOInstruction Tuning
β†’

Cards marked 'Full breakdown' link to interactive deep-dives. Others link to the original paper.