Ilya's Top 30 Papers
Ilya Sutskever's personal reading list β the papers that shaped modern deep learning. From Kolmogorov complexity to Transformers, residual networks to scaling laws.
Original list by Aman Chadha βThe First Law of Complexodynamics
Scott Aaronson
Why does complexity in physical systems rise, peak, and fall β unlike entropy which only grows? Introduces 'complextropy' as a bounded complexity measure.
The Unreasonable Effectiveness of Recurrent Neural Networks
Andrej Karpathy Β· 2015
RNNs trained character-by-character on raw text can produce surprisingly coherent outputs β code, Shakespeare, math papers. A must-read for building intuition about sequence models.
Understanding LSTM Networks
Christopher Olah Β· 2015
The clearest explanation of how LSTM gates (forget, input, output) enable long-term memory. Required reading before diving into Transformers.
Recurrent Neural Network Regularization
Zaremba, Sutskever, Vinyals Β· 2014
Dropout for LSTMs: apply it only on non-recurrent connections. Simple fix that significantly improves generalization on language modeling tasks.
Keeping Neural Networks Simple by Minimizing the Description Length of the Weights
Hinton, van Camp Β· 1993
Apply Minimum Description Length (MDL) to networks: add Gaussian noise to weights to compress them. A 1993 precursor to modern weight regularization and Bayesian deep learning.
Pointer Networks
Vinyals, Fortunato, Jaitly Β· 2015
Attention mechanism that points to positions in the input instead of a fixed output vocabulary. Solves variable-output problems like convex hull and TSP.
ImageNet Classification with Deep CNNs (AlexNet)
Krizhevsky, Sutskever, Hinton Β· 2012
The paper that started the deep learning revolution. AlexNet used ReLU activations, GPU training, and dropout to crush ImageNet by a 10%+ margin.
Order Matters: Sequence to Sequence for Sets
Vinyals, Bengio, Kudlur Β· 2015
The order you feed inputs into seq2seq models matters significantly β even for set-structured problems. Proposes methods for learning optimal input/output orderings.
GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism
Huang et al. (Google) Β· 2018
Split model layers across accelerators and pipeline micro-batches through them. Enables training billion-parameter models on commodity hardware setups.
Deep Residual Learning for Image Recognition (ResNet)
He, Zhang, Ren, Sun Β· 2015
Skip connections solve the degradation problem in very deep networks. ResNet-152 wins ImageNet 2015; residual blocks are now everywhere in deep learning.
Multi-Scale Context Aggregation by Dilated Convolutions
Yu, Koltun Β· 2015
Dilated convolutions expand the receptive field exponentially without losing resolution. Key for semantic segmentation tasks.
Neural Message Passing for Quantum Chemistry
Gilmer et al. (Google) Β· 2017
Unifies GNN variants under a single Message Passing Neural Network framework for predicting molecular properties from graphs.
Attention Is All You Need
Vaswani et al. (Google Brain) Β· 2017
Introduced the Transformer β replacing RNNs entirely with self-attention. The foundation of every modern LLM.
Neural Machine Translation by Jointly Learning to Align and Translate
Bahdanau, Cho, Bengio Β· 2014
The original attention paper. Decoder learns to focus on relevant source words dynamically β precursor to the Transformer's self-attention.
Identity Mappings in Deep Residual Networks
He, Zhang, Ren, Sun Β· 2016
ResNet v2: move BN and ReLU before the convolution for clean identity mappings. Enables training ResNet-1001.
A Simple Neural Network Module for Relational Reasoning
Santoro et al. (DeepMind) Β· 2017
Relation Networks: a small module that computes all pairwise object relations. State-of-the-art on visual QA with a strikingly simple design.
Variational Lossy Autoencoder
Chen, Kingma, Salimans et al. Β· 2017
Combines VAEs with autoregressive models: use autoregressive decoder to capture local details, let the VAE latent capture global structure.
Relational Recurrent Neural Networks
Santoro et al. (DeepMind) Β· 2018
Relational Memory Core: uses multi-head attention for memory-to-memory interactions. Improves on tasks requiring tracking relations over time.
The Coffee Automaton: Coarse-graining, Symmetry Breaking, and Possible Futures
Aaronson, Carroll, Ouellette Β· 2014
Cellular automata model of complexity: shows complexity peaks at intermediate times using coarse-grained Kolmogorov complexity. Companion to paper #1.
Neural Turing Machines
Graves, Wayne, Danihelka (DeepMind) Β· 2014
Neural network + differentiable external memory = can learn algorithms (copy, sort, associative recall). Content + location-based addressing.
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
Amodei et al. (Baidu Research) Β· 2015
End-to-end deep learning for ASR matching human performance on benchmarks. Works across English and Mandarin, different accents, noisy environments.
Scaling Laws for Neural Language Models
Kaplan et al. (OpenAI) Β· 2020
Loss follows power laws in N (params), D (data), C (compute). Optimal allocation: scale N and D together. Directly led to GPT-3 and the LLM scaling era.
A Tutorial Introduction to the Minimum Description Length Principle
Peter GrΓΌnwald Β· 2004
MDL principle: the best model is the one that compresses the data most. A bridge between Kolmogorov complexity and practical statistics / model selection.
Machine Super Intelligence (Dissertation)
Shane Legg (DeepMind) Β· 2008
Theoretical foundations of machine superintelligence: formal definition of intelligence, pathways to superintelligence, early AI safety framing.
Kolmogorov Complexity and Algorithmic Randomness
Shen, Uspensky, Vereshchagin
Comprehensive technical textbook on Kolmogorov complexity: incompressibility, algorithmic randomness, mutual information. The math behind intelligence measures.
CS231n: CNNs for Visual Recognition (Stanford)
Fei-Fei Li et al.
The gold standard CNN course β backpropagation, convolutions, batch norm, transfer learning. Still the best technical intro to deep learning for vision.
Better & Faster LLMs via Multi-Token Prediction
Gloeckle, Idrissi, Rozière et al. (Meta) · 2024
Instead of predicting one next token, predict the next k tokens in parallel with k independent heads. Faster inference + better code/reasoning performance.
Dense Passage Retrieval for Open-Domain QA
Karpukhin et al. (Meta) Β· 2020
Dual-encoder BERT for dense retrieval dramatically outperforms sparse BM25. The foundation of modern RAG systems.
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (RAG)
Lewis et al. (Meta AI) Β· 2020
Combine a pre-trained seq2seq model with a dense retriever over Wikipedia. Factual, updateable knowledge without retraining the model.
Zephyr: Direct Distillation of LM Alignment
Tunstall et al. (HuggingFace) Β· 2023
Distill alignment from a larger teacher LLM to a smaller student using dSFT + dDPO β no PPO, no reward model. Zephyr-7B beats larger RLHF models.
Cards marked 'Full breakdown' link to interactive deep-dives. Others link to the original paper.