Ilya's Top 30 Papers

Ilya Sutskever's personal reading list — the papers that shaped modern deep learning. From Kolmogorov complexity to Transformers, residual networks to scaling laws.

Original list by Aman Chadha ↗

The First Law of Complexodynamics

Scott Aaronson

Full breakdown

Why does complexity in physical systems rise, peak, and fall — unlike entropy which only grows? Introduces 'complextropy' as a bounded complexity measure.

Kolmogorov complexitySophisticationEntropy

→

The Unreasonable Effectiveness of Recurrent Neural Networks

Andrej Karpathy · 2015

Full breakdown

RNNs trained character-by-character on raw text can produce surprisingly coherent outputs — code, Shakespeare, math papers. A must-read for building intuition about sequence models.

RNNSequence ModelingCharacter-level LM

→

Understanding LSTM Networks

Christopher Olah · 2015

Full breakdown

The clearest explanation of how LSTM gates (forget, input, output) enable long-term memory. Required reading before diving into Transformers.

LSTMGating MechanismsLong-term Dependencies

→

Recurrent Neural Network Regularization

Zaremba, Sutskever, Vinyals · 2014

Full breakdown

Dropout for LSTMs: apply it only on non-recurrent connections. Simple fix that significantly improves generalization on language modeling tasks.

DropoutLSTMRegularization

→

Keeping Neural Networks Simple by Minimizing the Description Length of the Weights

Hinton, van Camp · 1993

Full breakdown

Apply Minimum Description Length (MDL) to networks: add Gaussian noise to weights to compress them. A 1993 precursor to modern weight regularization and Bayesian deep learning.

MDL PrincipleWeight CompressionBayesian NNs

→

Pointer Networks

Vinyals, Fortunato, Jaitly · 2015

Full breakdown

Attention mechanism that points to positions in the input instead of a fixed output vocabulary. Solves variable-output problems like convex hull and TSP.

Pointer MechanismAttentionCombinatorial Optimization

→

ImageNet Classification with Deep CNNs (AlexNet)

Krizhevsky, Sutskever, Hinton · 2012

Full breakdown

The paper that started the deep learning revolution. AlexNet used ReLU activations, GPU training, and dropout to crush ImageNet by a 10%+ margin.

CNNReLUGPU TrainingDropout

→

Order Matters: Sequence to Sequence for Sets

Vinyals, Bengio, Kudlur · 2015

Full breakdown

The order you feed inputs into seq2seq models matters significantly — even for set-structured problems. Proposes methods for learning optimal input/output orderings.

Seq2SeqOrder InvarianceAttention

→

GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism

Huang et al. (Google) · 2018

Full breakdown

Split model layers across accelerators and pipeline micro-batches through them. Enables training billion-parameter models on commodity hardware setups.

Model ParallelismPipeline ParallelismScaling

→

Deep Residual Learning for Image Recognition (ResNet)

He, Zhang, Ren, Sun · 2015

Full breakdown

Skip connections solve the degradation problem in very deep networks. ResNet-152 wins ImageNet 2015; residual blocks are now everywhere in deep learning.

Skip ConnectionsResidual BlocksVanishing Gradients

→

Multi-Scale Context Aggregation by Dilated Convolutions

Yu, Koltun · 2015

Full breakdown

Dilated convolutions expand the receptive field exponentially without losing resolution. Key for semantic segmentation tasks.

Dilated ConvolutionsSemantic SegmentationReceptive Field

→

Neural Message Passing for Quantum Chemistry

Gilmer et al. (Google) · 2017

Full breakdown

Unifies GNN variants under a single Message Passing Neural Network framework for predicting molecular properties from graphs.

Graph Neural NetworksMessage PassingMolecular Properties

→

Attention Is All You Need

Vaswani et al. (Google Brain) · 2017

Full breakdown

Introduced the Transformer — replacing RNNs entirely with self-attention. The foundation of every modern LLM.

Self-AttentionMulti-Head AttentionPositional EncodingEncoder-Decoder

→

Neural Machine Translation by Jointly Learning to Align and Translate

Bahdanau, Cho, Bengio · 2014

Full breakdown

The original attention paper. Decoder learns to focus on relevant source words dynamically — precursor to the Transformer's self-attention.

Attention MechanismAlignmentNMT

→

Identity Mappings in Deep Residual Networks

He, Zhang, Ren, Sun · 2016

Full breakdown

ResNet v2: move BN and ReLU before the convolution for clean identity mappings. Enables training ResNet-1001.

Identity MappingsResidual NetworksSignal Propagation

→

A Simple Neural Network Module for Relational Reasoning

Santoro et al. (DeepMind) · 2017

Full breakdown

Relation Networks: a small module that computes all pairwise object relations. State-of-the-art on visual QA with a strikingly simple design.

Relational ReasoningPairwise RelationsVisual QA

→

Variational Lossy Autoencoder

Chen, Kingma, Salimans et al. · 2017

Full breakdown

Combines VAEs with autoregressive models: use autoregressive decoder to capture local details, let the VAE latent capture global structure.

VAEVariational InferenceAutoregressive Models

→

Relational Recurrent Neural Networks

Santoro et al. (DeepMind) · 2018

Full breakdown

Relational Memory Core: uses multi-head attention for memory-to-memory interactions. Improves on tasks requiring tracking relations over time.

Memory Augmented NNsMulti-Head AttentionRelational Reasoning

→

The Coffee Automaton: Coarse-graining, Symmetry Breaking, and Possible Futures

Aaronson, Carroll, Ouellette · 2014

Full breakdown

Cellular automata model of complexity: shows complexity peaks at intermediate times using coarse-grained Kolmogorov complexity. Companion to paper #1.

Coarse-grainingComplexity DynamicsKolmogorov Complexity

→

Neural Turing Machines

Graves, Wayne, Danihelka (DeepMind) · 2014

Full breakdown

Neural network + differentiable external memory = can learn algorithms (copy, sort, associative recall). Content + location-based addressing.

External MemoryDifferentiable ProgrammingTuring Completeness

→

Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

Amodei et al. (Baidu Research) · 2015

Full breakdown

End-to-end deep learning for ASR matching human performance on benchmarks. Works across English and Mandarin, different accents, noisy environments.

ASRCTCBatch NormalizationEnd-to-End

→

Scaling Laws for Neural Language Models

Kaplan et al. (OpenAI) · 2020

Full breakdown

Loss follows power laws in N (params), D (data), C (compute). Optimal allocation: scale N and D together. Directly led to GPT-3 and the LLM scaling era.

Scaling LawsPower LawsCompute OptimalLLM

→

A Tutorial Introduction to the Minimum Description Length Principle

Peter Grünwald · 2004

Full breakdown

MDL principle: the best model is the one that compresses the data most. A bridge between Kolmogorov complexity and practical statistics / model selection.

MDLModel SelectionData CompressionKolmogorov Complexity

→

Machine Super Intelligence (Dissertation)

Shane Legg (DeepMind) · 2008

Full breakdown

Theoretical foundations of machine superintelligence: formal definition of intelligence, pathways to superintelligence, early AI safety framing.

AGIIntelligence MeasuresAI SafetyRecursive Self-improvement

→

Kolmogorov Complexity and Algorithmic Randomness

Shen, Uspensky, Vereshchagin

Full breakdown

Comprehensive technical textbook on Kolmogorov complexity: incompressibility, algorithmic randomness, mutual information. The math behind intelligence measures.

Kolmogorov ComplexityAlgorithmic RandomnessInformation Theory

→

CS231n: CNNs for Visual Recognition (Stanford)

Fei-Fei Li et al.

Full breakdown

The gold standard CNN course — backpropagation, convolutions, batch norm, transfer learning. Still the best technical intro to deep learning for vision.

CNNBackpropagationBatch NormalizationTransfer Learning

→

Better & Faster LLMs via Multi-Token Prediction

Gloeckle, Idrissi, Rozière et al. (Meta) · 2024

Full breakdown

Instead of predicting one next token, predict the next k tokens in parallel with k independent heads. Faster inference + better code/reasoning performance.

Multi-Token PredictionParallel DecodingInference Speed

→

Dense Passage Retrieval for Open-Domain QA

Karpukhin et al. (Meta) · 2020

Full breakdown

Dual-encoder BERT for dense retrieval dramatically outperforms sparse BM25. The foundation of modern RAG systems.

Dense RetrievalDual EncoderOpen-Domain QAEmbeddings

→

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (RAG)

Lewis et al. (Meta AI) · 2020

Full breakdown

Combine a pre-trained seq2seq model with a dense retriever over Wikipedia. Factual, updateable knowledge without retraining the model.

RAGKnowledge RetrievalParametric MemorySeq2Seq

→

Zephyr: Direct Distillation of LM Alignment

Tunstall et al. (HuggingFace) · 2023

Full breakdown

Distill alignment from a larger teacher LLM to a smaller student using dSFT + dDPO — no PPO, no reward model. Zephyr-7B beats larger RLHF models.

Knowledge DistillationAlignmentDPOInstruction Tuning

→

Cards marked 'Full breakdown' link to interactive deep-dives. Others link to the original paper.