PaperTrace — Interactive ML Paper Deep-Dives

TL;DR

Entropy always increases — but the interesting 'complexity' of a closed system first rises and then falls. Aaronson proposes a measure called complextropy, defined via Kolmogorov complexity and model compression, which captures this arc: random noise and uniform gas are both low-complextropy, while galaxies, organisms, and brains sit at the hard-to-describe-briefly middle. He conjectures a First Law: complextropy of a closed system first increases then decreases toward zero.

1. The Entropy Paradox

The Second Law of Thermodynamics says entropy never decreases in a closed system. A glass of water and a drop of ink start in a low-entropy state — all the ink molecules clustered together — and over time the ink disperses until the solution is a uniform blue. Entropy has increased, and it will never spontaneously decrease.

Now consider a box of gas. At time 0, all molecules are crammed into one corner: extremely ordered, low entropy. As time passes, the gas expands and fills the box uniformly: high entropy. Both the initial state (all in one corner) and the final state (perfectly uniform) seem simple and easy to describe. But the intermediate states — molecules half-mixed in complicated swirling patterns — seem far more complex and harder to describe.

The puzzle: Entropy increases monotonically, but something we might call 'complexity' or 'sophistication' first increases and then decreases. If complexity is not entropy, what is it — and can we define it precisely?

This is not a trivial observation. The early universe was also low entropy and 'simple', yet it gave rise to stars, galaxies, cells, and brains — all of which seem vastly more complex than the final heat-death equilibrium. Something interesting happens in between, and entropy alone cannot capture it.

2. Kolmogorov Complexity

To make 'complexity' precise, we turn to algorithmic information theory. The Kolmogorov complexity K(x) of a string x is the length of the shortest program that outputs x on a universal Turing machine U:

K(x) = \min_{p\,:\,U(p)=x} |p|

In other words: K(x) measures how compressible x is. A string of one million zeros has very low Kolmogorov complexity — the program 'print 0 one million times' is short. A random string of one million bits has high Kolmogorov complexity — no program shorter than the string itself can reproduce it.

But Kolmogorov complexity fails to capture the 'interesting middle'. A random string has maximum K(x) — yet it is not complex in any meaningful sense. It is just noise. A perfectly uniform string has minimum K(x) — it is also not interesting. The things we care about — organisms, languages, theorems — sit somewhere in between.

Kolmogorov complexity as a thermodynamic analogy

Shannon entropy H(X) and Kolmogorov complexity K(x) are deeply related: for a string drawn from a distribution, E[K(x)] ≈ H(X) up to logarithmic terms. But K(x) is defined for individual strings, not distributions — and that is exactly what we need for a physical system whose microstate evolves deterministically.

3. Sophistication: Complexity Given a Model

The key insight is that a random string, despite having high K(x), is not 'sophisticated'. It has no exploitable structure — its description can only be the string itself. Sophistication tries to capture the idea: how much of x's complexity comes from non-trivial structure versus mere randomness?

Formally, a model C(x) is a succinctly-described set containing x. The sophistication of x with parameter k is the minimum description length of any model C such that: (1) K(C) ≤ k — the model itself is short to describe — and (2) K(x|C) ≤ K(x) − k + O(1) — given the model, x still requires roughly K(x)−k bits to specify, meaning x is 'typical' of the model rather than special within it.

\mathrm{soph}(x) = \min\bigl\{K(C) : K(C) \leq k,\; K(x \mid C) \leq K(x) - k + O(1)\bigr\}

Intuitively: sophistication asks 'what is the shortest description of the best model for x?' A random string has soph(x) ≈ 0 — the best model is just 'all strings of length n', which is trivial to describe. A string that is the output of a complex but compressible computation has high sophistication: you need a non-trivial model to compress it.

4. Complextropy Defined

Sophistication is the right idea, but it is technically fragile: small changes in k can cause soph(x) to jump discontinuously. Aaronson proposes a smoother, more robust measure called complextropy. The intuition is to minimize, over all models C, the total two-part description length — model size plus residual — subject to neither part being trivial:

\mathrm{compl}(x) \;\approx\; \min_{C}\;\bigl[K(C) + K(x \mid C)\bigr] \quad \text{with balance constraint}

Without the balance constraint, this minimum equals K(x) (just take C to be the empty model) — not interesting. The balance constraint says that neither K(C) nor K(x|C) dominates: we want a model that genuinely compresses x, not one that either explains everything (memorization) or nothing (trivial model).

Complextropy vs. entropy vs. sophistication

entropy HMeasures disorder / missing information. Monotone increasing. Does not capture structure.K(x)Shortest program length. High for random strings, low for structured ones. But random noise maximizes it.soph(x)Shortest useful model. Near 0 for noise, high for structured data. Technically unstable.compl(x)Balanced two-part code. Near 0 for noise and uniform gas, high for organisms and brains. The proposed right notion.

5. The First Law Conjecture

With complextropy defined, Aaronson states the conjecture that he calls the First Law of Complexodynamics:

First Law of Complexodynamics (conjecture)

For a typical closed physical system evolving from a low-entropy initial state: complextropy first increases over time, reaches a maximum at some intermediate time, and then decreases toward near zero as the system approaches thermal equilibrium.

This is a conjecture, not a theorem. The difficulty is both technical (Kolmogorov complexity is not computable) and conceptual (what does 'typical closed system' mean precisely?). But the intuitive content is crisp: there is a sense in which the universe has been on the rising phase of its complextropy arc ever since the Big Bang, with the long-run heat death being the eventual descent.

Why should complextropy first increase? Because starting from a fully correlated low-entropy state, the system begins generating structure: correlations between distant parts that are non-trivial to describe but that can be captured by a good model. Why should it then decrease? Because as the system fully thermalizes, those correlations wash out into pure noise — which is low-complextropy for the reason that a random string has soph ≈ 0.

6. Three Examples: Noise, Crystal, Organism

To make the definitions concrete, consider three canonical objects and how they fare on complextropy:

Example 1: Random noise (high-entropy gas at equilibrium)

A string of random bits has K(x) ≈ |x| — nearly incompressible. Its entropy is maximal. But what is soph(x)? The only 'model' for a random string is the trivial one: 'all strings of this length'. K(C) ≈ O(log |x|), and K(x|C) ≈ K(x). The total K(C) + K(x|C) ≈ K(x), but neither term is genuinely reduced by the model. Complextropy ≈ 0. Random noise is complex in the entropy sense but not in the complextropy sense — there is nothing to model.

Example 2: Perfect crystal (low-entropy, highly ordered)

A perfect crystal — say, a string 'ABABAB...AB' of length n — has K(x) ≈ O(log n): very compressible. Its entropy is nearly zero. The best model C is the pattern 'AB repeated n/2 times', which has K(C) ≈ O(log n). Given C, K(x|C) ≈ 0 — the string is fully determined by the model. So K(C) + K(x|C) ≈ O(log n) ≈ 0. Complextropy ≈ 0. Crystals are simple in both the entropy sense and the complextropy sense.

Example 3: A living organism (medium entropy, high sophistication)

The genome and molecular machinery of a bacterium is neither perfectly random nor perfectly regular. It has genuine structure — genetic code, metabolic pathways, regulatory networks — that can be partially described by a model C of non-trivial length. Given C, K(x|C) is still large (there are many specific details the model does not capture). Both K(C) and K(x|C) are large and comparable. Complextropy is high. The organism is in the interesting middle: it has structure worth modeling, and residual specificity worth noting.

The arc from Big Bang to heat death passes through all three regimes: from an initial near-crystal (ultra-low-entropy vacuum state) through the interesting middle (stars, planets, life) to a final near-random noise (thermal equilibrium). Complextropy rises and then falls, tracing the bell curve of cosmic history.

7. Why This Matters for Understanding Intelligence

Aaronson's essay is not just physics. It has sharp implications for what 'intelligence' and 'learning' mean. If complextropy measures genuine structured complexity, then the goal of learning — in both biological and machine intelligence — is precisely to find the model C that achieves the balance: capturing real structure without memorizing noise.

A neural network that memorizes its training data has found a model C with K(C) ≈ K(x) — it is a lookup table. The residual K(x|C) is near zero, but C itself is expensive. This is low-complextropy in the bad way: the model is not generalizing. A network that learns genuine features — grammar, physics, causal structure — has found a short C (the inductive bias built into the architecture plus learned weights) such that K(x|C) is small. That is high-complextropy intelligence.

Connection to generalization: Minimum description length (MDL) and Bayesian model selection both operationalize this balance. In MDL: the best model minimizes K(C) + K(data|C). In Bayesian inference: the posterior favors models with short description and high likelihood. Complextropy gives a physical grounding for why generalization is the right goal.

8. Connection to the Coffee Automaton

Aaronson illustrates the First Law with a thought experiment he calls the 'coffee automaton': a cellular automaton (like a discrete physics) starting from a state where one region is black (coffee) and the rest is white (cream). As the automaton runs, the two regions mix:

⬛⬜

t = 0: two clean blocks. Low entropy, low complextropy.

🌀

t = T: complex swirling pattern. High entropy, high complextropy.

🩶

t = ∞: uniform gray. Max entropy, near-zero complextropy.

At t = 0: the state is easy to describe — 'left half black, right half white'. K(x) is small. Complextropy is near zero because K(C) + K(x|C) is small (the model is simple and the residual is also small).

At intermediate t: the swirling pattern cannot be described by a simple model (K(C) is large) but it also is not fully random — there are structures coming from the initial condition (K(x|C) is still non-trivial). Both terms are substantial. Complextropy is high.

At t = ∞: uniform gray. K(x) is tiny. The best model is 'uniform distribution over all states', K(C) ≈ 0. K(x|C) ≈ K(x) ≈ 0 as well. Total complextropy ≈ 0.

The coffee automaton is a microcosm of cosmic history. The universe started as coffee and cream in separate corners; it is now in the swirling phase; heat death will be the uniform gray. We — all the interesting structures — are the swirl.

← Back to Papers