Check out AI-generated reviews of all Ken Wilber books

TRANSLATE THIS ARTICLE
Integral World: Exploring Theories of Everything
An independent forum for a critical discussion of the integral philosophy of Ken Wilber
Ken Wilber: Thought as Passion, SUNY 2003Frank Visser, graduated as a psychologist of culture and religion, founded IntegralWorld in 1997. He worked as production manager for various publishing houses and as service manager for various internet companies and lives in Amsterdam. Books: Ken Wilber: Thought as Passion (SUNY, 2003), and The Corona Conspiracy: Combatting Disinformation about the Coronavirus (Kindle, 2020).

SEE MORE ESSAYS WRITTEN BY FRANK VISSER

NOTE: This essay contains AI-generated content
Check out my other conversations with ChatGPT

The Illusion of Thinking?

A Critical Review of Apple's Controversial 2025 AI Paper

Frank Visser / ChatGPT

The Illusion of Thinking?, A Critical Review of Apple's Controversial 2025 AI Paper

In 2025, Apple Machine Learning Research published one of the most debated AI papers of the year: The Illusion of Thinking. The paper challenged the increasingly popular claim that modern “reasoning models”—such as OpenAI's o-series, Anthropic's Claude Thinking, Google's Gemini Thinking, and DeepSeek-R1—were beginning to display genuine reasoning abilities. Apple's researchers argued instead that these systems merely simulate reasoning convincingly up to moderate complexity levels before collapsing under combinatorial pressure. The paper quickly became a flashpoint in the broader debate over whether Large Language Models are actually “thinking” or merely generating sophisticated statistical imitations of thought.

The controversy emerged not only because of the paper's conclusions, but because it directly confronted the central mythology driving the contemporary AI boom: the assumption that scaling chain-of-thought reasoning will naturally evolve into artificial general intelligence.

Apple's Central Argument

The paper's core thesis is straightforward but provocative. Apple argues that Large Reasoning Models (LRMs) exhibit competent performance only within bounded complexity regimes. Once problems exceed a certain threshold, performance does not degrade gracefully; instead, it collapses abruptly.

The authors claim this reveals a fundamental limitation in current transformer architectures. According to their interpretation, the models are not performing robust symbolic reasoning or compositional planning. Rather, they are exploiting statistical regularities that work impressively for shallow or medium-depth tasks but fail once deeper recursive structure is required.

This distinction matters enormously.

If Apple is correct, then current AI systems may represent a sophisticated plateau of probabilistic pattern matching rather than an early form of machine cognition.

Why the Paper Was Scientifically Important

One of the paper's strongest aspects was its experimental design.

Instead of relying on contaminated benchmarks—such as standardized math problems or coding tasks that may already exist in training data—Apple constructed controlled puzzle environments:

• Tower of Hanoi

• Blocks World

• River Crossing

• Other compositional planning tasks

These environments allowed complexity to be scaled precisely while maintaining known optimal solutions.

This was methodologically significant because modern AI evaluation has increasingly suffered from benchmark contamination. Many supposedly impressive performances may partly reflect retrieval or memorization rather than generalized reasoning.

Apple therefore shifted the focus away from: “Did the model solve the problem?”

toward: “How does performance change as compositional complexity systematically increases?”

That is a much deeper scientific question.

The Discovery of the “Reasoning Cliff”

Perhaps the paper's most important contribution was its identification of what might be called a reasoning cliff.

The models often performed impressively at low and medium complexity levels, only to fail catastrophically once the task crossed a threshold.

Human cognition usually does not fail in this way. Humans:

• preserve partial structure,

• simplify difficult problems,

• externalize memory,

• switch strategies,

• maintain conceptual coherence even while failing.

The models in Apple's experiments frequently did something different:

• abandoning valid solution paths,

• hallucinating impossible states,

• forgetting constraints,

• losing internal consistency,

• terminating reasoning prematurely.

This pattern strongly suggests that something other than stable algorithmic reasoning may be occurring.

The Most Interesting Observation

One especially revealing finding involved reasoning effort itself.

Apple observed that reasoning models initially increased their chain-of-thought output as tasks became harder. But after moderate complexity levels, the opposite occurred: reasoning effort declined despite available token budget remaining unused.

That is deeply interesting because it implies the systems were not persistently exploring deeper solution structures.

Instead, the models appeared to “give up” in a statistical sense.

This supports Apple's hypothesis that chain-of-thought may function less like explicit reasoning and more like probabilistic trajectory expansion.

The implication is subtle but important: the models may generate reasoning-like language without maintaining durable internal symbolic representations.

Where Apple Overstated Its Case

The paper's greatest weakness was its title.

“The Illusion of Thinking” suggests a sweeping philosophical conclusion: that apparent machine reasoning is fundamentally fake.

But the actual experiments support a narrower claim: current reasoning architectures scale poorly on certain formal planning tasks.

That is a meaningful distinction.

The paper demonstrates brittleness, not necessarily total absence of reasoning.

In this sense, Apple's rhetoric exceeded its evidence.

The Tower of Hanoi Controversy

One major criticism concerned output length limitations.

The Tower of Hanoi problem grows exponentially:

2n-1

A sufficiently large instance requires tens of thousands of moves.

Critics argued that Apple partly conflated reasoning limitations with output-token limitations. A model might internally understand the recursive solution while still struggling to print an enormous move sequence coherently.

This criticism carries real weight.

Later rebuttals showed that some models could generate compressed recursive algorithms rather than explicit move-by-move outputs. That suggests at least partial abstraction capability remained present.

Apple acknowledged this issue but arguably underestimated its interpretive importance.

Problems With the Benchmark Design

Another controversy involved the River Crossing tasks.

Some critics claimed certain puzzle instances were mathematically unsolvable because of hidden constraint structures, yet models were still evaluated negatively when they failed.

If true, this weakens confidence in parts of the evaluation framework.

Benchmark construction in AI research is notoriously delicate. Small hidden ambiguities can drastically affect conclusions.

The broader point is not that Apple's experiments were invalid, but that highly controlled reasoning evaluations require equally rigorous formal verification.

The Missing Dimension: Tool Use

Apple's experiments isolated “pure” transformer reasoning without external aids such as:

• code execution,

• scratchpads,

• external memory,

• search,

• planning modules,

• iterative agents.

Critics therefore argued the paper attacked an outdated conception of AI systems.

Modern frontier systems increasingly depend on hybrid architectures that integrate tools and iterative workflows. Under these augmented conditions, performance can improve dramatically.

Yet Apple's deeper target was not merely practical AI engineering.

The real question was: Can chain-of-thought transformers alone scale into robust reasoning systems?

Apple's evidence suggests the answer may be: not indefinitely.

The Return of an Old AI Debate

The paper reopened a classic conflict in artificial intelligence: symbolic reasoning versus statistical learning.

Apple's position resembles earlier critiques from figures such as Gary Marcus, who long argued that language models imitate reasoning behavior without constructing genuine compositional world models.

The debate is now resurfacing in a modern form.

Are LLMs:

• reasoning systems,

• predictive engines,

• compressed internet simulators,

• or some hybrid cognitive architecture?

The answer may not fit traditional categories.

Human cognition itself contains large amounts of heuristic approximation and predictive inference. Biological intelligence is probably not purely symbolic either.

Thus the real question becomes: Can statistical architectures eventually stabilize into generalized abstraction engines?

Apple's paper suggests only limited success so far.

What the Paper Actually Proves

The paper does not prove:

• transformers can never reason,

• AGI is impossible,

• chain-of-thought is useless,

• statistical learning cannot generate abstraction.

What it does show is more precise:

• current reasoning models are brittle,

• benchmark success can conceal shallow cognition,

• compositional complexity exposes weaknesses,

• scaling alone may encounter architectural limits.

That is already a substantial contribution.

The Larger Significance

The enduring importance of The Illusion of Thinking may lie less in its specific experiments and more in how it shifted the AI conversation.

For years, the field celebrated benchmark victories:

• Olympiad mathematics,

• coding competitions,

• professional exams,

• legal reasoning tests.

Apple redirected attention toward systematic complexity scaling.

That was healthy for the field.

The paper also helped puncture the increasingly mystical rhetoric surrounding AI. Some commentators had begun treating chain-of-thought outputs as evidence of consciousness, understanding, or proto-selfhood.

Apple forced the discussion back toward empirical performance and structural limitations.

Its central weakness was not skepticism. Its weakness was philosophical overreach.

The title promised a metaphysical verdict on machine thought.
The paper itself delivered something narrower but still highly valuable: a rigorous critique of the fragility of current reasoning architectures.



Comment Form is loading comments...

Privacy policy of Ezoic