The Illusion of Thinking
Although I have no background in computer science (I’ve only watched Grant Sanderson’s Neural Networks series), the Illusion of Thinking paper (Shaojee P, Mirzadeh I, Alizadeh K, Horton M, Bengio S, and Farajtabar M, 2025) was well worth the read. The paper details some of the fundamental limits of large language models (LLMs) and large reasoning models (LRMs) I think it’s important to note here that although these models might appear to be “thinking” (and indeed, the industry uses terms like “thinking” and “reasoning” to mean very different things than one would think), they’re just doing math.
Current benchmarks for testing LLMs and LRMs, such as AIME24, suffer from data contamination: they have been leaked into the training data, and thus cannot be used to reliably test model performance. The authors of this study took a list of puzzles with a variable complexity parameter N, including the River Crossing Puzzle and the Tower of Hanoi, and tested these models on the puzzles. The authors found that “state of the art LRMs still fail to develop generalizable problem solving capabilities, with accuracy ultimately collapsing to zero beyond certain complexities across different environments.” Both LLMs and LRMs suffer this limitation, though LRMs delay it longer than LLMs:
In the first regime where problem complexity is low, we observe that non-thinking models are capable to obtain performance comparable to, or even better than thinking models with more token-efficient inference. In the second regime with medium complexity, the advantage of reasoning models capable of generating long chain-of-thought begin to manifest, and the performance gap between model pairs increases. The most interesting regime is the third regime where problem complexity is higher and the performance of both models have collapsed to zero. Results show that while thinking models delay this collapse, they also ultimately encounter the same fundamental limitations as their non-thinking counterparts.
This accuracy collapse isn’t merely the models giving a wrong answer—the models themselves seem to stop functioning as intended. As part of their function, LRMs generate additional text to feed into themselves for further “reasoning.” But as problem complexity approaches the collapse threshold, these models seem to reduce the amount of resources used for this reasoning process. The models aren’t doing this because they’ve arrived at an optimal solution. Instead, they give up without finding a correct solution! Even when given the algorithmic solution to a problem, the models still cannot produce a solution for problems beyond the collapse threshold! So much for “general intelligence.”
Models seemed to do well on the Tower of Hanoi problems, solving N=5 problems with “near-perfect accuracy.” The River Crossing problem, which has much fewer solutions online, had a much lower complexity threshold for model collapse (collapse at N=3). The authors noted that River Crossing examples seem to be much more scarce online than Tower of Hanoi examples, which makes me suspicious that these bots are merely regurgitating solutions they’ve “memorized” within their training data.
This paper firmly solidified the status of LLMs in my mind as bullshit machines. I doubt that the results of this paper will push the slop mills or their enablers away from churning out filth, but it does make me feel justified in my skepticism. The whole paper is worth reading, as well as Eryk Salvaggio’s article “Complete Accuracy Collapse,” where he discusses it:
Generative AI has a market because people are anxious about taking creative risks. That makes these risks – and the people willing to take them – more valuable than ever. The rapid convergence of ideas generated by creative industries will burn itself out. No creative industry can survive on the rapid delivery of its competitors accumulated averages.
I still don’t know what problem “generative AI” is trying to solve. All it seems to do right now is burn up money at a superexponential rate.
Updated by Elliott Weix.