A Knockout Blow for LLMs?

garymarcus.substack.com

59 points by tu7001 12 hours ago

no_op 10 hours ago

The material finding of this paper is that reasoning models are better than non-reasoning models at solving puzzles of intermediate complexity (where that's defined, essentially, by how many steps are required), but that performance collapses past a certain threshold. This threshold differs for different puzzle types. It occurs even if a model is explicitly supplied with an algorithm it can use to solve the puzzle, and it's not a consequence of limited context window size.

The authors speculate that this pattern is a consequence of reasoning models actually solving these puzzles by way of pattern-matching to training data, which covers some puzzles at greater depth than others.

Great. That's one possible explanation. How might you support it?

- You could systematically examine the training data, to see if less representation of a puzzle type there reliably correlates with worse LLM performance.

- You could test how successfully LLMs can play novel games that have no representation in the training data, given instructions.

- Ultimately, using mechanistic interpretability techniques, you could look at what's actually going on inside a reasoning model.

This paper, however, doesn't attempt any of these. People are getting way out ahead of the evidence in accepting its speculation as fact.

somethingsome 3 hours ago

While I agree overall, LLMs are pattern matching in a complicated way.
You transform your training data in a very strange and high dimensional space. Then when you write an input, you calculate the distance between that input and the closest point in that space.
So, in some sense.. You pattern match your input with the training data. Of course, in a very non intuitive way for humans.
Now, it doesn't necessarily imply things as 'models cannot solve new problems not seen before' we don't know if our problem could get matched to something completely unrelated for us, but in that space it makes sense.
So with your experiments, if the model is able to solve a new puzzle never seen before, you'll never know why, but it doesn't imply either that the new puzzle was not matched in some sense to some previous data in the dataset.

1dom 11 hours ago

"They're super expensive pattern matchers that break as soon as we step outside their training distribution" - I find it really weird that things like these are seen as some groundbreaking endgame discovery about LLMs

LLMs have a real issues with polarisation. It's probably smart people saying all this stuff about knockout blows, and LLM uselessness, but I find them really useful. Is there some emperor's new clothes type thing going on here - am I just a dumbass who can't see he's excited at a random noise generator?

It's like if I saw a headline about a knockout blow for cars because SomeBigBame discovered it's possible to crash them.

It wouldn't change my normal behaviour, it would just make me think "huh, I should avoid anything SomeBigName is doing with cars then if they only just realised that."

tim333 3 hours ago

Marcus’s schtick is going on about things like "knockout blows, and LLM uselessness". He's kind of the go to AI/LLM naysayer.
camgunz 3 hours ago

Finding it useful is different than "I can replace my whole customer service department with an LLM", which the hype is convincing people is possible. You're not a dumbass; I hate LLMs and even I admit they're pretty good at coding.
Den_VR 11 hours ago

Marcus’s writing is from a scientific perspective, it’s not general artificial intelligence and probably not a meaningful path it GAI.
But the ivory tower misses the point of how LLM improved the ability of regular people to interact with information and technology.
While it might not be the grail they were seeking, it’s still a useful thing what will improve life and in turn be improved.
- barrkel 11 hours ago
  
  Marcus's writing is from the perspective of someone who is situated in the branch of AI that didn't work out - symbolic systems - and has a bit of an axe to grind against LLMs.
  He's not always wrong, and sometimes useful as a contrarian foil, but not a source of much insight.
- lostmsu 5 hours ago
  
  I don't see the relevance of this paper to AGI assuming one considers humans GI. Humans exhibit the same behavior where for each there's a complexity limit beyond which they are unable to solve any tasks in reasonable amount of time. For very complex tasks even training becomes unfeasible.

macrolime 11 hours ago

From a quick glance it seems to be about spatial reasoning problems. I think there is good reasons for why it's tricky to become extremely good at these from being trained on text and static images. Future models being further multimodally trained with video and then physics simulators should deal with this much better I think.

There's a recent talk about this by Jim Fan from Nvidia https://youtu.be/_2NijXqBESI

kubb 11 hours ago

It would be awesome to develop some theory around what kind of problems LLMs can and cannot solve. That should deter some leads pushing for solving the unsolvable with the technology.

That being said, this isn’t a knockout blow by any stretch. The strength of LLMs lies in the people who are excited about them. And there’s a perfect reinforcing mechanism for the excitement - the chatbots that use the models.

Admit for a second that you’re a human with biases. If you see something more frequently, you’ll think it’s more important. If you feel good when doing something, you’ll feel good about that thing. If all your friends say something, you’re likely to adopt it as your own belief.

If you have a chatbot that can talk to you more coherently than anyone you’ve ever met, and implement these two nested loops that you’ve always struggled with, you’re poised to become a fan, an enthusiast. You start to believe.

And belief is power. As in the case of neuroscience development not being able to retire the concept of the dualism of body and soul, so will the testing of LLMs not be able to retire the concept of AI poised to dominate everything soon.

jhbadger 4 hours ago

>It would be awesome to develop some theory around what kind of problems LLMs can and cannot solve. That should deter some leads pushing for solving the unsolvable with the technology.
That could have unfortunate consequences. Most people stopped looking at neural nets for years because they thought that Minsky's and Papert's 1969 proof that perceptrons (linear neural nets) couldn't solve basic problems incorrectly applied to neural nets in general. So the field basically abandoned neural nets for a couple of decades which were more or less wasted on "symbolic" approaches to AI that accomplished little.

golol 11 hours ago

The first figure in the paper with Accuracy vs Complexity makes the whole point moot. The authors find that the performance of Claude 3.7 collapses around complexity 3 while Claude 3.7 thinking collapsed around complexity 7. A massive improvement in the complexity horizon that can be dealt with. It's real, it's quantitative, so what's the point of philosophical atguments about whether it is truly "reasoning" or not. All LLMs have various horizons, a context horizon/length, a complexity horizon etc. Reasoning pushes this out further, but not to some infinite algorithmically perfect recurrent reasoning effect. But I bet humans pretty much just have a complexity horizon of 12 or 20 or whatever and bigger models trained on bigger data with bigger reasoning posttraining and better distillation will push the horizons further and further.

threeseed 11 hours ago

> bigger models trained on bigger data with bigger reasoning posttraining and better distillation will push the horizons further and further
There is no evidence this is the case.
We could be in an era of diminishing returns where bigger models do not yield substantial improvements in quality but instead they become faster, cheaper and more resource efficient.
- golol 11 hours ago
  
  I would claim that o1 -> o3 is evidence of exactly that, and supposedly in half a year we will have even better reasoning models (further complexity horizon), so what could that be besides what I am describing.
  
  threeseed 11 hours ago
  
  Is there some breakthrough in reasoning between o1 and o3 that we are all missing.
  And no one cares what we may have in the future. OpenAI etc already have an issue with credibility.
  
  golol 11 hours ago
  
  No breakthrough, it's just better in some quantitatively measurable way.
- energy123 10 hours ago
  
  The empirical scaling laws are evidence. They're not deductive evidence, but still evidence.
  The scaling laws themselves advertise diminishing returns, something like a natural log. This was never debated by AI optimists, so it's odd to suggest otherwise as if it contradicts anything the AI optimists have been saying.
  The scaling laws are kind of a worst case scenario, anyway. They assume no paradigm shift in methodology. As we saw when the test-time scaling law was discovered, you can't bet on stasis here.

ookblah 11 hours ago

how about lets stop making AI (i guess LLM here) some monolithic block that applies to every problem we could ever have on this earth? claude might suck at tower of hanoi, but in the end we will just use something better suited to the job. nobody complains that ML or vision models "fail spectacularly" at soemthing it's not suited for.

i get the criticism, but "on the ground" there's real stuff getting done that couldn't be done before. all of this boils down to an intellectual study which, while good to know, is meaningless in the long run. the only thing that matters is if the dollars put in can be recouped to the level of hype created and that answer is probably "maybe" in some areas but not others.

this AI doomerism is getting just as annoying as people claiming AI will replace everyone and everything.

tlrobinson 11 hours ago

I always assumed LLMs would be one component of “AGI”, but there would be “coprocessors” like logic engines or general purpose code interpreters that would be driven by code or data produced by LLMs just in time.

genewitch 11 hours ago

> neural networks of various kinds can generalize within a training distribution of data they are exposed to, but their generalizations tend to break down outside that distribution.

so the AI companies really took that to heart and tried to put everything into the training distribution. My stuff, your stuff, their stuff. I remember the good old days of feeding the wikimedia dump into a markov chain.

nantersand 11 hours ago

The paper shows reasoning is better than no reasoning, reasoning needs more tokens to work for simple tasks, and that models get confused when things get too complicated. Nothing interesting, on the level of what an undergrad would write for a side project. If it wasn’t “from apple” no one would be mentioning it.

nantersand 11 hours ago

They cite https://arxiv.org/abs/2503.23829 which is interesting though. If you have lots of tokens to burn, just try the task lots of times. Can find better solutions than reasoning would on its first try. Only tested on small models though.
081guy 11 hours ago

I think that these kind of papers are necessary to ground people back into reality - the hype machine is too strong to be left unguarded.
- nantersand 11 hours ago
  
  All the papers I’ve seen show models have limits. This is just an attention grab by lazy “researchers” cashing in on their Apple credentials.
  
  threeseed 11 hours ago
  
  Samy Bengio is the co-author of Torch. His credentials speak for themselves.

rfv6723 11 hours ago

Oh, another LLM skepticism paper from Apple.

This paper from last year doesn't age well due to rapid proliferation of reasoning models.

https://machinelearning.apple.com/research/gsm-symbolic

krackers 11 hours ago

Is apple putting out these papers just to justify their seeming inability to properly integrate them into their software?
- rfv6723 11 hours ago
  
  They seem to be very skeptical against Large models.
  While everyone learned the bitter lesson, apple chose to focus on small on-device models even after the explosion of chatgpt.
  
  bird0861 9 hours ago
  
  I think it's because they are betting on JEPA and they're trying to carve out brand identity by being special little Steve Jobs clones. They're taking "Think Different" to its neurotic limits.
  "See this is why we can't build with transformers and had to use JEPA and look how much better it is!"

r721 9 hours ago

Related discussion: https://news.ycombinator.com/item?id=44203562

IshKebab 11 hours ago

Argh please stop. Everyone knows LLMs aren't AGI currently and they have annoying limitations like hallucinations. Even the "giving up" thing was known before Apple's paper.

You aren't winning anything by saying "aha! I told you they are useless!" because they demonstrably aren't.

Yes everybody is hoping that someone will come up with a better algorithm that solves these problems but until they do it's a little like complaining about the invention of the railway because it can only go on tracks while humans can go pretty much anywhere.

ramblerman 7 hours ago

You don’t need to feel offended on behalf of LLMs, they are tremendously useful. But not above criticism.
There is a belief being peddled that AGI is right around the corner and we can get there by just scaling up LLMs.
Papers like this are a good takedown of that thinking
threeseed 11 hours ago

> aha! I told you they are useless
You said this. Neither Apple nor the author did.
The focus was specifically on LLM's reasoning capabilities not whether they are entirely useless or not.
This is relevant because countless startups and investment is predicated on LLM's current capabilities being able to be improved and built on top of. If it is a technological dead-end then we could be in for another long lull in progress. And companies like OpenAI should have their valuations massively cut.
It also constrains the level of investment Apple would need to be comparable to top tier LLM companies.

dgan 11 hours ago

Just before you write your comment, consider that the author very specifically says "LLM are not useless".

dr_dshiv 8 hours ago

He talks about the limits of reasoning with the tower of Hanoi game. So I asked Gemini to make it — and then to solve it. You can try it yourself: https://g.co/gemini/share/eb8b68d1dace

Aeolun 11 hours ago

Why would they need to execute the algorithm? That feels like complaining your fork doesn’t cut things like a knife would…

sweetjuly 11 hours ago

The point they draw is that the results are non sequiturs. Reasoning models produce chains of thought (and are sometimes even correct in the chain of thought) and still produce a wrong, logically inconsistent answer. The extreme examples of this is giving the model a step by step guide to complete the program (the Towers program) and it being unable to produce answers which are consistent with the provided plan. So, not only is it unable to produce robust chain of thought, even were it correct and explicit, it cannot mash this information into a reasoned response.

Basje 8 hours ago

I made a zero-copy video recorder in C for Linux, I barely know anything about C, pointers or vulkan at all. LLM's are improving rapidly.

spwa4 9 hours ago

Algorithms made to imitate humans exhibit human weaknesses. What a terrible unexpected outcome! I love how the article is written but it is literally proving the opposite of it's premise.

Playbook:

1) you want to "disprove" some version of AI. Doesn't really matter what.

Take a problem humans face. For example, an almost total inability to follow simple rules for a long time to make a calculation. It's almost impossible to get a human to do this.

Check if AI algorithms, which are algorithms made to imitate humans have this same problem. Now of course, in practice if they indeed have that problem, that is actually a success: algorithm made to imitate humans ... imitates humans succesfully, strengths and weaknesses! But of course, if you find it, you describe it as total proof this algorithm is worthless.

An easy source for these problems is of course computers. Anything humans use computers for ... it's because humans suck at doing it themselves. Keeping track of history or facts. Exact calculation. Symbolic computation. Logic (ie. exactly correct answers). More generally math and even positive sciences as a whole are an endless supply of such problems.

2) you want to "prove" some version of AI.

Find something humans are good at. Point out AIs do this. Humans are social animals so how about influencing other humans? From convincing your boss, or on a larger scale using a social network to win an election, right up to actual seduction. Use what humans use to do it, of course (ie. be inaccurate, lie, ...)

Point out what a great success this is. How magical it is that machines can now do this.

3) you want to make a boatload of money

Take something humans are good at but hate, have an AI do it for money.

feketegy 11 hours ago

In other news, water is wet.

I don't think anybody who uses LLMs professionally day-to-day thinks that it can reason like human beings... If some people thought this, they fundamentally do not understand how LLMs work under the hood.

simonklitj 11 hours ago

I think most people, my close relatives included, who use LLMs professionally day-to-day do not understand how LLMs work under the hood.
Lerc 11 hours ago

There are quite a lot of options out at the extremes and both ends seem to presuppose things about consciousness that even specialist's in the field have debated for years.
I'm ok with thinking it's possible that some subset of consciousness might exist in LLMs while also being well aware of their limitations. Cognitive science has plenty of examples of mental impairments that show that there are individuals that lack some of the things that LLMs also lack that. We would hardly deny those individuals are conscious. The distinction for what is thought is lower but no less complex.
Before we had machines pushing at these boundaries, there were very learned people debating these issues, but it seems like now some gut instincts from people who have chatted to a bot for a bit are carrying the day.
openplatypus 11 hours ago

Oh buddy, step our of your bubble. There are people out there who swear by LLM being a modern day mesahiah. And no, this are not just SV VCs trying to sell their investments.
- IshKebab 11 hours ago
  
  Sure but VCs always exaggerate. Remember the dot com bubble? Or "there's an app for that"?
  The internet and smartphones were still extremely useful. There's no need to refute VC exaggeration. It's like writing articles to prove that perfume won't catch you Bradd Pitt. Nobody literally believes the adverts but that doesn't mean perfume is a lie.
  I'm not saying that VCs only push good ideas - e.g. flying cars & web3 aren't going to work. Just that their claims are obviously exaggerated and can be ignored, even for useful ideas.

zmgsabst 11 hours ago

Citing a few points to justify my own conclusion:

> Many (not all) humans screw up on versions of the Tower of Hanoi with 8 discs.

> LLMs are no substitute for good well-specified conventional algorithms.

> will continue have their uses, especially for coding and brainstorming and writing

> But anybody who thinks LLMs are a direct route to the sort AGI that could fundamentally transform society for the good is kidding themselves.

I agree with the assessment but disagree with the conclusion:

Being good at coding, writing, etc is precisely the sort of labor that is both “general intelligence” and will radically change society when clerical jobs are mechanized — and their ability to write (and interface with) classical algorithms to buttress their performance will only improve.

This is like when machines came for artisans.