LLM MODELS : A PROBABILISTIC HEURISTIC NOT A GUARANTEED SOLVER
Introduction
It was not long ago when LLM models were called stochastic Parrots, which ingest the internet to amplify biases and push hegemonic worldviews. The reasoning and mathematical skills were weak and easily noticeable. However with each release, better versions of LLMs were pushed and sometimes very rapidly to counter any questions around its capabilities: we had CHATGPT 5.0 released in September and just a couple of months later 5.1. Similarly Gemini 3.0 was just released in Nov 2025 while 2.5 was released in March 2025.
If we look back over the last 3-4 years, there has been amazing progress in the field of technical capabilities, architecture, science and methodologies which has made the job of critics evaluating LLMs much more daunting and more professional.
The LLMs are built using many layers of transformers trained using Self-Supervised learning to predict the next best token. They are further fine tuned using domain specific data: research papers, code repositories, curated web data and later further using specially prepared labelled data. In this article we understand the major breakthroughs in the LLM world in the last 4 to 5 years which brought changes in how LLM reasoning skills improve or appear to improve.
More is Different: Emergent Capabilities
Back in 2021-2022 there was ample proof to suggest LLM models were just functioning like students who can’t provide answers to out-of-syllabus questions. However things improved with each release. It is however not clear whether performance enhancements observed was due to more training, more parameters (GPT 4 was widely rumored to be a multi-trillion parameter model), exposure to training to all kinds of benchmark data, or due to Emergent Capabilities (which I will explain below).
There is a hypothesis which states that owing to massive datasets used and trainings involved, these LLM models have developed Emergent Capabilities. Reasoning Ability, Problem Solving and In-Context Learning were believed to be some of these Emergent Capability. The idea around Emergent Capabilities was initially put up by Nobel Laureate Philip Anderson in his article More is different. His work was in physics but underlying philosophy was applied to other sciences including AI. It was a landmark essay which critiqued relying on only fundamental laws for all universal problems. It further emphasized how when we look at scale, it brings new perspectives which were called Emergent Abilities. For e.g. understanding of electron is one thing but understanding of collective behavior of billions of electrons leads us to understanding of superconductivity (which is then an example of emergent ability of electrons). This hypothesis was further correlated to LLM models wherein vis-a-vis smaller models, LLM models are developing new capabilities due to added scale and complexities.
Further, another interesting development was research which stated and proved that using structured prompts are helping to unlock these emergent capabilities in LLM vis-a-vis smaller models where no such capabilities were found. These structured prompts are a series of step-by-step prompts, which came to be known as Chains of Thought (discussed below in detail) and brought in a totally new field of prompt engineering. Using COT methodology for prompting, GPT4.0 and Gemini 1.0 Ultra fared better in GSM8K benchmarks with scores of 87% and 92% respectively.
Even though the hypothesis of more is different is sound, there is no one single rule or hypothesis governing the science underneath LLM models. For example, InstructGPT despite being 100 times smaller than GPT 3 outperformed GPT 3 on a variety of evaluations. This was achieved through fine-tuning using Reinforcement Learning from Human Feedback (RLHF). So, does that mean emergent capability (developed due to massive size and massive training) is not the only thing needed for an intelligent LLM? Are there other things which if implemented into small size LLM can outclass LLM models?
Chain of Thought (COT) & Reinforcement Learning
The Chain of Thought (COT) and RLHF (Reinforcement Learning from Human Feedback) were major breakthroughs which brought remarkable change in how the LLM model works. We discussed earlier how COT-enabled prompts reflect emergent capabilities better than standard prompting.
GPT 4.0 and Gemini 1.0 Ultra scored 53.2% and 52.9% respectively on advanced maths. However using COT, the models fared better in GSM8K benchmarks with scores of 87% and 92% respectively. COT came up as a technique for prompt engineering which involved step-by-step structured querying. Chain of Thought (COT) was later natively introduced in GPT 4o and DeepSeek R1 models. These models were also trained using Reinforcement Learning. These models were the first LRM (Large Reasoning Models). DeepSeek models were however trained using only reinforcement training without supervised fine tuning (SFT).
The reasoning models performed great in all benchmarks. However there were still questions raised: the models excelled at problems they have seen before or trained on but struggled with novel variations, indicating limitations in true reasoning. This led to popular belief with critics that LLM models’ reasoning skills is a mirage and it just gives the illusion of thinking emerging from memorized or interpolated patterns in the training data rather than logical inference.
The paper by Apple demonstrated that given a set of complex problems and increasing the complexity has adverse effects on reasoning ability. So for easier and mid-level problems, LLM models fared better but failed to provide accurate results for advanced problems. Despite these models faring well in mathematical benchmarks, they would fail even with minor variations on problems they were giving correct results. So does this mean LLM models were still being statistical parrots and any reflection of structural reasoning is a mirage? Again repeating my analogy of no single hypothesis governing the science of LLM, there came another interesting conflicting piece of evidence which discovered Symbolic Reasoning in LLM models.
Mechanistic Interpretability & Symbolic Reasoning
Mechanistic interpretability is an interesting research field in AI that seeks to reverse-engineer LLM architecture or any neural network to understand how they work internally. Using the same approach, researchers in the article Abstract Reasoning in Large Language Models were able to establish the presence of abstract reasoning skills in some of the attention networks within Llama-3-70B. The kind of abstraction skills researchers found using mechanistic interpretability is what we understand as symbolic reasoning with humans.
Consider the line "The glorp hits the dax". Our mind shows symbolic reasoning by immediately recognizing glorp = name and dax = object. This kind of abstraction is done in some of the neural networks which just like the brain assign variables for various kinds of contexts. For example the pattern dog, cat, dog is assigned the variable A, B, A which will also be assigned to tiger, goat, tiger. These abstraction heads have values based on context of sentence, irrespective of variation in sequence. The level of such abstraction is not clear in the article but to even prove the presence of such intelligence no matter how nominal is game changing for LLM models. We will explore symbolic AI below but the ability of LLM models to develop the ability even without symbolic training is interesting.
Neuro-Symbolic Artificial Intelligence
LLM models draw inspiration from Cognitive Science. If AI would have a role model, it would definitely be Cognitive Science.
There are two foundational paradigms around which AI has been based: Connectionism and Symbolism. Connectionism is what we understand as neural networks using which layers of attention network and transformers are built. Symbolism refers to the concepts, knowledge and logic rules constructed by human beings.
Cognitive Scientists have found symbolic reasoning in humans as the ability to internally represent numbers, logical relationships, and mathematical rules in an abstract, amodal fashion. For example if we see a danger signboard with a deer, we use symbolic logic to deduce: Triangle = Warning + Deer = Risk → "Drive Carefully."
Neuro-Symbolic AI is a combination of these two divergent paradigms: Connectionism + Symbolic AI.
One of the use case examples of Neuro-Symbolic AI in LLM models is the use of guardrails in LLM models. Most of LLMs have guardrails against hate speech, so it looks for input text and if it falls within rules specified in guardrails, it is sent to a neural network for replies. Otherwise a message is blocked or maybe standard reply is sent back to the user. This is just a simplistic explanation of a very basic use case; obviously what goes underneath is a little more complicated.
Neuro-Symbolic implementation does not have standard architecture as of now. As we found above, symbolic understanding is getting developed by itself also in LLM models as discussed as one of Emergent Capability. However there is research happening with fusion of ontology + Markov logic network + Graph Neural Network to become a reasoning layer for LLM models. Such integration has potential to be an enterprise-level LLM model which drastically reduces hallucination.
Note: Emergent symbolic-like circuits in LLMs are distinct from intentionally engineered neuro-symbolic architectures.
Agentic AI
Agentic AI is the synergy of connectionist neural models (LLMs), symbolic reasoning structures (rules, tools, graphs), and autonomous agent architectures (planning, memory, self-directed action).
Before delving into Agentic AI, let's understand its inspiration in cognitive science first. Humans constantly talk to themselves: what is also called inner thought or inner speech. Cognitive researchers believe inner speech to be a dynamic, unstable, fluid phenomenon that appears momentarily between the more clearly formed and stable poles of verbal thinking, that is, between word and thought. Vygotsky, a Soviet psychologist, went into detail about inner speech in his publication here. Further research also found inner speech to be like a conversation and not a monologue between oneself. This internal dialogue is a direct translation of our thoughts into a structured format that can be used to evaluate choices and guide future actions.
Taking inspiration from above, Google Brain researchers conceived the idea of ReACT (Reasoning + Action) in the landmark paper REACT: Synergizing Reasoning and Acting in Language Models. They introduced a prompt framework which used prompts as thoughts and actions when required. A simple example of action would be querying Wikipedia. This was a better approach than COT discussed above as the COT idea of step-by-step prompting was monotonic (linear) and had no role of observations or external information which could impact further reasoning and thereby accurate answer.
What started as prompts got integrated into models later, wherein the concept of internal reasoning (like COT) was exposed to actions like querying search engines. Armed with this, next thought could be influenced. This cycle of internal reasoning, acting, and observing is the foundation of how current LLM agents operate. This made autonomous agents more powerful as they were powered by more powerful internal reasoning and hence the success of Agentic AI as a result.
Conclusion: Are LLM’s reasoning skills still a mirage?
We discussed how LLM models keep improving their reasoning skills and also improve their scores on all benchmarks including Mathematical. The scale at which these LLM models train is huge; we are not sure whether LLMs are genuinely capable of structured human-like reasoning or are they mimicking this capacity by statistically approximating their training data and failing for out-of-distribution data.
Armed with Symbolic AI and integration of autonomous agents have introduced deductive reasoning; however how efficient abstract reasoning is still questionable.
A recent 2025 paper highlights this ambiguity. Mathematicians while working with GPT-5 on a complex open problem involving the Malliavin–Stein method found GPT-5 making a blunder in one of the calculations. The trick was, this was something new for LLM. Since the solution did not exist in the pre-training data, the model relied on its generalization capabilities and failed to self-correct despite expert prompting. The authors were experts in the field; they asked questions and the LLM model replied only what it learned in its training. Despite the author's prompting, it was not able to correct immediately. Interestingly, the model would likely solve this problem correctly today. This is attributed to Agentic workflows (RAG) that allow the model to query the now-published solution from the internet.
Also challenging LLM models is no easy feat now—the amount of data the models are trained on and now armed with Agentic AI wherein it can query the internet to improve answers, it's difficult to understand the level of reasoning capability these models possess. Having said that, there is no doubt that it is the best assistant that can provide the most probable answer but is not a guaranteed solver.



0 Comments