When Machines Count Without Calculating: A Journey to the Heart of a Misunderstanding

MATHEMATICS & AI

By Jp@NeuroStratum — Originally published on January 13, 2026

In brief — LLMs can produce “7 × 8 = 56” while doing “nothing more” than predicting the next word. A paradox? No — a misunderstanding. The statistical mechanism doesn’t determine the sometimes uncanny behavior. These models have learned to mimic the arithmetic patterns of their training data. It often works, but it guarantees nothing. The moment the numbers grow or the formats stray from familiar ground, the house of cards starts to wobble. The rule of thumb: if accuracy matters, hand it off to a real calculator. If the stakes are low, an approximation will do.

⏱ Estimated reading time: 7 minutes

How LLMs produce arithmetic results — and why they don’t always deserve your trust.

The Objection That Stings

“Large language models don’t think. They predict the next token.”

The claim lands like a verdict. And without fail, the counterattack fires back: “Oh really? Then how do they handle 347 × 28?”

You have to admit, the question is sly. If these machines do nothing but guess the next likely word in a sentence, how on earth could they handle a multiplication? Nobody’s ever seen a parrot do long division.

Except a delicious misunderstanding hides behind the exchange. “Predicting the next token” describes the technical how — not the observable what. It’s confusing the mechanics of the piano with the music that comes out of it. A pianist produces sound by striking keys; that doesn’t reduce Chopin to random percussion.

Let me try to clear this up: untangle the paradox, see when LLMs get their calculations right (and why they sometimes fail so spectacularly), and above all give you a practical compass for knowing when to trust them.

Ready for the journey?

First, Let’s Get Our Terms Straight

“Predicting the next token” is the heart of the engine. The model looks at what came before, estimates the probability of each possible word, and picks one. Then it starts again. That’s the mechanism — not the purpose.

“Reasoning,” in the strong sense, means applying logical rules systematically, with guarantees of validity. A formal computation program reasons: it unfolds axioms and inference rules without flinching.

The distinction that changes everything has three layers: the mechanism — how the output gets made (statistical prediction); the behavior — what the output appears to show (solving a problem); and the guarantee — what’s formally assured (spoiler: nothing, on the LLM side).

What the research tells us is that the behavior can look strikingly like reasoning even when the underlying mechanism offers none of its guarantees. It’s not human reasoning in the strict sense. But it’s not “nothing” either. It’s something else — and that something is fascinating.

Try It Yourself

Open your favorite LLM — Claude, GPT, Gemini, it doesn’t matter — and run these three exercises by it.

First test: the warm-up. Calculate: 234 + 567. Expected answer: 801. Most models handle it without flinching.

Second test: turning up the heat. Calculate: 987654321 × 123456789. Now things get murky. The exact answer is eighteen digits long, and the models often spin their wheels.

Third test: the nasty trap. Which is bigger: 9.11 or 9.9?

You know the answer: 9.9 is bigger. But some models reply “9.11” — because they treat the decimal portion as a separate integer (11 > 9).

Why these differences? The key is called tokenization. To you, “987654321” is a number. To the model, it’s a string of chunks sliced up according to rules that have nothing to do with mathematics. See the gap?

How It Works (When It Works)

Memory of regularities. An LLM trained on billions of texts has run into “7 × 8 = 56” a staggering number of times. It’s absorbed these patterns like a sponge. For small numbers, this implicit memorization does the heavy lifting. No need to “calculate”: recognizing the pattern and completing it is enough.

The limits of generalization. The GSM-Symbolic study made waves in 2024. Changing the numerical values of a problem (without touching its logic) shifts performance. More troubling: adding irrelevant information can drop performance by 65%. The noise pulls the model off course.

Exact or plausible? That’s the knot at the heart of the problem. An LLM doesn’t intrinsically distinguish between “correct answer” and “answer that sounds right.” To the model, “7 × 8 = 56” and “7 × 8 = 54” are two strings of tokens; one simply appears far more often in the training data.

The Distinction That Changes Everything

Empirical performance: “The model answers correctly in X% of cases.” That’s a statistical measure. Useful, but partial.

Formal guarantee: “The system always produces the right answer.” That’s what your pocket calculator offers.

The difference isn’t one of degree. It’s one of nature. It’s the difference between “often works” and “always works.” Between talent and certainty.

The Pianist and the Mathematician

A virtuoso can play a sonata sublimely. But a wrong note can slip in. The mathematician proving a theorem doesn’t get that luxury: either the proof holds, or it collapses.

The LLM is on the pianist’s side. Brilliant, often right, sometimes off-key.

The Real Solution: Hybrid Systems

The naked LLM. Default mode. You ask a question, the model generates a text answer. No guarantee of arithmetic accuracy. It’s text about calculation — not calculation itself.

The tooled LLM. This is the approach that reshuffles the deck. GPT-4 with Code Interpreter, Claude with its tool-use capabilities: the LLM understands the question, generates the code, and an external interpreter runs the actual computation. Adding tools cuts errors by a factor of 5 to 13. That’s not a marginal improvement — it’s a paradigm shift.

What to Take Away

LLMs can produce correct arithmetic results while being “only” token predictors. These models have learned regularities that work well in common cases. And they fail in predictable ways at the edge cases.

The Rule of Thumb

If accuracy is critical — accounting, engineering, medical work — offload to a dedicated tool and check the result. Always.

If the stakes are low — estimation, orders of magnitude — an approximation will do.

In between? Use your judgment. And when in doubt, a calculator costs nothing.

As the old saying goes: “Trust, but verify.” LLMs didn’t invent that advice. They’ve just made it more relevant than ever.

And You — What Do You Think?

Have you ever been surprised — pleasantly or not — by an LLM’s arithmetic?

Share your experience in the comments: I’m curious whether you’re in the verify-everything camp, or the trust-your-gut camp.


Written with the support of AI to help organize the thinking and shape the language.

Jp@NeuroStratum

For Further Reading

  • GSM-Symbolic — Mirzadeh et al. (2024), Understanding the Limitations of Mathematical Reasoning in LLMs, ICLR 2025. The study that shifted the conversation on LLM reasoning capabilities arxiv.org/abs/2410.05229
  • Tokenization Counts — Singh & Strouse (2024), The Impact of Tokenization on Arithmetic in Frontier LLMs. The first systematic study on the link between tokenization and arithmetic arxiv.org/abs/2402.14903
  • Survey Mathematical Reasoning — Ahn et al. (2024), Large Language Models for Mathematical Reasoning: Progresses and Challenges, EACL 2024. The most comprehensive map of the territory arxiv.org/abs/2402.00157
  • Numerical Precision — Feng et al. (2024), How Numerical Precision Affects Arithmetical Reasoning Capabilities of LLMs. The theoretical analysis of arithmetic capabilities in Transformers arxiv.org/abs/2410.13857
  • LLM Agents + Tools — Goodwin et al. (2025), npj Digital Medicine. The most rigorous evaluation of the benefit of tools for clinical calculations nature.com/articles/s41746-025-01475-8

Originally published on Skool IA Mastery on January 13, 2026.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *