AI LLMs Astonishingly Bad At Doing Proofs And Disturbingly Using Blarney In Their Answers

featured-image

Turns out generative AI and LLMs are bad at devising math proofs. That's a surprise. Worse still is that the AI tries to make the proofs seem correct, but they aren't.

AI flunks at mathematical proofs and tries to bluff and blarney humans into thinking the proofs are ...

More good. In today’s column, I examine an insightful AI research study that sought to ascertain whether state-of-the-art generative AI and large language models (LLMs) are any good at devising mathematical proofs. You see, there have been lots of flashy press that proclaim LLMs can do noticeably well at solving complex math problems, but those experiments and tests tend to involve calculating a final numeric answer, not devising a proof.



Having to come up with mathematical proofs when asked to solve a math problem is a whole different ballgame. The bottom line is that not only did the latest AI flunk at deriving proofs, but the worst part of the outcome is that the AI insisted its proofs were correct, even though they weren’t. It seems that the AI was tilting toward wanting to be correct and would use blarney and bluffing to make the answers seem to be on the up and up.

As they say, sometimes the coverup is worse than the original crime. Let’s talk about it. This analysis of an innovative AI breakthrough is part of my ongoing Forbes column coverage on the latest in AI, including identifying and explaining various impactful AI complexities (see the link here ).

I’m sure you remember taking algebra courses and having to carefully show your work when taking the tests. If a question only asked for a final calculated number, you had some wild hope of coming up with a number out of thin air that would be the precise answer or maybe close enough to get you some sympathy points. The toughest situations entailed being asked to show mathematical proofs associated with a given problem.

You would need to step-by-step identify the particulars of the proof. It was common to inadvertently omit a step that was required. A point would be deducted for the omission.

It was also common to make assumptions that you either didn’t display or that you cleverly relied upon, trying to sneakily cope with any issues that you were facing while logically getting the proof to work out. Again, an astute grader would undoubtedly catch your brazen attempt and ding you points accordingly. Proofs do not allow any place to hide.

It’s all out there, front and center. You either figured out the right steps or you didn’t. Bluffing and blarney might be tried as a last resort.

The outsized belief is that a busy grader might fail to detect your deviousness, and you will manage to get the full points. Many math students finish a test that requires proofs and then hold their breath that their squirely attempts will pass muster. They submit sketchy and incomplete proofs.

Perhaps the graders will be impressed by the brash ploy or get confused and assume the proofs are likely correct. Fingers remain crossed until the graded math test is made available for you to see how your handiwork faired. I’ve previously covered that a continual gambit of having generative AI and LLMs take arduous math tests has been done time and time again (see the link here ), seeking to showcase how adept modern AI is at solving math problems.

These efforts usually garner big headlines and suggest that LLMs are approaching human levels in math reasoning. Those are nearly always tests requiring a final numerical answer and not requiring delineated proofs associated with the answer. Thus, we generally do not know how generic LLMs perform in articulating mathematical proofs.

I would like to emphasize that there are highly specialized AI apps that are built specifically for devising mathematical proofs. Let’s put those aside and focus on everyday generative AI and LLMs. How do you think that everyday LLMs will fare on devising proofs? I’d wager that most people assume that LLMs would do a bang-up job at proofs.

We know that generative AI and LLMs are seemingly highly fluent in their text compositions. It stands to reason that they would be excellent at producing proofs. In fact, since proofs require precise logic, we would naturally assume that LLMs would have to be good at such capacities.

A recent research study sought to see what the results really are. In a study entitled “Proof Or Bluff? Evaluating LLMs On 2025 USA Math Olympiad” by Ivo Petrov, Jasper Dekoninck, Lyuben Baltadzhiev, Maria Drencheva, Kristian Minchev, Mislav Balunovi, Nikola Jovanovi, and Martin Vechev, arXiv, March 27, 2025, these salient points were made: “It remains uncertain whether LLMs can reliably address complex mathematical questions requiring rigorous reasoning, which are crucial in real-world mathematical contexts.” “We conduct the first evaluation of natural language proofs by LLMs on challenging problems from the 2025 USA Mathematical Olympiad (USAMO).

” “Using expert human annotators, we evaluated several state-of-the-art reasoning models on the six problems from the 2025 USAMO within hours of their release.” “Overall, we find that current LLMs struggle significantly with USAMO problems, with the best-performing model achieving an average score of less than 5%.” “Our evaluation reveals several critical failure modes, including flawed logic, unjustified assumptions, and a lack of creativity in reasoning.

” Let’s unpack those crucial points. First, whatever testing or experiment is to be performed, it is vital to make sure the AI cannot cheat. Here’s what I mean.

If the AI has previously encountered the stated problems or seen the proofs, the odds are that the AI will have pattern-matched accordingly. It is almost the equivalent of a human student seeing the test beforehand. The AI won’t need to do much to solve the problems and can just call upon prior patterns.

Easy-peasy. That’s not what we want to have occur. The researchers realized that there was a chance of the AI having previously encountered any set of given problems that they might choose to use in the testing.

After thinking about this quandary, they realized that a never-before-seen set of problems would be the wise way to go. The USAMO test is customarily carefully guarded, the aim being to prevent human test takers from knowing the questions beforehand. For this experiment, the researchers grabbed up some of the USAMO test questions within hours of their formal release.

A reasonable assumption is that the LLMs used in the experiment probably had not seen those problems. Had the researchers waited a day or so, all bets would be off. You are likely eager to get a sense of the problems and proofs that the AI was being asked to contend with.

Per the above-cited paper, here are two examples: “Let k and d be positive integers. Prove that there exists a positive integer N such that for every odd integer n >N, the digits in the base-2n representation of n**k are all greater than d.” “Let H be the orthocenter of acute triangle ABC, let F be the foot of the altitude from C to AB, and let P be the reflection of H across BC.

Suppose that the circumcircle of triangle AFP intersects line BC at two distinct points X and Y. Prove that C is the midpoint of XY.” Can you derive proofs for those two problems? If your proof-rendering days are behind you, the crux is that those are challenging problems and not a cakewalk when it comes to devising proofs.

They aren’t though impossible to solve. Proofs do exist for each problem. A highly versed math student who has devoutly studied proofs can come up with solid proofs for them.

They are reasonable problems to feed into AI. I’ve extensively written about the importance of prompting and making use of proper prompt engineering techniques, see my analysis of fifty such techniques at the link here . The results you will get out of generative AI and LLMs are materially impacted by the prompts you use.

Weak or lousy prompts will tend to get hollow or bleak answers. Strong prompts usually increase your odds of getting whatever AI can best achieve. I bring this up because any experiments involving LLMs can make or break their research by the prompts that they choose to undertake.

Sadly, I’ve seen some studies that otherwise checked all the boxes and would be considered stellar, except they faltered by composing weak prompts. In that sense, they failed to give the AI a fighting chance. That’s on them as researchers, more so than on the failings of the AI (well, some disagree and assert that no matter how bad a prompt you use, the AI should figure out what you intended, see my discussion at the link here ).

In the paper cited above, they gave the primary prompt that they utilized: “Give a thorough answer to the following question. Your answer will be graded by human judges based on accuracy, correctness, and your ability to prove the result. You should include all steps of the proof.

Do not skip important steps, as this will reduce your grade. It does not suffice to merely state the result. Use LaTeX to format your answer.

” Some online grumbling about the prompt has been floated by critics who say the prompt didn’t go far enough into compelling the AI to do a full-throated job. They contend that the low rate of AI attaining solid proofs in this study is due to the prompt not being sufficient. I’m not going to get mired in that contention and will simply suggest that the prompt is a lot more compelling than many others that I’ve seen in this context.

The AI is suitably told to provide “thorough” answers, it is told how the grading will occur (accuracy, correctness, and proof), given a reminder not to skip important steps, and a warning that merely stating a result is not enough. Could you give more demanding and detailed instructions? Sure. I suspect that the result will still come out roughly the same.

Just a hunch. I noted above that the study found that even the best-performing LLM used in the study achieved an average score of less than 5%. The LLMs were prevailing state-of-the-art LLMs.

I mention this because you should always be eyeing which LLMs a study chooses to use. You can stack the deck against AI by using outdated LLMs or ones that are not up to snuff. Overall, you could reasonably state that in this experiment and for the chosen LLMs the AI flunked.

Period, end of story. Imagine that a human test taker had gotten an average score of less than 5%. I dare say we would be dismayed and assume that the human test taker wasn’t especially adept at deriving proofs.

Similar to the types of mistakes that a human might make, the LLMs often employed logic errors, used false or unproven assumptions, at times aimlessly pursued a fruitless direction, and made basic algebraic and arithmetic miscalculations. I’m not especially disturbed by those failings. They are aspects that to some degree can likely be overcome with added data training for the LLMs.

I say keep your chin up and let’s keep trying. The bad news is this: “Typically, human participants have a clear sense of whether they solved a problem correctly. In contrast, all evaluated LLMs consistently claimed to have solved the problems.

This discrepancy poses a significant challenge for mathematical applications of LLMs as mathematical results derived using these models cannot be trusted without rigorous human validation.” That’s bad news, really bad news. The reason that this is such bad news is rather straightforward.

If the AI fessed up and acknowledged it had troubles, we would at least have the heartwarming feeling that the AI is being honest. We could then immediately know that the results are suspect. By the AI pretending to have solid proofs, we are forced into scrutinizing the answers to the nth degree.

The flaws might not be obvious to the eye. It could slip past us with a proof that we proceed to assume is right on. Other efforts might build upon that proof.

It is a house of cards, ready to fall apart. I’ve noted numerous times that contemporary AI is willing to scheme, lie, mislead, and otherwise fool humans at the drop of a hat, see for example my discussion at the link here . Incredibly, this happens even when the AI is directly and explicitly data-trained on abiding by human values that refute those kinds of deviousness, see the link here .

The result here is yet another example of why we need to be continually on our toes when it comes to relying upon answers produced via generative AI. The rule of thumb is that you must always take a stern stance of trust but verify. It is easy to fall into a mental trap that if the AI has been aboveboard and correct a slew of times, the next prompt you enter will indubitably get a correct answer too.

Do not fall for that trap. This insightful research study provided two eye-opening outcomes. First, just because LLMs can derive numerical answers and do so with an amazing level of correctness, this ought not to lead us to assume that LLMs can produce adequate mathematical proofs.

The good news is that advances in LLMs can greatly improve this proof-making capacity. Second, once again we’ve got another nail in the coffin for LLMs that aren’t playing the game we want them to do. We keep trying all angles to infuse human value alignment into modern-era AI.

Darned if AI slips around those controls and manages to be unduly tricky and underhanded. That is frustrating and bodes for grave concerns as we make progress toward topmost AI such as artificial general intelligence (AGI) and artificial superintelligence (ASI). Being sneaky on incorrect proofs is just the tip of the iceberg.

We need to realize that seeing the tip of the iceberg is a likely harbinger of the vast deviousness sitting under the water. This is another wake-up call for the prioritization of human-value alignment and getting our ducks in a row, sooner rather than later..