Fluent and wrong look exactly the same
Why Plan is the most important step in working with an AI Agent or LLM Model
The thing nobody warns you about working with AI Agents and LLM models is that confidently right and confidently wrong arrive in the same paragraph. Same tone. Same fluency. Same finished-looking output. Fluency is supposed to be a signal of competence. With language models it’s just a default setting.
I run into this because I’m building two agents in my spare time. One is a chief of staff that runs the logistics of my life, and the other helps me with investing. The investing one sits on top of a financial analytics engine I built that can run a discounted cash flow and spit out an intrinsic value for any stock, which is a fancy way of saying it estimates what a share is actually worth. So when I tell you it’s capable of being confidently wrong, I mean it can hand me a precise dollar figure, delivered with total composure, that happens to be off by a factor of a few hundred.
It did exactly that on Memorial Day.
The agent ran its undervaluation scan and flagged thirty-one stocks as deeply undervalued. NVDA was one of them, with a computed fair value of $44,701 per share. That number is obviously absurd. But obvious is the lucky case. A quieter version of the same bug could have flagged a stock at a merely wrong price instead of a comic one, and I’d never have caught that by eye. The agent didn’t touch the fix. Before it’s allowed to act on anything non-trivial, it has to stop and write down five things: what it thinks the problem actually is, what it knows versus what it’s only assuming, the one question it’s least sure about, what “solved” will look like as a number I can check, and the smallest change that could possibly work.
That’s the whole intervention. Make it write the plan before it writes the code.
It sounds like process-hygiene, the kind of checklist people tape to a wall and ignore. It isn’t, and the reason is the part I didn’t see coming. The value isn’t in the five sections. It’s in when they get written. Reasoning produced before any work exists is a plan. The same reasoning produced after the work is built is a rationalization, because by then the agent (like a person) is explaining the thing it already made rather than deciding what to make. Those two documents can contain the identical words and mean opposite things. One you can veto. The other just makes you feel informed while you rubber-stamp.
On the NVDA run the section that earned its keep was the boring one: what am I assuming. The agent wrote down that it was assuming the valuation pulled clean annual financials, and it flagged that it hadn’t actually checked. That admission is what aimed the whole investigation at the data instead of the math. The bug was a query pulling mixed annual and quarterly rows, which produced a 268 percent growth rate, which compounded into a roughly half-quadrillion-dollar company.
When your LLM/AI agent, hands you a finished answer with fluency, you can’t grade the answer; it’s built to look right. What you can grade is the plan it would have written before it started, so make it write that plan first. Force it to separate what it knows from what it’s guessing. Force it to name the one number that proves the job is done. Do that before any work exists, while the reasoning is still a decision and not a defense.
It costs a few minutes every time, and I genuinely resented that at first. What I got back was strange: I read the agent’s actual output less carefully now, not more, because I stopped trusting the output and started trusting the plan. The slow part was never the typing anyway. It was the thinking, and the memo is just the place I make it happen where I can see it before it’s too late to matter.
