The Jagged Frontier of Your Business
AI is brilliant at some tasks and confidently wrong on others that look identical. For an operator, the whole game is telling them apart before a fluent wrong answer reaches a customer.
Part of the AI for the People Who Run Things series. Start with AI Won't Run Your Business.
In 2024, a team at Purdue presented a study where they fed 517 real programming questions to ChatGPT and graded the answers. 52% of them were wrong.
Here's the part that should worry anyone running a business. When they showed those answers to people, the people preferred ChatGPT's answer 35% of the time anyway, and missed the errors in 39% of cases. The answers were wrong and well-written, and the second part beat the first.
That's the problem with AI in one experiment. It's not that it makes mistakes. Everything makes mistakes. It's that it makes them in a confident, fluent, well-organized voice that reads exactly like the answers it gets right. The wrong answer and the right answer look identical until you check.
Last article I mentioned the jagged frontier, the term Harvard and BCG researchers gave to AI's strange capability map. Some tasks sit inside the frontier, where it's as good as your best person. Some sit outside, where it's worse than useless. This one is about why you can't see the line, and what to do about it anyway.
The Line Isn't Where You'd Guess
The frontier doesn't follow difficulty the way a human's would. AI can draft a clean contract clause and then botch grade-school arithmetic. It can summarize a forty-page inspection report and then invent a building code that doesn't exist. Hard-for-humans and hard-for-AI are different maps, and yours is not the model's.
Even the tools built specifically to be safe get this wrong. In 2024, Stanford's RegLab tested the AI legal-research tools sold by LexisNexis and Thomson Reuters, the ones marketed to lawyers as the careful, hallucination-free option. Across 202 real legal queries, they fabricated information somewhere between 17% and a third of the time. These are purpose-built, expensive, professional tools with whole companies behind them. They still made things up one query in six, or worse.
If the legal-AI vendors can't keep it inside the frontier with that much effort, you're not going to do it by trusting the output because it reads well.
The Only Question That Matters Is Whether You Can Check It
So here's the sorting rule I actually use. Forget "is this task hard." Ask one thing: can I verify the answer in about ten seconds?
Tasks where you can: drafting an email you'll read before it sends. Summarizing notes you already have in front of you. Reformatting a list you can eyeball. Pulling quotes from a document you can search. These sit inside the frontier not because they're easy, but because a wrong answer gets caught the instant it appears. The cost of a mistake is basically zero.
Tasks where you can't: the final price on a custom job. A compliance answer. A safety call. Anything where the output ships straight to a customer and you find out it was wrong when they're angry. These sit outside the frontier not because AI won't attempt them, it always will, but because you can't catch the confident wrong answer before it costs you something.
The Harvard and BCG study measured this exactly. On tasks inside the frontier, the consultants using AI did more and better work. On one task that sat just outside it, they came out 19 percentage points worse than the people working with no AI at all. They trusted a wrong answer they couldn't easily check.
The Model Has No Tell
The real trap is that AI gives you nothing to read. A person who's unsure hedges. They slow down, they say "I think," their voice changes. The model writes its wildest guess in the same steady, certain prose as its best work. There's no tremor in it.
The 39% of people in the Purdue study who missed the errors weren't lazy. They were human, trusting a signal we all use every day. This is well-organized and confident, so it's probably right. That signal is reliable when it comes from a person. The technology breaks it on purpose.
So you have to supply the doubt the model won't. Every output gets the same question before you act on it: how would I know if this were wrong, and how fast would I find out?
If the answer is "instantly, I'd see it," use AI freely and move fast. If the answer is "I'd hear about it from a customer next week," keep a person in the loop or keep AI out of that job entirely.
The frontier is jagged and you cannot see it from the outside. So stop staring at the answer to judge whether it's good. Start asking how cheaply you can prove it's wrong. That single habit is most of the difference between the owners who get the real gains and the ones who get burned by something that sounded perfect.
Part of the AI for the People Who Run Things series. Continue with Why Most Owners Get Nothing From AI.
This article reflects the state of AI tooling as of June 2026. What the models can and can't do reliably is still moving.
Sources
- Magesh, Varun et al. "Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools" (2024, Stanford RegLab; Journal of Empirical Legal Studies, 2025) (opens in new tab)
- Kabir, Samia et al. "Is Stack Overflow Obsolete? An Empirical Study of the Characteristics of ChatGPT Answers to Stack Overflow Questions" (2024, CHI '24) (opens in new tab)
- Dell'Acqua, Fabrizio et al. "Navigating the Jagged Technological Frontier" (2023, Harvard Business School Working Paper 24-013) (opens in new tab)
- Stanford HAI: "AI on Trial: Legal Models Hallucinate in 1 out of 6 (or More) Benchmarking Queries" (2024) (opens in new tab)



