Hallucinations Are LLMs Working As Designed
Chuck Wendig, science fiction and horror writer, has a cat. Or has had many cats. Or has killed cats. The cat is named Boomba. Or the cat is named Franken. Or Dartanian. Or Catlin. And six dogs, though their relationship to the cat, or cats, is unclear. Oh, and he has cancer.
Chuck Wendig has no cat.
Never has. And while he does have dogs, he has two, not six. And no animal he has ever owned has carried any of the names listed as belonging to his nonexistent cats. Wendig, obviously, found the false information about his supposed pets via Google Gemini search. The problem for imitative AI systems like Gemini, and our problem as a society where these systems are being forced into most software, is that the answers Gemini provided about Wendig’s are exactly what Gemini is designed to produce. What their purveyors call hallucinations are actually acceptable answers to the system.
Gemini, and other imitative AI systems, are not designed to produce true answers. They are designed to produce text that resembles a valid answer. To simplify: they probabilistically determine what a correct answer or next piece of text should look like. They have no means of determining the truth or falsity of an answer because they have no model of the real world to reply upon. What Gemini said about Wendig’s pets looked like a real answer would look. Based on its training data, the answer to “Does Chuck Wendig have pets?” could very well look like “Chuck Wendig has a cat named Boomba”. Therefore, Gemini successfully completed its task.
When you see scientific papers that cite papers that do not exist, that is not a mistake made by the imitative AI system that produced it. Such citations are a core component of real scientific research, and so the output is plausibly correct. And since plausibly correct is the only kind of correct any imitative AI system can come to, that is a designed output. Essentially, every output given by an imitative AI system is a hallucination. Another way of saying this is that hallucinations is marketing term. The makers of these systems want you to believe that there are two types of outputs: correct, and incorrect. In reality, there is only one type of output: probabilistically plausible based on training data.
This fact explains why even in areas that imitative AI system show some usefulness they are inconsistent. Programming is the canonical example. The range of plausible answers is better modeled in some language training data than others. Languages like R, Java, and Python have a ton of code available in open access repositories so the models have a better probability of finding a plausible outcome that is also useful. In other languages, such as .NET, the data is much more limited and so the plausible results that are also useful are less common. But, again, this does not mean that the systems are correct. Imitative AI code still produces more bugs than human code, for example. It merely means that some languages have a more robust probability space.
Okay, so doesn’t that mean that the hyper-scalers, the people who say that all you need is more training data, correct? Can’t more data solve this problem? Unfortunately no. First, the cost of existing training is already enormous. Second, you cannot have perfect training data at the scale these systems require. It is impossible to guarantee that training data contains no false data, meaning that falsities will appear plausible as well. Third, the math of the systems essentially guarantees that they will pass over uncertainty and simply provide plausible answers. Indeed, the way these systems are built, they are discouraged from admitting uncertainty. The best known way to limit incorrect answers is to have the machines admit uncertainty and refuse to provide an output. Unfortunately, it appears that the doing so would produce an “I don’t know” rate of around half in current models. That is, obviously, not an especially useful tool.
Well, then why not just build a way for the system to validate their answers. Some systems do try and do that kind of post-output validation, but it is not easy. Often there is ambiguity that must be parsed, and checking outside sources doesn’t provide a clear answer. Anything that is simple to check would likely just be the equivalent of a lookup table, not needing imitative AI at all. And, of course, extra steps and extra resources makes the systems even more expensive to run.
Great, then just build a model of the world in these things. Yep, that would do it. But building an accurate, useful model of the world is not easy. Our inability to do that is largely responsible for the joke that artificial intelligence has been ten years away for the last fifty years. If it were that easy, it would have been done already.
This is not to say that these models cannot be improved. Better training data, more domain specific training data, better fine tuning, external validation in some cases, can all help. But the simple fact is that the inherent design of these systems, the very math they are based upon, makes it impossible for these systems to be precise. They will always hallucinate because, in the truest sense, that is all they ever do.
So what? Why does the presence of mistakes, of untruths, matter so much? After all, humans make mistakes. The point of automation is that it is repeatable and precise. It can do the same thing with the same result over and over again faster and more reliably than a human can. While nothing is ever perfect, the repeatability and precision of automation is what makes it economically beneficial. Any quality assurance is a small part of the over all process. Imitative AI is not repeatable — ask it the same question and you will often get different answers. See Chuck’s cat saga if you need more proof. And that significantly limits the value of imitative AI as a business tool.
There are industries where slop is good enough, or perceived as good enough. But the quality assurance process on imitative AI requires everything to be checked. That is difficult and costly do and likely to let more errors slip through than if people just did the work to start. Human beings are poor at finding uncommon errors in a stream of work that is usually accurate enough. Most businesses are turning away from AI and studies show that programmers think they are more productive but are actually less productive using imitative AI. The unavoidability of errors mens that imitative AI is not a good automation tool. And if it is not, then it is almost certainly not going not be able to produce enough economic value to justify its costs. All because hallucinations are its output, not its mistakes.
Oh, and Chuck Wendig says he doesn’t have cancer. But, really, who are you going to believe — Chuck, or the machine that produced a statistically plausible output?


I'm not sure what to say. Nice article, I had 10 cats, 3 dogs, 2 alligatotors, 7 tarantulas and a bearded dragon. Forgot, also a turtle and a tortoise. All at the same time and, yes, they all had names. The neighborhood Zoo. Long ago and now defunct. Happy week to you !