The Problem Has Not Gone Away
Despite everything the industry has tried, hallucinations have not been eliminated. Modern models hallucinate less than their predecessors on well-understood topics, but they still generate confident-sounding false information. A model that correctly answers 95 out of 100 factual questions is still a serious problem in any application where the remaining 5 have consequences - medical information, legal advice, financial analysis.
The problem is structural: language models are trained to produce plausible text, not truthful text. They optimize for coherence and typicality, not accuracy. A well-trained model will generate an answer that sounds right more often than it generates one that is right. That asymmetry is fundamental, not a bug that a clever prompting trick can fix.
What Actually Helps
Retrieval augmentation is the most reliable practical intervention. Grounding model outputs in retrieved documents forces the model to respond from specific information rather than generating from implicit knowledge. A RAG system that retrieves relevant documents and asks the model to answer from those documents will hallucinate far less than a model relying on its training data. The key is retrieval quality: if the system retrieves wrong or irrelevant documents, the model will confidently synthesize wrong answers from them.
Structured output formats reduce a specific class of hallucinations. When you ask a model to output JSON matching a schema, the model cannot hallucinate fields that are not in the schema. This is a significant improvement for data extraction and structured reasoning tasks. The hallucination problem for free-form generation is much harder than for constrained structured output.
Explicit uncertainty signaling helps when calibrated properly. Some models now support probability distributions over their outputs that correlate reasonably with actual accuracy. Using those signals to flag low-confidence responses - rather than always returning a confident answer - can route uncertain outputs to human review or trigger additional retrieval steps.
What Does Not Work Well
Prompting techniques like chain-of-thought or self-consistency do not eliminate hallucinations; they make reasoning more transparent and sometimes reduce certain error types, but confident wrong reasoning is still confident and still wrong. Multi-shot prompting has similar limitations - examples guide the model toward correct patterns but do not guarantee accuracy on novel inputs.
Post-generation fact-checking with a second LLM call is popular but unreliable. The checking model often has the same hallucination tendencies as the original model, and the overhead of double inference is substantial. Some teams have built production pipelines on this approach and reported results, but controlled evaluations suggest it catches a minority of hallucinations rather than most of them.
Building Honest Systems
The most mature teams are treating hallucination as a risk management problem rather than a solved problem. They design their systems to minimize reliance on model accuracy where it matters - using retrieval for factual grounding, adding explicit citations, building human review workflows for high-stakes outputs, and clearly communicating uncertainty to users rather than presenting every response as equally authoritative.