AI

Study suggests that even the best AI models hallucinate a bunch

Comment

Robots work on a contract and review a legal book to illustrate AI usage in law.
Image Credits: mathisworks / Getty Images

All generative AI models hallucinate, from Google’s Gemini to Anthropic’s Claude to the latest stealth release of OpenAI’s GPT-4o. The models are unreliable narrators in other words — sometimes to hilarious effect, other times problematically so.

But not all models make things up at the same rate. And the kinds of mistruths they spout depend on which sources of info they’ve been exposed to.

A recent study from researchers at Cornell, the universities of Washington and Waterloo and the nonprofit research institute AI2 sought to benchmark hallucinations by fact-checking models like GPT-4o against authoritative sources on topics ranging from law and health to history and geography. They found that no model performed exceptionally well across all topics, and that models that hallucinated the least did so partly because they refused to answer questions they’d otherwise get wrong.

“The most important takeaway from our work is that we cannot yet fully trust the outputs of model generations,” Wenting Zhao, a doctorate student at Cornell and a co-author on the research, told TechCrunch. “At present, even the best models can generate hallucination-free text only about 35% of the time.”

There’s been other academic attempts at probing the “factuality” of models, including one by a separate AI2-affiliated team. But Zhao notes that these earlier tests asked models questions with answers easily found on Wikipedia — not exactly the toughest ask, considering most models are trained on Wikipedia data.

To make their benchmark more challenging — and to more accurately reflect the types of questions people ask of models — the researchers identified topics around the web that don’t have a Wikipedia reference. Just over half the questions in their test can’t be answered using Wikipedia (they included some Wikipedia-sourced ones for good measure), and touch on topics including culture, geography, astronomy, pop culture, finance, medicine, computer science and celebrities.

For their study, the researchers evaluated over a dozen different popular models, many of which were released in the past year. In addition to GPT-4o, they tested “open” models such as Meta’s Llama 3 70B, Mistral’s Mixtral 8x22B and Cohere’s Command R+, as well as gated-behind-API models like Perplexity’s Sonar Large (which is based on Llama), Google’s Gemini 1.5 Pro and Anthropic’s Claude 3 Opus.

The results suggest that models aren’t hallucinating much less these days, despite claims to the contrary from OpenAI, Anthropic and the other big generative AI players.

GPT-4o and OpenAI’s much older flagship GPT-3.5 performed about the same in terms of the percentage of questions they answered factually correctly on the benchmark. (GPT-4o was marginally better.) OpenAI’s models were the least hallucinatory overall, followed by Mixtral 8x22B, Command R and Perplexity’s Sonar models.

Questions pertaining to celebrities and finance gave the models the hardest time, but questions about geography and computer science were easiest for the models to answer (perhaps because their training data contained more references to these). In cases where the source of an answer wasn’t Wikipedia, every model answered less factually on average (but especially GPT-3.5 and GPT-4o), suggesting that they’re all informed heavily by Wikipedia content.

Even models that can search the web for information, like Command R and Perplexity’s Sonar models, struggled with “non-Wiki” questions in the benchmark. Model size didn’t matter much; smaller models (e.g. Anthropic’s Claude 3 Haiku) hallucinated roughly as frequently as larger, ostensibly more capable models (e.g. Claude 3 Opus).

So what does all this mean — and where are the improvements that vendors promised?

Well, we wouldn’t put it past vendors to exaggerate their claims. But a more charitable take is the benchmarks they’re using aren’t fit for this purpose. As we’ve written about before, many, if not most, AI evaluations are transient and devoid of important context, doomed to fall victim to Goodhart’s law.

Regardless, Zhao says that she expects the issue of hallucinations to “persist for a long time.”

“Empirical results in our paper indicate that, despite the promise of certain methods to reduce or eliminate hallucinations, the actual improvement achievable with these methods is limited,” she said. “Additionally, our analysis reveals that even the knowledge found on the internet can often be conflicting, partly because the training data — authored by humans — can also contain hallucinations.”

An interim solution could be simply programming models to refuse to answer more often — the technical equivalent to telling a know-it-all to knock it off.

In the researchers’ testing, Claude 3 Haiku answered only around 72% of the questions it was asked, choosing to abstain from the rest. When accounting for the abstentions, Claude 3 Haiku was in fact the most factual model of them all — at least in the sense that it lied least often.

But will people use a model that doesn’t answer many questions? Zhao thinks not and says vendors should focus more of their time and efforts on hallucination-reducing research. Eliminating hallucinations entirely may not be possible, but they can be mitigated through human-in-the-loop fact-checking and citation during a model’s development, she asserts.

“Policies and regulations need to be developed to ensure that human experts are always involved in the process to verify and validate the information generated by generative AI models,” Zhao added. “There are still numerous opportunities to make significant impacts in this field, such as developing advanced fact-checking tools for any free text, providing citations for factual content and offering corrections for hallucinated texts.”

More TechCrunch

The research suggest that models aren’t hallucinating much less, despite claims to the contrary from OpenAI, Anthropic and the other big AI players.

Study suggests that even the best AI models hallucinate a bunch
Image Credits: mathisworks / Getty Images

The U.S. Federal Trade Commission (FTC) announced on Wednesday a final rule that will tackle several types of fake reviews and prohibit marketers from using deceptive practices, such as AI-generated…

FTC finalizes rule banning fake reviews, including those made with AI 

Cybersecurity giant Palo Alto Networks is getting a lot of grief for a recent trade show event in which two women posed with lampshades on their heads. The debacle —…

Palo Alto Networks CEO apologizes for happy hour display featuring women with lampshades on their heads

Hiya, folks, welcome to TechCrunch’s regular AI newsletter. This week in AI, a new study shows that generative AI really isn’t all that harmful — at least not in the…

This Week in AI: AI isn’t world-ending — but it’s still plenty harmful

Popular iOS pro photography app Halide launched its new version today with a new feature called Process Zero, which does not use AI in image processing. Lux Optics, the company…

Camera app Halide’s latest update adds an option for ‘zero-AI’ image processing

Definity focuses on the data transformation plane on top of a data lake or warehouse, not the data ingestion part of the pipeline.

Definity raises $4.5M as it looks to transform data application observability

Analytics and AI giant Databricks reportedly paid nearly $2 billion when it acquired Tabular in June, a startup that was only doing $1 million in annual recurring revenue, according to…

Databricks reportedly paid $2 billion in Tabular acquisition

Apple’s exclusive access to the iPhone’s NFC capabilities had been under investigation by the European Commission for years.

Apple opens up NFC transactions to developers, but says there will be ‘associated fees’

Stoke Space is nothing if not ambitious. The five-year-old launch startup has generated a lot of hype due to its bold plans to develop the first fully reusable rocket, with…

Stoke Space’s initial launch plans at Cape Canaveral take shape

Telegram announced on Wednesday that it’s adding new ways for creators to make money on its platform. Most notably, the platform is launching monthly paid subscriptions that users can purchase…

Telegram adds new ways for creators to earn money on its  platform

A Texas company says it lost $60 million to a criminal fraud scheme, which the FBI says makes fraudsters billions of dollars every year.

Texas firm says it lost $60M in a bank wire transfer scam

Software as a service (SaaS) is an ever-evolving industry. We’ll talk to some of the brightest minds and leaders in the industry — executives from early- and late-stage SaaS companies,…

Announcing the final agenda for the SaaS Stage at TechCrunch Disrupt 2024

What is the right way to build a software business? Many startup advisers say that B2B software should solve one pain point, gain customers, then add features as their company…

Parker Conrad says founders have been building software wrong for the last 20 years

Virtuix’s timeline has coincided with a rise of interest around mixed reality, led by Oculus/Meta, HTC and now Apple, among others.

Virtuix’s VR treadmill is finally launching in September

London-based Roto VR’s spinning gaming chair is the first of its kind to boast a “Made for Meta” seal of approval.

Check out this $800 rotating VR chair for Meta Quest

EliseAI employs an army of chatbots to text with, email, and respond to calls from renters about things such as apartment tours, maintenance requests, lease renewals and delinquencies.

EliseAI lands $75M for chatbots that help property managers deal with renters

In crafting laws to regulate AI, like the EU AI Act or California’s SB 1047, policymakers have struggled to come to a consensus on which risks the laws should cover.

MIT researchers release a repository of AI risks

Kiteworks, which builds tools to secure email communications and file sharing, has raised $456 million from Insight Partners and Sixth Street Growth.

Kiteworks captures $456M at a $1B+ valuation to help secure sensitive data

Hadrian announced they bought Datum Source, a software company founded by SpaceX alums that uses AI to help hardware companies find manufacturing partners.

The defense tech acquisition spree has begun: Autonomous factory startup Hadrian acquires Datum Source

Spotify will be able to display the pricing for things like Spotify subscriptions and digital goods, including Spotify’s more recently added collection of audiobooks.

Apple finally allows Spotify to show pricing info to EU users on iOS

India’s Supreme Court has cleared the way for insolvency proceedings to be resumed against Byju’s in a win for U.S. creditors.

India’s top court clears way for Byju’s insolvency proceedings

Elon Musk-owned X launched Grok-2 and Grok-2 mini in beta today with improved reasoning. The new Grok AI model can now generate images on the X social network, though Grok…

xAI releases Grok-2, adds image generation on X

Google Pixel 9 series India launch coincides with the expansion of its sales channels and after-sales support in the country.

Google faces headwinds as it brings Pixel 9 to India

General Catalyst and Mars Growth Capital are co-leading the Series G round, which will be closed within a few days, sources familiar with the deal told TechCrunch.

Zepto raises $340M at a $5B valuation as India’s quick-commerce market heats up

Let’s dive right into what the Google Pixel 9 lineup looks like, how Google’s Gemini AI will be incorporated in the devices, and more.

Made by Google 2024: All of Google’s reveals, from the Pixel 9 lineup to Gemini AI’s addition to everything

We rounded up some of the more intriguing AI-related announcements that didn’t get a ton of play, like Pixel Studio.

Made by Google 2024: A few AI features you might’ve missed

Ben Affleck and Matt Damon have acquired a screenplay called “Killing Gawker,” which presumably delves into billionaire VC Peter Thiel’s campaign to bury the media outfit for posting excerpts from…

Thiel’s Gawker takedown could be coming to a theater near you

Google launched Gemini Live during its Made by Google event Tuesday. The feature allows you to have a semi-natural spoken conversation, not typed out, with an AI chatbot powered by…

Gemini Live first look: Better than talking to Siri, but worse than I’d like

Texas filed a lawsuit Tuesday against GM over years of alleged abuse of customers’ data and trust. New car owners were presented with a “confusing and highly misleading” process that…

Texas sues GM, saying it tricked customers into sharing driving data sold to insurers

Chinese autonomous vehicle company WeRide has received the green light to test its driverless vehicles with passengers in California.  The step comes as WeRide begins the process to go public…

Chinese robotaxi startup WeRide gets approval to carry passengers in California