The researchers found that large language models from top tech companies "hallucinated" — produced text with incorrect information — 69% to 88% of the time when answering legal questions, and their performances worsened as the questions got more difficult.
The models did no better than random guessing when asked to measure the precedential relationship between two cases and "hallucinated" at least 75% of the time when asked about a court's main ruling, according to the study.
"These findings suggest that LLMs are not yet able to perform the kind of legal reasoning that attorneys perform when they assess the precedential relationship between cases — a core objective of legal research," the researchers said in a blog post Thursday.
The researchers asked OpenAI's GPT-3.5, Meta's Llama 2 and Google's PaLM 2 more than 200,000 legal questions of varying difficulty, according to the study. The questions ranged in complexity from who wrote a court opinion to whether two cases agree with each other.
The researchers found that the models got it wrong more frequently when asked about decisions from lower courts than when asked about rulings from higher courts like the U.S. Supreme Court.
"This suggests that LLMs may struggle with localized legal knowledge that is often crucial in lower court cases, and calls into doubt claims that LLMs will reduce long-standing access to justice barriers in the United States," the researchers said.
The researchers also found that the models were wrong more often when asked about the Supreme Court's oldest and newest cases, suggesting their peak performance falls behind current legal doctrine.
In addition, the researchers found that the AI models are vulnerable to what they called "contra-factual bias," treating a false premise in a legal question as correct.
Meta, Google and OpenAI did not respond to requests for comment.
Since OpenAI released its chatbot ChatGPT in November 2022, several legal tech companies and law firms have developed their own tools with the technology.
While OpenAI's GPT-3.5 and more advanced GPT-4 are popular models to use for legal tools, law firms and legal tech companies are experimenting with models from other providers like Meta and Google.
In November, LegalOn Technologies published a study saying that OpenAI's GPT-4 outdid the average test score for law students in multiple-choice legal ethics questions.
--Editing by Robert Rudinger.
For a reprint of this article, please contact reprints@law360.com.