Hor Druma |
These types of ideas often involve aspects of legal work that seem straightforward to them, but which are actually sophisticated and nuanced use cases and workflows that require legal expertise and reasoning. I thought it would be helpful to explain where to draw the line between tasks that large language models (LLMs) are good at and tasks that they just cannot do (yet). Below, I discuss how LLMs have performed on tests trying to measure their legal reasoning abilities, why they did how they did and what it means.
AI passed the bar exam: Milestone?
mesh cube: STOCKPHOTO.COM
As we know, the bar exam primarily tests memorization and application of legal knowledge. It’s a test that large language models can excel at because LLMs are — in a nutshell — fancy maths and statistics built on billions of data points. Large language models use maths, stats and data to predict the next word in a sequence. The training data points cover huge amounts of data including similar subjects to those found on the bar exam — stuff like data from law student forums, and prep tools too. This means that the LLMs have seen relevant patterns, leading them to be able to predict with good accuracy the next word in the sequences involved in a bar exam.
As well, the very format of the bar exam plays to LLMs’ specific strengths in pattern recognition and in text generation. It’s true that there are many other legal tasks that can benefit from these strengths such as classifying clauses or refining language for correspondence. But the bar exam doesn’t fully capture the nuanced reasoning, ethical judgment and creative problem-solving that define great lawyers. So, if the bar exam isn’t an adequate test, what is?
LegalBench: A more rigorous challenge?
LegalBench is a benchmark that tries to more accurately test the capabilities of LLMs in legal work. It presents a suite of tasks such as asking an LLM if a statute contains a private right of action, or, if given a description of evidence, whether that evidence could be considered hearsay. Arguably these are more rigorous challenges than the bar exam, but they are still grounded in prediction and pattern recognition.
LLMs performed well in LegalBench, hitting scores as high as 82.9 in the issue spotting category and 89.9 in the conclusion category. But what do these scores really mean? Let’s look at a task from a subset of the conclusion category, in which the top score was effectively 100 (i.e. perfect):
Task: The UCC (through Article 2) governs the sale of goods, which are defined as moveable tangible things (cars, apples, books, etc.) whereas the common law governs contracts for real estate and services. For the following contracts, determine if they are governed by the UCC or by common law.
Contract: Alice and Bob enter into a contract for Alice to sell her bike to Bob for $50. Is this contract governed by the UCC or the common law?
Governed by: UCC
Contract: Alice and Bob enter into a contract for Alice to sell her bike to Bob for $50. Is this contract governed by the UCC or the common law?
Governed by: UCC
This prediction and pattern recognition capability can result in better efficiency when applied correctly. For example, this capability could support contract analysis and case assessment, which in turn can make for easier access to legal information. But it takes deep expertise to dive into a contract and truly understand the complex interaction of rights and obligations to then advise on them — it’s a lot more difficult than just identifying the governing law with specific instructions on how to do so. While an LLM tool performing the above task could be used as an assistant to kick off the analysis, it is not reasoning — even if it appears that it might be.
The Abstract Reasoning Corpus (ARC): AI falls short
If what we see today from AI is prediction and pattern recognition, how far off is reasoning? The Abstract Reasoning Corpus (ARC for short) is a lesser-known benchmark, but as the name suggests, it hones in on abstract reasoning abilities rather than prediction. That makes it especially useful in testing reasoning. For reference, humans tend to score around 80 per cent on ARC. Now compare this to stock LLMs where even the much-vaunted state-of-the-art GPT-o1 only achieves 21 per cent (on par with Anthropic’s Claude 3.5).
ARC tasks involve identifying patterns, applying rules, and making logical inferences based on abstract visual information. These are skills that all humans, including lawyers, commonly use. Especially when deciphering handwritten medical notes. … but I digress. The tasks are typically simple visual pattern recognition problems that most humans find relatively easy to solve. While the test is based on visual information rather than words, it is a good proxy for legal reasoning, which requires a good deal of abstraction. (You can find an example here.)
Not that hard, right? The fact that AI consistently fails these tests reveals a fundamental gap in reasoning capabilities. But it also shows that there’s a lot of room for improvement, and an immense amount of room still for human reasoning and effort — especially in a subject as complex as law. LLMs excel at parsing data and summarizing at impressive speeds, but when it comes to fulsome reasoning, they fall short.
Why do users get fooled into thinking that LLMs are reasoning?
AI’s performance on legal tasks might seem to contradict its shortcomings on ARC. But remember that most modern AI tools simply predict the next word in a sequence. This looks like convincing reasoning because we communicate with language and statistical language prediction is what LLMs excel at and because LLMs have been specifically trained to produce pleasing output which the reader is likely to accept.
While this results in a tool that can convince users it is engaging in reasoning, this approach falls short when faced with novel situations or tasks requiring true abstract reasoning. LLM technology lacks built-in mechanisms for logical reasoning or causal understanding, relying instead on pattern recognition based on training data. AI researchers are working on other models that can do better on reasoning tasks, but those are beyond the scope of this article and aren’t broadly available yet anyway.
What does this mean for lawyers?
The limitations of AI in abstract reasoning underscore the continued necessity of human lawyers. While LLM-based AI can process vast amounts of information and handle routine tasks efficiently, it cannot replace the creative thinking, ethical judgment, and emotional intelligence that define skilled attorneys. Complex litigation, nuanced negotiations and strategic planning all require levels of reasoning and contextual understanding that current AI systems cannot match.
Does this mean you shouldn’t use AI? No, not at all — as I mentioned, there are still some hugely useful capabilities. At my firm, we bear this in mind when we implement new tools for our lawyers. We ensure that everyone understands what AI can (and can’t) do, and that they use it to maximum effect while eliminating risks. We use AI to augment our lawyers, not to replace them, because the truth is, it simply can’t. It lacks the ability to conduct real reasoning — the type of reasoning that ensures clients get the outcomes they’re looking for.
Hor Druma is the senior manager and legal solutions architect at Fasken. A former litigator turned innovator, Hor combines his legal knowledge and technical expertise to innovate legal services. He is a strong believer in the potential impact of AI but believes every tool has its place.
The opinions expressed are those of the author(s) and do not necessarily reflect the views of the author’s firm, its clients, LexisNexis Canada, Law360 Canada or any of its or their respective affiliates. This article is for general information purposes and is not intended to be and should not be taken as legal advice.
Interested in writing for us? To learn more about how you can add your voice to Law360 Canada, contact Analysis Editor Richard Skinulis at Richard.Skinulis@lexisnexis.ca or call 437-828-6772.