Forget The Hype… What Does Good Legal AI Even Look Like?


The Robot Lawyer revolution remains on hold.
Despite advances in generative AI and growing adoption throughout the legal industry, we must all continue to wait for a lawyerly version of Skynet to become fully self-aware… and sign its own surrender deal with the Trump administration.
But in the meantime, legal applications of generative AI continue to produce remarkable and task-focused time-saving tools. As developers work to bring the latest advancements in LLMs to bigger and better legal applications, what does it even look like to build something that works in law?
And, no, it’s not “feed it all court cases” unless you’re a deeply unserious person.
Thomson Reuters CTO Joel Hron opened up in a recent article about the company’s approach to benchmarking large language models as it builds out its AI offering. It’s a detailed yet approachable account of the philosophical and practical concerns that go into melding the almost daily shifting generative AI world into a coherent product that approaches tasks in a manner that can produce usable results for attorneys.
One might think that one of the biggest factors in building a more sophisticated tool is being able to handle more content. And that’s true to a point. That said, as Hron points out, tokens ain’t everything and simply stuffing a million tokens into a model isn’t a magic spell for accuracy:
When GPT-4 was first released in 2023, it featured a context window of 8K tokens, equivalent to approximately 6,000 words or 20 pages of text. To process documents longer than this, it was necessary to split them into smaller chunks, process each chunk individually, and synthesize the final answer. Today, most major LLMs have context windows ranging from 128K to over 1M tokens. However, the ability to fit 1M tokens into an input window does not guarantee effective performance with that much text. Often, the more text included, the higher the risk of missing important details. To ensure CoCounsel’s effectiveness with long documents, we’ve developed rigorous testing protocols to measure long context effectiveness.
It’s a paradox that every lawyer on the wrong end of an irrelevant document dump knows well. Yet without sifting through the kitchen sink, there’s no way to get comfortable as an attorney.
So developers need to get AI in a place where it can handle large amounts of information without missing the most important point. But it’s also true that legal work is rarely about pulling a needle out of a haystack as much as identifying material strewn throughout a mass of text.
Our initial benchmarks measure LLM performance across key capabilities critical to our skills. We use over 20,000 test samples from open and private benchmarks covering legal reasoning, contract understanding, hallucinations, instruction following, and long context capability. These tests have easily gradable answers (e.g., multiple-choice questions), allowing for full automation and easy evaluation of new LLM releases.
Thomson Reuters employs a multi-LLM approach, so it’s not just evaluating potential AI “engines” as a binary, “use/don’t use” test, but to figure out what tasks the model might be well-suited or ill-suited to perform and adjusting its role within the “secret sauce” accordingly. It’s not about crowning a winner, but building a functional team.
For our long context benchmarks, we use tests from LOFT, which measures the ability to answer questions from Wikipedia passages, and NovelQA, which assesses the ability to answer questions from English novels. Both tests accommodate up to 1M input tokens and measure key long context capabilities critical to our skills, such as multihop reasoning (synthesizing information from multiple locations in the input text) and multitarget reasoning (locating and returning multiple pieces of information). These capabilities are essential for applications like interpreting contracts or regulations, where the definition of a term in one part of the text determines how another part is interpreted or applied.
After this round of testing, they run the models through skill-specific tests that they design to mimic rubber-meets-road legal tasks:
Once a skill flow is fully developed, it undergoes evaluation using LLM-as-a-judge against attorney-authored criteria. For each skill, our team of attorney subject matter experts (SMEs) has generated hundreds of tests representing real use cases. Each test includes a user query (e.g., “What was the basis of Panda’s argument for why they believed they were entitled to an insurance payout?”), one or more source documents (e.g., a complaint and demand for jury trial), and an ideal minimum viable answer capturing the key data elements necessary for the answer to be useful in a legal context. Our SMEs and engineers collaborate to create grading prompts so that an LLM judge can score skill outputs against the ideal answers written by our SMEs. This is an iterative process, where LLM-as-a-judge scores are manually reviewed, grading prompts are adjusted, and ideal answers are refined until the LLM-as-a-judge scores align with our SME scores. More details on our skill-specific benchmarks are discussed in our previous post.
A takeaway from this process is that the advertised context windows from LLM designers don’t necessarily pan out in complex legal work. In fact, models with smaller windows can perform better when it comes to complex tasks because larger context windows can lose effectiveness as they get stretched. For this reason, Thomson Reuters still employs a “split and synthesize” approach for some documents to avoid this problem.
When you look at the advertised context window for leading models today, don’t be fooled into thinking this is a solved problem. It is exactly the kind of complex, reasoning-heavy real-world problem where that effective context window shrinks. Our challenge to the model builders: keep stretching and stress-testing that boundary!
After all this, human subject matter experts perform a manual review to capture nuanced issues that might get lost across everything else.
And that’s how they build an AI infrastructure with a multi-LLM strategy. It’s a buddy cop show: one AI is the straight-laced by-the-book type, the other’s the unorthodox. Together, they solve crimes — or at least contract reviews.