A recent study on OpenAI’s GPT-4, the advanced large language model powering CoCounsel, is turning heads. 

The paper takes a deep dive on GPT-4’s score in the 90th percentile on the Uniform Bar Exam (UBE), and has garnered heavy media attention, with some of the biggest names in news and entertainment taking notice. From The Wall Street Journal and The New York Times to The Late Show with Stephen Colbert, everyone seems to be talking about the implications of the research. 

We’re able to share particular insight into and reflections on this milestone because two of our Casetext team were study co-authors. Pablo Arredondo, our co-founder, Chief Innovation Officer, and a fellow at Stanford’s Center for Legal Informatics (Stanford CodeX), collaborated with OpenAI to co-author the study on GPT-4’s UBE performance, along with Casetext senior machine learning researcher Shang Gao; Daniel Katz, a Professor at Illinois Tech—Chicago Kent College of Law; and Michael Bommarito, Professor and Head of Research at Reinvent Law Laboratory at Michigan State College of Law. 

GPT-4’s predecessor was GPT-3.5, which powered ChatGPT, the application released in November 2022 that took the world by storm. OpenAI announced GPT-4 less than six months later, in mid-March 2023, stating its newest, most advanced large language model is far more accurate and capable than GPT-3.5 was. The study analyzes their respective performances on the UBE, which is just one marker highlighting the vast differences between the models’ capabilities.

GPT-4 didn’t just pass the bar examit scored in the 90th percentile 

GPT-4 is the first AI to pass the bar exam, scoring in the 90th percentile on both the multiple-choice (MBE) and written (MEE and MPT) portions of the UBE. By comparison, ChatGPT (i.e., GPT-3.5), whose performance on the MBE alone was analyzed by Katz and Bommarito late in 2022, scored in the 10th percentile. 

Arredondo et al.’s study also broke down GPT-4’s scores on each section of the UBE—the MBE, MEE, and MPT—and compared them to ChatGPT’s section scores. Most notably, GPT-4’s biggest increase was on the MBE, with a score of 158 points, up from ChatGPT’s 116. 

GPT-4’s 298-point overall score is 25 points higher than Arizona’s minimum passing score of 273, which is the highest threshold among the 36 states and jurisdictions using the test. A combined score of 266 is enough to pass in several states, including New York, New Jersey, and Illinois.

While ChatGPT passed the evidence and torts sections of the MBE, the model failed the MBE as a whole. With an overall score of 213 points out of 400, ChatGPT was correct just over 50% of the time, while real test-takers answered 68% of the questions correctly. This led many to speculate the AI would remain far from ready for professional use for some time, perhaps even years. Its failure was analogous to a bar exam candidate’s failure—it simply wasn’t qualified to be relied upon in the  practice law. 

But less than three months later GPT-4 passed all sections of the UBE with an accuracy rate of 74.5%, 9.5% higher than the 68% average of real test-takers. GPT-4’s overall score was 298 points—85 points higher than ChatGPT’s score of 213. This significant jump demonstrates just how much progress has been made in a relatively short amount of time, and indicates GPT-4 is far more powerful than its predecessor, surpassing the majority of bar candidates, who spend three intense years studying law, followed by two rigorous months of pre-bar study. 

Why does passing the bar matter for AI?

An AI that can pass the bar is without a doubt impressive, and perhaps that’s obvious. It’s worth thinking about what precisely is impressive about it, though. The results chart above tells this story: GPT-2, at the far left, provided few if any correct answers, and each subsequent model has answered more questions correctly (more or less). The differences among any of these models seem to be ones of degree—that is, until GPT-4 exceeded the “passing” threshold.

Much of the news coverage about GPT-4 reflects this perspective, characterizing the model as, among other things, more precise, more accurate, and more expert than ChatGPT. But as Arredondo said during a webinar hosted by Legaltech, “I cannot stress enough how much better this new model is than anything we’ve seen before.” So much better, in fact, that GPT-4 is different in kind—it’s a step change. “We are now in a new age … where computers have, essentially, literacy,” Arredondo continued. “To my mind it’s not the generation of text that’s so important. It’s that these large language models are now capable of reading the text, interpreting it, classifying it, analyzing it, and doing all sorts of other things that are so key to the practice of law.”

GPT-4 isn’t just capable of doing more of what ChatGPT could do. It’s capable of doing things ChatGPT couldn’t do. GPT-4 grasps the deep structures of language, an understanding necessary for making sense of nuance and subtlety, recognizing humor, and “reading between the lines.” This is the difference between a model that can do interesting and entertaining things and a model that can power solutions suitable for professional use. This is the before-and-after moment we’re in. 

So “passing every section of the bar” matters because only a model with this facility with language could get enough correct answers to pass. The bar exam is the hurdle people must pass before they’re deemed ready to practice law. And now for the first time an AI can clear that hurdle, too.

Changing more than the practice of law—increasing access to justice

Passing the bar is only one requirement for practicing law, which is why GPT-4’s hitting this milestone does not mean it can replace lawyers. Casetext’s Gao, an AI engineer who co-authored the study, explained that “by passing the bar, GPT-4 has demonstrated reasoning and comprehension capabilities previous models are unable to match, consistently performing at an unprecedented level a wide range of tasks that are challenging even for humans.” He continued, “And we’re already seeing this directly translate into time saved for lawyers across the diverse set of skills we’ve built out in CoCounsel.”

And that’s why the most compelling takeaway about this new model is not what it can do, but what it enables. Combined with subject matter expertise, product design, and necessary security and privacy measures, such as the systems we’ve developed at Casetext, GPT-4 is the engine powering a variety of applications across industries as diverse as fintech, healthcare, and education—and law. It’s what our new AI legal assistant, CoCounsel, is built on. 

Even CoCounsel, with its specialized focus, cannot replace lawyers—and isn’t meant to—but rather reliably and securely performs an array of tasks fundamental to legal practice. CoCounsel gives lawyers more time for things a machine can’t do—like thinking creatively, devising and applying strategies, and building and strengthening relationships with colleagues and clients—in the service of deepening their expertise, growing their practice, and serving more people’s legal needs. And it’s all possible because we’ve at last hit this extraordinary tipping point where “It’d be irresponsible for me to trust this product with my professional legal work” is fast becoming “It’d be irresponsible for me not to.” 

CoCounsel has been described as a “force multiplier,” and according to Arredondo, this latest advance in the underlying model is “the most important thing that could happen for access to justice, because it amplifies what a single attorney can do.” This impact is particularly pronounced for the legal aid community, whose resources more often than not are severely limited. “We are profoundly failing to offer anything close to the ideals of just, speedy, and inexpensive resolution, and part of it is that it’s a lot of work to do that, it takes a lot of work to bring justice,” Arredondo concluded during the Legaltech webinar. “Having an AI that’s now sophisticated enough to provide a lot of that lift, I think enables now a lot more representation.”

