Casetext CIO co-authors study on impressive UBE score by GPT-4, the AI powering CoCounsel

The new study is making waves in the media, including The Wall Street Journal, New York Times, and The Late Show with Stephen Colbert

A recent study on OpenAI’s GPT-4, the advanced large language model powering CoCounsel, is turning heads. 

The paper takes a deep dive on GPT-4’s score in the 90th percentile on the Uniform Bar Exam (UBE), and has garnered heavy media attention, with some of the biggest names in news and entertainment taking notice. From The Wall Street Journal and The New York Times to The Late Show with Stephen Colbert, everyone seems to be talking about the implications of the research. 

We’re able to share particular insight into and reflections on this milestone because two of our Casetext team were study co-authors. Pablo Arredondo, our co-founder, Chief Innovation Officer, and a fellow at Stanford’s Center for Legal Informatics (Stanford CodeX), collaborated with OpenAI to co-author the study on GPT-4’s UBE performance, along with Casetext senior machine learning researcher Shang Gao; Daniel Katz, a Professor at Illinois Tech—Chicago Kent College of Law; and Michael Bommarito, Professor and Head of Research at Reinvent Law Laboratory at Michigan State College of Law. 

GPT-4’s predecessor was GPT-3.5, which powered ChatGPT, the application released in November 2022 that took the world by storm. OpenAI announced GPT-4 less than six months later, in mid-March 2023, stating its newest, most advanced large language model is far more accurate and capable than GPT-3.5 was. The study analyzes their respective performances on the UBE, which is just one marker highlighting the vast differences between the models’ capabilities.

GPT-4 didn’t just pass the bar exam—it scored in the 90th percentile

GPT-4 is the first AI to pass the bar exam, scoring in the 90th percentile on both the multiple-choice (MBE) and written (MEE and MPT) portions of the UBE. By comparison, ChatGPT (i.e., GPT-3.5), whose performance on the MBE alone was analyzed by Katz and Bommarito late in 2022, scored in the 10th percentile. 

Arredondo et al.’s study also broke down GPT-4’s scores on each section of the UBE—the MBE, MEE, and MPT—and compared them to ChatGPT’s section scores. Most notably, GPT-4’s biggest increase was on the MBE, with a score of 158 points, up from ChatGPT’s 116. 

GPT-4’s 298-point overall score is 25 points higher than Arizona’s minimum passing score of 273, which is the highest threshold among the 36 states and jurisdictions using the test. A combined score of 266 is enough to pass in several states, including New York, New Jersey, and Illinois.

While ChatGPT passed the evidence and torts sections of the MBE, the model failed the MBE as a whole. With an overall score of 213 points out of 400, ChatGPT was correct just over 50% of the time, while real test-takers answered 68% of the questions correctly. This led many to speculate the AI would remain far from ready for professional use for some time, perhaps even years. Its failure was analogous to a bar exam candidate’s failure—it simply wasn’t qualified to be relied upon in the  practice law. 

But less than three months later GPT-4 passed all sections of the UBE with an accuracy rate of 74.5%, 9.5% higher than the 68% average of real test-takers. GPT-4’s overall score was 298 points—85 points higher than ChatGPT’s score of 213. This significant jump demonstrates just how much progress has been made in a relatively short amount of time, and indicates GPT-4 is far more powerful than its predecessor, surpassing the majority of bar candidates, who spend three intense years studying law, followed by two rigorous months of pre-bar study. 

Why does passing the bar matter for AI?

An AI that can pass the bar is without a doubt impressive, and perhaps that’s obvious. It’s worth thinking about what precisely is impressive about it, though. The results chart above tells this story: GPT-2, at the far left, provided few if any correct answers, and each subsequent model has answered more questions correctly (more or less). The differences among any of these models seem to be ones of degree—that is, until GPT-4 exceeded the “passing” threshold.

Much of the news coverage about GPT-4 reflects this perspective, characterizing the model as, among other things, more precise, more accurate, and more expert than ChatGPT. But as Arredondo said during a webinar hosted by Legaltech, “I cannot stress enough how much better this new model is than anything we’ve seen before.” So much better, in fact, that GPT-4 is different in kind—it’s a step change. “We are now in a new age … where computers have, essentially, literacy,” Arredondo continued. “To my mind it’s not the generation of text that’s so important. It’s that these large language models are now capable of reading the text, interpreting it, classifying it, analyzing it, and doing all sorts of other things that are so key to the practice of law.”

GPT-4 isn’t just capable of doing more of what ChatGPT could do. It’s capable of doing things ChatGPT couldn’t do. GPT-4 grasps the deep structures of language, an understanding necessary for making sense of nuance and subtlety, recognizing humor, and “reading between the lines.” This is the difference between a model that can do interesting and entertaining things and a model that can power solutions suitable for professional use. This is the before-and-after moment we’re in. 

So “passing every section of the bar” matters because only a model with this facility with language could get enough correct answers to pass. The bar exam is the hurdle people must pass before they’re deemed ready to practice law. And now for the first time an AI can clear that hurdle, too.

Changing more than the practice of law—increasing access to justice

Passing the bar is only one requirement for practicing law, which is why GPT-4’s hitting this milestone does not mean it can replace lawyers. Casetext’s Gao, an AI engineer who co-authored the study, explained that “by passing the bar, GPT-4 has demonstrated reasoning and comprehension capabilities previous models are unable to match, consistently performing at an unprecedented level a wide range of tasks that are challenging even for humans.” He continued, “And we’re already seeing this directly translate into time saved for lawyers across the diverse set of skills we’ve built out in CoCounsel.”

And that’s why the most compelling takeaway about this new model is not what it can do, but what it enables. Combined with subject matter expertise, product design, and necessary security and privacy measures, such as the systems we’ve developed at Casetext, GPT-4 is the engine powering a variety of applications across industries as diverse as fintech, healthcare, and education—and law. It’s what our new AI legal assistant, CoCounsel, is built on. 

Even CoCounsel, with its specialized focus, cannot replace lawyers—and isn’t meant to—but rather reliably and securely performs an array of tasks fundamental to legal practice. CoCounsel gives lawyers more time for things a machine can’t do—like thinking creatively, devising and applying strategies, and building and strengthening relationships with colleagues and clients—in the service of deepening their expertise, growing their practice, and serving more people’s legal needs. And it’s all possible because we’ve at last hit this extraordinary tipping point where “It’d be irresponsible for me to trust this product with my professional legal work” is fast becoming “It’d be irresponsible for me not to.” 

CoCounsel has been described as a “force multiplier,” and according to Arredondo, this latest advance in the underlying model is “the most important thing that could happen for access to justice, because it amplifies what a single attorney can do.” This impact is particularly pronounced for the legal aid community, whose resources more often than not are severely limited. “We are profoundly failing to offer anything close to the ideals of just, speedy, and inexpensive resolution, and part of it is that it’s a lot of work to do that, it takes a lot of work to bring justice,” Arredondo concluded during the Legaltech webinar. “Having an AI that’s now sophisticated enough to provide a lot of that lift, I think enables now a lot more representation.”

Featured posts

© 2024 Casetext Inc., a part of Thomson Reuters
Casetext, part of Thomson Reuters are not a law firm and do not provide legal advice.
Do Not Sell or Share My Personal Information/Limit the Use of My Sensitive Personal Information

Draft Correspondence

Rapidly draft common legal letters and emails.

How this skill works

  • Specify the recipient, topic, and tone of the correspondence you want.

  • CoCounsel will produce a draft.

  • Chat back and forth with CoCounsel to edit the draft.

Review Documents

Get answers to your research questions, with explanations and supporting sources.

How this skill works

  • Enter a question or issue, along with relevant facts such as jurisdiction, area of law, etc.

  • CoCounsel will retrieve relevant legal resources and provide an answer with explanation and supporting sources.

  • Behind the scenes, Conduct Research generates multiple queries using keyword search, terms and connectors, boolean, and Parallel Search to identify the on-point case law, statutes, and regulations, reads and analyzes the search results, and outputs a summary of its findings (i.e. an answer to the question), along with the supporting sources and applicable excerpts.

Legal Research Memo

Get answers to your research questions, with explanations and supporting sources.

How this skill works

  • Enter a question or issue, along with relevant facts such as jurisdiction, area of law, etc.

  • CoCounsel will retrieve relevant legal resources and provide an answer with explanation and supporting sources.

  • Behind the scenes, Conduct Research generates multiple queries using keyword search, terms and connectors, boolean, and Parallel Search to identify the on-point case law, statutes, and regulations, reads and analyzes the search results, and outputs a summary of its findings (i.e. an answer to the question), along with the supporting sources and applicable excerpts.

Prepare for a Deposition

Get a thorough deposition outline in no time, just by describing the deponent and what’s at issue.

How this skill works

  • Describe the deponent and what’s at issue in the case, and CoCounsel identifies multiple highly relevant topics to address in the deposition and drafts questions for each topic.

  • Refine topics by including specific areas of interest and get a thorough deposition outline.

Extract Contract Data

Ask questions of contracts that are analyzed in a line-by-line review

How this skill works

  • Allows the user to upload a set of contracts and a set of questions

  • This skill will provide an answer to those questions for each contract, or, if the question is not relevant to the contract, provide that information as well

  • Upload up to 10 contracts at once

  • Ask up to 10 questions of each contract

  • Relevant results will hyperlink to identified passages in the corresponding contract

Contract Policy Compliance

Get a list of all parts of a set of contracts that don’t comply with a set of policies.

How this skill works

  • Upload a set of contracts and then describe a policy or set of policies that the contracts should comply with, e.g. "contracts must contain a right to injunctive relief, not merely the right to seek injunctive relief."

  • CoCounsel will review your contracts and identify any contractual clauses relevant to the policy or policies you specified.

  • If there is any conflict between a contractual clause and a policy you described, CoCounsel will recommend a revised clause that complies with the relevant policy. It will also identify the risks presented by a clause that does not conform to the policy you described.

Summarize

Get an overview of any document in straightforward, everyday language.

How this skill works

  • Upload a document–e.g. a legal memorandum, judicial opinion, or contract.

  • CoCounsel will summarize the document using everyday terminology.

Search a Database

Find all instances of relevant information in a database of documents.

How this skill works

  • Select a database and describe what you're looking for in detail, such as templates and precedents to use as a starting point for drafting documents, or specific clauses and provisions you'd like to include in new documents you're working on.

  • CoCounsel identifies and delivers every instance of what you're searching for, citing sources in the database for each instance.

  • Behind the scenes, CoCounsel generates multiple queries using keyword search, terms and connectors, boolean, and Parallel Search to identifiy the on-point passages from every document in the database, reads and analyzes the search results, and outputs a summary of its findings (i.e. an answer to the question), citing applicable excerpts in specific documents.

Skills

UNIVERSAL
Search a Database

Find all instances of relevant information in a database of documents.

Summarize

Get an overview of any document in straightforward, everyday language.

Draft Correspondence

Rapidly draft common legal letters and emails.

TRANSACTIONAL
Contract Policy Compliance

Get a list of all parts of a set of contracts that don’t comply with a set of policies.

Extract Contract Data

Ask questions of contracts that are analyzed in a line-by-line review

Prepare for a Deposition

Get a thorough deposition outline by describing the deponent and what’s at issue.

LITIGATION
Legal Research Memo

Get answers to your research questions, with explanations and supporting sources.

Review Documents

Get comprehensive answers to your questions about a set of documents.