In conjunction with Stanford’s Regulation, Evaluation, and Governance Lab, we’re excited to present the Overruling Dataset: a benchmark corresponding to the task of determining when a sentence is overruling a prior decision. This is a binary classification task, where positive examples are overruling sentences and negative examples are non-overruling sentences extracted from legal opinions. In law, an overruling sentence is a statement that nullifies a previous case decision as a precedent, by a constitutionally valid statute or a decision by the same or higher ranking court which establishes a different rule on the point of law involved. The Overruling dataset consists of 2,400 sentences.
Casetext constructed this dataset by selecting positive overruling samples through manual annotation by attorneys and negative samples through random sampling sentences from the Casetext law corpus. This procedure has a low false positive rate for negative samples because the prevalence of overruling sentences in the whole law is low. Less than 1% of cases overrule another case and within those cases, usually only a single sentence contains overruling language. Casetext validates this procedure by estimating the rate of false positives on a subset of sentences randomly sampled from the law and extrapolating this rate for the whole set of random samples to determine the proportion of sampled sentences to be reviewed by human reviewers for quality assurance.
The Overruling task is important for lawyers because the process of verifying the authorities of cases are still valid and cases have not been overruled is critical to ensuring the validity of legal arguments. This need has led to the broad adoption of proprietary systems, such as Shepard’s (on Lexis Advance) and KeyCite (on Westlaw), as well as SmartCite (Casetext’s AI-based Citator system). High language model performance on the Overruling tasks could enable further automation of the identification of cases that are no longer good law.
You can download the Overruling dataset here. To learn more about how current models perform on the Overruling dataset, see this recent work.