Search Results Evaluation Efforts at Casetext

Google+ Pinterest LinkedIn Tumblr

In order to improve our ranking algorithms, we need to be able to measure how effective our new models or systems are, in comparison with our existing one. we have tried several approaches, including qualitative and quantitative evaluation frameworks. One example of qualitative evaluation framework we have experimented with, consists in showing expert attorneys two or more ranked lists respectively from different systems or ranking models on the same webpage, for a given query; and then having them choose the result list they prefer. One problem with this approach is that it does not make it easy to rapidly iterate by fine-tuning models. In order to achieve rapid iterations through automatic evaluations of our new ranking models and systems, one of the main solutions used in the field of information retrieval, is quantitative evaluation framework based on static test collections. In the remainder of this article, I will explain how we utilize that framework at Casetext.

Test collections

In order to determine how effective a search system or a retrieval model is, we typically need a test collection. A test collection is comprised of a representative set of queries, as well as ranked lists of documents generated by various search systems for each query. And since there can be thousands or even millions of documents for some queries, we minimize the set of documents assessed by our attorney colleagues by pooling only some of the documents from each search system. Next, the documents are graded by those assessors, and finally these relevance assessments are used to compute mathematical evaluation measures.


For a given search query, thousands or even millions of documents may be returned by the search engine. Assessing each of those many documents is nearly impossible, especially when you are a small startup like us at Casetext. To get a representative sample of documents assessed by our experts, we adopt a technique called pooling. Pooling is a popular solution adopted by several research institutions. For example, the National Institute for Standard and Technologies (NIST) in the United States for many test collections such as the TREC Legal track test collection – a legal information retrieval task at the Text REtrieval Conference (TREC), co-sponsored by NIST between 2006 and 2012. Only the top-k results returned by each search system are included in the set of documents to be assessed where k could be 10, 20, 100 or any manageable number. In theory, the pooled documents can be comprised of all the documents from the ranked lists of all available search systems, as long as they are manageable. But it is not necessary to pool all documents in order to have a reliable effectiveness measure.

In the future, we could consider alternative pooling strategies that do not simply pool the top-k documents for assessment, but instead assign higher probability (to be selected for sampling) to higher-ranked documents in a ranked list. One simple way of achieving this would be to loop through the documents starting from the highest ranked and algorithmically flipping a coin at every step to decide whether to include the document or not, until k documents are selected.

Relevance Judgments

After sampling documents for each search query or information need, our assessors proceed to judge the selected documents. Being former or current legal professionals, our assessors are familiar with legal research, and well positioned to provide good assessments. Each information need or search can be assigned to a single assessor, or to an uneven number of assessors, in which case the final grade retained for a given document is the grade that receives the majority vote by the assessors. Relevance assessments can be done in a binary fashion where a document is either relevant or irrelevant with respect to an information need. It can also be done on a graded scale, in which case the assessor will be tasked to assign a relevance grade to a legal document given an information need. At Casetext, for example, we use 0 for irrelevant document, 1 for somewhat relevant, 2 for relevant, and 3 for exactly on-point.

Evaluation measures

It has been widely suggestedthat for most lawyers, the most important measure is recall. Recall is the ratio of the number of documents that a search system correctly determines to be relevant by the number of actual relevant documents in the test collection. The higher recall the system obtains, the more complete its result set is. And this is an important measure because lawyers want as much and as complete information as possible, since information can be viewed as the ability to reduce uncertainty. Recall has been viewed as crucial because in American case law, it is the lawyer’s duty to know all information relevant to their client’s case. Lawyers are thus liable for not being fully informed. Consequently, it seems on the surface that a system that does not maximize recall is a system that is not fulfilling the minimum expectations.

However, precision is also a very important factor. Precision is the ratio of the number of relevant retrieved documents to the total number of retrieved results. Precision measures the exactness. But it is essential to not overly focus on this number, since that can restrict the set of retrieved results to a smaller set that the system is absolutely certain about, and leave out many other relevant results. Some researchers and legal experts argue that what online legal researchers really need is the ability to find a few on-point legal documents effectively and fast, and then use these documents to discover other on-point cases (for instance through citation links). Thus there is a clear trade-off between precision and recall. For these reasons, information retrieval practitioners tend to use both of these measures or a measure that combines precision and recall in a balanced way, such as the F Score.

Furthermore, with research showing that searchers usually assess ranked results from top to bottom, many information retrieval experts strive to ensure that highly relevant documents are ranked at the top of the list. Thus good evaluation measures should account for the position of the document in the ranked list, and focus on judging only the top-10 or top-20 ranked documents.

Relevance judgments on a graded scale, as opposed to binary relevance judgments, are used for computing such measures. One example of such measures is nDCG. The normalized Discounted Cumulative Gain (nDCG) measure rewards documents with high relevance grades and discounts the gains of documents that are ranked at lower positions. For several experiments at Casetext, we have adopted nDCG@10 and nDCG@20. Another evaluation measure used with graded relevance judgments, the Expected Reciprocal Rank (ERR), is defined as the expected reciprocal length of time it takes the user to find a relevant document; it also takes into account the position of the document as well as the relevance of the documents shown above it.

Beyond Topical Relevance

Thus far, we have been using the concept of relevance to measure how on-point a document is, given a query. This notion is very much tied to the concept of topicality. This means that the assessor would grade a document as exactly on-point if the document covers the topic of the query or information need. Other important dimensions are not necessarily accounted for.  Examples of such dimensions are: legal issues, the party that the user is representing (e.g. defense or prosecution), relevant jurisdictions, relevant causes of actions, relevant motion types and seminality. It would become immensely difficult to attempt to create an evaluation framework that accounts for every single one of these dimensions.

One of the approaches we are considering at Casetext, for factoring these dimensions into the evaluation measure is to first identify the most important dimensions, in addition to topicality (e.g. seminality and relevant jurisdiction). Next, modify the topicality-focused relevant judgment so that relevance grades can be increased by one when the legal case is either a seminal case or a case from a relevant jurisdiction. In the example above where 0 is for irrelevant cases, 1 for somewhat relevant cases, 2 for relevant cases, and 3 for exactly on-point cases, we would now assign a grade of 4 to cases that are both exactly on-point topically and also from a relevant jurisdiction. We would then assign a grade of 5 to documents that are also seminal cases, in addition to being both from the relevant jurisdiction and on-point topically.

An even better way to judge legal documents could be to assess them in terms of their usefulness in a search session, rather than their topical relevance. This concept of usefulness would help us assess a document not simply by how on-point it is with respect to the information need, but in terms of how much it helps satisfy the user’s information need in a search session. While assessing documents based on how on-point they are, presumes that search is a sequence of unrelated events, usefulness-based assessment assumes search to be a dynamic information seeking process that involves tasks and contexts. Usefulness-based assessment should therefore account for how a document seen in a previous search interaction throughout the same session, can impact progress towards the overall goal or a sub-goal of the task. A search session is a sequence of interactions between a searcher and a search engine. During each search interaction, the searcher provides a query, gets a ranked list of documents as a result, examines the snippets of some or all documents, and then clicks and reads some or all documents with the purpose of learning more about a specific topic. Usefulness, as referred to here, is a more general concept than relevance, and it encompasses various factors such as the number of steps to complete a sub-goal, the reading time of ranked documents, the user’s actions to save, highlight, copy with citation, bookmark, revisit, classify and use documents, and explicit judgments such as relevance and usefulness grades.

Factoring how much the system is helping legal professionals learn about the topic of their search, is another challenging axis of research we will be considering in the future. Much of legal precedence retrieval is concerned with finding prior cases relevant to a legal search query, and the goal of searchers is to find pieces of information that will help them learnmore about their topic. Evaluating a search-for-learning system is much more challenging than simply evaluating with respect to a single query. Since the search task spans more than one search query, it would be more beneficial to evaluate queries in the same search session non-independently, and determine how muchthe system helps the searcher learn at each step.

In order to properly evaluate search systems that are meant to help searchers learn throughout a search session, one has to understand the concept of learning. Although there is no universally accepted definition of the concept of learning, learning can be defined as the process of acquiring more information that updates a person’s state of knowledge either by providing her with new information or by strengthening what she already knows. Various techniques could be devised to measure search-for-learning. One technique proposed by some researchers consists in asking searchers to demonstrate what they have learned by writing a summary. After the summaries are written, the researchers proceed by either counting how many facts and statements the summary contains or how many subtopics they cover. Using Bloom’s taxonomy, which describes what students are expected to learn as a result of instruction, researchers have devised some evaluation techniques. Bloom’s taxonomy comprises 6 stages of cognitive process: remembering, understanding, applying, analyzing, evaluating and creating. Researchers proposed to capture certain components of Bloom’s taxonomy by using the simple technique of asking searchers to demonstrate what they have learned by writing a summary. They proposed to capture the “understanding” component by measuring the quality of facts recalled in the text, and the “analysis” component by assessing the interpretation of facts into statements. They finally proposed to capture the “evaluation” component by identifying either statements that compared facts or that facts to challenge other facts. However, these evaluation techniques are arduous and require a lot of human effort. Less arduous and more efficient evaluation techniques have yet to be proposed.

Alongside the research field of Search Evaluation, we will be investigating and striving to adopt the best and more efficient evaluation techniques to measure how well our search engine helps support our users in their search tasks.


Anderson, John Robert. Learning and memory: An integrated approach. John Wiley & Sons Inc, 2000.

Anderson, Lorin W., and David R. Krathwohl. “A Taxonomy for learning, teaching, and assessing: A revision of Bloom’s Taxonomy of.”

Berring, Robert C. “Full-text databases and legal research: Backing into the future.” High Technology Law Journal1, no. 1 (1986): 27-60.

Dabney, Daniel P. “The curse of Thamus: An analysis of full-text legal document retrieval.” Law. Libr. J.78 (1986): 5.

Klir, George J. (2005), Uncertainty and Information: Foundations of Generalized Information Theory (Hoboken, NJ, USA: John Wiley & Sons 2005)

Mandal, Arpan, Kripabandhu Ghosh, Arnab Bhattacharya, Arindam Pal, and Saptarshi Ghosh. “Overview of the FIRE 2017 IRLeD Track: Information Retrieval from Legal Documents.”

Maxwell, K. Tamsin, and Burkhard Schafer. “Concept and Context in Legal Information Retrieval.” In JURIX, pp. 63-72. 2008.

Thenmozhi, D., Kawshik Kannan, and Chandrabose Aravindan. “A Text Similarity Approach for Precedence Retrieval from Legal Documents.”

Wilson, Mathew J., and Max L. Wilson. “A comparison of techniques for measuring sensemaking and learning within participant‐generated summaries.” Journal of the American Society for Information Science and Technology64, no. 2 (2013): 291-306.