Dataset Card for Dataset Name

Homepage: https://hazyresearch.stanford.edu/legalbench/
Repository: https://github.com/HazyResearch/legalbench/
Paper: https://arxiv.org/abs/2308.11462

Dataset Description

Dataset Summary

The LegalBench project is an ongoing open science effort to collaboratively curate tasks for evaluating legal reasoning in English large language models (LLMs). The benchmark currently consists of 162 tasks gathered from 40 contributors.

If you have questions about the project or would like to get involved, please see the website for more information.

Supported Tasks and Leaderboards

LegalBench tasks span multiple types (binary classification, multi-class classification, extraction, generation, entailment), multiple types of text (statutes, judicial opinions, contracts, etc.), and multiple areas of law (evidence, contracts, civil procedure, etc.). For more information on tasks, we recommend visiting the website, where you can search through task descriptions, or the Github repository, which contains more granular task descriptions. We also recommend reading the paper, which provides more background on task significance and construction process.

Languages

All LegalBench tasks are in English.

Dataset Structure

Data Instances

Detailed descriptions of the instances for each task can be found on the Github. An example of an instance, for the abercrombie task, is provided below:

{
  "text": "The mark "Ivory" for a product made of elephant tusks.",
  "label": "generic" 
  "idx": 0
}

A substantial number of LegalBench tasks are binary classification tasks, which require the LLM to determine if a piece of text has some legal attribute. Because these are framed as Yes/No questions, the label space is "Yes" or "No".

Data Fields

Detailed descriptions of the instances for each task can be found on the Github.

Data Splits

Each task has a training and evaluation split. Following RAFT, train splits only consists of a few-labeled instances, reflecting the few-shot nature of most LLMs.

Dataset Creation

Curation Rationale

LegalBench was created to enable researchers to better benchmark the legal reasoning capabilities of LLMs.

Source Data

Initial Data Collection and Normalization

Broadly, LegalBench tasks are drawn from three sources. The first source of tasks are existing available datasets and corpora. Most of these were originally released for non-LLM evaluation settings. In creating tasks for LegalBench from these sources, we often significantly reformatted data and restructured the prediction objective. For instance, the original CUAD dataset contains annotations on long-documents and is intended for evaluating extraction with span-prediction models. We restructure this corpora to generate a binary classification task for each type of contractual clause. While the original corpus emphasized the long-document aspects of contracts, our restructured tasks emphasize whether LLMs can identify the distinguishing features of different types of clauses. The second source of tasks are datasets that were previously constructed by legal professionals but never released. This primarily includes datasets hand-coded by legal scholars as part of prior empirical legal projects. The last category of tasks are those that were developed specifically for \name, by the authors of this paper. Overall, tasks are drawn from 36 distinct corpora. Please see the Appendix of the paper for more details.

Who are the source language producers?

LegalBench data was created by humans. Demographic information for these individuals is not available.

Annotations

Annotation process

Please see the paper for more information on the annotation process used in the creation of each task.

Who are the annotators?

Please see the paper for more information on the identity of annotators for each task.

Personal and Sensitive Information

Data in this benchmark has either been synthetically generated, or derived from an already public source (e.g., contracts from the EDGAR database).

Several tasks have been derived from the LearnedHands corpus, which consists of public posts on /r/LegalAdvice. Some posts may discuss sensitive issues.

Considerations for Using the Data

Social Impact of Dataset

Please see the original paper for a discussion of social impact.

Discussion of Biases

Please see the original paper for a discussion of social impact.

Other Known Limitations

LegalBench primarily contains tasks corresponding to American law.

Additional Information

Dataset Curators

Please see the website for a full list of participants in the LegalBench project.

Licensing Information

LegalBench tasks are subject to different licenses. Please see the paper for a description of the licenses.

Citation Information

If you intend to reference LegalBench broadly, please use the citation below. If you are working with a particular task, please use the citation below in addition to the task specific citation (which can be found on the task page on the website or Github).

@misc{guha2023legalbench,
      title={LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models}, 
      author={Neel Guha and Julian Nyarko and Daniel E. Ho and Christopher Ré and Adam Chilton and Aditya Narayana and Alex Chohlas-Wood and Austin Peters and Brandon Waldon and Daniel N. Rockmore and Diego Zambrano and Dmitry Talisman and Enam Hoque and Faiz Surani and Frank Fagan and Galit Sarfaty and Gregory M. Dickinson and Haggai Porat and Jason Hegland and Jessica Wu and Joe Nudell and Joel Niklaus and John Nay and Jonathan H. Choi and Kevin Tobia and Margaret Hagan and Megan Ma and Michael Livermore and Nikon Rasumov-Rahe and Nils Holzenberger and Noam Kolt and Peter Henderson and Sean Rehaag and Sharad Goel and Shang Gao and Spencer Williams and Sunny Gandhi and Tom Zur and Varun Iyer and Zehua Li},
      year={2023},
      eprint={2308.11462},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
@article{koreeda2021contractnli,
  title={ContractNLI: A dataset for document-level natural language inference for contracts},
  author={Koreeda, Yuta and Manning, Christopher D},
  journal={arXiv preprint arXiv:2110.01799},
  year={2021}
}
@article{hendrycks2021cuad,
  title={Cuad: An expert-annotated nlp dataset for legal contract review},
  author={Hendrycks, Dan and Burns, Collin and Chen, Anya and Ball, Spencer},
  journal={arXiv preprint arXiv:2103.06268},
  year={2021}
}
@article{wang2023maud,
  title={MAUD: An Expert-Annotated Legal NLP Dataset for Merger Agreement Understanding},
  author={Wang, Steven H and Scardigli, Antoine and Tang, Leonard and Chen, Wei and Levkin, Dimitry and Chen, Anya and Ball, Spencer and Woodside, Thomas and Zhang, Oliver and Hendrycks, Dan},
  journal={arXiv preprint arXiv:2301.00876},
  year={2023}
}
@inproceedings{wilson2016creation,
  title={The creation and analysis of a website privacy policy corpus},
  author={Wilson, Shomir and Schaub, Florian and Dara, Aswarth Abhilash and Liu, Frederick and Cherivirala, Sushain and Leon, Pedro Giovanni and Andersen, Mads Schaarup and Zimmeck, Sebastian and Sathyendra, Kanthashree Mysore and Russell, N Cameron and others},
  booktitle={Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  pages={1330--1340},
  year={2016}
}
@inproceedings{zheng2021does,
  title={When does pretraining help? assessing self-supervised learning for law and the casehold dataset of 53,000+ legal holdings},
  author={Zheng, Lucia and Guha, Neel and Anderson, Brandon R and Henderson, Peter and Ho, Daniel E},
  booktitle={Proceedings of the eighteenth international conference on artificial intelligence and law},
  pages={159--168},
  year={2021}
}
@article{zimmeck2019maps,
  title={Maps: Scaling privacy compliance analysis to a million apps},
  author={Zimmeck, Sebastian and Story, Peter and Smullen, Daniel and Ravichander, Abhilasha and Wang, Ziqi and Reidenberg, Joel R and Russell, N Cameron and Sadeh, Norman},
  journal={Proc. Priv. Enhancing Tech.},
  volume={2019},
  pages={66},
  year={2019}
}
@article{ravichander2019question,
  title={Question answering for privacy policies: Combining computational and legal perspectives},
  author={Ravichander, Abhilasha and Black, Alan W and Wilson, Shomir and Norton, Thomas and Sadeh, Norman},
  journal={arXiv preprint arXiv:1911.00841},
  year={2019}
}
@article{holzenberger2021factoring,
  title={Factoring statutory reasoning as language understanding challenges},
  author={Holzenberger, Nils and Van Durme, Benjamin},
  journal={arXiv preprint arXiv:2105.07903},
  year={2021}
}
@article{lippi2019claudette,
  title={CLAUDETTE: an automated detector of potentially unfair clauses in online terms of service},
  author={Lippi, Marco and Pa{\l}ka, Przemys{\l}aw and Contissa, Giuseppe and Lagioia, Francesca and Micklitz, Hans-Wolfgang and Sartor, Giovanni and Torroni, Paolo},
  journal={Artificial Intelligence and Law},
  volume={27},
  pages={117--139},
  year={2019},
  publisher={Springer}
}

answer string	index string	text string
"generic"	"0"	"The mark "Ivory" for a product made of elephant tusks."
"descriptive"	"1"	"The mark "Tasty" for bread."
"suggestive"	"2"	"The mark "Caress" for body soap."
"arbitrary"	"3"	"The mark "Virgin" for wireless communications."
"fanciful"	"4"	"The mark "Aswelly" for a taxi service."