Datasets:
Dataset Card for the_pile_books3
Dataset Summary
This dataset is Shawn Presser's work and is part of EleutherAi/The Pile dataset.
This dataset contains all of bibliotik in plain .txt form, aka 197,000 books processed in exactly the same way as did for bookcorpusopen (a.k.a. books1). seems to be similar to OpenAI's mysterious "books2" dataset referenced in their papers. Unfortunately OpenAI will not give details, so we know very little about any differences. People suspect it's "all of libgen", but it's purely conjecture.
|download_size|36.8 Gib| |dataset_size|100.9 Gib|
Supported Tasks and Leaderboards
This dataset is used for Language Modeling.
Languages
The dataset is in English.
Dataset Structure
Data Instances
{'title': '07 LEGO Ninjago - The Search For Zane (Scholastic) - Kate Howard (retail)'
'text': '\n\nTITLE PAGE\n\nFROM THE JOURNAL OF SENSEI GARMADON\n\nCHAPTER 1\n\nCHAPTER 2\n\nCHAPTER 3\n\nCHAPTER 4\n\nCHAPTER 5\n\nCHAPTER 6\n\nCHAPTER 7\n\nCHAPTER 8\n\nCHAPTER 9\n\nCOPYRIGHT\n\nThroughout Ninjago", five ninja are well-known for their speed, strength, and of course the elemental powers that help them protect our world from evil. But there are others who possess some of the same powers as the ninja. Others who may not always use their powers for good.\n\nBefore now, the ninja believed they were special. They di.......'}
Data Fields
title
: title of the booktext
: text content of the book
Data Splits
|split|num examples|
|train|196640|
Dataset Creation
Curation Rationale
[Needs More Information]
Source Data
Initial Data Collection and Normalization
[Needs More Information]
Who are the source language producers?
[Needs More Information]
Annotations
Annotation process
[Needs More Information]
Who are the annotators?
[Needs More Information]
Personal and Sensitive Information
[Needs More Information]
Considerations for Using the Data
Social Impact of Dataset
[Needs More Information]
Discussion of Biases
[Needs More Information]
Other Known Limitations
[Needs More Information]
Additional Information
Dataset Curators
[Needs More Information]
Licensing Information
MIT
Citation Information
@article{pile,
title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},
author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},
journal={arXiv preprint arXiv:2101.00027},
year={2020}
}
Contributions
Thanks to @shawwn for creating this dataset. Thanks to @richarddwang for adding this dataset.
- Downloads last month
- 968