description string	citation string	homepage string	license string	features dict	post_processed null	supervised_keys dict	task_templates list	builder_name string	config_name string	version dict	splits dict	download_checksums dict	download_size int64	post_processing_size null	dataset_size int64	size_in_bytes int64
"LibriSpeech is a corpus of approximately 1000 hours of read English speech with sampling rate of 16 kHz, prepared by Vassil Panayotov with the assistance of Daniel Povey. The data is derived from read audiobooks from the LibriVox project, and has been carefully segmented and aligned.87 "	"@inproceedings{panayotov2015librispeech, title={Librispeech: an ASR corpus based on public domain audio books}, author={Panayotov, Vassil and Chen, Guoguo and Povey, Daniel and Khudanpur, Sanjeev}, booktitle={Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on}, pages={5206--5210}, year={2015}, organization={IEEE} } "	"http://www.openslr.org/12"	""	{ "file": { "dtype": "string", "id": null, "_type": "Value" }, "audio": { "sampling_rate": 16000, "mono": true, "decode": true, "id": null, "_type": "Audio" }, "text": { "dtype": "string", "id": null, "_type": "Value" }, "speaker_id": { "dtype": "int64", "id": null, "_type": "Value" }, "chapter_id": { "dtype": "int64", "id": null, "_type": "Value" }, "id": { "dtype": "string", "id": null, "_type": "Value" } }	null	{ "input": "file", "output": "text" }	[ { "task": "automatic-speech-recognition", "audio_column": "audio", "transcription_column": "text" } ]	"myspeechasr"	"all"	{ "version_str": "2.1.0", "description": "", "major": 2, "minor": 1, "patch": 0 }	{ "train.clean.100": { "name": "train.clean.100", "num_bytes": 337965294, "num_examples": 2864, "dataset_name": "myspeechasr" }, "train.clean.360": { "name": "train.clean.360", "num_bytes": 337965294, "num_examples": 2864, "dataset_name": "myspeechasr" }, "train.other.500": { "name": "train.other.500", "num_bytes": 337965294, "num_examples": 2864, "dataset_name": "myspeechasr" }, "validation.clean": { "name": "validation.clean", "num_bytes": 337283662, "num_examples": 2864, "dataset_name": "myspeechasr" }, "validation.other": { "name": "validation.other", "num_bytes": 337283662, "num_examples": 2864, "dataset_name": "myspeechasr" }, "test.clean": { "name": "test.clean", "num_bytes": 337965294, "num_examples": 2864, "dataset_name": "myspeechasr" }, "test.other": { "name": "test.other", "num_bytes": 337965294, "num_examples": 2864, "dataset_name": "myspeechasr" } }	{ "/content/drive/MyDrive/dev-clean.tar.gz": { "num_bytes": 314305928, "checksum": "12661c48e8c3fe1de2c1caa4c3e135193bfb1811584f11f569dd12645aa84365" }, "/content/drive/MyDrive/dev-other.tar.gz": { "num_bytes": 314305928, "checksum": "12661c48e8c3fe1de2c1caa4c3e135193bfb1811584f11f569dd12645aa84365" }, "/content/drive/MyDrive/test-clean.tar.gz": { "num_bytes": 314305928, "checksum": "12661c48e8c3fe1de2c1caa4c3e135193bfb1811584f11f569dd12645aa84365" }, "/content/drive/MyDrive/test-other.tar.gz": { "num_bytes": 314305928, "checksum": "12661c48e8c3fe1de2c1caa4c3e135193bfb1811584f11f569dd12645aa84365" }, "/content/drive/MyDrive/train-clean-100.tar.gz": { "num_bytes": 314305928, "checksum": "12661c48e8c3fe1de2c1caa4c3e135193bfb1811584f11f569dd12645aa84365" }, "/content/drive/MyDrive/train-clean-360.tar.gz": { "num_bytes": 314305928, "checksum": "12661c48e8c3fe1de2c1caa4c3e135193bfb1811584f11f569dd12645aa84365" }, "/content/drive/MyDrive/train-other-500.tar.gz": { "num_bytes": 314305928, "checksum": "12661c48e8c3fe1de2c1caa4c3e135193bfb1811584f11f569dd12645aa84365" } }	2,200,141,496	null	2,364,393,794	4,564,535,290

Dataset Card for librispeech_asr

Dataset Summary

LibriSpeech is a corpus of approximately 1000 hours of 16kHz read English speech, prepared by Vassil Panayotov with the assistance of Daniel Povey. The data is derived from read audiobooks from the LibriVox project, and has been carefully segmented and aligned.

Supported Tasks and Leaderboards

automatic-speech-recognition, audio-speaker-identification: The dataset can be used to train a model for Automatic Speech Recognition (ASR). The model is presented with an audio file and asked to transcribe the audio file to written text. The most common evaluation metric is the word error rate (WER). The task has an active Model Database leaderboard which can be found at https://huggingface.co/spaces/huggingface/hf-speech-bench. The leaderboard ranks models uploaded to the Hub based on their WER. An external leaderboard at https://paperswithcode.com/sota/speech-recognition-on-librispeech-test-clean ranks the latest models from research and academia.

Languages

The audio is in English. There are two configurations: clean and other. The speakers in the corpus were ranked according to the WER of the transcripts of a model trained on a different dataset, and were divided roughly in the middle, with the lower-WER speakers designated as "clean" and the higher WER speakers designated as "other".

Dataset Structure

Data Instances

A typical data point comprises the path to the audio file, usually called file and its transcription, called text. Some additional information about the speaker and the passage which contains the transcription is provided.

{'chapter_id': 141231,
 'file': '/home/siddhant/.cache/huggingface/datasets/downloads/extracted/b7ded9969e09942ab65313e691e6fc2e12066192ee8527e21d634aca128afbe2/dev_clean/1272/141231/1272-141231-0000.flac',
  'audio': {'path': '/home/siddhant/.cache/huggingface/datasets/downloads/extracted/b7ded9969e09942ab65313e691e6fc2e12066192ee8527e21d634aca128afbe2/dev_clean/1272/141231/1272-141231-0000.flac',
  'array': array([-0.00048828, -0.00018311, -0.00137329, ...,  0.00079346,
          0.00091553,  0.00085449], dtype=float32),
  'sampling_rate': 16000},
 'id': '1272-141231-0000',
 'speaker_id': 1272,
 'text': 'A MAN SAID TO THE UNIVERSE SIR I EXIST'}

Data Fields

file: A path to the downloaded audio file in .flac format.
audio: A dictionary containing the path to the downloaded audio file, the decoded audio array, and the sampling rate. Note that when accessing the audio column: dataset[0]["audio"] the audio file is automatically decoded and resampled to dataset.features["audio"].sampling_rate. Decoding and resampling of a large number of audio files might take a significant amount of time. Thus it is important to first query the sample index before the "audio" column, i.e. dataset[0]["audio"] should always be preferred over dataset["audio"][0].
text: the transcription of the audio file.
id: unique id of the data sample.
speaker_id: unique id of the speaker. The same speaker id can be found for multiple data samples.
chapter_id: id of the audiobook chapter which includes the transcription.

Data Splits

The size of the corpus makes it impractical, or at least inconvenient for some users, to distribute it as a single large archive. Thus the training portion of the corpus is split into three subsets, with approximate size 100, 360 and 500 hours respectively. A simple automatic procedure was used to select the audio in the first two sets to be, on average, of higher recording quality and with accents closer to US English. An acoustic model was trained on WSJ’s si-84 data subset and was used to recognize the audio in the corpus, using a bigram LM estimated on the text of the respective books. We computed the Word Error Rate (WER) of this automatic transcript relative to our reference transcripts obtained from the book texts. The speakers in the corpus were ranked according to the WER of the WSJ model’s transcripts, and were divided roughly in the middle, with the lower-WER speakers designated as "clean" and the higher-WER speakers designated as "other". For "clean", the data is split into train, validation, and test set. The train set is further split into train.100 and train.360 respectively accounting for 100h and 360h of the training data. For "other", the data is split into train, validation, and test set. The train set contains approximately 500h of recorded speech.

	Train.500	Train.360	Train.100	Valid	Test
clean	-	104014	28539	2703	2620
other	148688	-	-	2864	2939

Dataset Creation

Curation Rationale

[Needs More Information]

Source Data

Initial Data Collection and Normalization

[Needs More Information]

Who are the source language producers?

[Needs More Information]

Annotations

Annotation process

[Needs More Information]

Who are the annotators?

[Needs More Information]

Personal and Sensitive Information

The dataset consists of people who have donated their voice online. You agree to not attempt to determine the identity of speakers in this dataset.

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[Needs More Information]

Additional Information

Dataset Curators

The dataset was initially created by Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur.

Licensing Information

CC BY 4.0

Citation Information

@inproceedings{panayotov2015librispeech,
  title={Myspeech: an ASR corpus based on public domain audio books},
  author={Panayotov, Vassil and Chen, Guoguo and Povey, Daniel and Khudanpur, Sanjeev},
  booktitle={Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on},
  pages={5206--5210},
  year={2015},
  organization={IEEE}
}