Datasets:
The dataset viewer is not available for this split.
Error code: JobManagerCrashedError
Need help to make the dataset viewer work? Open a discussion for direct support.
Snow Mountain
Dataset Summary
The Snow Mountain dataset contains the audio recordings (in .mp3 format) and the corresponding text of The Bible (contains both Old Testament (OT) and New Testament (NT)) in 11 Indian languages. The recordings were done in a studio setting by native speakers. Each language has a single speaker in the dataset. Most of these languages are geographically concentrated in the Northern part of India around the state of Himachal Pradesh. Being related to Hindi they all use the Devanagari script for transcription.
We have used this dataset for experiments in ASR tasks. But these could be used for other applications in speech domain, like speaker recognition, language identification or even as unlabelled corpus for pre-training.
Supported Tasks and Leaderboards
Atomatic speech recognition, Speech-to-Text, Speaker recognition, Language identification
Languages
Hindi, Haryanvi, Bilaspuri, Dogri, Bhadrawahi, Gaddi, Kangri, Kulvi, Mandeali, Kulvi Outer Seraji, Pahari Mahasui, Malayalam, Kannada, Tamil, Telugu
Dataset Structure
data
|- cleaned
|- lang1
|- book1_verse_audios.tar.gz
|- book2_verse_audios.tar.gz
...
...
|- all_verses.tar.gz
|- short_verses.tar.gz
|- lang2
...
...
|- experiments
|- lang1
|- train_500.csv
|- val_500.csv
|- test_common.csv
...
...
|- lang2
...
...
|- raw
|- lang1
|- chapter1_audio.mp3
|- chapter2_audio.mp3
...
...
|- text
|- book1.csv
|- book1.usfm
...
...
|- lang2
...
...
Data Instances
A data point comprises of the path to the audio file, called path
and its transcription, called sentence
.
{'sentence': 'क्यूँके तू अपणी बात्तां कै कारण बेकसूर अर अपणी बात्तां ए कै कारण कसूरवार ठहराया जावैगा',
'audio': {'path': 'data/cleaned/haryanvi/MAT/MAT_012_037.wav',
'array': array([0., 0., 0., ..., 0., 0., 0.]),
'sampling_rate': 16000},
'path': 'data/cleaned/haryanvi/MAT/MAT_012_037.wav'}
Data Fields
path
: The path to the audio file
audio
: A dictionary containing the path to the downloaded audio file, the decoded audio array, and the sampling rate. Note that when accessing the audio column: dataset[0]["audio"]
the audio file is automatically decoded and resampled to dataset.features["audio"].sampling_rate
. Decoding and resampling of a large number of audio files might take a significant amount of time. Thus it is important to first query the sample index before the "audio" column, i.e. dataset[0]["audio"]
should always be preferred over dataset["audio"][0]
.
sentence
: The transcription of the audio file.
Data Splits
We create splits of the cleaned data for training and analysing the performance of ASR models. The splits are available in the experiments
directory. The file names indicate the experiment and the split category. Additionally two CSV files are included in the data splits - all_verses
and short_verses
. Various data splits were generated from these main two CSVs. short_verses.csv
contains audios of length < 10s and corresponding transcriptions. all_verses.csv
contains complete cleaned verses including long and short audios. Due to the large size (>10MB), we keep these CSVs compressed in the tar.gz format in the
cleaned` folder.
Dataset Loading
raw
folder has chapter wise audios in .mp3 format. For doing experiments, we might need audios in .wav format. Verse wise audio files are keept in the cleaned
folder in .wav format. This results in a much larger size which contributes to longer loading time into memory. Here is the approximate time needed for loading the Dataset.
- Hindi (OT books): ~20 minutes
- Hindi minority languages (NT books): ~9 minutes
- Dravidian languages (OT+NT books): ~30 minutes
Details
Please refer to the paper for more details on the creation and the rationale for the splits we created in the dataset.
Licensing Information
The data is licensed under the Creative Commons Attribution-ShareAlike 4.0 International Public License (CC BY-SA 4.0)
Citation Information
Please cite this work if you make use of it:
@inproceedings{Raju2022SnowMD,
title={Snow Mountain: Dataset of Audio Recordings of The Bible in Low Resource Languages},
author={Kavitha Raju and V. Anjaly and R. Allen Lish and Joel Mathew},
year={2022}
}
- Downloads last month
- 64