Snow Mountain

Dataset Summary

The Snow Mountain dataset contains the audio recordings (in .mp3 format) and the corresponding text of The Bible (contains both Old Testament (OT) and New Testament (NT)) in 11 Indian languages. The recordings were done in a studio setting by native speakers. Each language has a single speaker in the dataset. Most of these languages are geographically concentrated in the Northern part of India around the state of Himachal Pradesh. Being related to Hindi they all use the Devanagari script for transcription.

We have used this dataset for experiments in ASR tasks. But these could be used for other applications in speech domain, like speaker recognition, language identification or even as unlabelled corpus for pre-training.

Supported Tasks and Leaderboards

Atomatic speech recognition, Speech-to-Text, Speaker recognition, Language identification

Languages

Hindi, Haryanvi, Bilaspuri, Dogri, Bhadrawahi, Gaddi, Kangri, Kulvi, Mandeali, Kulvi Outer Seraji, Pahari Mahasui, Malayalam, Kannada, Tamil, Telugu

Dataset Structure

data
  |- cleaned
    |- lang1
      |- book1_verse_audios.tar.gz
      |- book2_verse_audios.tar.gz
        ...
        ...
      |- all_verses.tar.gz
      |- short_verses.tar.gz
    |- lang2
      ...
      ...   
  |- experiments 
    |- lang1
      |- train_500.csv
      |- val_500.csv
      |- test_common.csv
        ...
        ...
    |- lang2
      ...
      ...
  |- raw
    |- lang1
      |- chapter1_audio.mp3
      |- chapter2_audio.mp3
        ...
        ...
      |- text
        |- book1.csv
        |- book1.usfm
          ...
          ...
    |- lang2
      ...
      ...

Data Instances

A data point comprises of the path to the audio file, called path and its transcription, called sentence.

{'sentence': 'क्यूँके तू अपणी बात्तां कै कारण बेकसूर अर अपणी बात्तां ए कै कारण कसूरवार ठहराया जावैगा',
 'audio': {'path': 'data/cleaned/haryanvi/MAT/MAT_012_037.wav',
  'array': array([0., 0., 0., ..., 0., 0., 0.]),
  'sampling_rate': 16000},
 'path': 'data/cleaned/haryanvi/MAT/MAT_012_037.wav'}

Data Fields

path: The path to the audio file

audio: A dictionary containing the path to the downloaded audio file, the decoded audio array, and the sampling rate. Note that when accessing the audio column: dataset[0]["audio"] the audio file is automatically decoded and resampled to dataset.features["audio"].sampling_rate. Decoding and resampling of a large number of audio files might take a significant amount of time. Thus it is important to first query the sample index before the "audio" column, i.e. dataset[0]["audio"] should always be preferred over dataset["audio"][0].

sentence: The transcription of the audio file.

Data Splits

We create splits of the cleaned data for training and analysing the performance of ASR models. The splits are available in the experiments directory. The file names indicate the experiment and the split category. Additionally two CSV files are included in the data splits - all_verses and short_verses. Various data splits were generated from these main two CSVs. short_verses.csv contains audios of length < 10s and corresponding transcriptions. all_verses.csv contains complete cleaned verses including long and short audios. Due to the large size (>10MB), we keep these CSVs compressed in the tar.gz format in the cleaned` folder.

Dataset Loading

raw folder has chapter wise audios in .mp3 format. For doing experiments, we might need audios in .wav format. Verse wise audio files are keept in the cleaned folder in .wav format. This results in a much larger size which contributes to longer loading time into memory. Here is the approximate time needed for loading the Dataset.

Hindi (OT books): ~20 minutes
Hindi minority languages (NT books): ~9 minutes
Dravidian languages (OT+NT books): ~30 minutes

Details

Please refer to the paper for more details on the creation and the rationale for the splits we created in the dataset.

Licensing Information

The data is licensed under the Creative Commons Attribution-ShareAlike 4.0 International Public License (CC BY-SA 4.0)

Citation Information

Please cite this work if you make use of it:

@inproceedings{Raju2022SnowMD,
  title={Snow Mountain: Dataset of Audio Recordings of The Bible in Low Resource Languages},
  author={Kavitha Raju and V. Anjaly and R. Allen Lish and Joel Mathew},
  year={2022}
}