GMaSC: GEC Barton Hill Malayalam Speech Corpus

GMaSC is a Malayalam text and speech corpus created by the Government Engineering College Barton Hill with an emphasis on Malayalam-accented English. The corpus contains 2,000 text-audio pairs of Malayalam sentences spoken by 2 speakers, totalling in approximately 139 minutes of audio. Each sentences has at least one English word common in Malayalam speech.

Dataset Structure

The dataset consists of 2,000 instances with fields text, speaker, and audio. The audio is mono, sampled at 48kH. The transcription is normalized and only includes Malayalam characters and common punctuation. The table given below specifies how the 2,000 instances are split between the speakers, along with some basic speaker info:

Speaker	Gender	Age	Time (HH:MM:SS)	Sentences
Sonia	Female	43	01:02:17	1,000
Anil	Male	48	01:17:23	1,000
Total			02:19:40	2,000

Data Instances

An example instance is given below:

{'text': 'സൗജന്യ ആയുർവേദ മെഡിക്കൽ ക്യാമ്പ്',
 'speaker': 'Sonia',
 'audio': {'path': None,
  'array': array([0.00036621, 0.00033569, 0.0005188 , ..., 0.00094604, 0.00091553,
         0.00094604]),
  'sampling_rate': 48000}}

Data Fields

text (str): Transcription of the audio file
speaker (str): The name of the speaker
audio (dict): Audio object including loaded audio array, sampling rate and path to audio (always None)

Data Splits

We provide all the data in a single train split. The loaded dataset object thus looks like this:

DatasetDict({
     train: Dataset({
         features: ['text', 'speaker', 'audio'],
         num_rows: 2000
     })
 })

Additional Information

Licensing

The corpus is made available under the Creative Commons license (CC BY-SA 4.0).