Datasets:
The dataset viewer is not available for this split.
Error code: RowsPostProcessingError
Need help to make the dataset viewer work? Open a discussion for direct support.
GMaSC: GEC Barton Hill Malayalam Speech Corpus
GMaSC is a Malayalam text and speech corpus created by the Government Engineering College Barton Hill with an emphasis on Malayalam-accented English. The corpus contains 2,000 text-audio pairs of Malayalam sentences spoken by 2 speakers, totalling in approximately 139 minutes of audio. Each sentences has at least one English word common in Malayalam speech.
Dataset Structure
The dataset consists of 2,000 instances with fields text
, speaker
, and audio
. The audio is mono, sampled at 48kH. The transcription is normalized and only includes Malayalam characters and common punctuation. The table given below specifies how the 2,000 instances are split between the speakers, along with some basic speaker info:
Speaker | Gender | Age | Time (HH:MM:SS) | Sentences |
---|---|---|---|---|
Sonia | Female | 43 | 01:02:17 | 1,000 |
Anil | Male | 48 | 01:17:23 | 1,000 |
Total | 02:19:40 | 2,000 |
Data Instances
An example instance is given below:
{'text': 'സൗജന്യ ആയുർവേദ മെഡിക്കൽ ക്യാമ്പ്',
'speaker': 'Sonia',
'audio': {'path': None,
'array': array([0.00036621, 0.00033569, 0.0005188 , ..., 0.00094604, 0.00091553,
0.00094604]),
'sampling_rate': 48000}}
Data Fields
- text (str): Transcription of the audio file
- speaker (str): The name of the speaker
- audio (dict): Audio object including loaded audio array, sampling rate and path to audio (always None)
Data Splits
We provide all the data in a single train
split. The loaded dataset object thus looks like this:
DatasetDict({
train: Dataset({
features: ['text', 'speaker', 'audio'],
num_rows: 2000
})
})
Additional Information
Licensing
The corpus is made available under the Creative Commons license (CC BY-SA 4.0).
- Downloads last month
- 11