Datasets:
The dataset viewer is not available for this split.
Error code: StreamingRowsError Exception: ValueError Message: Protocol not known: ['https Traceback: Traceback (most recent call last): File "/src/services/worker/src/worker/utils.py", line 257, in get_rows_or_raise return get_rows( File "/src/services/worker/src/worker/utils.py", line 198, in decorator return func(*args, **kwargs) File "/src/services/worker/src/worker/utils.py", line 235, in get_rows rows_plus_one = list(itertools.islice(ds, rows_max_number + 1)) File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/iterable_dataset.py", line 1379, in __iter__ for key, example in ex_iterable: File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/iterable_dataset.py", line 233, in __iter__ yield from self.generate_examples_fn(**self.kwargs) File "/tmp/modules-cache/datasets_modules/datasets/ai4bharat--kathbath/3baf116837b04bb852e9b4f24e45227491f87e34a6b53160283d519931104ae3/kathbath.py", line 161, in _generate_examples for path in audio_files: File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/download/streaming_download_manager.py", line 840, in __iter__ yield from self.generator(*self.args, **self.kwargs) File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/download/streaming_download_manager.py", line 890, in _iter_from_urlpath compression = _get_extraction_protocol(urlpath, download_config=download_config) File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/download/streaming_download_manager.py", line 391, in _get_extraction_protocol with fsspec.open(urlpath, **(storage_options or {})) as f: File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/core.py", line 439, in open return open_files( File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/core.py", line 282, in open_files fs, fs_token, paths = get_fs_token_paths( File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/core.py", line 594, in get_fs_token_paths chain = _un_chain(urlpath0, storage_options or {}) File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/core.py", line 325, in _un_chain cls = get_filesystem_class(protocol) File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/registry.py", line 217, in get_filesystem_class raise ValueError("Protocol not known: %s" % protocol) ValueError: Protocol not known: ['https
Need help to make the dataset viewer work? Open a discussion for direct support.
Dataset Card for Kathbath
Dataset Summary
Kathbath is an human-labeled ASR dataset containing 1,684 hours of labelled speech data across 12 Indian languages from 1,218 contributors located in 203 districts in India
Languages
- Bengali
- Gujarati
- Kannada
- Hindi
- Malayalam
- Marathi
- Odia
- Punjabi
- Sanskrit
- Tamil
- Telugu
- Urdu
Dataset Structure
Audio Data
data
βββ bengali
β βββ <split_name>
β β βββ 844424931537866-594-f.m4a
β β βββ 844424931029859-973-f.m4a
β β βββ ...
βββ gujarati
βββ ...
Transcripts
data
βββ bengali
β βββ <split_name>
β β βββ transcription_n2w.txt
βββ gujarati
βββ ...
Licensing Information
The IndicSUPERB dataset is released under this licensing scheme:
- We do not own any of the raw text used in creating this dataset.
- The text data comes from the IndicCorp dataset which is a crawl of publicly available websites.
- The audio transcriptions of the raw text and labelled annotations of the datasets have been created by us.
- We license the actual packaging of all this data under the Creative Commons CC0 license (βno rights reservedβ).
- To the extent possible under law, AI4Bharat has waived all copyright and related or neighboring rights to the IndicSUPERB dataset.
- This work is published from: India.
Citation Information
@misc{https://doi.org/10.48550/arxiv.2208.11761,
doi = {10.48550/ARXIV.2208.11761},
url = {https://arxiv.org/abs/2208.11761},
author = {Javed, Tahir and Bhogale, Kaushal Santosh and Raman, Abhigyan and Kunchukuttan, Anoop and Kumar, Pratyush and Khapra, Mitesh M.},
title = {IndicSUPERB: A Speech Processing Universal Performance Benchmark for Indian languages},
publisher = {arXiv},
year = {2022},
copyright = {arXiv.org perpetual, non-exclusive license}
}
Contributions
We would like to thank the Ministry of Electronics and Information Technology (MeitY) of the Government of India and the Centre for Development of Advanced Computing (C-DAC), Pune for generously supporting this work and providing us access to multiple GPU nodes on the Param Siddhi Supercomputer. We would like to thank the EkStep Foundation and Nilekani Philanthropies for their generous grant which went into hiring human resources as well as cloud resources needed for this work. We would like to thank DesiCrew for connecting us to native speakers for collecting data. We would like to thank Vivek Seshadri from Karya Inc. for helping setup the data collection infrastructure on the Karya platform. We would like to thank all the members of AI4Bharat team in helping create the Query by Example dataset.
- Downloads last month
- 17