Datasets:
Tasks:
Question Answering
Sub-tasks:
extractive-qa
Languages:
Chinese
Multilinguality:
monolingual
Size Categories:
10K<n<100K
Language Creators:
crowdsourced
Annotations Creators:
crowdsourced
Source Datasets:
original
License:
cc-by-sa-4.0
The dataset viewer is not available for this split.
Cannot load the dataset split (in streaming mode) to extract the first rows.
Error code: StreamingRowsError Exception: ValueError Message: Cannot seek streaming HTTP file Traceback: Traceback (most recent call last): File "/src/services/worker/src/worker/utils.py", line 264, in get_rows_or_raise return get_rows( File "/src/services/worker/src/worker/utils.py", line 205, in decorator return func(*args, **kwargs) File "/src/services/worker/src/worker/utils.py", line 227, in get_rows ds = load_dataset( File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/load.py", line 2146, in load_dataset return builder_instance.as_streaming_dataset(split=split) File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 1329, in as_streaming_dataset splits_generators = {sg.name: sg for sg in self._split_generators(dl_manager)} File "/tmp/modules-cache/datasets_modules/datasets/cmrc2018/3cbb788a586e4597f67937944006349cd758baef9409fb90a6ddb85c1c84690c/cmrc2018.py", line 92, in _split_generators downloaded_files = dl_manager.download_and_extract(urls_to_download) File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/download/streaming_download_manager.py", line 1063, in download_and_extract return self.extract(self.download(url_or_urls)) File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/download/streaming_download_manager.py", line 1015, in extract urlpaths = map_nested(self._extract, url_or_urls, map_tuple=True) File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 464, in map_nested mapped = [ File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 465, in <listcomp> _single_map_nested((function, obj, types, None, True, None)) File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 367, in _single_map_nested return function(data_struct) File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/download/streaming_download_manager.py", line 1020, in _extract protocol = _get_extraction_protocol(urlpath, download_config=self.download_config) File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/download/streaming_download_manager.py", line 392, in _get_extraction_protocol return _get_extraction_protocol_with_magic_number(f) File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/download/streaming_download_manager.py", line 366, in _get_extraction_protocol_with_magic_number f.seek(0) File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/implementations/http.py", line 747, in seek raise ValueError("Cannot seek streaming HTTP file") ValueError: Cannot seek streaming HTTP file
Need help to make the dataset viewer work? Open a discussion for direct support.
Dataset Card for "cmrc2018"
Dataset Summary
A Span-Extraction dataset for Chinese machine reading comprehension to add language diversities in this area. The dataset is composed by near 20,000 real questions annotated on Wikipedia paragraphs by human experts. We also annotated a challenge set which contains the questions that need comprehensive understanding and multi-sentence inference throughout the context.
Supported Tasks and Leaderboards
Languages
Dataset Structure
Data Instances
default
- Size of downloaded dataset files: 11.50 MB
- Size of the generated dataset: 22.31 MB
- Total amount of disk used: 33.83 MB
An example of 'validation' looks as follows.
This example was too long and was cropped:
{
"answers": {
"answer_start": [11, 11],
"text": ["光荣和ω-force", "光荣和ω-force"]
},
"context": "\"《战国无双3》()是由光荣和ω-force开发的战国无双系列的正统第三续作。本作以三大故事为主轴,分别是以武田信玄等人为主的《关东三国志》,织田信长等人为主的《战国三杰》,石田三成等人为主的《关原的年轻武者》,丰富游戏内的剧情。此部份专门介绍角色,欲知武...",
"id": "DEV_0_QUERY_0",
"question": "《战国无双3》是由哪两个公司合作开发的?"
}
Data Fields
The data fields are the same among all splits.
default
id
: astring
feature.context
: astring
feature.question
: astring
feature.answers
: a dictionary feature containing:text
: astring
feature.answer_start
: aint32
feature.
Data Splits
name | train | validation | test |
---|---|---|---|
default | 10142 | 3219 | 1002 |
Dataset Creation
Curation Rationale
Source Data
Initial Data Collection and Normalization
Who are the source language producers?
Annotations
Annotation process
Who are the annotators?
Personal and Sensitive Information
Considerations for Using the Data
Social Impact of Dataset
Discussion of Biases
Other Known Limitations
Additional Information
Dataset Curators
Licensing Information
Citation Information
@inproceedings{cui-emnlp2019-cmrc2018,
title = "A Span-Extraction Dataset for {C}hinese Machine Reading Comprehension",
author = "Cui, Yiming and
Liu, Ting and
Che, Wanxiang and
Xiao, Li and
Chen, Zhipeng and
Ma, Wentao and
Wang, Shijin and
Hu, Guoping",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)",
month = nov,
year = "2019",
address = "Hong Kong, China",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/D19-1600",
doi = "10.18653/v1/D19-1600",
pages = "5886--5891",
}
Contributions
Thanks to @patrickvonplaten, @mariamabarham, @lewtun, @thomwolf for adding this dataset.
- Downloads last month
- 1,461
Models trained or fine-tuned on cmrc2018
Question Answering
•
Updated
•
1
•
1