Datasets:
The dataset viewer is not available for this split.
Error code: ResponseAlreadyComputedError
Need help to make the dataset viewer work? Open a discussion for direct support.
Dataset Card for [Dataset Name]
Dataset Summary
The project gathered a large dataset of Finnish paraphrase pairs (over 100,000). The paraphrases are selected and classified manually, so as to minimize lexical overlap, and provide examples that are maximally structurally and lexically different. The objective is to create a dataset which is challenging and better tests the capabilities of natural language understanding. An important feature of the data is that most paraphrase pairs are distributed in their document context. The primary application for the dataset is the development and evaluation of deep language models, and representation learning in general.
Usage:
from datasets import load_dataset
dataset = load_dataset('TurkuNLP/turku_paraphrase_corpus', name="plain")
where name
is one of the supported loading options: plain
, plain-context
, classification
, classification-context
, or generation
. See Data Fields for more information.
Supported Tasks and Leaderboards
- Paraphrase classification
- Paraphrase generation
Languages
Finnish
Dataset Structure
Data Instances
[More Information Needed]
Data Fields
The dataset consist of pairs of text passages, where a typical passage is about a sentence long, however, a passage may also be longer or shorter than a sentence. Thus, each example includes two text passages (string), a manually annotated label to indicate the paraphrase type (string), and additional metadata. The dataset includes three different configurations: plain
, classification
, and generation
. The plain
configuration loads the original data without any additional preprocessing or transformations, while the classification
configuration directly builds the data in a form suitable for training a paraphrase classifier, where each example is doubled in the data with different directions (text1, text2, label) --> (text2, text1, label) taking care of the label flipping as well if needed (paraphrases with directionality flag < or >). In the generation
configuration, the examples are preprocessed to be directly suitable for the paraphrase generation task. In here, paraphrases not suitable for generation are discarded (negative, and highly context-dependent paraphrases), and directional paraphrases are provided so that the generation goes from more detailed passage to the more general one in order to prevent model hallucination (i.e. model learning to introduce new information). The rest of the paraphrases are provided in both directions (text1, text2, label) --> (text2, text1, label).
Each pair in the plain
and classification
configurations will include fields:
id
:
Identifier of the paraphrase pair (string)
gem_id
:
Identifier of the paraphrase pair in the GEM dataset (string)
goeswith
:
Identifier of the document from which the paraphrase was extracted, can be not available
in case the source of the paraphrase is not from document-structured data. All examples with the same goeswith
value (other than not available
) should be kept together in any train/dev/test split; most users won't need this (string)
fold
:
0-99, data split into 100 parts respecting document boundaries, you can use this e.g. to implement crossvalidation safely as all paraphrases from one document are in one fold, most users won't need this (int)
text1
:
First paraphrase passage (string)
text2
:
Second paraphrase passage (string)
label
:
Manually annotated labels (string)
binary_label
:
Label turned into binary with values positive
(paraphrase) and negative
(not-paraphrase) (string)
is_rewrite
:
Indicator whether the example is human produced rewrite or naturally occurring paraphrase (bool)
Each pair in the generation
config will include the same fields except text1
and text2
are renamed to input
and output
in order to indicate the generation direction. Thus the fields are: id
, gem_id
, goeswith
, fold
, input
, output
, label
, binary_label
, and is_rewrite
Context: Most (but not all) of the paraphrase pairs are identified in their document context. By default, these contexts are not included to conserve memory, but can be accessed using the configurations plain-context
and classification-context
. These are exactly like plain
and classification
with these additional fields:
context1
:
a dictionary with the fields doctext
(string), begin
(int), end
(int). These mean that the paraphrase in text1
was extracted from doctext[begin:end]
. In most cases, doctext[begin:end]
and text1
are the exact same string, but occassionally that is not the case when e.g. intervening punctuations or other unrelated texts were "cleaned" from text1
during annotation. In case the context is not available, doctext
is an empty string and beg==end==0
context2
:
same as context1
but for text2
Data Splits
[More Information Needed]
Dataset Creation
Curation Rationale
[More Information Needed]
Source Data
Initial Data Collection and Normalization
[More Information Needed]
Who are the source language producers?
[More Information Needed]
Annotations
Annotation process
[More Information Needed]
Who are the annotators?
[More Information Needed]
Personal and Sensitive Information
[More Information Needed]
Considerations for Using the Data
Social Impact of Dataset
[More Information Needed]
Discussion of Biases
[More Information Needed]
Other Known Limitations
[More Information Needed]
Additional Information
Dataset Curators
[More Information Needed]
Licensing Information
[More Information Needed]
Citation Information
[More Information Needed]
Contributions
- Downloads last month
- 1,176