Datasets:
The dataset viewer is not available for this split.
Error code: ResponseAlreadyComputedError
Need help to make the dataset viewer work? Open a discussion for direct support.
Dataset Card for Taskmaster-2
Dataset Summary
Taskmaster is dataset for goal oriented conversations. The Taskmaster-2 dataset consists of 17,289 dialogs in the seven domains which include restaurants, food ordering, movies, hotels, flights, music and sports. Unlike Taskmaster-1, which includes both written "self-dialogs" and spoken two-person dialogs, Taskmaster-2 consists entirely of spoken two-person dialogs. In addition, while Taskmaster-1 is almost exclusively task-based, Taskmaster-2 contains a good number of search- and recommendation-oriented dialogs. All dialogs in this release were created using a Wizard of Oz (WOz) methodology in which crowdsourced workers played the role of a 'user' and trained call center operators played the role of the 'assistant'. In this way, users were led to believe they were interacting with an automated system that “spoke” using text-to-speech (TTS) even though it was in fact a human behind the scenes. As a result, users could express themselves however they chose in the context of an automated interface.
Supported Tasks and Leaderboards
[More Information Needed]
Languages
The dataset is in English language.
Dataset Structure
Data Instances
A typical example looks like this
{
"conversation_id": "dlg-0047a087-6a3c-4f27-b0e6-268f53a2e013",
"instruction_id": "flight-6",
"utterances": [
{
"index": 0,
"segments": [],
"speaker": "USER",
"text": "Hi, I'm looking for a flight. I need to visit a friend."
},
{
"index": 1,
"segments": [],
"speaker": "ASSISTANT",
"text": "Hello, how can I help you?"
},
{
"index": 2,
"segments": [],
"speaker": "ASSISTANT",
"text": "Sure, I can help you with that."
},
{
"index": 3,
"segments": [],
"speaker": "ASSISTANT",
"text": "On what dates?"
},
{
"index": 4,
"segments": [
{
"annotations": [
{
"name": "flight_search.date.depart_origin"
}
],
"end_index": 37,
"start_index": 27,
"text": "March 20th"
},
{
"annotations": [
{
"name": "flight_search.date.return"
}
],
"end_index": 45,
"start_index": 41,
"text": "22nd"
}
],
"speaker": "USER",
"text": "I'm looking to travel from March 20th to 22nd."
}
]
}
Data Fields
Each conversation in the data file has the following structure:
conversation_id
: A universally unique identifier with the prefix 'dlg-'. The ID has no meaning.utterances
: A list of utterances that make up the conversation.instruction_id
: A reference to the file(s) containing the user (and, if applicable, agent) instructions for this conversation.
Each utterance has the following fields:
index
: A 0-based index indicating the order of the utterances in the conversation.speaker
: Either USER or ASSISTANT, indicating which role generated this utterance.text
: The raw text of the utterance. In case of self dialogs (one_person_dialogs), this is written by the crowdsourced worker. In case of the WOz dialogs, 'ASSISTANT' turns are written and 'USER' turns are transcribed from the spoken recordings of crowdsourced workers.segments
: A list of various text spans with semantic annotations.
Each segment has the following fields:
start_index
: The position of the start of the annotation in the utterance text.end_index
: The position of the end of the annotation in the utterance text.text
: The raw text that has been annotated.annotations
: A list of annotation details for this segment.
Each annotation has a single field:
name
: The annotation name.
Data Splits
There are no deafults splits for all the config. The below table lists the number of examples in each config.
Config | Train |
---|---|
flights | 2481 |
food-orderings | 1050 |
hotels | 2355 |
movies | 3047 |
music | 1602 |
restaurant-search | 3276 |
sports | 3478 |
Dataset Creation
Curation Rationale
[More Information Needed]
Source Data
[More Information Needed]
Initial Data Collection and Normalization
[More Information Needed]
Who are the source language producers?
[More Information Needed]
Annotations
[More Information Needed]
Annotation process
[More Information Needed]
Who are the annotators?
[More Information Needed]
Personal and Sensitive Information
[More Information Needed]
Considerations for Using the Data
Social Impact of Dataset
[More Information Needed]
Discussion of Biases
[More Information Needed]
Other Known Limitations
[More Information Needed]
Additional Information
Dataset Curators
[More Information Needed]
Licensing Information
The dataset is licensed under Creative Commons Attribution 4.0 License
Citation Information
[More Information Needed]
@inproceedings{48484,
title = {Taskmaster-1: Toward a Realistic and Diverse Dialog Dataset},
author = {Bill Byrne and Karthik Krishnamoorthi and Chinnadhurai Sankar and Arvind Neelakantan and Daniel Duckworth and Semih Yavuz and Ben Goodrich and Amit Dubey and Kyu-Young Kim and Andy Cedilnik},
year = {2019}
}
Contributions
Thanks to @patil-suraj for adding this dataset.
- Downloads last month
- 1,464