Dataset Card for PMD

Dataset Summary

Introduced in the FLAVA paper, Public Multimodal Dataset (PMD) is a collection of publicly-available image-text pair datasets. PMD contains 70M image-text pairs in total with 68M unique images. The dataset contains pairs from Conceptual Captions, Conceptual Captions 12M, WIT, Localized Narratives, RedCaps, COCO, SBU Captions, Visual Genome and a subset of YFCC100M dataset.

If you use PMD, please cite the original FLAVA paper as follows, along with the individual datasets (!! - see below for references):

@inproceedings{singh2022flava,
  title={Flava: A foundational language and vision alignment model},
  author={Singh, Amanpreet and Hu, Ronghang and Goswami, Vedanuj and Couairon, Guillaume and Galuba, Wojciech and Rohrbach, Marcus and Kiela, Douwe},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={15638--15650},
  year={2022}
}

You can load this dataset by first logging into Model Database using huggingface-cli login and then running the following commands:

from datasets import load_dataset
pmd = load_dataset("facebook/pmd", use_auth_token=True)

You can also load the dataset in streaming mode if you don't want to download the big dataset files (> 50GB locally without the images):

pmd = load_dataset("facebook/pmd", use_auth_token=True, streaming=True)

Dataset Preprocessing

This dataset doesn't download all of the images locally by default. Instead, it exposes URLs for some of the images. To fetch the images, use the following code:

from concurrent.futures import ThreadPoolExecutor
from functools import partial
import io
import urllib

import PIL.Image

from datasets import load_dataset
from datasets.utils.file_utils import get_datasets_user_agent


USER_AGENT = get_datasets_user_agent()


def fetch_single_image(image_data, timeout=None, retries=0):
    image_url, image = image_data
    if image is not None:
        return image

    for _ in range(retries + 1):
        try:
            request = urllib.request.Request(
                image_url,
                data=None,
                headers={"user-agent": USER_AGENT},
            )
            with urllib.request.urlopen(request, timeout=timeout) as req:
                image = PIL.Image.open(io.BytesIO(req.read()))
            break
        except Exception:
            image = None
    return image


def fetch_images(batch, num_threads, timeout=None, retries=0):
    fetch_single_image_with_args = partial(fetch_single_image, timeout=timeout, retries=retries)
    with ThreadPoolExecutor(max_workers=num_threads) as executor:
        batch["image"] = list(executor.map(fetch_single_image_with_args, zip(batch["image_url"], batch["image"])))
    return batch


num_threads = 20
dset = load_dataset("pmd", use_auth_token=True)
dset = dset.map(fetch_images, batched=True, batch_size=100, fn_kwargs={"num_threads": num_threads})

Save to disk

You can also save the dataset to disk for faster and direct loading next time but beware of the space required:

dset.save_to_disk(</path/to/save>)

Load Subsets

You can also download a specific set from the PMD dataset by using

dset = load_dataset("pmd", <choice>, use_auth_token=True)

The choices are `

"all","coco","sbu", "wit", "localized_narratives","conceptual_captions","visual_genome","conceptual_captions_12M","redcaps","yfcc100M_subset", "localized_narratives_openimages","localized_narratives_ade20k", "localized_narratives_coco"

Flickr30K Localized Narratives Subset

The Flickr30K subset of Localized Narratives is not included by default as it requires a manual download. You can include it by downloading the tar file from here after signing an agreement to </path/to/Downloads> and then loading it whole PMD or localized narratives subset by:

dset = load_dataset("pmd", data_dir=</path/to/Downloads/flickr30k-images.tar.gz>, use_auth_token=True, use_flickr30k_ln=True)

# Load LN subset only
dset = load_dataset("pmd", "localized_narratives", data_dir=</path/to/Downloads/flickr30k-images.tar.gz>, use_auth_token=True, use_flickr30k_ln=True)

Facing issues?

If you are facing issues, you can try loading a specific revision of the repo by using:

dset = load_dataset("pmd", use_auth_token=True, revision="311cd48")

Supported Tasks and Leaderboards

In the FLAVA paper, the dataset has been used to pretrain the FLAVA model as a source of well-aligned image-text pairs. This allows having a generic vision-and-language model which can be fine-tuned for a variety of tasks.

We anticipate that the dataset can be used to train deep neural networks that perform image captioning and that learn transferable visual representations for a variety of downstream visual recognition tasks (image classification, object detection, instance segmentation). We also anticipate that the dataset could be used for a variety of vision-and-language (V&L) tasks, such as image or text retrieval or text-to-image synthesis.

Languages

All of the subsets in PMD use English as their primary language.

Dataset Structure

Data Instances

Each instance in PMD represents a single image-text pair:

{
    'image_url': None, 
    'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=640x480 at 0x7FCFF86A1E80>, 
    'text': 'A woman wearing a net on her head cutting a cake. ', 
    'source': 'coco', 
    'meta': '{\n  "annotation": [\n    "A woman wearing a net on her head cutting a cake. "\n  ],\n  "image_path": "zip:/val2014/COCO_val2014_000000522418.jpg::http:/images.cocodataset.org/zips/val2014.zip"\n}'
}

Data Fields

image_url: Static URL for downloading the image associated with the text. Can be None if image is locally available.
image: A PIL Image object for the image associated with the text. Can be None if image is not locally available.
text: str, A textual description corresponding to the image.
source: str, The PMD subset which this pair is from.
meta: str, A json representation of the original annotation from the dataset.

Data Splits

All the data is contained in the training set. The training set has nearly 70M instances.

We intend for this dataset to be primarily used for pre-training with one or more specific downstream task(s) in mind. Thus, all of the instances should be used for pretraining. If required, we specifically make sure that there is no overlap with Karpathy's COCO validation set so users can use that subset for any validation purposes. Users can also load Karpathy's val subset by specifying the "validation" split while loading PMD. This will also load other "validation" splits for some subsets, if they are available.

Dataset Creation

Curation Rationale

From the paper:

Purely contrastive methods, however, also have important shortcomings. Their cross-modal nature does not make them easily usable on multimodal problems that require dealing with both modalities at the same time. They require large corpora, which for both CLIP and ALIGN have not been made accessible to the research community and the details of which remain shrouded in mystery, notwithstanding well-known issues with the construction of such datasets

Source Data

Initial Data Collection and Normalization

From the paper:

Data Collection Pipeline

For the YFCC100M dataset, we filter the image-text data by discarding non-English captions and only keeping captions that contain more than two words from the description field of each image, if this does not pass our filters we consider the title field. Other than that, we did not do any additional filtering.
For the VisualGenome, COCO and Localized Narratives subsets, we remove any overlaps with Karpathy's COCO val and test sets.
For Localized Narratives, we split the original caption which is a paragraph into multiple captions by using spaCy library and take the cartesan product leading to each sample as a separate image-text pair.

Compared to original FLAVA paper

The PMD dataset in this repo doesn't correspond 1:1 exactly to the original PMD dataset used in the FLAVA paper though this repo is built by the same authors. This is due to difficulty in reproducing WiT and YFCC100M subsets exactly. This repo in general contains more data than the PMD in the FLAVA paper and hence should probably result in better performance.

Who are the source language producers?

Please refer to the original dataset papers to understand where the content is coming from.

Annotations

Annotation process

The dataset is a combination of existing public datasets with some filtering applied on top so there is no annotation process involved.

Who are the annotators?

Please refer to the original dataset papers to understand where the content is coming from.

Personal and Sensitive Information

Please refer to the original dataset papers to understand where the content is coming from. For example, a detailed description on this for RedCaps can be found here.

Considerations for Using the Data

Social Impact of Dataset

From the paper:

Has an analysis of the potential impact of the dataset and its use on data subjects (e.g., a data protection impact analysis) been conducted?

No.

Discussion of Biases

Please refer to the original dataset papers to understand where the content is coming from. For example, a detailed description on this for RedCaps can be found here.

Other Known Limitations

From the paper:

Are there any errors, sources of noise, or redundancies in the dataset?

PMD is noisy by design since image-text pairs on the internet are noisy and unstructured. Though, since it contains sources such as COCO, Visual Genome, and Localized Narratives which are hand-curated by annotators, it has a lot of well-aligned data as well. So, it is definitely more aligned compared to e.g. LAION.

Some instances may also have duplicate images and captions but should have almost no effect in training large-scale models.

Does the dataset contain data that might be considered confidential (e.g., data that is protected by legal privilege or by doctor-patient confidentiality, data that includes the content of individuals non-public communications)?

Not that the authors know of. Please refer to the original dataset papers to understand where the content is coming from. For example, a detailed description on this for RedCaps can be found here.

Additional Information

Dataset Curators

The authors of the original dataset papers, as well as the authors of the FLAVA paper (Amanpreet, Ronghang, Vedanuj, Guillaume, Wojciech, Marcus and Douwe).

Licensing Information

Here are the individual licenses from each of the datasets that apply if you use this dataset:

COCO

The annotations in the COCO dataset belong to the COCO Consortium and are licensed under a Creative Commons Attribution 4.0 License.

The COCO Consortium does not own the copyright of the images. Use of the images must abide by the Flickr Terms of Use. The users of the images accept full responsibility for the use of the dataset, including but not limited to the use of any copies of copyrighted images that they may create from the dataset.

Conceptual Captions

The dataset may be freely used for any purpose, although acknowledgement of Google LLC ("Google") as the data source would be appreciated. The dataset is provided "AS IS" without any warranty, express or implied. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

WIT

This data is available under the Creative Commons Attribution-ShareAlike 3.0 Unported license.

Visual Genome

Visual Genome by Ranjay Krishna et al is licensed under a Creative Commons Attribution 4.0 International License.

Localized Narratives

All the annotations available through this website are released under a CC BY 4.0 license. You are free to redistribute and modify the annotations, but we ask you to please keep the original attribution to our paper.

YFCC100M

Use of the original media files is subject to the Creative Commons licenses chosen by their creators/uploaders. License information for each media file can be found within the YFCC100M metadata. Use of the dataset is subject to the relevant Webscope License Agreement, which you need to agree to if you use this dataset.

RedCaps

The image metadata is licensed under CC-BY 4.0 license. Additionally, uses of this dataset are subject to Reddit API terms (https://www.reddit.com/wiki/ api-terms) and users must comply with Reddit User Agreeement, Content Policy, and Privacy Policy – all accessible at https://www.redditinc.com/policies.

Similar to RedCaps:

PMD should only be used for non-commercial research. PMD should not be used for any tasks that involve identifying features related to people (facial recognition, gender, age, ethnicity identification, etc.) or make decisions that impact people (mortgages, job applications, criminal sentences; or moderation decisions about user-uploaded data that could result in bans from a website). Any commercial and for-profit uses of PMD are restricted – it should not be used to train models that will be deployed in production systems as part of a product offered by businesses or government agencies.

Citation Information

Please cite the main FLAVA paper in which PMD was introduced along with each of the subsets used in PMD as follows:

@inproceedings{singh2022flava,
  title={Flava: A foundational language and vision alignment model},
  author={Singh, Amanpreet and Hu, Ronghang and Goswami, Vedanuj and Couairon, Guillaume and Galuba, Wojciech and Rohrbach, Marcus and Kiela, Douwe},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={15638--15650},
  year={2022}
}

@article{chen2015microsoft,
  title={Microsoft coco captions: Data collection and evaluation server},
  author={Chen, Xinlei and Fang, Hao and Lin, Tsung-Yi and Vedantam, Ramakrishna and Gupta, Saurabh and Doll{\'a}r, Piotr and Zitnick, C Lawrence},
  journal={arXiv preprint arXiv:1504.00325},
  year={2015}
}

@inproceedings{ordonez2011sbucaptions,
  Author    = {Vicente Ordonez and Girish Kulkarni and Tamara L. Berg},
  Title     = {Im2Text: Describing Images Using 1 Million Captioned Photographs},
  Booktitle = {Neural Information Processing Systems ({NIPS})},
  Year      = {2011},
}

@article{krishna2017visual,
  title={Visual genome: Connecting language and vision using crowdsourced dense image annotations},
  author={Krishna, Ranjay and Zhu, Yuke and Groth, Oliver and Johnson, Justin and Hata, Kenji and Kravitz, Joshua and Chen, Stephanie and Kalantidis, Yannis and Li, Li-Jia and Shamma, David A and others},
  journal={International journal of computer vision},
  volume={123},
  number={1},
  pages={32--73},
  year={2017},
  publisher={Springer}
}

@article{srinivasan2021wit,
  title={WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning},
  author={Srinivasan, Krishna and Raman, Karthik and Chen, Jiecao and Bendersky, Michael and Najork, Marc},
  journal={arXiv preprint arXiv:2103.01913},
  year={2021}
}

@inproceedings{sharma2018conceptual,
  title={Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning},
  author={Sharma, Piyush and Ding, Nan and Goodman, Sebastian and Soricut, Radu},
  booktitle={Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  pages={2556--2565},
  year={2018}
}

@inproceedings{changpinyo2021conceptual,
  title={Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts},
  author={Changpinyo, Soravit and Sharma, Piyush and Ding, Nan and Soricut, Radu},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={3558--3568},
  year={2021}
}

@inproceedings{ponttuset2020localized,
  author    = {Jordi Pont-Tuset and Jasper Uijlings and Soravit Changpinyo and Radu Soricut and Vittorio Ferrari},
  title     = {Connecting Vision and Language with Localized Narratives},
  booktitle = {ECCV},
  year      = {2020}
}

@article{thomee2016yfcc100m,
  title={YFCC100M: The new data in multimedia research},
  author={Thomee, Bart and Shamma, David A and Friedland, Gerald and Elizalde, Benjamin and Ni, Karl and Poland, Douglas and Borth, Damian and Li, Li-Jia},
  journal={Communications of the ACM},
  volume={59},
  number={2},
  pages={64--73},
  year={2016},
  publisher={ACM New York, NY, USA}
}

@misc{desai2021redcaps,
      title={RedCaps: web-curated image-text data created by the people, for the people},
      author={Karan Desai and Gaurav Kaul and Zubin Aysola and Justin Johnson},
      year={2021},
      eprint={2111.11431},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Contributions

Thanks to @aps, Thomas Wang, and @VictorSanh for adding this dataset.

You need to agree to share your contact information to access this dataset

Dataset Card for PMD

Dataset Summary

Dataset Preprocessing

Save to disk

Load Subsets

Flickr30K Localized Narratives Subset

Facing issues?

Supported Tasks and Leaderboards

Languages

Dataset Structure

Data Instances

Data Fields

Data Splits

Dataset Creation

Curation Rationale

Source Data

Initial Data Collection and Normalization

Compared to original FLAVA paper

Who are the source language producers?

Annotations

Annotation process

Who are the annotators?

Personal and Sensitive Information

Considerations for Using the Data

Social Impact of Dataset

Discussion of Biases

Other Known Limitations

Additional Information

Dataset Curators

Licensing Information

COCO

Conceptual Captions

WIT

Visual Genome

Localized Narratives

YFCC100M

RedCaps

Citation Information

Contributions

Models trained or fine-tuned on facebook/pmd