Conceptual 12M is a dataset containing (image-URL, caption) pairs collected for vision-and-language pre-training.

Last update: Dec 07, 2022

Overview

Conceptual 12M

We introduce the Conceptual 12M (CC12M), a dataset with ~12 million image-text pairs meant to be used for vision-and-language pre-training. It is larger and covers a much more diverse set of visual concepts than the Conceptual Captions (CC3M), a dataset that is widely used for pre-training and end-to-end training of image captioning models. Check our paper for further details.

Download

Click here to download (2.5GB)

Format (.tsv)

[image_url_1]\t[caption_1]
[image_url_2]\t[caption_2]
[image_url_3]\t[caption_3]
…
[image_url_N]\t[caption_N]

Cite

If you use this dataset in your research, please cite:

Soravit Changpinyo, Piyush Sharma, Nan Ding, Radu Soricut. Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts. CVPR 2021.

@inproceedings{changpinyo2021cc12m,
  title = {{Conceptual 12M}: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts},
  author = {Changpinyo, Soravit and Sharma, Piyush and Ding, Nan and Soricut, Radu},
  booktitle = {CVPR},
  year = {2021},
}

FAQs

Q1: Can you provide image pixels?

A1: We do not own any of the images in the dataset and hence cannot legally provide them to you. The owner of an image can choose to delete it at anytime, in which case the image will no longer be available. Due to this, unfortunately, some images in the dataset will be lost over time, and we are unable to help with this issue.

Q2: Is it normal that a subset of images cannot be retrieved from the provided URLs?

A2: Yes. See Q1.

Q3: Is CC12M an “expanded” CC3M?

A3: No, CC12M is on purpose designed for vision-and-language pre-training, and meant to be disjoint from CC3M. CC3M is cleaner and more appropriate for fine-tuning, but can be used in conjunction with CC12M for pre-training, as illustrated in our paper. Coincidentally, their intersection is found to be non-zero — approximately 63K URLs.

Contact Us

If you have a question not provided in the FAQs above, please create an issue in this repository.

If you would like to share feedback or report concerns, please email us at [email protected].

Conceptual 12M is a dataset containing (image-URL, caption) pairs collected for vision-and-language pre-training.

Related tags

Overview

Conceptual 12M

Download

Cite

FAQs

Contact Us

Owner

Google Research Datasets

U^2-Net - Portrait matting This repository explores possibilities of using the original u^2-net model for portrait matting.

Deep Video Matting via Spatio-Temporal Alignment and Aggregation [CVPR2021]

Feed forward VQGAN-CLIP model, where the goal is to eliminate the need for optimizing the latent space of VQGAN for each input prompt

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

[ICCV 2021] Official Tensorflow Implementation for "Single Image Defocus Deblurring Using Kernel-Sharing Parallel Atrous Convolutions"

Python library for loading and using triangular meshes.

The world's simplest facial recognition api for Python and the command line

Iran Open Source Hackathon

Sparse Physics-based and Interpretable Neural Networks

Pytorch for Segmentation

StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks

Huawei Hackathon 2021 - Sweden (Stockholm)

TensorFlow Ranking is a library for Learning-to-Rank (LTR) techniques on the TensorFlow platform

Pytorch Implementation of Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic

Open-source code for Generic Grouping Network (GGN, CVPR 2022)

Code for Boundary-Aware Segmentation Network for Mobile and Web Applications

Re-implementation of the vector capsule with dynamic routing

Main Results on ImageNet with Pretrained Models

Machine Learning From Scratch. Bare bones NumPy implementations of machine learning models and algorithms with a focus on accessibility. Aims to cover everything from linear regression to deep learning.

ManiSkill-Learn is a framework for training agents on SAPIEN Open-Source Manipulation Skill Challenge (ManiSkill Challenge), a large-scale learning-from-demonstrations benchmark for object manipulation.