RedCaps Downloader

This repository provides the official command-line tool for downloading and extending the RedCaps dataset. Users can seamlessly download images of officially released annotations as well as download more image-text data from any subreddit over an arbitrary time-span.

Installation

This tool requires Python 3.8 or higher. We recommend using conda for setup. Download Anaconda or Miniconda first. Then follow these steps:

# Clone the repository.
git clone https://github.com/redcaps-dataset/redcaps-downloader
cd redcaps-downloader

# Create a new conda environment.
conda create -n redcaps python=3.8
conda activate redcaps

# Install dependencies along with this code.
pip install -r requirements.txt
python setup.py develop

Basic usage: Download official RedCaps dataset

We expect most users will only require this functionality. Follow these steps to download the official RedCaps annotations and images and arrange all the data in recommended directory structure:

/path/to/redcaps/
├── annotations/
│   ├── abandoned_2017.json
│   ├── abandoned_2017.json
│   ├── ...
│   ├── itookapicture_2019.json
│   ├── itookapicture_2020.json
│   ├── 
   
    _
    
     .json
│   └── ...
│
└── images/
    ├── abandoned/
    │   ├── guli1.jpg
    |   └── ...
    │
    ├── itookapicture/
    │   ├── 1bd79.jpg
    |   └── ...
    │
    ├── 
     
      /
    │   ├── 
      
       .jpg
    │   ├── ...
    └── ...

Create an empty directory and symlink it relative to this code directory:

cd redcaps-downloader

# Edit path here:
mkdir -p /path/to/redcaps
ln -s /path/to/redcaps ./datasets/redcaps

Download official RedCaps annotations from Dropbox and unzip them.

cd datasets/redcaps
wget https://www.dropbox.com/s/cqtdpsl4hewlli1/redcaps_v1.0_annotations.zip?dl=1
unzip redcaps_v1.0_annotations.zip

Download images by using redcaps download-imgs command (for a single annotation file).
```
for ann_file in ./datasets/redcaps/annotations/*.json; do
    redcaps download-imgs -a $ann_file --save-to path/to/images --resize 512 -j 4
    # Set --resize -1 to turn off resizing shorter edge (saves disk space).
done
```
Parallelize download by changing -j. RedCaps images are sourced from Reddit, Imgur and Flickr, each have their own request limits. This code contains approximate sleep intervals to manage them. Use multiple machines (= different IP addresses) or a cluster to massively parallelize downloading.

That's it, you are all set to use RedCaps!

Advanced usage: Create your own RedCaps-like dataset

Apart from downloading the officially released dataset, this tool supports downloading image-text data from any subreddit – you can reproduce the entire collection pipeline as well as create your own variant of RedCaps! Here, we show how to collect annotations from r/roses (2020) as an example. Follow these steps for any subreddit and years.

Additional one-time setup instructions

RedCaps annotations are extracted from image post metadata, which are served by the Pushshift API and official Reddit API. These APIs are authentication-based, and one must sign up for developer access to obtain API keys (one-time setup):

Copy ./credentials.template.json to ./credentials.json. Its contents are as follows:

:
" }, "imgur": { "client_id": "Your client ID here", "client_secret": "Your client secret here" } } "> {
    "reddit": {
        "client_id": "Your client ID here",
        "client_secret": "Your client secret here",
        "username": "Your Reddit username here",
        "password": "Your Reddit password here",
        "user_agent": "
      
       : 
       "
      
    },
    "imgur": {
        "client_id": "Your client ID here",
        "client_secret": "Your client secret here"
    }
} 

Register a new Reddit app here. Reddit will provide a Client ID and Client Secret tokens - fill them in ./credentials.json. For more details, refer to the Reddit OAuth2 wiki. Enter your Reddit account name and password in ./credentials.json. Set User Agent to anything and keep it unchanged (e.g. your name).
Register a new Imgur App by following instructions here. Fill the provided Client ID and Client Secret in ./credentials.json.

Download pre-trained weights of an NSFW detection model.

wget https://s3.amazonaws.com/nsfwdetector/nsfw.299x299.h5 -P ./datasets/redcaps/models

Data collection from `r/roses` (2020)

download-anns: Dowload annotations of image posts made in a single month (e.g. January).

redcaps download-anns --subreddit roses --month 2020-01 -o ./datasets/redcaps/annotations

# Similarly, download annotations for all months of 2020:
for ((month = 1; month <= 12; month += 1)); do
    redcaps download-anns --subreddit roses --month 2020-$month -o ./datasets/redcaps/annotations
done

NOTE: You may not get all the annotations present in official release as some of them may have disappeared (deleted) over time. After this step, the dataset directory would contain 12 annotation files:

    ./datasets/redcaps/
    └── annotations/
        ├── roses_2020-01.json
        ├── roses_2020-02.json
        ├── ...
        └── roses_2020-12.json

merge: Merge all the monthly annotation files into a single file.

redcaps merge ./datasets/redcaps/annotations/roses_2020-* \
    -o ./datasets/redcaps/annotations/roses_2020.json --delete-old

--delete-old will remove individual files after merging. After this step, the merged file will replace individual monthly files:

    ./datasets/redcaps/
    └── annotations/
        └── roses_2020.json

download-imgs: Download all images for this annotation file. This step is same as (3) in basic usage.
```
redcaps download-imgs --annotations ./datasets/redcaps/annotations/roses_2020.json \
    --resize 512 -j 4 -o ./datasets/redcaps/images --update-annotations
```
- --update-annotations removes annotations whose images were not downloaded.
filter-words: Filter all instances whose captions contain potentially harmful language. Any caption containing one of the 400 blocklisted words will be removed. This command modifies the annotation file in-place and deletes the corresponding images from disk.
```
redcaps filter-words --annotations ./datasets/redcaps/annotations/roses_2020.json \
    --images ./datasets/redcaps/images
```
filter-nsfw: Remove all instances having images that are flagged by an off-the-shelf NSFW detector. This command also modifies the annotation file in-place and deletes the corresponding images from disk.
```
redcaps filter-nsfw --annotations ./datasets/redcaps/annotations/roses_2020.json \
    --images ./datasets/redcaps/images \
    --model ./datasets/redcaps/models/nsfw.299x299.h5
```
filter-faces: Remove all instances having images with faces detected by an off-the-shelf face detector. This command also modifies the annotation file in-place and deletes the corresponding images from disk.
```
redcaps filter-faces --annotations ./datasets/redcaps/annotations/roses_2020.json \
    --images ./datasets/redcaps/images  # Model weights auto-downloaded
```
validate: All the above steps create a single annotation file (and downloads images) similar to official RedCaps annotations. To double-check this, run the following command and expect no errors to be printed.
```
redcaps validate --annotations ./datasets/redcaps/annotations/roses_2020.json
```

Citation

If you find this code useful, please consider citing:

@inproceedings{desai2021redcaps,
    title={{RedCaps: Web-curated image-text data created by the people, for the people}},
    author={Karan Desai and Gaurav Kaul and Zubin Aysola and Justin Johnson},
    booktitle={NeurIPS Datasets and Benchmarks},
    year={2021}
}

Command-line tool for downloading and extending the RedCaps dataset.

Related tags

Overview

RedCaps Downloader

Installation

Basic usage: Download official RedCaps dataset

Advanced usage: Create your own RedCaps-like dataset

Additional one-time setup instructions

Data collection from `r/roses` (2020)

Citation

Owner

RedCaps dataset

Official repo for the work titled "SharinGAN: Combining Synthetic and Real Data for Unsupervised GeometryEstimation"

Repository of our paper 'Refer-it-in-RGBD' in CVPR 2021

This repository is the code of the paper "Sparse Spatial Transformers for Few-Shot Learning".

Official PyTorch Implementation of Mask-aware IoU and maYOLACT Detector [BMVC2021]

Adversarial Attacks are Reversible via Natural Supervision

A Comprehensive Analysis of Weakly-Supervised Semantic Segmentation in Different Image Domains (IJCV submission)

Code for "The Intrinsic Dimension of Images and Its Impact on Learning" - ICLR 2021 Spotlight

Easy to use Python camera interface for NVIDIA Jetson

A library for uncertainty quantification based on PyTorch

Minimal PyTorch implementation of Generative Latent Optimization from the paper "Optimizing the Latent Space of Generative Networks"

Code for "Single-view robot pose and joint angle estimation via render & compare", CVPR 2021 (Oral).

Style-based Point Generator with Adversarial Rendering for Point Cloud Completion (CVPR 2021)

Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

BRNet - code for Automated assessment of BI-RADS categories for ultrasound images using multi-scale neural networks with an order-constrained loss function

Official implementation for: Blended Diffusion for Text-driven Editing of Natural Images.

PyZebrascope - an open-source Python platform for brain-wide neural activity imaging in behaving zebrafish

Fast (simple) spectral synthesis and emission-line fitting of DESI spectra.

Accepted at ICCV-2021: Workshop on Computer Vision for Automated Medical Diagnosis (CVAMD)

Scalable Optical Flow-based Image Montaging and Alignment

Simple machine learning library / 簡單易用的機器學習套件

Command-line tool for downloading and extending the RedCaps dataset.

Related tags

Overview

RedCaps Downloader

Installation

Basic usage: Download official RedCaps dataset

Advanced usage: Create your own RedCaps-like dataset

Additional one-time setup instructions

Data collection from r/roses (2020)

Citation

Owner

RedCaps dataset

Official repo for the work titled "SharinGAN: Combining Synthetic and Real Data for Unsupervised GeometryEstimation"

Repository of our paper 'Refer-it-in-RGBD' in CVPR 2021

This repository is the code of the paper "Sparse Spatial Transformers for Few-Shot Learning".

Official PyTorch Implementation of Mask-aware IoU and maYOLACT Detector [BMVC2021]

Adversarial Attacks are Reversible via Natural Supervision

A Comprehensive Analysis of Weakly-Supervised Semantic Segmentation in Different Image Domains (IJCV submission)

Code for "The Intrinsic Dimension of Images and Its Impact on Learning" - ICLR 2021 Spotlight

Easy to use Python camera interface for NVIDIA Jetson

A library for uncertainty quantification based on PyTorch

Minimal PyTorch implementation of Generative Latent Optimization from the paper "Optimizing the Latent Space of Generative Networks"

Code for "Single-view robot pose and joint angle estimation via render & compare", CVPR 2021 (Oral).

Style-based Point Generator with Adversarial Rendering for Point Cloud Completion (CVPR 2021)

Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

BRNet - code for Automated assessment of BI-RADS categories for ultrasound images using multi-scale neural networks with an order-constrained loss function

Official implementation for: Blended Diffusion for Text-driven Editing of Natural Images.

PyZebrascope - an open-source Python platform for brain-wide neural activity imaging in behaving zebrafish

Fast (simple) spectral synthesis and emission-line fitting of DESI spectra.

Accepted at ICCV-2021: Workshop on Computer Vision for Automated Medical Diagnosis (CVAMD)

Scalable Optical Flow-based Image Montaging and Alignment

Simple machine learning library / 簡單易用的機器學習套件

Data collection from `r/roses` (2020)