Code accompanying the paper Say As You Wish: Fine-grained Control of Image Caption Generation with Abstract Scene Graphs (Chen et al., CVPR 2020, Oral).

Last update: Dec 29, 2022

Related tags

Deep Learning asg2cap

Overview

Say As You Wish: Fine-grained Control of Image Caption Generation with Abstract Scene Graphs

This repository contains PyTorch implementation of our paper Say As You Wish: Fine-grained Control of Image Caption Generation with Abstract Scene Graphs (CVPR 2020).

Prerequisites

Python 3 and PyTorch 1.3.

# clone the repository
git clone https://github.com/cshizhe/asg2cap.git
cd asg2cap
# clone caption evaluation codes
git clone https://github.com/cshizhe/eval_cap.git
export PYTHONPATH=$(pwd):${PYTHONPATH}

Training & Inference

cd controlimcap/driver

# support caption models: [node, node.role, 
# rgcn, rgcn.flow, rgcn.memory, rgcn.flow.memory]
# see our paper for details
mtype=rgcn.flow.memory 

# setup config files
# you should modify data paths in configs/prepare_*_imgsg_config.py
python configs/prepare_coco_imgsg_config.py $mtype
resdir='' # copy the output string of the previous step

# training
python asg2caption.py $resdir/model.json $resdir/path.json $mtype --eval_loss --is_train --num_workers 8

# inference
python asg2caption.py $resdir/model.json $resdir/path.json $mtype --eval_set tst --num_workers 8

Datasets

Annotations

Annotations for MSCOCO and VisualGenome datasets can be download from GoogleDrive.

(Image, ASG, Caption) annotations: regionfiles/image_id.json

JSON Format:
{
	"region_id": {
		"objects":[
			{
	     		"object_id": int, 
	     		"name": str, 
	     		"attributes": [str],
				"x": int,
				"y": int, 
				"w": int, 
				"h": int
			}],
  	  "relationships": [
			{
				"relationship_id": int,
				"subject_id": int,
				"object_id": int,
				"name": str
			}],
  	  "phrase": str,
  }
}

vocabularies int2word.npy: [word] word2int.json: {word: int}
data splits: public_split directory trn_names.npy, val_names.npy, tst_names.npy

Features

Features for MSCOCO and VisualGenome datasets are available at BaiduNetdisk (code: 6q32).

We also provide pretrained models and codes to extract features for new images.

Global Image Feature: the last mean pooling feature of ResNet101 pretrained on ImageNet

format: npy array, shape=(num_fts, dim_ft) corresponding to the order in data_split names

Region Image Feature: fc7 layer of Faster-RCNN pretrained on VisualGenome

format: hdf5 files, "image_id".jpg.hdf5

key: 'image_id'.jpg

attrs: {"image_w": int, "image_h": int, "boxes": 4d array (x1, y1, x2, y2)}

Result Visualization

Citations

If you use this code as part of any published research, we'd really appreciate it if you could cite the following paper:

@article{chen2020say,
  title={Say As You Wish: Fine-grained Control of Image Caption Generation with Abstract Scene Graphs},
  author={Chen, Shizhe and Jin, Qin and Wang, Peng and Wu, Qi},
  journal={CVPR},
  year={2020}
}

License

MIT License

Code accompanying the paper Say As You Wish: Fine-grained Control of Image Caption Generation with Abstract Scene Graphs (Chen et al., CVPR 2020, Oral).

Related tags

Overview

Say As You Wish: Fine-grained Control of Image Caption Generation with Abstract Scene Graphs

Prerequisites

Training & Inference

Datasets

Annotations

Features

Result Visualization

Citations

License

Owner

Shizhe Chen

Computational modelling of ray propagation through optical elements using the principles of geometric optics (Ray Tracer)

ICCV2021 - Mining Contextual Information Beyond Image for Semantic Segmentation

This is an official implementation for "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" on Semantic Segmentation.

This Deep Learning Model Predicts that from which disease you are suffering.

A Fast and Accurate One-Stage Approach to Visual Grounding, ICCV 2019 (Oral)

Este conversor criará a medida exata para sua receita de capuccino gelado da grandiosa Rafaella Ballerini!

Learning to Self-Train for Semi-Supervised Few-Shot

Python Blood Vessel Topology Analysis

Chunkmogrify: Real image inversion via Segments

Head2Toe: Utilizing Intermediate Representations for Better OOD Generalization

A Model for Natural Language Attack on Text Classification and Inference

PyTorch version of the paper 'Enhanced Deep Residual Networks for Single Image Super-Resolution' (CVPRW 2017)

Code for our paper "Sematic Representation for Dialogue Modeling" in ACL2021

Contains source code for the winning solution of the xView3 challenge

Learning to Disambiguate Strongly Interacting Hands via Probabilistic Per-Pixel Part Segmentation [3DV 2021 Oral]

Metrics to evaluate quality and efficacy of synthetic datasets.

Learning Representational Invariances for Data-Efficient Action Recognition

TaCL: Improving BERT Pre-training with Token-aware Contrastive Learning

Pipeline for employing a Lightweight deep learning models for LOW-power systems