Official pytorch implementation of paper Dual-Level Collaborative Transformer for Image Captioning (AAAI 2021).

Last update: Dec 11, 2022

Related tags

Deep Learning image-captioning

Overview

Dual-Level Collaborative Transformer for Image Captioning

This repository contains the reference code for the paper Dual-Level Collaborative Transformer for Image Captioning.

Experiment setup

please refer to m2 transformer

Data preparation

Annotation. Download the annotation file annotation.zip. Extarct and put it in the project root directory.
Feature. You can download our ResNeXt-101 feature (hdf5 file) here. Acess code: jcj6.
evaluation. Download the evaluation tools here. Acess code: jcj6. Extarct and put it in the project root directory.

There are five kinds of keys in our .hdf5 file. They are

['%d_features' % image_id]: region features (N_regions, feature_dim)
['%d_boxes' % image_id]: bounding box of region features (N_regions, 4)
['%d_size' % image_id]: size of original image (for normalizing bounding box), (2,)
['%d_grids' % image_id]: grid features (N_grids, feature_dim)
['%d_mask' % image_id]: geometric alignment graph, (N_regions, N_grids)

We extract feature with the code in grid-feats-vqa.

The first three keys can be obtained when extracting region features with extract_region_feature.py. The forth key can be obtained when extracting grid features with code in grid-feats-vqa. The last key can be obtained with align.ipynb

Training

python train.py --exp_name dlct --batch_size 50 --head 8 --features_path ./data/coco_all_align.hdf5 --annotation annotation --workers 8 --rl_batch_size 100 --image_field ImageAllFieldWithMask --model DLCT --rl_at 17 --seed 118

Evaluation

python eval.py --annotation annotation --workers 4 --features_path ./data/coco_all_align.hdf5 --model_path path_of_model_to_eval --model DLCT --image_field ImageAllFieldWithMask --grid_embed --box_embed --dump_json gen_res.json --beam_size 5

Important args:

--features_path path to hdf5 file
--model_path
--dump_json dump generated captions to

Pretrained model is available here. Acess code: jcj6. By evaluating the pretrained model, you will get

{'BLEU': [0.8136727001615207, 0.6606095421082421, 0.5167535314080227, 0.39790755018790197], 'METEOR': 0.29522868252436046, 'ROUGE': 0.5914367650104326, 'CIDEr': 1.3382047139781112, 'SPICE': 0.22953477359195887}

References

[1] M2

[2] grid-feats-vqa

[3] butd

Acknowledgements

Thanks the original m2 and amazing work of grid-feats-vqa.

Official pytorch implementation of paper Dual-Level Collaborative Transformer for Image Captioning (AAAI 2021).

Related tags

Overview

Dual-Level Collaborative Transformer for Image Captioning

Experiment setup

Data preparation

Training

Evaluation

References

Acknowledgements

Owner

lyricpoem

Anagram Generator in Python

Geometry-Free View Synthesis: Transformers and no 3D Priors

Losslandscapetaxonomy - Taxonomizing local versus global structure in neural network loss landscapes

Attention mechanism with MNIST dataset

DeLiGAN - This project is an implementation of the Generative Adversarial Network

MonoRCNN is a monocular 3D object detection method for automonous driving

Implements Gradient Centralization and allows it to use as a Python package in TensorFlow

A library for uncertainty quantification based on PyTorch

Codes for building and training the neural network model described in Domain-informed neural networks for interaction localization within astroparticle experiments.

Re-implementation of the vector capsule with dynamic routing

The coda and data for "Measuring Fine-Grained Domain Relevance of Terms: A Hierarchical Core-Fringe Approach" (ACL '21)

Contra is a lightweight, production ready Tensorflow alternative for solving time series prediction challenges with AI

NeurIPS 2021, "Fine Samples for Learning with Noisy Labels"

Project page of the paper 'Analyzing Perception-Distortion Tradeoff using Enhanced Perceptual Super-resolution Network' (ECCVW 2018)

Official implementation for paper: Feature-Style Encoder for Style-Based GAN Inversion

This is implementation of AlexNet(2012) with 3D Convolution on TensorFlow (AlexNet 3D).

Winners of DrivenData's Overhead Geopose Challenge

Codes accompanying the paper "Learning Nearly Decomposable Value Functions with Communication Minimization" (ICLR 2020)

Watch faces morph into each other with StyleGAN 2, StyleGAN, and DCGAN!

Specificity-preserving RGB-D Saliency Detection