Generate text captions for images from their CLIP embeddings. Includes PyTorch model code and example training script.

Last update: Dec 21, 2022

Related tags

Deep Learning clip-text-decoder

Overview

clip-text-decoder

Generate text captions for images from their CLIP embeddings. Includes PyTorch model code and example training script.

Example Predictions

Example captions were computed with the pretrained model mentioned below.

"A man riding a wave on top of a surfboard."

A baseball player is swinging a bat at a ball.

"A dog running across a field with a frisbee."

Installation

Install for easier access to the following objects/classes:

clip_text_decoder.datasets.ClipCocoCaptionsDataset
clip_text_decoder.models.ClipDecoder
clip_text_decoder.models.ClipDecoderInferenceModel
clip_text_decoder.tokenizer.Tokenizer

The train.py script will not be available in the installed package, since it's located in the root directory. To train new models, either clone this repository or recreate train.py locally.

Using pip:

pip install clip-text-decoder

From source:

git clone https://github.com/fkodom/clip-text-decoder.git
cd clip-text-decoder
pip install .

NOTE: You'll also need to install openai/CLIP to encode images with CLIP. This is also required by ClipCocoCaptionsDataset to build the captions dataset the first time (cached for subsequent calls).

pip install "clip @ git+https://github.com/openai/CLIP.git"

For technical reasons, the CLIP dependency can't be included in the PyPI package, since it's not an officially published package.

Training

Launch your own training session using the provided script (train.py):

python train.py --max-epochs 5

Training CLI arguments, along with their default values:

--max-epochs 5  # (int)
--num-layers 6  # (int)
--dim-feedforward 256  # (int)
--precision 16  # (16 or 32)
--seed 0  # (int)

Inference

The training script will produce a model.zip archive, containing the Tokenizer and trained model parameters. To perform inference with it:

import clip
from PIL import Image
import torch

from clip_text_decoder.model import ClipDecoderInferenceModel

device = "cuda" if torch.cuda.is_available() else "cpu"
model = ClipDecoderInferenceModel.load("path/to/model.zip").to(device)
clip_model, clip_preprocessor = clip.load("ViT-B/32", device=device, jit=False)

# Create a blank dummy image
dummy_image = Image.new("RGB", (224, 224))
preprocessed = clip_preprocessor(dummy_image).to(device)
# Add a batch dimension using '.unsqueeze(0)'
encoded = clip_model.encode_image(preprocessed.unsqueeze(0))
text = model(encoded)

print(text)
# Probably some nonsense, because we used a dummy image.

Pretrained Models

A pretrained CLIP decoder is hosted in my Google Drive, and can easily be downloaded by:

from clip_text_decoder.model import ClipDecoderInferenceModel

model = ClipDecoderInferenceModel.download_pretrained()

To cache the pretrained model locally, so that it's not re-downloaded each time:

model = ClipDecoderInferenceModel.download_pretrained("/path/to/model.zip")

Shortcomings

Only works well with COCO-style images. If you go outside the distribution of COCO objects, you'll get nonsense text captions.
Relatively short training time. Even within the COCO domain, you'll occasionally see incorrect captions. Quite a few captions will have bad grammar, repetitive descriptors, etc.

Comments

Decoding Text Embeddings Coded Using Hugging Face ClipTextModel

Suppose that I have text embeddings created using Hugging Face's ClipTextModel using the following method:

import torch
from transformers import CLIPTokenizer, CLIPTextModel

class_list = ["i love going home and playing with my wife and kids", "i love going home", "playing with my wife and kids", 
"family", "war", "writing"]

model = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14")
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")

inputs = tokenizer(class_list, padding=True, return_tensors="pt")
outputs = model(**inputs)
hidden_state = outputs.last_hidden_state
embeddings = outputs.pooler_output

Questions:

Is It possible to use the clip-text-decoder to convert the embeddings back to text?
If it is indeed possible to do so, could you provide an example of how?

Looking forward to receiving your feedback.

opened by mbdzi 6

Fix string error when loading clip models.

error

The model name string ( VIT-xxx ) in the check_vision_backbone function is not compatible with the model name string ( ViT-xxx ) of the clip repository, which will cause at least one error in check_vision_backbone function or when loading the clip model.

solution

In this PR, the model name string in the check_vision_backbone function is modified to ViT-xxx to make it compatible with the clip repository.

opened by Adenialzz 1
BLIP vision backbone
Added blip backbone; still cleaning up last pieces

Bug fixes for training script, and remove debug code.

Fix dependencies in test workflow; update README statistics

Fix test issue with CUDA device

Update unit tests for newer Python, torch versions

Test up to Python 3.10

Test up to Python 3.9

Install lavis first
opened by fkodom 0
Feature: Beam Search
Add beam search, clip dependency to setup.py

Fix installation instructions

Remove main clause

Add '--beam-size' option to 'train.py' script.

Update README; propagate the '--beam-size' arg through eval functions

Update setup.cfg, add pre-commit hooks

Reformat images

Remove fixed image width

Add detail to README; comments to call method for beam search

Updated README headline
opened by fkodom 0
Bug Fixes for Broken Tests
Cache the old fashioned way :)

Fix silly typo in test for image caption model

Apply black and isort formatting

Install latest version of 'black', reapply formatting

Fix flake8 issue (duplicate function definition), and install latest patch version of pytorch for tests.

Skip slow tests by default, add 'slow' marker to inference model tests.
opened by fkodom 0
GPT2 Decoder
Update model to use DistilGPT2 as a pre-trained decoder.

Removed tokenizer (no longer used), fixed bugs in Model source file, and updated model unit tests.

Backwards compatibility for 'gdown.download' method.

Update installation requirements, caption examples in README
opened by fkodom 0
Upgrade CodeSee workflow to version 2
CodeSee is a code visibility platform.

This change updates the CodeSee workflow file to the latest version for security, maintenance, and support improvements (see changelog below).

That workflow file:

runs CodeSee's code analysis on every PR push and merge

uploads that analysis to CodeSee.

It does not transmit your code.

The code analysis is used to generate maps and insights about this codebase.

CodeSee workflow changelog:

Improved security: Updates permission to be read-only.

Improved future maintenance: Replaces the body of the workflow with a single github action: codesee-action. This makes it significantly easier for CodeSee to introduce future improvements and fixes without requiring another PR like this.

Improved Python support: The action now properly supports Python 3.11, and will continue to support new Python versions as they are released.
opened by codesee-maps[bot] 1

Incompatible checksum error

I see the following error when trying to load the pretrained model.

    tokenizer=pickle.loads(tokenizer_buffer.read()),
  File "stringsource", line 6, in spacy.pipeline.trainable_pipe.__pyx_unpickle_TrainablePipe
_pickle.PickleError: Incompatible checksums (102742709 vs 0x417ddeb = (cfg, model, name, vocab))

Am I missing something?

opened by dapurv5 0

Releases(1.4.4)

1.4.4(Nov 7, 2022)
What's Changed

Fix string error when loading clip models. by @Adenialzz in https://github.com/fkodom/clip-text-decoder/pull/12

New Contributors

@Adenialzz made their first contribution in https://github.com/fkodom/clip-text-decoder/pull/12

Full Changelog: https://github.com/fkodom/clip-text-decoder/compare/1.4.3...1.4.4
Source code(tar.gz)
Source code(zip)
1.4.3(Nov 7, 2022)
What's Changed

Refactor Dataset by @fkodom in https://github.com/fkodom/clip-text-decoder/pull/11

Full Changelog: https://github.com/fkodom/clip-text-decoder/compare/1.4.2...1.4.3
Source code(tar.gz)
Source code(zip)
1.4.2(Oct 26, 2022)
What's Changed

Huggingface Evaluate by @fkodom in https://github.com/fkodom/clip-text-decoder/pull/9

Full Changelog: https://github.com/fkodom/clip-text-decoder/compare/1.4.1...1.4.2
Source code(tar.gz)
Source code(zip)
1.4.1(Oct 26, 2022)
What's Changed

Datapipes by @fkodom in https://github.com/fkodom/clip-text-decoder/pull/8

Full Changelog: https://github.com/fkodom/clip-text-decoder/compare/1.4.0...1.4.1
Source code(tar.gz)
Source code(zip)
1.4.0(Oct 23, 2022)
What's Changed

BLIP vision backbone by @fkodom in https://github.com/fkodom/clip-text-decoder/pull/7

Full Changelog: https://github.com/fkodom/clip-text-decoder/compare/1.3.0...1.4.0
Source code(tar.gz)
Source code(zip)
1.3.0(Oct 2, 2022)
What's Changed

Feature: Beam Search by @fkodom in https://github.com/fkodom/clip-text-decoder/pull/5

Bug Fix: PyPI Release by @fkodom in https://github.com/fkodom/clip-text-decoder/pull/6

Full Changelog: https://github.com/fkodom/clip-text-decoder/compare/1.2.0...1.3.0
Source code(tar.gz)
Source code(zip)
1.2.0(Jan 29, 2022)
What's Changed

Cache CLIP embeddings for the dataset, rather than recomputing them each time.

Reduce model file sizes by storing at lower precision

Add an ImageCaptionInferenceModel class for easier out-of-the-box use

Fix some broken unit tests

Better Data Caching by @fkodom in https://github.com/fkodom/clip-text-decoder/pull/3

Bug Fixes for Broken Tests by @fkodom in https://github.com/fkodom/clip-text-decoder/pull/4

Full Changelog: https://github.com/fkodom/clip-text-decoder/compare/1.1.0...1.2.0
Source code(tar.gz)
Source code(zip)
1.1.0(Dec 22, 2021)
What's Changed

GPT2 Decoder by @fkodom in https://github.com/fkodom/clip-text-decoder/pull/2

New Contributors

@fkodom made their first contribution in https://github.com/fkodom/clip-text-decoder/pull/2

Full Changelog: https://github.com/fkodom/clip-text-decoder/compare/1.0.0...1.1.0
Source code(tar.gz)
Source code(zip)
1.0.0(Nov 15, 2021)

Source code(tar.gz)
Source code(zip)
0.1.1(Nov 14, 2021)

Add installation docs to README, and automatically publish to PyPI using GitHub Actions workflow.
Source code(tar.gz)
Source code(zip)
0.1.0(Nov 14, 2021)

First pre-release with pretrained models in README.
Source code(tar.gz)
Source code(zip)

Owner

Frank Odom

Director of Innovation at Plainsight. I like neural nets, and neural nets like me.

GitHub Repository

Voxel-based Network for Shape Completion by Leveraging Edge Generation (ICCV 2021, oral)

Voxel-based Network for Shape Completion by Leveraging Edge Generation This is the PyTorch implementation for the paper "Voxel-based Network for Shape

10 Dec 04, 2022

Code for paper [ACE: Ally Complementary Experts for Solving Long-Tailed Recognition in One-Shot] (ICCV 2021, oral))

ACE: Ally Complementary Experts for Solving Long-Tailed Recognition in One-Shot This repository is the official PyTorch implementation of ICCV-21 pape

21 May 09, 2022

PyTorch implementation of the paper: "Preference-Adaptive Meta-Learning for Cold-Start Recommendation", IJCAI, 2021.

PAML PyTorch implementation of the paper: "Preference-Adaptive Meta-Learning for Cold-Start Recommendation", IJCAI, 2021. (Continuously updating ) Int

15 Nov 18, 2022

Mip-NeRF: A Multiscale Representation for Anti-Aliasing Neural Radiance Fields.

This repository contains the code release for Mip-NeRF: A Multiscale Representation for Anti-Aliasing Neural Radiance Fields. This implementation is written in JAX, and is a fork of Google's JaxNeRF

625 Dec 30, 2022

Swapping face using Face Mesh with TensorFlow Lite

17 Apr 26, 2022

Unsupervised Learning of Multi-Frame Optical Flow with Occlusions

This is a Pytorch implementation of Janai, J., Güney, F., Ranjan, A., Black, M. and Geiger, A., Unsupervised Learning of Multi-Frame Optical Flow with

110 Nov 02, 2022

A High-Level Fusion Scheme for Circular Quantities published at the 20th International Conference on Advanced Robotics

Monte Carlo Simulation to the Paper A High-Level Fusion Scheme for Circular Quantities published at the 20th International Conference on Advanced Robotics

0 Dec 06, 2021

Source code for deep symbolic optimization.

Update July 10, 2021: This repository now supports an additional symbolic optimization task: learning symbolic policies for reinforcement learning. Th

290 Dec 25, 2022

Original Pytorch Implementation of FLAME: Facial Landmark Heatmap Activated Multimodal Gaze Estimation

FLAME Original Pytorch Implementation of FLAME: Facial Landmark Heatmap Activated Multimodal Gaze Estimation, accepted at the 17th IEEE Internation Co

19 Dec 17, 2022

Code for Referring Image Segmentation via Cross-Modal Progressive Comprehension, CVPR2020.

CMPC-Refseg Code of our CVPR 2020 paper Referring Image Segmentation via Cross-Modal Progressive Comprehension. Shaofei Huang*, Tianrui Hui*, Si Liu,

55 Dec 01, 2022

A motion detection system with RaspberryPi, OpenCV, Python

Human Detection System using Raspberry Pi Functionality Activates a relay on detecting motion. You may need following components to get the expected R

55 Dec 04, 2022

This repository contains an implementation of ConvMixer for the ICLR 2022 submission "Patches Are All You Need?".

Patches Are All You Need? 🤷 This repository contains an implementation of ConvMixer for the ICLR 2022 submission "Patches Are All You Need?". Code ov

934 Dec 30, 2022

Vision Transformer for 3D medical image registration (Pytorch).

ViT-V-Net: Vision Transformer for Volumetric Medical Image Registration keywords: vision transformer, convolutional neural networks, image registratio

192 Dec 20, 2022

Optimizers-visualized - Visualization of different optimizers on local minimas and saddle points.

Optimizers Visualized Visualization of how different optimizers handle mathematical functions for optimization. Contents Installation Usage Functions

1 Jan 01, 2022

TensorFlow ROCm port

Documentation TensorFlow is an end-to-end open source platform for machine learning. It has a comprehensive, flexible ecosystem of tools, libraries, a

622 Jan 09, 2023

Official implementation of the paper: "LDNet: Unified Listener Dependent Modeling in MOS Prediction for Synthetic Speech"

LDNet Author: Wen-Chin Huang (Nagoya University) Email: Wen-Chin Huang (unilight) 40 Nov 20, 2022

An AutoML Library made with Optuna and PyTorch Lightning

An AutoML Library made with Optuna and PyTorch Lightning Installation Recommended pip install -U gradsflow From source pip install git+https://github.

294 Dec 17, 2022

An unofficial personal implementation of UM-Adapt, specifically to tackle joint estimation of panoptic segmentation and depth prediction for autonomous driving datasets.

Semisupervised Multitask Learning This repository is an unofficial and slightly modified implementation of UM-Adapt[1] using PyTorch. This code primar

11 Nov 25, 2022

The code for 'Deep Residual Fourier Transformation for Single Image Deblurring'

Deep Residual Fourier Transformation for Single Image Deblurring Xintian Mao, Yiming Liu, Wei Shen, Qingli Li and Yan Wang News 2021.12.5 Release Deep

145 Jan 05, 2023

We present a regularized self-labeling approach to improve the generalization and robustness properties of fine-tuning.

Overview This repository provides the implementation for the paper "Improved Regularization and Robustness for Fine-tuning in Neural Networks", which

21 Sep 08, 2022

Generate text captions for images from their CLIP embeddings. Includes PyTorch model code and example training script.

Related tags

Overview

clip-text-decoder

Example Predictions

Installation

Training

Inference

Pretrained Models

Shortcomings

Comments

Releases(1.4.4)

1.4.4(Nov 7, 2022)

What's Changed

New Contributors

1.4.3(Nov 7, 2022)

What's Changed

1.4.2(Oct 26, 2022)

What's Changed

1.4.1(Oct 26, 2022)

What's Changed

1.4.0(Oct 23, 2022)

What's Changed

1.3.0(Oct 2, 2022)

What's Changed

1.2.0(Jan 29, 2022)

What's Changed

1.1.0(Dec 22, 2021)

What's Changed

New Contributors

1.0.0(Nov 15, 2021)

0.1.1(Nov 14, 2021)

0.1.0(Nov 14, 2021)

Owner

Frank Odom

Voxel-based Network for Shape Completion by Leveraging Edge Generation (ICCV 2021, oral)

Code for paper [ACE: Ally Complementary Experts for Solving Long-Tailed Recognition in One-Shot] (ICCV 2021, oral))

PyTorch implementation of the paper: "Preference-Adaptive Meta-Learning for Cold-Start Recommendation", IJCAI, 2021.

Mip-NeRF: A Multiscale Representation for Anti-Aliasing Neural Radiance Fields.

Swapping face using Face Mesh with TensorFlow Lite

Unsupervised Learning of Multi-Frame Optical Flow with Occlusions

A High-Level Fusion Scheme for Circular Quantities published at the 20th International Conference on Advanced Robotics

Source code for deep symbolic optimization.

Original Pytorch Implementation of FLAME: Facial Landmark Heatmap Activated Multimodal Gaze Estimation

Code for Referring Image Segmentation via Cross-Modal Progressive Comprehension, CVPR2020.

A motion detection system with RaspberryPi, OpenCV, Python

This repository contains an implementation of ConvMixer for the ICLR 2022 submission "Patches Are All You Need?".

Vision Transformer for 3D medical image registration (Pytorch).

Optimizers-visualized - Visualization of different optimizers on local minimas and saddle points.

TensorFlow ROCm port

Official implementation of the paper: "LDNet: Unified Listener Dependent Modeling in MOS Prediction for Synthetic Speech"

An AutoML Library made with Optuna and PyTorch Lightning

An unofficial personal implementation of UM-Adapt, specifically to tackle joint estimation of panoptic segmentation and depth prediction for autonomous driving datasets.

The code for 'Deep Residual Fourier Transformation for Single Image Deblurring'

We present a regularized self-labeling approach to improve the generalization and robustness properties of fine-tuning.