To create a deep learning model which can explain the content of an image in the form of speech through caption generation with attention mechanism on Flickr8K dataset.

Last update: Feb 08, 2022

Related tags

Text Data & NLP Eye_for_the_blind

Overview

Eye for the blind

To create a deep learning model which can explain the content of an image in the form of speech through caption generation with attention mechanism on Flickr8K dataset. This kind of model is a use-case for blind people so that they can understand any image with the help of speech. The caption generated through a CNN-RNN model will be converted to speech using a text to speech library.

This problem statement is an application of both deep learning and natural language processing. The features of an image will be extracted by CNN-based encoder and this will be decoded by an RNN model.

The project is an extended application of Show, Attend and Tell: Neural Image Caption Generation with Visual Attention paper. https://arxiv.org/abs/1502.03044

The dataset is taken from the Kaggle website and it consists of sentence-based image description having a list of 8,000 images that are each paired with five different captions which provide clear descriptions of the salient entities and events of the image.

Project Pipeline

The project pipeline can be briefly summarized in the following four steps:

Data Understanding: Here, you need to load the data and understand the representation.
Data preprocessing: In this step, you will process both images and captions to the desired format.
Train/Test Split: Combine both images and captions to create the train and test dataset.
Model-Building: This is the stage where you will create your image captioning model by building Encoder , Attention and Decoder model.
Model Evaluation: Evaluate the models using greedy search and BLEU score.

To create a deep learning model which can explain the content of an image in the form of speech through caption generation with attention mechanism on Flickr8K dataset.

Related tags

Overview

Eye for the blind

Project Pipeline

Owner

Ragesh Hajela

To create a deep learning model which can explain the content of an image in the form of speech through caption generation with attention mechanism on Flickr8K dataset.

The official code for “DocTr: Document Image Transformer for Geometric Unwarping and Illumination Correction”, ACM MM, Oral Paper, 2021.

[Preprint] Escaping the Big Data Paradigm with Compact Transformers, 2021

A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

Code for Findings at EMNLP 2021 paper: "Learn Continually, Generalize Rapidly: Lifelong Knowledge Accumulation for Few-shot Learning"

A large-scale (194k), Multiple-Choice Question Answering (MCQA) dataset designed to address realworld medical entrance exam questions.

Tutorial to pretrain & fine-tune a 🤗 Flax T5 model on a TPUv3-8 with GCP

pytorch-kaldi is a project for developing state-of-the-art DNN/RNN hybrid speech recognition systems. The DNN part is managed by pytorch, while feature extraction, label computation, and decoding are performed with the kaldi toolkit.

🐍 A hyper-fast Python module for reading/writing JSON data using Rust's serde-json.

[ICCV 2021] Instance-level Image Retrieval using Reranking Transformers

Code for the paper: Sequence-to-Sequence Learning with Latent Neural Grammars

LUKE -- Language Understanding with Knowledge-based Embeddings

Multilingual word vectors in 78 languages

Automatic privilege escalation for misconfigured capabilities, sudo and suid binaries

This is the offline-training-pipeline for our project.

Traditional Chinese Text Recognition Dataset: Synthetic Dataset and Labeled Data

Coreference resolution for English, French, German and Polish, optimised for limited training data and easily extensible for further languages

Official code for "Parser-Free Virtual Try-on via Distilling Appearance Flows", CVPR 2021