Hierarchical Cross-modal Talking Face Generation with Dynamic Pixel-wise Loss （ATVGnet）

Last update: Dec 27, 2022

Related tags

Deep Learning ATVGnet

Overview

Hierarchical Cross-modal Talking Face Generation with Dynamic Pixel-wise Loss （ATVGnet）

By Lele Chen , Ross K Maddox, Zhiyao Duan, Chenliang Xu.

University of Rochester.

Introduction
Citation
Running
Model
Results
Disclaimer and known issues

Introduction

This repository contains the original models (AT-net, VG-net) described in the paper Hierarchical Cross-modal Talking Face Generation with Dynamic Pixel-wise Loss. The demo video is avaliable at https://youtu.be/eH7h_bDRX2Q. This code can be applied directly in LRW and GRID. The outputs from the model are visualized here: the first one is the synthesized landmark from ATnet, the rest of them are attention, motion map and final results from VGnet.

Citation

If you use any codes, models or the ideas from this repo in your research, please cite:

@inproceedings{chen2019hierarchical,
  title={Hierarchical cross-modal talking face generation with dynamic pixel-wise loss},
  author={Chen, Lele and Maddox, Ross K and Duan, Zhiyao and Xu, Chenliang},
  booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
  pages={7832--7841},
  year={2019}
}

Running

This code is tested under Python 2.7. The model we provided is trained on LRW. However, it works fine on GRID,VOXCELB and other datasets. You can directly compare this model on other dataset with your own model. We treat this as fair comparison.
Pytorch environment:Pytorch 0.4.1. (conda install pytorch=0.4.1 torchvision cuda90 -c pytorch)
Install requirements.txt (pip install -r requirement.txt)
Download the pretrained ATnet and VGnet weights at google drive. Put the weights under model folder.
Run the demo code: python demo.py
- -device_ids: gpu id
- -cuda: using cuda or not
- -vg_model: pretrained VGnet weight
- -at_model: pretrained ATnet weight
- -lstm: use lstm or not
- -p: input example image
- -i: input audio file
- -lstm: use lstm or not
- -sample_dir: folder to save the outputs
- ...
Download and unzip the training data from LRW
Preprocess the data (Extract landmark and crop the image by dlib).
Train the ATnet model: python atnet.py
- -device_ids: gpu id
- -batch_size: batch size
- -model_dir: folder to save weights
- -lstm: use lstm or not
- -sample_dir: folder to save visualized images during training
- ...
Test the model: python atnet_test.py
- -device_ids: gpu id
- -batch_size: batch size
- -model_name: pretrained weights
- -sample_dir: folder to save the outputs
- -lstm: use lstm or not
- ...
Train the VGnet: python vgnet.py
- -device_ids: gpu id
- -batch_size: batch size
- -model_dir: folder to save weights
- -sample_dir: folder to save visualized images during training
- ...
Test the VGnet: python vgnet_test.py
- -device_ids: gpu id
- -batch_size: batch size
- -model_name: pretrained weights
- -sample_dir: folder to save the outputs
- ...

Model

Overall ATVGnet
Regresssion based discriminator network

Results

Result visualization on different datasets:
Reuslt compared with other SOTA methods:
The studies on image robustness respective with landmark accuracy:
Quantitative results:

Disclaimer and known issues

These codes are implmented in Pytorch.
In this paper, we train LRW and GRID seperately.
The model are sensitive to input images. Please use the correct preprocessing code.
I didn't finish the data processing code yet. I will release it soon. But you can try the model and replace with your own image.
If you want to train these models using this version of pytorch without modifications, please notice that:
- You need at lest 12 GB GPU memory.
- There might be some other untested issues.
There is another intresting and useful research on audio to landmark genration. Please check it out at https://github.com/eeskimez/Talking-Face-Landmarks-from-Speech.

Todos

Release training data

License

MIT

Hierarchical Cross-modal Talking Face Generation with Dynamic Pixel-wise Loss （ATVGnet）

Related tags

Overview

Hierarchical Cross-modal Talking Face Generation with Dynamic Pixel-wise Loss （ATVGnet）

Table of Contents

Introduction

Citation

Running

Model

Results

Disclaimer and known issues

Todos

License

Owner

Lele Chen

Remote sensing change detection using PaddlePaddle

The 2nd Version Of Slothybot

Code/data of the paper "Hand-Object Contact Prediction via Motion-Based Pseudo-Labeling and Guided Progressive Label Correction" (BMVC2021)

A simple implementation of Kalman filter in Multi Object Tracking

2021 Artificial Intelligence Diabetes Datathon

Official and maintained implementation of the paper "OSS-Net: Memory Efficient High Resolution Semantic Segmentation of 3D Medical Data" [BMVC 2021].

A general 3D Object Detection codebase in PyTorch.

YOLOv5 + ROS2 object detection package

On Evaluation Metrics for Graph Generative Models

This reporistory contains the test-dev data of the paper "xGQA: Cross-lingual Visual Question Answering".

OpenMMLab Semantic Segmentation Toolbox and Benchmark.

Neural Oblivious Decision Ensembles

Transformer based SAR image despeckling

ShuttleNet: Position-aware Fusion of Rally Progress and Player Styles for Stroke Forecasting in Badminton (AAAI 2022)

Deep Reinforcement Learning for Keras.

Energy consumption estimation utilities for Jetson-based platforms

🏆 The 1st Place Submission to AICity Challenge 2021 Natural Language-Based Vehicle Retrieval Track (Alibaba-UTS submission)

Python program that works as a contact list

This is a repository for a No-Code object detection inference API using the OpenVINO. It's supported on both Windows and Linux Operating systems.

StyleSpace Analysis: Disentangled Controls for StyleGAN Image Generation