The official code for the ICCV-2021 paper "Speech Drives Templates: Co-Speech Gesture Synthesis with Learned Templates".

Last update: Dec 23, 2022

Related tags

Computer Vision SpeechDrivesTemplates

Overview

SpeechDrivesTemplates

The official repo for the ICCV-2021 paper "Speech Drives Templates: Co-Speech Gesture Synthesis with Learned Templates".

[arxiv / video]

Our paper and this repo focus on upper-body pose generation from audio. To synthesize images from poses, please refer to this Pose2Img repo.

Code
Model
Data preparation

Package Hierarchy

|-- config
|     |-- default.py
|     |-- voice2pose_s2g_speech2gesture.yaml        # baseline: speech2gesture
|     |-- voice2pose_sdt_vae_speech2gesture.yaml    # ours (VAE)
|     |-- pose2pose_speech2gesture.yaml             # gesture reconstruction  
|     `-- voice2pose_sdt_bp_speech2gesture.yaml     # ours (Backprop)
|
|-- core
|     |-- datasets
|     |-- netowrks
|     |-- pipelines
|     \-- utils
|
|-- dataset
|     \-- speech2gesture  # create a soft link here
|
|-- output
|     \-- <date-config-tag>  # A directory for each experiment
|
`-- main.py

Setup the Dataset

Datasets shuold be placed in the dataset directory. Just create a soft link like this:

ln -s <path-to-SPEECH2GESTURE-dataset> ./dataset/speech2gesture

For your own dataset, you need to implement a subclass of torch.utils.data.Dataset in core/datasets/custom_dataset.py.

Train

Train a Model from Scratch

python main.py --config_file configs/voice2pose_sdt_bp_speech2gesture.yaml \
    --tag DEV \
    SYS.NUM_WORKERS 32

--tag set the name of the experiment which wil be displayed in the outputfile.
You can overwrite the any parameters defined in voice2pose_default.py by simply adding it at the end of the command. The example above set SYS.NUM_WORKERS to 32 temporarily.

Resume Training from an Interrupted Experiment

python main.py --config_file configs/voice2pose_sdt_bp_speech2gesture.yaml \
    --resume_from <checkpoint-to-continue-from>

This command will load the state_dict from the checkpoint for both the model and the optimizer, and write results to the original directory that the checkpoint lies in.

Training from a pretrained model

python main.py --config_file configs/voice2pose_sdt_bp_speech2gesture.yaml \
    --pretrain_from <checkpoint-to-continue-from> \
    --tag DEV

This command will only load the state_dict for the model, and write results to a new base directory.

Test

To test the model, run this command:

python main.py --config_file configs/voice2pose_sdt_bp_speech2gesture.yaml \
    --tag DEV \
    --test-only \
    --checkpoint <path-to-checkpoint>

Demo

python main.py --config_file configs/voice2pose_sdt_bp_speech2gesture.yaml \
    --tag <DEV> \
    --demo_input <audio.wav> \
    --checkpoint <path-to-checkpoint> \
    DATASET.SPEAKER oliver \
    SYS.VIDEO_FORMAT "['mp4']"

Important Details

Dataset caching

We turn on dataset caching (DATASET.CACHING) by default to speed up training.

If you encounter errors in the dataloader like RuntimeError: received 0 items of ancdata, please increase ulimit by running the command ulimit -n 262144. (refer to this issue)

DataParallel and DistributedDataParallel

We use single GPU (warpped by DataParallel) by default since it is fast enough with dataset caching. For multi-GPU training, we recommand using DistributedDataParallel (DDP) because it provide SyncBN across GPU cards. To enable DDP, set SYS.DISTRIBUTED to True and set SYS.WORLD_SIZE according to the number of GPUs.

When using DDP, assure that the batch_size can be divided exactly by SYS.WORLD_SIZE.

Misc

To run any module other than the main files in the root directory, for example the core\datasets\speech2gesture.py file, you should run python -m core.datasets.speech2gesture rather than python core\datasets\speech2gesture.py. This is an interesting problem of Python's relative importing which deserves in-depth thinking.
We save a checkpoint and conduct validation after each epoch. You can change the interval in the config file.
We generate and save 2 videos in each epoch when training. During validation, we sample 8 videos for each epoch. These videos are saved in tensorborad (without sound) and mp4 (with sound). You can change the SYS.VIDEO_FORMAT parameter to select one or two of them.
We usually sett NUM_WORKERS to 32 for best performance. If you encounter any error about memory, try lower NUM_WORKERS.

@inproceedings{qian2021speech,
  title={Speech Drives Templates: Co-Speech Gesture Synthesis with Learned Templates},
  author={Qian, Shenhan and Tu, Zhi and Zhi, YiHao and Liu, Wen and Gao, Shenghua},
  journal={International Conference on Computer Vision (ICCV)},
  year={2021}
}

The official code for the ICCV-2021 paper "Speech Drives Templates: Co-Speech Gesture Synthesis with Learned Templates".

Related tags

Overview

SpeechDrivesTemplates

Package Hierarchy

Setup the Dataset

Train

Train a Model from Scratch

Resume Training from an Interrupted Experiment

Training from a pretrained model

Test

Demo

Important Details

Dataset caching

DataParallel and DistributedDataParallel

Misc

Owner

Qian Shenhan

PyQT5 app that colorize black & white pictures using CNN(use pre-trained model which was made with OpenCV)

A collection of resources (including the papers and datasets) of OCR (Optical Character Recognition).

aardio的opencv库

This is an API written in python that uses FastAPI. It is a simple API that can detect discord tokens in Images.

3点クリックで円を指定し、極座標変換を行うサンプルプログラム

[ICCV, 2021] Cloud Transformers: A Universal Approach To Point Cloud Processing Tasks

A post-processing tool for scanned sheets of paper.

An Implementation of the FOTS: Fast Oriented Text Spotting with a Unified Network

computer vision, image processing and machine learning on the web browser or node.

Detect and fix skew in images containing text

A facial recognition device is a device that takes an image or a video of a human face and compares it to another image faces in a database.

Extracting Tables from Document Images using a Multi-stage Pipeline for Table Detection and Table Structure Recognition:

Controlling Volume by Hand Gestures

Using computer vision method to recognize and calcutate the features of the architecture.

A simple Digits Recogniser made in Python

Lightning Fast Language Prediction 🚀

[EMNLP 2021] Improving and Simplifying Pattern Exploiting Training

A PyTorch implementation of ECCV2018 Paper: TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes

a deep learning model for page layout analysis / segmentation.