Video Autoencoder: self-supervised disentanglement of 3D structure and motion

Overview

Video Autoencoder: self-supervised disentanglement of 3D structure and motion

This repository contains the code (in PyTorch) for the model introduced in the following paper:

Video Autoencoder: self-supervised disentanglement of 3D structure and motion
Zihang Lai, Sifei Liu, Alexi A. Efros, Xiaolong Wang
ICCV, 2021
[Paper] [Project Page] [12-min oral pres. video] [3-min supplemental video]

Figure

Citation

@inproceedings{Lai21a,
        title={Video Autoencoder: self-supervised disentanglement of 3D structure and motion},
        author={Lai, Zihang and Liu, Sifei and Efros, Alexei A and Wang, Xiaolong},
        booktitle={ICCV},
        year={2021}
}

Contents

  1. Introduction
  2. Data preparation
  3. Training
  4. Evaluation
  5. Pretrained model

Introduction

Figure We present Video Autoencoder for learning disentangled representations of 3D structure and camera pose from videos in a self-supervised manner. Relying on temporal continuity in videos, our work assumes that the 3D scene structure in nearby video frames remains static. Given a sequence of video frames as input, the Video Autoencoder extracts a disentangled representation of the scene including: (i) a temporally-consistent deep voxel feature to represent the 3D structure and (ii) a 3D trajectory of camera poses for each frame. These two representations will then be re-entangled for rendering the input video frames. Video Autoencoder can be trained directly using a pixel reconstruction loss, without any ground truth 3D or camera pose annotations. The disentangled representation can be applied to a range of tasks, including novel view synthesis, camera pose estimation, and video generation by motion following. We evaluate our method on several large-scale natural video datasets, and show generalization results on out-of-domain images.

Dependencies

The following dependencies are not strict - they are the versions that we use.

Data preparation

RealEstate10K:

  1. Download the dataset from RealEstate10K.
  2. Download videos from RealEstate10K dataset, decode videos into frames. You might find the RealEstate10K_Downloader written by cashiwamochi helpful. Organize the data files into the following structure:
RealEstate10K/
    train/
        0000cc6d8b108390.txt
        00028da87cc5a4c4.txt
        ...
    test/
        000c3ab189999a83.txt
        000db54a47bd43fe.txt
        ...
dataset/
    train/
        0000cc6d8b108390/
            52553000.jpg
            52586000.jpg
            ...
        00028da87cc5a4c4/
            ...
    test/
        000c3ab189999a83/
        ...
  1. Subsample the training set at one-third of the original frame-rate (so that the motion is sufficiently large). You can use scripts/subsample_dataset.py.
  2. A list of videos ids that we used (10K for training and 5K for testing) is provided here:
    1. Training video ids and testing video ids.
    2. Note: as time changes, the availability of videos could change.

Matterport 3D (this could be tricky):

  1. Install habitat-api and habitat-sim. You need to use the following repo version (see this SynSin issue for details):

    1. habitat-sim: d383c2011bf1baab2ce7b3cd40aea573ad2ddf71
    2. habitat-api: e94e6f3953fcfba4c29ee30f65baa52d6cea716e
  2. Download the models from the Matterport3D dataset and the point nav datasets. You should have a dataset folder with the following data structure:

    root_folder/
         mp3d/
             17DRP5sb8fy/
                 17DRP5sb8fy.glb  
                 17DRP5sb8fy.house  
                 17DRP5sb8fy.navmesh  
                 17DRP5sb8fy_semantic.ply
             1LXtFkjw3qL/
                 ...
             1pXnuDYAj8r/
                 ...
             ...
         pointnav/
             mp3d/
                 ...
    
  3. Walk-through videos for pretraining: We use a ShortestPathFollower function provided by the Habitat navigation package to generate episodes of tours of the rooms. See scripts/generate_matterport3d_videos.py for details.

  4. Training and testing view synthesis pairs: we generally follow the same steps as the SynSin data instruction. The main difference is that we precompute all the image pairs. See scripts/generate_matterport3d_train_image_pairs.py and scripts/generate_matterport3d_test_image_pairs.py for details.

###Replica:

  1. Testing view synthesis pairs: This procedure is similar to step 4 in Matterport3D - with only the specific dataset changed. See scripts/generate_replica_test_image_pairs.py for details.

Configurations

Finally, change the data paths in configs/dataset.yaml to your data location.

Pre-trained models

  • Pre-trained model (RealEstate10K): Link
  • Pre-trained model (Matterport3D): Link

Training:

Use this script:

CUDA_VISIBLE_DEVICES=0,1 python train.py --savepath log/train --dataset RealEstate10K

Some optional commands (w/ default value in square bracket):

  • Select dataset: --dataset [RealEstate10K]
  • Interval between clip frames: --interval [1]
  • Change clip length: --clip_length [6]
  • Increase/decrease lr step: --lr_adj [1.0]
  • For Matterport3D finetuning, you need to set --clip_length 2, because the data are pairs of images.

Evaluation:

1. Generate test results:

Use this script (for testing RealEstate10K):

CUDA_VISIBLE_DEVICES=0 python test_re10k.py --savepath log/model --resume log/model/checkpoint.tar --dataset RealEstate10K

or this script (for testing Matterport3D/Replica):

CUDA_VISIBLE_DEVICES=0 python test_mp3d.py --savepath log/model --resume log/model/checkpoint.tar --dataset Matterport3D

Some optional commands:

  • Select dataset: --dataset [RealEstate10K]
  • Max number of frames: --frame_limit [30]
  • Max number of sequences: --video_limit [100]
  • Use training set to evaluate: --train_set

Running this will generate a output folder where the results (videos and poses) save. If you want to visualize the pose, use packages for evaluation of odometry, such as evo. If you want to quantitatively evaluate the results, see 2.1, 2.2.

2.1 Quantitative Evaluation of synthesis results:

Use this script:

python eval_syn_re10k.py [OUTPUT_DIR] (for RealEstate10K)
python eval_syn_mp3d.py [OUTPUT_DIR] (for Matterport3D)

Optional commands:

  • Evaluate LPIPS: --lpips

2.2 Quantitative Evaluation of pose prediction results:

Use this script:

python eval_pose.py [POSE_DIR]

Contact

For any questions about the code or the paper, you can contact zihang.lai at gmail.com.

Owner
Working from home
MVS2D: Efficient Multi-view Stereo via Attention-Driven 2D Convolutions

MVS2D: Efficient Multi-view Stereo via Attention-Driven 2D Convolutions Project Page | Paper If you find our work useful for your research, please con

96 Jan 04, 2023
SE-MSCNN: A Lightweight Multi-scaled Fusion Network for Sleep Apnea Detection Using Single-Lead ECG Signals

SE-MSCNN: A Lightweight Multi-scaled Fusion Network for Sleep Apnea Detection Using Single-Lead ECG Signals Abstract Sleep apnea (SA) is a common slee

9 Dec 21, 2022
TF Image Segmentation: Image Segmentation framework

TF Image Segmentation: Image Segmentation framework The aim of the TF Image Segmentation framework is to provide/provide a simplified way for: Convert

Daniil Pakhomov 546 Dec 17, 2022
ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation

ENet in Caffe Execution times and hardware requirements Network 1024x512 1280x720 Parameters Model size (fp32) ENet 20.4 ms 32.9 ms 0.36 M 1.5 MB SegN

Timo Sämann 561 Jan 04, 2023
Python implementation of ADD: Frequency Attention and Multi-View based Knowledge Distillation to Detect Low-Quality Compressed Deepfake Images, AAAI2022.

ADD: Frequency Attention and Multi-View based Knowledge Distillation to Detect Low-Quality Compressed Deepfake Images Binh M. Le & Simon S. Woo, "ADD:

2 Oct 24, 2022
On Effective Scheduling of Model-based Reinforcement Learning

On Effective Scheduling of Model-based Reinforcement Learning Code to reproduce the experiments in On Effective Scheduling of Model-based Reinforcemen

laihang 8 Oct 07, 2022
AgML is a comprehensive library for agricultural machine learning

AgML is a comprehensive library for agricultural machine learning. Currently, AgML provides access to a wealth of public agricultural datasets for common agricultural deep learning tasks.

Plant AI and Biophysics Lab 1 Jul 07, 2022
Phy-Q: A Benchmark for Physical Reasoning

Phy-Q: A Benchmark for Physical Reasoning Cheng Xue*, Vimukthini Pinto*, Chathura Gamage* Ekaterina Nikonova, Peng Zhang, Jochen Renz School of Comput

29 Dec 19, 2022
Python Wrapper for Embree

pyembree Python Wrapper for Embree Installation You can install pyembree (and embree) via the conda-forge package. $ conda install -c conda-forge pyem

Anthony Scopatz 67 Dec 24, 2022
A library for preparing, training, and evaluating scalable deep learning hybrid recommender systems using PyTorch.

collie_recs Collie is a library for preparing, training, and evaluating implicit deep learning hybrid recommender systems, named after the Border Coll

ShopRunner 97 Jan 03, 2023
A minimalist environment for decision-making in autonomous driving

highway-env A collection of environments for autonomous driving and tactical decision-making tasks An episode of one of the environments available in

Edouard Leurent 1.6k Jan 07, 2023
1st place solution to the Satellite Image Change Detection Challenge hosted by SenseTime

1st place solution to the Satellite Image Change Detection Challenge hosted by SenseTime

Lihe Yang 209 Jan 01, 2023
Face Library is an open source package for accurate and real-time face detection and recognition

Face Library Face Library is an open source package for accurate and real-time face detection and recognition. The package is built over OpenCV and us

52 Nov 09, 2022
Start-to-finish tutorial for interactive music co-creation in PyTorch and Tensorflow.js

Start-to-finish tutorial for interactive music co-creation in PyTorch and Tensorflow.js

Chris Donahue 98 Dec 14, 2022
The implementation of CVPR2021 paper Temporal Query Networks for Fine-grained Video Understanding, by Chuhan Zhang, Ankush Gupta and Andrew Zisserman.

Temporal Query Networks for Fine-grained Video Understanding 📋 This repository contains the implementation of CVPR2021 paper Temporal_Query_Networks

55 Dec 21, 2022
TensorFlow-LiveLessons - "Deep Learning with TensorFlow" LiveLessons

TensorFlow-LiveLessons Note that the second edition of this video series is now available here. The second edition contains all of the content from th

Deep Learning Study Group 830 Jan 03, 2023
A PyTorch implementation of Implicit Q-Learning

IQL-PyTorch This repository houses a minimal PyTorch implementation of Implicit Q-Learning (IQL), an offline reinforcement learning algorithm, along w

Garrett Thomas 30 Dec 12, 2022
Repository for the paper "PoseAug: A Differentiable Pose Augmentation Framework for 3D Human Pose Estimation", CVPR 2021.

PoseAug: A Differentiable Pose Augmentation Framework for 3D Human Pose Estimation Code repository for the paper: PoseAug: A Differentiable Pose Augme

Pyjcsx 328 Dec 17, 2022
Background Matting: The World is Your Green Screen

Background Matting: The World is Your Green Screen By Soumyadip Sengupta, Vivek Jayaram, Brian Curless, Steve Seitz, and Ira Kemelmacher-Shlizerman Th

Soumyadip Sengupta 4.6k Jan 04, 2023
Nonnegative spatial factorization for multivariate count data

Nonnegative spatial factorization for multivariate count data This repository contains supporting code to facilitate reproducible analysis. For detail

Will Townes 24 Dec 19, 2022