Unofficial & improved implementation of NeRF--: Neural Radiance Fields Without Known Camera Parameters

Overview

[Unofficial code-base] NeRF--: Neural Radiance Fields Without Known Camera Parameters

[ Project | Paper | Official code base ] ⬅️ Thanks the original authors for the great work!

  • ⚠️ This is an unofficial pytorch re-implementation of the paper NeRF--: Neural Radiance Fields Without Known Camera Parameters.
  • I have reproduced the results on the LLFF-fern dataset, LLFF-flower dataset, personal photos, and some YouTube video clips chosen by myself.
  • This repo contains implementation of both the original paper and my personal modifications & refinements.

Example results

input raw images of the same scene (order doesn't matter, could be in arbitrary order)
output
(after joint optimization)
camera intrinsics (focal_x and focal_y)
camera extrinsics (inverse of poses, rotations and translations) of each image
a 3D implicit representation framework [NeRF] that models both appearance and geometry of the scene

Source 1: random YouTube video clips, time from 00:10:36 to 00:10:42

ReLU-based NeRF--
(no refinement, stuck in local minima)
SIREN-based NeRF--
(no refinement)
input
32 raw photos, sampled at 5 fps
32 x [540 x 960 x 3]
castle_input
learned scene model size 1.7 MiB / 158.7k params
8+ MLPs with width of 128
learned camera poses castle_1041_relu_pose castle_1041_pose_siren
predicted rgb
(appearance)
(with novel view synthesis)
castle_1041_relu castle_1041_siren
predicted depth
(geometry)
(with novel view synthesis)
castle_1041_relu_depth castle_1041_siren

Source 2: random YouTube video clips, time from 00:46:17 to 00:46:28

ReLU-based NeRF--
(with refinement, still stuck in local minima)
SIREN-based NeRF--
(with refinement)
input
27 raw photos, sampled at 2.5 fps
27 x [540 x 960 x 3]
castle_4614_input
learned scene model size 1.7 MiB / 158.7k params
8+ MLPs with width of 128
learned camera poses castle_4614_pose_siren castle_4614_pose_siren
predicted rgb
(appearance)
(with novel view synthesis)
castle_1041_siren castle_1041_siren
predicted depth
(geometry)
(with novel view synthesis)
castle_1041_siren castle_1041_siren

Source 3: photos by @crazyang

ReLU-based NeRF--
(no refinement)
SIREN-based NeRF--
(no refinement)
input
22 raw photos
22 x [756 x 1008 x3]
piano_input
learned scene model size 1.7 MiB / 158.7k params
8+ MLPs with width of 128
learned camera poses piano_relu_pose piano_siren_pose
predicted rgb
(appearance)
(with novel view synthesis)
piano_relu_rgb piano_siren_rgb
predicted depth
(geometry)
(with novel view synthesis)
piano_relu_depth piano_siren_depth

Notice that the reflectance of the piano's side is misunderstood as transmittance, which is reasonable and acceptable since no prior of the shape of the piano is provided.

What is NeRF and what is NeRF--

NeRF

NeRF is a neural (differentiable) rendering framework with great potentials. Please view [NeRF Project Page] for more details.

It represents scenes as a continuous function (typically modeled by several layers of MLP with non-linear activations); the same ideas within DeepSDF, SRN, DVR, and so on.

It is suggested to refer to [awesome-NeRF] and [awesome-neural-rendering] to catch up with recent 'exploding' development in these areas.

NeRF--

NeRF-- modifies the original NeRF from requiring known camera parameters to supporting unknown and learnable camera parameters.

  • NeRF-- does the following work in the training process:

    • Joint optimization of
      • camera intrinsics
      • camera extrinsics
      • a NeRF model (appearance and geometry)
    • Using pure raw real-world images and using just photometric loss (image reconstruction loss).
  • SfM+MVS

    • In other words, NeRF-- tackles exactly the same problem with what a basic SfM+MVS system like COLMAP does, but learns the camera parameters, geometry and appearance of the scene simultaneously in a more natural and holistic way, requiring no hand-crafted feature extraction procedures like SIFT or points, lines, surfaces etc.
  • How?

    • Since NeRF is a neural rendering framework (which means the whole framework is differentiable), one can directly compute the gradients of the photometric loss with respect to the camera parameters.
  • 🚀 Wide future of NeRF-based framework --- vision by inverse computer graphics

    • Expect more to come! Imagine direct computing of gradients of photometric loss w.r.t. illumination? object poses? object motion? object deformation? objects & background decomposition ? object relationships?...

My modifications & refinements / optional features

This repo first implements NeRF-- with nothing changed from the original paper. But it also support the following optional modifications, and will keep updating.

All the options are configured using yaml configuration files in the configs folder. See details about how to use these configs in the configuration section.

SIREN-based NeRF as backbone

Replace the ReLU activations of NeRF with sinusoidal(sin) activation. Codes borrowed and modified from [lucidrains' implementation of pi-GAN]. Please refer to SIREN and pi-GAN for more theoretical details.

To config:

model:
  framework: SirenNeRF # options: [NeRF, SirenNeRF]

📌 SIREN-based NeRF compared with ReLU-based NeRF

  • SirenNeRF could lead to smoother learned scenes (especially smoother shapes). (below, left for ReLU-based, right for SIREN-based)
ReLU-based NeRF-- (no refinement) SIREN-based NeRF-- (no refinement)
image-20210418015802348 depth_siren
  • SirenNeRF could lead to better results (smaller losses at convergence, better SSIM/PSNR metrics) .

siren_vs_relu_loss

The above two conclusions are also evidenced by the DeepSDF results shown in the SIREN project.

  • SirenNeRF is a little bad for scenes with lots of sharp and messy edges (for its continuously differentiable & smoothing nature).

e.g. LLFF-flower scene

ReLU-based NeRF-- (with refinement) SIREN-based NeRF-- (with refinement)
relu_rgb rgb_siren
relu_depth depth_siren

Note: since the raw output of SirenNeRF is relatively slower to grow, I multiply the raw output (sigma) of SirenNeRF with a factor of 10 30. To config, use model:siren_sigma_mul

[WIP] Perceptual model

For fewer shots with large viewport changes, I add an option to use a perceptual model (CLIP) and an additional perceptual loss along with the reconstruction loss, as in DietNeRF.

To config:

data:
  N_rays: -1 # options: -1 for whole image and no sampling, a integer > 0 for number of ray samples
training:
  w_perceptual: 0.01 # options: 0. for no perceptual model & loss, >0 to enable

Note: as the CLIP model requires at least 224x224 resolution and a whole image (not sampled rays) as input

  • data:N_rays must set to -1 for generating whole images when training
  • data:downscale must set to a proper value, and a GPU with larger memory size is required
    • or proper up-sampling is required

More choices of rotation representations / intrinsics parameterization

  • rotation representation

Refer to this paper for theoretical suggestions for different choices of SO(3) representations.

To config:

model:
  so3_representation: 'axis-angle' # options: [quaternion, axis-angle, rotation6D]
  • intrinsics parameterization

To config:

model:
  intrinsics_representation: 'square' # options: [square, ratio, exp]

Usage

hardware

  • 💻 OS: tested on Ubuntu 16 & 18
  • GPU (all assume 1024 ray samples(by default) + 756 x 1000 resolution + 128 network width)
    • scene model parameter size
      • 🌟 1.7 MiB for float32.
      • For NeRF scene model, it's just 8+ layers of MLPs with ReLU/sin activation, with width of 128.
    • 🕐 training time on 2080Ti
      • <10 mins or 0-200 epochs: learning poses mainly, and rough appearances
      • ~4 hours, from ~300 to 10000 epochs: the poses has little further changes; the NeRF model learns fine details (geometry & appearance) of the scene
    • GPU memory:
      • (training) ~3300 MiB GPU memory usage
      • (testing / rendering) lower GPU memory usage, but potentially more GPU usage since testing is on full resolution, while training is on a small batch of sampled pixel for each iteration.

software

  • Python >= 3.5

  • To install requirements, run:

    • Simply just run: (suggested used in anaconda environments):

      ## install torch & cuda & torchvision using your favorite tools, conda/pip
      # pip install torch torchvision
      
      ## install other requirements
      pip install numpy pyyaml addict imageio imageio-ffmpeg scikit-image tqdm tensorboardX "pytorch3d>=0.3.0" opencv-python
    • Or

      conda env create -f environment.yml
  • Before running any python scripts for the first time, cd to the project root directory and add the root project directory to the PYTHONPATH by running:

    cd /path/to/improved-nerfmm
    source set_env.sh

configuration

There are three choices for giving configuration values:

  • [DO NOT change] configs/base.yaml contains all the default values of the whole config.
  • [Your playground] Specific config yamls in configs folder are for specific tasks. You only need to put related config keys here. It is given to the python scripts using python xxx.py --config /path/to/xxx.yaml.
  • You can also give additional runtime command-line arguments with python xxx.py --xxx:yyy val, to change the config dict: 'xxx':{'yyy': val}

The configuration overwriting priority order:

  • command line args >>overwrites>> --config /path/to/xxx.yaml >>overwrites>> configs/base.yaml

data

dataset source link / script file path
LLFF Download LLFF example data using the scripts (run in project root directory):
bash dataio/download_example_data.sh
(automatic)
Youtube video clips https://www.youtube.com/watch?v=hWagaTjEa3Y ./data/castle_1041
./data/castle_4614
piano photos by @crazyang Google-drive ./data/piano

pre-trained models

You can get pre-trained models in either of the following two ways:

  • Clone the repo using git-lfs, and you will get the pre-trained models and configs in pretrained folder.
  • From pretrained folder in the google-drive

Training

Before running any python scripts for the first time, cd to the project root directory and add the root project directory to the PYTHONPATH by running:

cd /path/to/improved-nerfmm
source set_env.sh

Train on example data (without refinement)

Download LLFF example data using the scripts (run in project root directory)

bash dataio/download_example_data.sh

Start training:

python train.py --config configs/fern.yaml
  • To specify used GPUs: (e.g. 2 and 3) (in most cases, one GPU is quite enough.)

    python train.py --config configs/fern.yaml --device_ids 2,3
    • Currently, this repo use torch.DataParallel for multi-GPU training.
  • View the training logs and stats output in the experiment folder: ./logs/fern

  • Run tensorboard to monitor the training process:

    tensorboard --logdir logs/fern/events
  • To resume previously interrupted training:

    python train.py --load_dir logs/fern
    • Note: Full config is automatically backed up in the experiment directory when start training. Thus when loading from a directory, the scripts will only read from your_exp_dir/config.yaml, and configs/base.yaml will not be used.

🚀 Train on your own data

  • 📌 Note on suitable input

    • ①static scene ②forward-facing view ③with small view-port changes.
      • Smaller viewport change / forward facing views
        • So that a certain face of a certain object should appear in all views
        • Otherwise the training would fail in the early stages (failed to learn reasonable camera poses, and hence no chance for the NeRF).
        • This is mostly because it processes all input images at once.
      • No moving / deforming objects. (e.g. a car driving across the street, a bird flying across the sky, people waving hands)
      • No significant illumination/exposure changes. (e.g. camera moving from pointing towards the sun to back to the sun)
      • No focal length changes. Currently assume all input share the same camera intrinsics.
      • Just temporarily! (All imaginable limitations have imaginable solutions. Stay tuned!)
  • 📌 Note on training

    • When training with no refinement, the training process is roughly split into two phases:
      • [0 to about 100-300 epochs] The NeRF model learns some rough blurry pixel blocks, and these rough blocks helps with optimizing the camera extrinsics.
      • [300 epochs+ to end] The camera extrinsics are almost fixed, with very small further changes; the NeRF model learns the fine details of the scene.
    • You should monitor the early 100~300 epochs of the training process. If no meaningful camera poses (especially the camera translation on xy-plane) are learned during this early stages, there almost won't be any miracle further.
    • I have not tested on >50 images, but you can give it a try.
  • First, prepare your photos and put them into one separate folder, say /path/to/your_photos/xxx.png.

  • Second:

    • Write a new config file for your data: (you can put any config key mentioned in configs/base.yaml)

      expname: your_expname
      data:
        data_dir: /path/to/your_photos
        #=========
        N_rays: 1024 # numer of sampled rays in training.
        downscale: 4.
        #=========
    • And run

      python train.py --config /path/to/your_config.yaml
    • Or you can use some existing config file and run:

      python train.py --config /path/to/xxx.yaml --data:data_dir /path/to/your_photos --expname your_expname
  • The logging and stats would be in logs/your_expname folder.

  • Monitor the training process with:

    tensorboard --logdir logs/your_expname/events

Train on video clips

  • First, clip your video.mp4 with ffmepg.

    ffmpeg -ss 00:10:00 -i video.mp4 -to 00:00:05 -c copy video_clip.mp4
    • Note:

      • time format: hh:mm:ss.xxx
      • -ss means starting timestamp
      • -to means duration length, not end timestamp.
  • Second, convert video_clip.mp4 into images:

    mkdir output_dir
    ffmpeg -i video_clip.mp4 -filter:v fps=fps=3/1 output_dir/img-%04d.png
    • Note:

      • 3/1 means 3 frames per second. 3 is the nominator, 1 is the denominator.
  • Then train on your images with instructions in 🚀 ​Train on your own data

Automatic training with a pre-train stage and refine stage

Run

python train.py --config ./configs/fern_prefine.yaml

Or

python train.py --config ./configs/fern.yaml --training:num_epoch_pre 1000 --expname fern_prefine

You can also try on your own photos using similar configurations.

Refining a pre-trained NeRF--

This is the step suggested by original NeRF-- paper: drop all pre-trained parameters except for camera parameters, and refine.

For example, refine a pre-trained LLFF-fern scene, with original config stored in ./configs/fern.yaml, a pre-trained checkpoint in ./logs/fern/ckpts/final_xxxx.pt, and with a new experiment name fern_refine:

python train.py --config ./configs/fern.yaml --expname fern_refine --training:ckpt_file ./logs/fern/ckpts/final_xxxx.pt  --training:ckpt_only_use_keys cam_params

Note:

  • --training:ckpt_only_use_keys cam_params is used to drop all the keys in the pre-trained state_dict except cam_params when loading the checkpoints.
    • Some warnings like Could not find xxx in checkpoint will be prompted, which is OK and is the exact desired behavior.
  • a new expname is specified and hence a new experiment directory would be used, since we do not want to concatenate and mix the new logging stats with the old ones.

Testing

Free view port rendering

  • To render with camera2world matrices interpolated from the learned pose
python vis/free_viewport_rendering.py --load_dir /path/to/pretrained/exp_dir --render_type interpolate
  • To render with spiral camera paths as in the original NeRF repo
python vis/free_viewport_rendering.py --load_dir /path/to/pretrained/exp_dir --render_type spiral

Visualize learned camera pose

python vis/plot_camera_pose.py --load_dir /path/to/pretrained/exp_dir

Notice that the learned camera phi & t is actually for camera2world matrices, the inverse of camera extrinsics

You will get a matplotlib window like this:

image-20210425181218826

Road-map & updates

Basic NeRF model

  • 2021-04-17 Basic implementation of the original paper, including training & logging
    • Add quaternion, axis-angle, rotation6D as the rotation representation
    • Add exp, square, ratio for different parameterizations of camera focal_x and focal_y
    • Siren-ized NeRF
    • refinement process as in the original NeRF-- paper
  • Change DataParallel to DistributedDataParallel

Efficiency & training

  • 2021-04-19 Add pre-train for 1000 epochs then refine for 10000 epochs, similar with the official code base.

  • tolist - recent works in speeding up NeRF

More experiments

  • vSLAM tasks and datasets
  • traditional SfM datasets
  • 2021-04-15 raw videos handling

Better SfM strategy

  • tolist

More applicable for more scenes

  • NeRF++ & NeRF-- for handling unconstrained scenes

  • NeRF-W

  • some dynamic NeRF framework for dynamic scenes

  • Finish perceptual loss, for fewer shots

Related/used code bases

Citations

  • NeRF--
@article{wang2021nerf,
  title={Ne{RF}$--$: Neural Radiance Fields Without Known Camera Parameters},
  author={Wang, Zirui and Wu, Shangzhe and Xie, Weidi and Chen, Min and Prisacariu, Victor Adrian},
  journal={arXiv preprint arXiv:2102.07064},
  year={2021}
}
  • SIREN
@inproceedings{sitzmann2020siren,
  author={Sitzmann, Vincent and Martel, Julien NP and Bergman, Alexander W and Lindell, David B and Wetzstein, Gordon},
  title={Implicit neural representations with periodic activation functions},
  booktitle={Proc. NeurIPS},
  year={2020}
}
  • Perceptual model / semantic consistency from DietNeRF
@article{jain2021dietnerf,
  title={Putting NeRF on a Diet: Semantically Consistent Few-Shot View Synthesis},
  author={Ajay Jain and Matthew Tancik and Pieter Abbeel},
  journal={arXiv},
  year={2021}
}
Owner
Jianfei Guo
Thrive, don't just exist.
Jianfei Guo
Hierarchical probabilistic 3D U-Net, with attention mechanisms (—𝘈𝘵𝘵𝘦𝘯𝘵𝘪𝘰𝘯 𝘜-𝘕𝘦𝘵, 𝘚𝘌𝘙𝘦𝘴𝘕𝘦𝘵) and a nested decoder structure with deep supervision (—𝘜𝘕𝘦𝘵++).

Hierarchical probabilistic 3D U-Net, with attention mechanisms (—𝘈𝘵𝘵𝘦𝘯𝘵𝘪𝘰𝘯 𝘜-𝘕𝘦𝘵, 𝘚𝘌𝘙𝘦𝘴𝘕𝘦𝘵) and a nested decoder structure with deep supervision (—𝘜𝘕𝘦𝘵++). Built in TensorFlow 2.5. Configured for vox

Diagnostic Image Analysis Group 32 Dec 08, 2022
Joint learning of images and text via maximization of mutual information

mutual_info_img_txt Joint learning of images and text via maximization of mutual information. This repository incorporates the algorithms presented in

Ruizhi Liao 10 Dec 22, 2022
TransNet V2: Shot Boundary Detection Neural Network

TransNet V2: Shot Boundary Detection Neural Network This repository contains code for TransNet V2: An effective deep network architecture for fast sho

Tomáš Souček 212 Dec 27, 2022
This repository contains project created during the Data Challenge module at London School of Hygiene & Tropical Medicine

LSHTM_RCS This repository contains project created during the Data Challenge module at London School of Hygiene & Tropical Medicine (LSHTM) in collabo

Lukas Kopecky 3 Jan 30, 2022
Code for Active Learning at The ImageNet Scale.

Code for Active Learning at The ImageNet Scale. This repository implements many popular active learning algorithms and allows training with torch's DDP.

Zeyad Emam 47 Dec 12, 2022
Deep motion generator collections

GenMotion GenMotion (/gen’motion/) is a Python library for making skeletal animations. It enables easy dataset loading and experiment sharing for synt

23 May 24, 2022
Code accompanying our NeurIPS 2021 traffic4cast challenge

Traffic forecasting on traffic movie snippets This repo contains all code to reproduce our approach to the IARAI Traffic4cast 2021 challenge. In the c

Nina Wiedemann 2 Aug 09, 2022
Fbone (Flask bone) is a Flask (Python microframework) starter/template/bootstrap/boilerplate application.

Fbone (Flask bone) is a Flask (Python microframework) starter/template/bootstrap/boilerplate application.

Wilson 1.7k Dec 30, 2022
Supplemental Code for "ImpressionNet :A Multi view Approach to Predict Socio Facial Impressions"

Supplemental Code for "ImpressionNet :A Multi view Approach to Predict Socio Facial Impressions" Environment requirement This code is based on Python

Rohan Kumar Gupta 1 Dec 19, 2021
MakeItTalk: Speaker-Aware Talking-Head Animation

MakeItTalk: Speaker-Aware Talking-Head Animation This is the code repository implementing the paper: MakeItTalk: Speaker-Aware Talking-Head Animation

Adobe Research 285 Jan 08, 2023
Confidence Propagation Cluster aims to replace NMS-based methods as a better box fusion framework in 2D/3D Object detection

CP-Cluster Confidence Propagation Cluster aims to replace NMS-based methods as a better box fusion framework in 2D/3D Object detection, Instance Segme

Yichun Shen 41 Dec 08, 2022
Implementations of CNNs, RNNs, GANs, etc

Tensorflow Programs and Tutorials This repository will contain Tensorflow tutorials on a lot of the most popular deep learning concepts. It'll also co

Adit Deshpande 1k Dec 30, 2022
Converting CPT to bert form for use

cpt-encoder 将CPT转成bert形式使用 说明 刚刚刷到又出了一种模型:CPT,看论文显示,在很多中文任务上性能比mac bert还好,就迫不及待想把它用起来。 根据对源码的研究,发现该模型在做nlu建模时主要用的encoder部分,也就是bert,因此我将这部分权重转为bert权重类型

黄辉 1 Oct 14, 2021
DAFNe: A One-Stage Anchor-Free Deep Model for Oriented Object Detection

DAFNe: A One-Stage Anchor-Free Deep Model for Oriented Object Detection Code for our Paper DAFNe: A One-Stage Anchor-Free Deep Model for Oriented Obje

Steven Lang 58 Dec 19, 2022
Delving into Localization Errors for Monocular 3D Object Detection, CVPR'2021

Delving into Localization Errors for Monocular 3D Detection By Xinzhu Ma, Yinmin Zhang, Dan Xu, Dongzhan Zhou, Shuai Yi, Haojie Li, Wanli Ouyang. Intr

XINZHU.MA 124 Jan 04, 2023
Distributed DataLoader For Pytorch Based On Ray

Dpex——用户无感知分布式数据预处理组件 一、前言 随着GPU与CPU的算力差距越来越大以及模型训练时的预处理Pipeline变得越来越复杂,CPU部分的数据预处理已经逐渐成为了模型训练的瓶颈所在,这导致单机的GPU配置的提升并不能带来期望的线性加速。预处理性能瓶颈的本质在于每个GPU能够使用的C

Dalong 23 Nov 02, 2022
PPO Lagrangian in JAX

PPO Lagrangian in JAX This repository implements PPO in JAX. Implementation is tested on the safety-gym benchmark. Usage Install dependencies using th

Karush Suri 2 Sep 14, 2022
DexterRedTool - Dexter's Red Team Tool that creates cronjob/task scheduler to consistently creates users

DexterRedTool Author: Dexter Delandro CSEC 473 - Spring 2022 This tool persisten

2 Feb 16, 2022
Classification of EEG data using Deep Learning

Graduation-Project Classification of EEG data using Deep Learning Epilepsy is the most common neurological disease in the world. Epilepsy occurs as a

Osman Alpaydın 5 Jun 24, 2022
The implementation code for "DAGAN: Deep De-Aliasing Generative Adversarial Networks for Fast Compressed Sensing MRI Reconstruction"

DAGAN This is the official implementation code for DAGAN: Deep De-Aliasing Generative Adversarial Networks for Fast Compressed Sensing MRI Reconstruct

TensorLayer Community 159 Nov 22, 2022