zhengqili / neural-scene-flow-fields Goto Github PK

View Code? Open in Web Editor NEW

702.0 702.0 94.0 92.14 MB

PyTorch implementation of paper "Neural Scene Flow Fields for Space-Time View Synthesis of Dynamic Scenes"

License: MIT License

Python 96.42% C++ 0.42% Cuda 3.13% Shell 0.03%

neural-scene-flow-fields's People

Contributors

Stargazers

Watchers

Forkers

liuguoyou peterzs barryzm tamwaiban peterzhousz ashbt jasonlsc edgar-tr longjohncoder ayulockin andreas-ui lutznw rob813 bruinxiong springwald tiamat-tech aquaeatrae sansmoraxz codewhizz k-washi davgit dw5 damywise lulu1315 satoshirobatofujimoto gsygsy96 baldrlector ken2576 dongwoohhh santolina andrewchiyz nirvanalan refarde derrick-xwp peterouzh rohaldb zhanghongyong123456 raghavendra-yanni zhangtianjia ml-and-ai-repo bencoster arthikbhandary sinhasam teehanming hkmtechnology x-kimna passysosysmas frankhome61 gaochen315 liubl1217 chikayan rslnmtvv c00renut qbeer r-levi rogerqi liruilong940607 mavende atlasgooo2 kepengxu whuzhang753 dave-dush oximi123 taoranyi ashishd luciferbobo jackzhousz maivanquanbk jerryzfc cmpt-985-term-project cmpt-983-term-project ayanamisaki phoenixdigitalfx lipet2k yurkar2333 gg-big-org xianzhuoliu imclab qianqian121 manideep1999 rachmadionl martincastellano wez234 buesma 5l1v3r1 rakyat-game heshengyiyue dearborn-open-ai lutyyyy imnotprepared goodsmash louis-leee mickaelbressieux rogakann

neural-scene-flow-fields's Issues

urlopen error [Errno 111]

I get the following error while loading the pre-trained ResNet when I run run_midas.py from a remote server. On my local machine, it however worked (with python 3.9.12).
Here I use python 3.7.4, but I also tried with python 3.8.5 with the same result.

Traceback (most recent call last): File "run_midas.py", line 267, in <module> args.resize_height) File "run_midas.py", line 158, in run model = MidasNet(model_path, non_negative=True) File "/cluster/project/infk/courses/252-0579-00L/group34_nerf/CloudNeRF/other_papers/Neural-Scene-Flow-Fields/nsff_scripts/models/midas_net.py", line 30, in __init__ self.pretrained, self.scratch = _make_encoder(features, use_pretrained) File "/cluster/project/infk/courses/252-0579-00L/group34_nerf/CloudNeRF/other_papers/Neural-Scene-Flow-Fields/nsff_scripts/models/blocks.py", line 6, in _make_encoder pretrained = _make_pretrained_resnext101_wsl(use_pretrained) File "/cluster/project/infk/courses/252-0579-00L/group34_nerf/CloudNeRF/other_papers/Neural-Scene-Flow-Fields/nsff_scripts/models/blocks.py", line 26, in _make_pretrained_resnext101_wsl resnet = torch.hub.load("facebookresearch/WSL-Images", "resnext101_32x8d_wsl") File "/cluster/project/infk/courses/252-0579-00L/group34_nerf/CloudNeRF/other_papers/Neural-Scene-Flow-Fields/nsff_venv/lib64/python3.7/site-packages/torch/hub.py", line 403, in load repo_or_dir = _get_cache_or_reload(repo_or_dir, force_reload, verbose, skip_validation) File "/cluster/project/infk/courses/252-0579-00L/group34_nerf/CloudNeRF/other_papers/Neural-Scene-Flow-Fields/nsff_venv/lib64/python3.7/site-packages/torch/hub.py", line 170, in _get_cache_or_reload repo_owner, repo_name, branch = _parse_repo_info(github) File "/cluster/project/infk/courses/252-0579-00L/group34_nerf/CloudNeRF/other_papers/Neural-Scene-Flow-Fields/nsff_venv/lib64/python3.7/site-packages/torch/hub.py", line 124, in _parse_repo_info with urlopen(f"https://github.com/{repo_owner}/{repo_name}/tree/main/"): File "/cluster/apps/nss/python/3.7.4/x86_64/lib64/python3.7/urllib/request.py", line 222, in urlopen return opener.open(url, data, timeout) File "/cluster/apps/nss/python/3.7.4/x86_64/lib64/python3.7/urllib/request.py", line 525, in open response = self._open(req, data) File "/cluster/apps/nss/python/3.7.4/x86_64/lib64/python3.7/urllib/request.py", line 543, in _open '_open', req) File "/cluster/apps/nss/python/3.7.4/x86_64/lib64/python3.7/urllib/request.py", line 503, in _call_chain result = func(*args) File "/cluster/apps/nss/python/3.7.4/x86_64/lib64/python3.7/urllib/request.py", line 1360, in https_open context=self._context, check_hostname=self._check_hostname) File "/cluster/apps/nss/python/3.7.4/x86_64/lib64/python3.7/urllib/request.py", line 1319, in do_open raise URLError(err) urllib.error.URLError: <urlopen error [Errno 111] Connection refused>

How would you recommend adapting NSFF to non-forward facing scenes?

Hello,

first of all, thank you for releasing the implementation for your amazing project.
The question I wanted to ask is how does one adapt NSFF to support reconstruction in euclidean space, thereby extending it to also work on non-forward facing scenes?

In other words, which parts of the codebase would I need to modify to enable the codebase to run on such scenes? I'm guessing just setting the "no_ndc" flag to "True" inside the config file wouldn't be enough.

Number of training images on Nivida dynamic dataset

HI,

How many training images do you use on each time instance?

MiDaS depth prediction -- inverse depth?

Hi,

I just want to quickly inquire about Midas depth prediction. The original midas approach seems to predict disparity or inverse depth, rather than euclidean depth: isl-org/MiDaS#42.

However, as far as I can tell, the rendered depth map is based on z-values distance, not inverse depth. Thus, is your MiDaS model pretrained on depth? Thanks.

Faster Training.

Hello,
Do you have any suggestions for making the training faster given GPUs with more memory?
I'm working with 2 A6000s and would like to fully leverage the memory capacity.

What's the minimum system requirements for running the inference？

poor results on a new dataset

I trained on a new dataset, (full 500000) but had to change the netwidth to 128 (from 256) to be able to train on 16GB GPU. See attached video, the result is poor, the scene is obviously complex, with original forward motions into moving windmills. Do you have any advice? Is the 128 MLP the cause of the poor results, or the dataset is too complex for the approach?

YYVID_BT.mp4

thanks

Implementation different from paper?

Neural-Scene-Flow-Fields/nsff_exp/render_utils.py

Line 487 in d400175

raw_ref = network_query_fn(pts_ref, viewdirs, network_fn)

Here the network only takes input of points and view directions and predicts the forward and backward flow to neighbor frames. But in paper eq(4), the network additionally needs the time i. Is the i not useful in practice?

If that's the case, I'm wondering how the network can distinguish the different frames, if there are some points and view dirs that happen to be the same in different frames.

RuntimeError: stack expects each tensor to be equal size & AttributeError: 'NoneType' object has no attribute 'shape'

#local run
colmap feature_extractor \
--database_path ./database.db --image_path ./dense/images/

colmap exhaustive_matcher \
--database_path ./database.db

colmap mapper \
--database_path ./database.db \
--image_path ./dense/images \
--output_path ./dense/sparse

colmap image_undistorter \
--image_path ./dense/images \
--input_path ./dense/sparse/0 \
--output_path ./dense \
--output_type COLMAP \
--max_image_size 2000

#colab run

%cd /content/drive/MyDrive/neural-net
!git clone https://github.com/zl548/Neural-Scene-Flow-Fields
%cd Neural-Scene-Flow-Fields
!pip install configargparse
!pip install matplotlib
!pip install opencv
!pip install scikit-image
!pip install scipy
!pip install cupy
!pip install imageio.
!pip install tqdm
!pip install kornia

my Images are 288x512 pixels

%cd /content/drive/MyDrive/neural-net/Neural-Scene-Flow-Fields/nsff_scripts/
    # create camera intrinsics/extrinsic format for NSFF, same as original NeRF where it uses imgs2poses.py script from the LLFF code: https://github.com/Fyusion/LLFF/blob/master/imgs2poses.py
!python save_poses_nerf.py --data_path "/content/drive/MyDrive/neural-net/Neural-Scene-Flow-Fields/nerf_data/bolli/dense"
    # Resize input images and run single view model, 
    # argument resize_height: resized image height for model training, width will be resized based on original aspect ratio
!python run_midas.py --data_path "/content/drive/MyDrive/neural-net/Neural-Scene-Flow-Fields/nerf_data/bolli/dense"  --resize_height 512
!bash ./download_models.sh
    # Run optical flow model
!python run_flows_video.py --model models/raft-things.pth --data_path /content/drive/MyDrive/neural-net/Neural-Scene-Flow-Fields/nerf_data/bolli/dense

Error:

Traceback (most recent call last):
  File "run_flows_video.py", line 448, in <module>
    run_optical_flows(args)
  File "run_flows_video.py", line 350, in run_optical_flows
    images = load_image_list(images)
  File "run_flows_video.py", line 257, in load_image_list
    images = torch.stack(images, dim=0)
RuntimeError: stack expects each tensor to be equal size, but got [3, 512, 288] at entry 0 and [3, 512, 287] at entry 31

So input_w = is not consistent, eventhough my images are all dimensions 288x512

Even if I modify the script:

def load_image(imfile):
    long_dim = 512

    img = np.array(Image.open(imfile)).astype(np.uint8)

    # Portrait Orientation
    if img.shape[0] > img.shape[1]:
        input_h = long_dim
        input_w = 288

The dimensions error is gone, but another error:

...
flow input w 288 h 512
0
/usr/local/lib/python3.7/dist-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  ../aten/src/ATen/native/TensorShape.cpp:2157.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Traceback (most recent call last):
  File "run_flows_video.py", line 448, in <module>
    run_optical_flows(args)
  File "run_flows_video.py", line 363, in run_optical_flows
    (img_train.shape[1], img_train.shape[0]), 
AttributeError: 'NoneType' object has no attribute 'shape'

Evaluation metrics

Hi, I am wondering if there is a standard for SSIM and LPIPS.

For SSIM: I see you use the implementation of scikit-image. When I use kornia's implementation with window size = 11 (I don't know what size scikit-image uses if it's not set), it seems to yield different result... Do you have idea what other authors use?

For LPIPS:

Do the other authors also use alexnet?

Neural-Scene-Flow-Fields/nsff_exp/evaluation.py

Line 326 in 62a1770

lpips = model.forward(gt_img_0, rgb_0)

The network expects the rgb to be scaled to [-1, 1]. If it's [0, 1] it seems that you need to pass argument normalize=True

Neural-Scene-Flow-Fields/nsff_exp/models/__init__.py

Lines 26 to 34 in 62a1770

 def forward(self, pred, target, mask=None, normalize=False): 

 """ 

  Pred and target are Variables. 

  If normalize is True, assumes the images are between [0,1] and then scales them between [-1,+1] 

  If normalize is False, assumes the images are already between [-1,+1] 

  Inputs pred and target are Nx3xHxW 

  Output pytorch Variable N long 

  """

So I'm afraid the evaluation you have is not exactly correct...

It makes me think that if there's no common standard for these metrics that might differ from one implementation to the other, or that sometimes the authors make mistake in the evaluation process, then only the PSNR score is credible...

Evaluation Set

Hi,

I found the evaluation.py actually use the training images. Could you please share how to get the exact number in Table 3 of your paper. More specifically, how to know which are

the remaining 11 held-out images per time instance for evaluation

Question about Least Kinetic Motion Prior

Hi, I was wondering why

sf_sm_loss += args.w_sm * compute_sf_lke_loss(ret['raw_pts_ref'], 
                                                    ret['raw_pts_post'], 
                                                    ret['raw_pts_prev'], 
                                                    H, W, focal)

is called twice at the following two lines?:

Neural-Scene-Flow-Fields/nsff_exp/run_nerf.py

Line 589 in d400175

sf_sm_loss += args.w_sm * compute_sf_lke_loss(ret['raw_pts_ref'],

Neural-Scene-Flow-Fields/nsff_exp/run_nerf.py

Line 593 in 86ad6dd

sf_sm_loss += args.w_sm * compute_sf_lke_loss(ret['raw_pts_ref'],

Should compute_sf_lke_loss compute the $L_{temp}$ term?

Thank you!

Instructions of Using COLMAP

Hi, thank you for sharing this amazing work!

I would like to try a personal video using your model and wonder if you can share more detailed instructions on how to use COLMAP to get the training data. For instance, do you use only sparse reconstruction to get the data? If so, what do you set for the data type and shared intrinsics options? Furthermore, how do you deal with merging iterations of sparse reconstruction?

Thanks

Clear usage instruction with one's own data

Hi,
I've been trying to work with your code (very cool project by the way). Is there any plan to make the instructions cristal clear on how to use your code to infer scenes from one's video or image. Is there anyway you could for example create a google colab out of it? I've given a shot at it and couldn't complete colmap's installation.
I'm really looking forward to being able to experiment with your project!
Thanks in advance

Does default config match the implementation detail of the paper

Hi i want to reproduce the original quality
Is running kids config in cofigs folder the same as the original?
How many iters should i train
2000000 or 360000?

Coordinate System Operations

Hi team, amazing paper!

I am trying to adapt your model to regularise the volume-rendered scene flow against a monocular scene flow estimator from a different paper. The third-party scene flow estimator produces results in world coordinates (not normalised) so for me to compare against this model's scene flow, which is in NDC space, I need to transform one coordinate system to the other. This has me confused by some of the code you use to do coordinate system conversions.

Firstly, in the supplementary document, and in the code, you reference "euclidean space". I couldn't find anything online about whether this is world space or camera space. Could you please clarify?

The supplementary document references the NDC ray space derivation from the NeRF paper. That derivation outlines how to convert points from camera space (o) to NDC space (o'):

Following this, I found this function, which appears to do the inverse operation of eq 25 above.

Neural-Scene-Flow-Fields/nsff_exp/run_nerf_helpers.py

Lines 534 to 541 in 5620494

 def NDC2Euclidean(xyz_ndc, H, W, f): 

 z_e = 2./ (xyz_ndc[..., 2:3] - 1. + 1e-6) 

 x_e = - xyz_ndc[..., 0:1] * z_e * W/ (2. * f) 

 y_e = - xyz_ndc[..., 1:2] * z_e * H/ (2. * f) 

 xyz_e = torch.cat([x_e, y_e, z_e], -1) 

 return xyz_e

That is, it converts from NDC to what I can conclude must be camera space. However, when I look at its invocation, the variable name suggests that this function converts from NDC to world coordinates

Neural-Scene-Flow-Fields/nsff_exp/run_nerf_helpers.py

Line 552 in 5620494

pts_3d_e_world = NDC2Euclidean(pts_3d, H, W, f)

Following this, the pipeline to project from 3d NDC to a 2d image has me quite confused:

Neural-Scene-Flow-Fields/nsff_exp/run_nerf_helpers.py

Lines 546 to 563 in 5620494

 def projection_from_ndc(c2w, H, W, f, weights_ref, raw_pts, n_dim=1): 

 R_w2c = c2w[:3, :3].transpose(0, 1) 

 t_w2c = -torch.matmul(R_w2c, c2w[:3, 3:]) 

 pts_3d = torch.sum(weights_ref[...,None] * raw_pts, -2) # [N_rays, 3] 

 pts_3d_e_world = NDC2Euclidean(pts_3d, H, W, f) 

 if n_dim == 1: 

 pts_3d_e_local = se3_transform_points(pts_3d_e_world, 

 R_w2c.unsqueeze(0), 

 t_w2c.unsqueeze(0)) 

 else: 

 pts_3d_e_local = se3_transform_points(pts_3d_e_world, 

 R_w2c.unsqueeze(0).unsqueeze(0), 

 t_w2c.unsqueeze(0).unsqueeze(0)) 

 pts_2d = perspective_projection(pts_3d_e_local, H, W, f)

a) I assume se3_transform_points converts from world space to camera space, is that correct?
b) Why do you perform the perspective projection from camera space? Everything I have read online seems to perform perspective projection from either world coordinates / ndc.

Generally, it would be very helpful to me if you could point me to where you obtained the operations for perspective_projection, se3_transform_points and NDC2Euclidean.

My graphics knowledge is limited so apologies if these questions are trivial. Your help is greatly appreciated :)

NSFF Quality on Custom Dataset

Hi all. I'm trying to run this method on custom data, with mixed success so far. I was wondering if you had any insight about what might be happening. I'm attaching some videos and images to facilitate discussion.

First of all, we're able to run static NeRF using poses from colmap, and it seems to do fine

01_nerf_result.mp4

Likewise, setting the dynamic blending weight to 0 in your model, and using only the color reconstruction loss produces plausible results (novel view synthesis result below, for fixed time)

Using the dynamic model, while setting all the frame indices to 0 should also emulate a static NeRF. It does alright, but includes some strange haze

Finally, running NSFF on our full video sequence with all losses for 130k iterations produces a lot of ghosting (04_nsff_result.mp4).

04_nsff_result.mp4

Even though the data driven monocular depth / flow losses are phased out during training, I wonder if monocular depth is perhaps causing these issues? Although again the monocular depth and flow both look reasonable.

Let me know if you have any insights about what might be going on, and how we can improve quality here -- I'm a bit stumped at the moment. I'm also happy to send you the short dataset / video sequence that we're using if you'd like to take a look.

All the best,
~Ben

Question about softsplat

Hi! I was wondering more about the method solving novel time synthesis(average splatting). Is this a mathematical transformation or a method using pretrain model?

Singularly in NDC2Euclidean

NDC2Euclidean appears to be attempting to prevent a divide-by-zero error by the addition of an epsilon value:

def NDC2Euclidean(xyz_ndc, H, W, f):
    z_e = 2./ (xyz_ndc[..., 2:3] - 1. + 1e-6)
    x_e = - xyz_ndc[..., 0:1] * z_e * W/ (2. * f)
    y_e = - xyz_ndc[..., 1:2] * z_e * H/ (2. * f)

    xyz_e = torch.cat([x_e, y_e, z_e], -1)
 
    return xyz_e

However, since the coordinates have scene flow field vectors added to them, and the scene flow field output ranges (-1.0,1.0), it is possible to have xyz_ndc significantly outside of the normal range. This means that a divide-by-zero can still happen in the above code if the z value hits (1.0+1e-6), which it does in our training.

We suggest clamping to valid NDC values to the range (-1.0, 0.99), with 0.99 chosen to prevent the Euclidean far plane from getting too large. This choice of clamping has significantly stabilized our training in early iterations:

z_e = 2./ (torch.clamp(xyz_ndc[..., 2:3], -1.0, 0.99) - 1.0)

https://github.com/zhengqili/Neural-Scene-Flow-Fields/blob/main/nsff_exp/run_nerf_helpers.py#L535

Correct approach to separate static and dynamic regions

Hi, I have originally mentioned this issue in #1, but it seems to deviate from the original question, so I decided to open a new issue.

As discussed in #1, I tried setting the raw_blend_w to either 0 or 1 to create "static only" and "dynamic only" images that theoretically would look like the Fig. 5 in the paper and in the video. However, this approach seems to be wrong because from the result, the static part looks ok-ish but the dynamic part is almost everything, which is not good at all (we want only the moving part, e.g. only the running kid).

It's been a week that I have been testing this while waiting for some response, but still to no avail. @zhengqili @sniklaus @snavely @owang Sorry for bothering, but could any of the authors kindly clarify what's wrong with my approach to separate static/dynamic by setting the blending weight to either 0 or 1? I also tried blending the sigmas (opacity in the code) instead of alphas as in the paper, or directly use the rgb_map_ref_dy as output image, but neither helped.

Neural-Scene-Flow-Fields/nsff_exp/render_utils.py

Lines 804 to 809 in 7d8a336

 opacity_dy = act_fn(raw_dy[..., 3] + noise)#.detach() #* raw_blend_w 

 opacity_rigid = act_fn(raw_rigid[..., 3] + noise)#.detach() #* (1. - raw_blend_w)  

 # alpha with blending weights 

 alpha_dy = (1. - torch.exp(-opacity_dy * dists) ) * raw_blend_w 

 alpha_rig = (1. - torch.exp(-opacity_rigid * dists)) * (1. - raw_blend_w)

I have applied the above approach to other pretrained scenes, but none of them produces good results.

^{Left: static (raw_blend_w=0). Right: dynamic (raw_blend_w=1).}

I believe there's something wrong with my approach, but I cannot figure out. I really appreciate if the authors could kindly point out what's the correct approach. Thank you very much.

Evaluation on broom and curl dataset

I really appreciate to your great work!!

And I have a question though.

I think there are no multi-view image for evaluation on broom and curl scene dataset.

Could you let me know how to evaluate it?

I guess that the only way to evaluate it is creating validation dataset from original nerfies dataset or loading the original one using their data format.

Am I right or do we have some other ways?

If you give me the answer, it would be really helpful!! :)

thank you!

replicating the CVD comaprison

I would like to replicate the CVD experiment you have on the project page.

Since CVD estimates depth, I wonder how you did the texturing
Is it possible to plug different depth maps into your software? For example, I have a depth map for each image (like RGBD)?
thanks

Is there any memory estimations or tips for higher resolution models?

I ran the example training on an RTX 3090 and it took 41 hours to complete. Is there any recommended settings for maximum results for using a 24 GB card other than trial and error if training time isn't a concern? I noticed when I run the example the quality is much lower than your gifs on your project page? Are you using more samples in your configuration?

Also can this generate depth map output or am I misunderstanding what the video was describing? It looked like you had a visualization for depth per frame. Is extracting that information supported in the released implementation easily?

poses_bounds.npy only contains poses from one camera

I wanted to use the data provided in nvidia_data_full.zip for evaluating my own methods. During this I recognised that when I directly load the data with np.load the dimension of the loaded array is (24, 17). This leads me to believe that these poses (probably) only correspond to the poses in the images directory, which would mean that no poses are available for the images in the mv_images directory, since 24 corresponds to the number of time steps, but not to the number of cameras, which would be 12.

I am not really sure what this implies for the provided evaluation script, which loads the images from the mv_images directory but relies on the output poses from the load_nvidia_data function.
In the evaluation script the ids of the cameras are used to access the first dimension of the poses array.

Neural-Scene-Flow-Fields/nsff_exp/evaluation.py

Line 298 in d400175

c2w = poses[camera_i]

In my understanding this dimensions corresponds not to the cameras but to the time steps since it is the only dimension influenced by modifications of the start_frame and end_frame parameters. Since the number of time steps (24) is higher than the number of cameras (12) this does not throw an error, as long as the start_frame and end_frame parameters are selected accordingly.

I would really appreciate clarification on this, even if the confusion is just due to my misunderstanding of the code.
Also please let me know if there is another way to obtain the poses for the images in the the mv_images directory.

OS-specific code in run_midas.py and run_flows_video.py

I'm running your program from Windows and I noticed a few things.

Very minor, but I needed to install tensorboard as a dependency that isn't listed in the README in order to train.

On lines https://github.com/zhengqili/Neural-Scene-Flow-Fields/blob/main/nsff_scripts/run_midas.py#L84 and 104 you have linux commands for copying and removing files.

I don't know python, but import shutil and this sounded right:

        src_files = os.listdir(imgdir_orig)
        for file_name in src_files:
            full_file_name = os.path.join(imgdir_orig, file_name)
            if os.path.isfile(full_file_name):
                shutil.copy(full_file_name, imgdir)

Also https://github.com/zhengqili/Neural-Scene-Flow-Fields/blob/main/nsff_scripts/run_flows_video.py#L218 and 219 use rm.

shutil.rmtree(mask_dir)
shutil.rmtree(semantic_dir)

When run_flows_video.py runs it puts the motion masks into the image directory. If I copy them to the motion_masks folder then copy the images_512x288 images to the images directory I can begin training. (Assuming that's where they're supposed to be). I'm training on a single 3090, so this might take a while. (I turned off the NaN detection in the recent pull request which seems to have sped things up quite a bit).

Multi-gpu training

In readme you say it takes 2 days on 2 V100 gpus, but I don't see any option setting the number of gpus to use in run_nerf.py. Does it mean this code only supports single gpu training?

Error in training

I am trying to train on my dataset, a small one of 12 images (computed colmap and llff and all seems fine up to the training), and I encounter this error, which seems to come from freeing memory or something similar. Any advice? I am happy to share my image data. Using linux, (I tried gcc 6.3, 7.5 and 9.1, same), with cuda 11.0 and pytorch 1.8.0. see error file

error.txt

sceneflow visualization

Thank you for your great work!!
How did you visualize 3D sceneflow as in demo video?
Could you share a code or resource?

was nsff tested with 360 captured scene?

Were any experiments conducted with 360 captured scene? Curious to know if there is any difference between training forward facing vs 360 scene.

error when run the evaluation.py with trained model

Hello, I really appreciate to your awesome work!!

However, I have an error when I try to run the evaluation.py file with our trained model.

I think there are lack of declaration on --chain_sf in the file.

I used the given configuration file of kid-running scene.

Here are script that I run

python evaluation.py --datadir /data1/dogyoon/neural_sceneflow_data/nerf_data/kid-running/dense/ --expname Default_Test --config configs/config_kid-running.txt

Do I should add the argument in evaluation.py file or there are any way to solve this problem?

Thanks!!

README Demo Not Working

Hi,

I believe there is an issue with the demo as outlined in the readme. Specifically, when trying to use the pre-trained model to render any of the 3 interpolation methods - time, viewpoint, or both - the resulting images (as found in nsff_exp/logs/kid-running_ndc_5f_sv_of_sm_unify3_testing_F00-30/<interpolation_dependent_name>/images) come out looking like this:

or

depending on which form of interpolation I run.

I have a colab file set up that obtains the above result. It pulls from a forked repository which has only two changes: I add a requirements.txt file and update the data_dir in nsff_exp/configs/config_kid-running.txt.

Have I missed something or am I correct in saying the demo broken? Thanks!

How to get the Motion mask?

I download the kidrun data but when I run the run_nerf.py it causes the FileNotFoundError. Because I don't have the motion mask data. How can I get the mask data?It was created by this model or use other mehtod to get the motion mask?

Question about image warping and blending weight

Hi, the idea of warping and having a blending weight predicting how to merge static and dynamic results is a good idea, however I have been confused by some operations in the implementation. Initially I was not sure if these confusing parts lead to bad results, until now I see several issues concerning the performance on own datasets, so I decided to post this issue, hoping to provide some insights that could potentially solve some of the issues.

Predicting the blending weight by the static network seems really counterintuitive since dynamic objects move inside the scene independently of the static network. A more reasonable way is to predict this weight by the dynamic network as suggested by this paper. Or a even simpler way is to totally remove this blending weight and use the addition strategy as in NeRF-W. I did a short experiment on the difference between these two blending strategies (blending weight vs addition), and find that addition produces better reconstruction and novel view results. Although this finding might not be always correct for different data, at least I think predicting the blending weight by the static network is not the ideal way to go.
For image warping, you do blending in the current timestamp

Neural-Scene-Flow-Fields/nsff_exp/render_utils.py

Lines 1019 to 1022 in 4cb2ef4

rgb_map_ref_dy, depth_map_ref_dy, weights_ref_dy = raw2outputs_blending(raw_rgba_ref, raw_rgba_rigid,

raw_blend_w,

z_vals, rays_d,

raw_noise_std)

but only render the dynamic parts of t-1 and t+1

Neural-Scene-Flow-Fields/nsff_exp/render_utils.py

Lines 1055 to 1057 in 4cb2ef4

rgb_map_prev_dy, _, weights_prev_dy = raw2outputs_warp(raw_rgba_prev,

z_vals, rays_d,

raw_noise_std)

this is really confusing, basically it means there are two rendering pipelines, and the network should learn to maximize the performance of both pipelines. It is hard to tell how this causes problem exactly, but I doubt that this reduces the final performance that uses blending. In my opinion, suppose that having the static network gives better result, you should use blending in current time rendering and in image warping too. Again, I don't know whether changing this yields better result, but this part is confusing.

I would like to know @zhengqili 's opinion on these points, and maybe suggest the users to try these modifications to see if it solves some problem.

Greyed images in dpretrained rendering

I followed the instruction on the kid's data and I am getting this kind of image

The prep commands all worked fine, and the motion mask looks alright. I redid the whole process twice but no change.
It is on linux/cuda11,pytorch 1.8

Any insight as to why?

Running on my dataset

Hi, I want to run nsff on my dataset. When I run the preprocessing section on my dataset, motion_masks are all white. Does that mean there are some issues with my dataset or maybe I cannot run it with my dataset? How can I solve it? Thanks!

Other data and the motion mask accuracy

Hi, thanks for the code! Do you plan to publish the full data (running kid, and other data you used in the paper other than the NVIDIA ones) as well?

In fact, the thing I'd like to check the most is your motion masks' accuracy. I'd like know if it's really possible to let the network learn to separate the background and the foreground by only providing the "coarse mask" that you mentioned in the supplementary.

For example for the bubble scene on the project page, how accurate should the mask be to clearly separate the bubbles from the background like you showed? Have you also experimented on the influence of the mask quality, i.e. if masks are more coarse (larger), then how well can the model separate bg/fg?

	def forward(self, pred, target, mask=None, normalize=False):
	"""
	Pred and target are Variables.
	If normalize is True, assumes the images are between [0,1] and then scales them between [-1,+1]
	If normalize is False, assumes the images are already between [-1,+1]

	Inputs pred and target are Nx3xHxW
	Output pytorch Variable N long
	"""

	def NDC2Euclidean(xyz_ndc, H, W, f):
	z_e = 2./ (xyz_ndc[..., 2:3] - 1. + 1e-6)
	x_e = - xyz_ndc[..., 0:1] * z_e * W/ (2. * f)
	y_e = - xyz_ndc[..., 1:2] * z_e * H/ (2. * f)

	xyz_e = torch.cat([x_e, y_e, z_e], -1)

	return xyz_e

	def projection_from_ndc(c2w, H, W, f, weights_ref, raw_pts, n_dim=1):
	R_w2c = c2w[:3, :3].transpose(0, 1)
	t_w2c = -torch.matmul(R_w2c, c2w[:3, 3:])

	pts_3d = torch.sum(weights_ref[...,None] * raw_pts, -2) # [N_rays, 3]

	pts_3d_e_world = NDC2Euclidean(pts_3d, H, W, f)

	if n_dim == 1:
	pts_3d_e_local = se3_transform_points(pts_3d_e_world,
	R_w2c.unsqueeze(0),
	t_w2c.unsqueeze(0))
	else:
	pts_3d_e_local = se3_transform_points(pts_3d_e_world,
	R_w2c.unsqueeze(0).unsqueeze(0),
	t_w2c.unsqueeze(0).unsqueeze(0))

	pts_2d = perspective_projection(pts_3d_e_local, H, W, f)

	opacity_dy = act_fn(raw_dy[..., 3] + noise)#.detach() #* raw_blend_w
	opacity_rigid = act_fn(raw_rigid[..., 3] + noise)#.detach() #* (1. - raw_blend_w)

	# alpha with blending weights
	alpha_dy = (1. - torch.exp(-opacity_dy * dists) ) * raw_blend_w
	alpha_rig = (1. - torch.exp(-opacity_rigid * dists)) * (1. - raw_blend_w)

	rgb_map_ref_dy, depth_map_ref_dy, weights_ref_dy = raw2outputs_blending(raw_rgba_ref, raw_rgba_rigid,
	raw_blend_w,
	z_vals, rays_d,
	raw_noise_std)

	rgb_map_prev_dy, _, weights_prev_dy = raw2outputs_warp(raw_rgba_prev,
	z_vals, rays_d,
	raw_noise_std)

zhengqili / neural-scene-flow-fields Goto Github PK

neural-scene-flow-fields's People

Contributors

Stargazers

Watchers

Forkers

neural-scene-flow-fields's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs