GithubHelp home page GithubHelp logo

autonomousvision / giraffe Goto Github PK

View Code? Open in Web Editor NEW
1.2K 20.0 160.0 30.25 MB

This repository contains the code for the CVPR 2021 paper "GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields"

Home Page: https://m-niemeyer.github.io/project-pages/giraffe/index.html

License: MIT License

Python 97.91% Shell 2.09%
cvpr2021 generative-model generative-modelling generative-adversarial-network nerf implicit-surfaces neural-scene-representations

giraffe's Introduction

GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields

Add Clevr Tranlation Horizontal Cars Interpolate Shape Faces

If you find our code or paper useful, please cite as

@inproceedings{GIRAFFE,
    title = {GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields},
    author = {Niemeyer, Michael and Geiger, Andreas},
    booktitle = {Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)},
    year = {2021}
}

TL; DR - Quick Start

Rotating Cars Tranlation Horizontal Cars Tranlation Horizontal Cars

First you have to make sure that you have all dependencies in place. The simplest way to do so, is to use anaconda.

You can create an anaconda environment called giraffe using

conda env create -f environment.yml
conda activate giraffe

You can now test our code on the provided pre-trained models. For example, simply run

python render.py configs/256res/cars_256_pretrained.yaml

This script should create a model output folder out/cars256_pretrained. The animations are then saved to the respective subfolders in out/cars256_pretrained/rendering.

Usage

Datasets

To train a model from scratch or to use our ground truth activations for evaluation, you have to download the respective dataset.

For this, please run

bash scripts/download_dataset.sh

and following the instructions. This script should download and unpack the data automatically into the data/ folder.

Controllable Image Synthesis

To render images of a trained model, run

python render.py CONFIG.yaml

where you replace CONFIG.yaml with the correct config file. The easiest way is to use a pre-trained model. You can do this by using one of the config files which are indicated with *_pretrained.yaml.

For example, for our model trained on Cars at 256x256 pixels, run

python render.py configs/256res/cars_256_pretrained.yaml

or for celebA-HQ at 256x256 pixels, run

python render.py configs/256res/celebahq_256_pretrained.yaml

Our script will automatically download the model checkpoints and render images. You can find the outputs in the out/*_pretrained folders.

Please note that the config files *_pretrained.yaml are only for evaluation or rendering, not for training new models: when these configs are used for training, the model will be trained from scratch, but during inference our code will still use the pre-trained model.

FID Evaluation

For evaluation of the models, we provide the script eval.py. You can run it using

python eval.py CONFIG.yaml

The script generates 20000 images and calculates the FID score.

Note: For some experiments, the numbers in the paper might slightly differ because we used the evaluation protocol from GRAF to fairly compare against the methods reported in GRAF.

Training

Finally, to train a new network from scratch, run

python train.py CONFIG.yaml

where you replace CONFIG.yaml with the name of the configuration file you want to use.

You can monitor on http://localhost:6006 the training process using tensorboard:

cd OUTPUT_DIR
tensorboard --logdir ./logs

where you replace OUTPUT_DIR with the respective output directory. For available training options, please take a look at configs/default.yaml.

2D-GAN Baseline

For convinience, we have implemented a 2D-GAN baseline which closely follows this GAN_stability repo. For example, you can train a 2D-GAN on CompCars at 64x64 pixels similar to our GIRAFFE method by running

python train.py configs/64res/cars_64_2dgan.yaml

Using Your Own Dataset

If you want to train a model on a new dataset, you first need to generate ground truth activations for the intermediate or final FID calculations. For this, you can use the script in scripts/calc_fid/precalc_fid.py. For example, if you want to generate an FID file for the comprehensive cars dataset at 64x64 pixels, you need to run

python scripts/precalc_fid.py  "data/comprehensive_cars/images/*.jpg" --regex True --gpu 0 --out-file "data/comprehensive_cars/fid_files/comprehensiveCars_64.npz" --img-size 64

or for LSUN churches, you need to run

python scripts/precalc_fid.py path/to/LSUN --class-name scene_categories/church_outdoor_train_lmdb --lsun True --gpu 0 --out-file data/church/fid_files/church_64.npz --img-size 64

Note: We apply the same transformations to the ground truth images for this FID calculation as we do during training. If you want to use your own dataset, you need to adjust the image transformations in the script accordingly. Further, you might need to adjust the object-level and camera transformations to your dataset.

Evaluating Generated Images

We provide the script eval_files.py for evaluating the FID score of your own generated images. For example, if you would like to evaluate your images on CompCars at 64x64 pixels, save them to an npy file and run

python eval_files.py --input-file "path/to/your/images.npy" --gt-file "data/comprehensive_cars/fid_files/comprehensiveCars_64.npz"

Futher Information

More Work on Implicit Representations

If you like the GIRAFFE project, please check out related works on neural representions from our group:

giraffe's People

Contributors

m-niemeyer avatar seriousran avatar xh-liu-tech avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

giraffe's Issues

how to obtain 3d bounding box of the object from its {s,t,R}

Thanks for your excellent work ! I am really surprised by the controllable object generation !

I wonder if there is a way to extract 3d bounding box of the object from its affine transformation {s,t,R}.

I tried to establish the equation between the 3d bounding box and {s,t,R}, but failed. I transform a cube by equation(6) in the paper, but it seems their relationships do not follow equation(6). Do you have any suggestion ?

Fitting large number of objects in memory

Thanks for sharing this awesome work, and congrats on winning best paper!

I'd like to train GIRAFFE on a custom dataset with up to 20+ objects per image, but I'm finding that a batch of 32 images won't fit into 11GB of GPU memory. For 64x54 resolution, I can render at most a batch of 18 images, and for 256x256 resolution, I can render at most a batch of 9 images. I haven't tried training yet, but I would expect it to take up at least as much memory as inference.

Do you think it would be safe to reduce the training batch size, or would that make the GAN training unstable at some point? Thanks.

why the input images requires grad?

hi, i have a question in im2scene/giraffe/training.py, in function train_step_discriminator(),line 152 (x_real.requires_grad_())and line 168(x_fake.requires_grad_()). Does this mean that x_real(from Dataset) and x_fake(from generator) will be optimized over and over again during training, but why optimize these two variables?What are the deep ideas in this?
WechatIMG317

Question about dataset

Thanks for your reply. #20 for the second question , i mean how to get the config's parameter for different dataset?
Specially for the parameter :bounding_box_generator_kwargs and generator_kwargs.

about how to get shape and appearance code

Hi , thanks for your great work. I want to know about how to get shape and appearance codes, are these encoded from an image when training, and generate randomly when testing?

Issues Training with other datasets.

I'm having an issue with training where the model fades to black during training while using a subset of my dataset. This has happened with a wide array of config setups. I've included two screen captures of the images tensor board screen to help illustrate this issue. Please let me know if there is something I am missing or if this issue has come up before.
Screen Shot 2022-05-20 at 11 22 38 AM
Screen Shot 2022-05-20 at 11 22 47 AM

About multi-object disentanglement and how to set the list probs

Hi, I'm very interested in your extraordinary work! And now I'm trying to verify how the model can handle the multi-object cases. Now I'm using the dataset Clevr2345 to train from scratch, but I set the n_boxes to 4 in config file to see what the network can learn when the branches of n_boxes set in the model is smaller than that in pictures from dataset. But I came into a problem in im2scene/giraffe/models/generator.py, there is a function named get_object_existance. In your code, you hard-code the list probs to only handle n_boxes=5 cases, I want to know how I should change the list probs for my purpose, and what is the mathematical meaning of this list.
Wish to get your reply, thanks.

The number of images in celeba-hq dataset? 30k or 200k?

Hi, sorry to bother you, I meet some problem about which folder I should choose.

I'd like to use images of 128*128 resolution, and I notice the celeba-128 folders contains 30k images, but the img_celeba images also has 200k images, so I got little confused about it. Could you tell me which folder should I choose? Really thanks!!! : )

image

The results of FID score on Chairs data

Hi, I got a FID score=28.01, which is larger than the results of paper, 20. I have trained model enough epochs.

I used your configs/64res/chairs_64.yaml without any changes. Is there a problem?

I also trained the model on Cats data, the FID score=10.24, your results is 8.

image
image

Some questions on model training

Hey, awesome work!

I had a few questions regarding training:

  1. What was the hardware used for training the models in the papers? And, given that hardware spec, what was the total size of the model when training?

  2. Currently, Giraffe operates at a maximum resolution of 256x256. What would you say are the main bottlenecks that make training at higher resolutions more challenging?

About the scale and transform, rotation

Congrats for the best paper.

I want to know about how to attain the these parameters for new dataset.

for example:

scale_range_min: [0.21, 0.21, 0.21]
scale_range_max: [0.21, 0.21, 0.21]
translation_range_min: [0., 0., 0.]
translation_range_max: [0., 0., 0.]
rotation_range: [0.375, 0.625]

Thanks!

some error about the Training process

Hi , Thanks for your great work, i run the train.py , and after few iteration , the program broke down:
urllib.error.URLError: <urlopen error [Errno 110] Connection timed out>

Future works

Are you solving the following problem? If not, we can discuss it.

Disentanglement Failures. For Churches, the background sometimes contains a church, and for CompCars, the
object sometimes contains background parts or vice versa. We attribute these to mismatches between the assumed uniform
distributions over object and camera poses and their real distributions, and identify learning them instead as interesting future
work.

Training Time

Thank you for your interesting works.
I'm not familiar with this field, neural rendering, so I have no idea of the training time.
Could you let me know the approximate training time with 1080 Ti (6464), V100 (256256) respectively??

How the Model Can Achieve Ability to Conditionate on Controllable Params (shape, appearance, etc)

Hi,

Thanks for the awesome work! I'm curious, how does the trained model have the ability to be conditioned on controllable parameters, broken down as follows?

  1. Shape and appearance latent code: is the Discriminator also be conditioned for this shape and appearance aspects? I cannot find it in the code. If it's indeed not being conditioned, then how does finally the model can associate that the latent code of shape is the variable to control the shape in the data generation? Likewise for the appearance latent code.
  2. When sampling the transformation (s, R, T) and also camera pose per batch, do the corresponding real_data also have the similar properties of them (s, R, T and camera pose)? If not, again, how do the model can associate these controllable variables correctly? For example for an "unwanted case", when we sample T so that the generated object will be in the left, but the corresponding real_data when training the Discriminator has the object in the right.

Many thanks.

Question about the coordinate system transformation

Thanks for the great work! However, I still have a question about the coordinate system transformation.

# Arange Pixels
pixels = arange_pixels((res, res), batch_size,
invert_y_axis=False)[1].to(device)
pixels[..., -1] *= -1.

In the above codes, you first generate the coordinates under the image coordinate system, then invert the y-axis. What is the purpose of this operation?

Looking forward to your reply.

Rendering error with trained model

Hi,

I trained a model locally using the comprehensive car dataset (64x64). Then using this model I tried to render images to see if I can reproduce the results of the original pretrained model. To do this, I copied the config file cars_64_pretrained.yaml and changed the test model_file location to my trained model instead of the online pretrained model location. Then I ran the following command:

python render.py "my config file"

I am getting the following error.

      File "im2scene/giraffe/config.py", line 62, in get_model
          if cfg['test']['take_generator_average']:
      TypeError: string indices must be integers

"take_generator_average" is set to True in default.yaml. As per the render.py code, default.yaml is being loaded in addition to my config file. Other than the model file location under test, should I be making any other change in my config file in order to use my model instead of the pretrained model?

Thanks for your help.

3D mesh extraction and controlled shape generation

Hi,

  1. Using a trained NeRF model, we can extract the 3D mesh of an image using marching cubes (https://github.com/bmild/nerf/blob/master/extract_mesh.ipynb).
    Can we do the same using the trained Giraffe model? If so, could you please provide some guidance on how the 3D mesh can be extracted for a given image after the model is trained?

  2. Also, your paper indicates that we can control the shape and appearance in the latent space without supervision because the feature space is disentangled. Does the code you have released support this? I have used the StyleGAN2 projector for controlled image generation. But I am very interested in knowing if Giraffe can be used for editing the 3D shape. For e.g. after training a Giraffe model on chairs, can we input an image of a chair with arms and transform the latent space to reconstruct the 3D mesh/image of the same chair without arms?

Thank you very much.

About the number of the object in the scene

Hi , thanks for your great work , i want konw if the N is bigger than the exact object number in the scene , what ganna be influenced? just like the second situation described in the paper.

Some doubts regarding the disentanglement of objects w.r.t. each other and w.r.t. background

Hello,

Thanks so much for sharing the code for your amazing work. I had few doubts regarding the disentanglement part of the work:

  • Is the object-background disentanglement explicit (i.e. using background-foreground masks to train one part of the generator using just background pixels, and remaining parts using foreground pixels) or does the model learn it implicitly. I saw that the paper mentioned that scale and translation are fixed for the background to make it span the entire scene, and to make it centered at origin. But does the model 'unsupervisedly' learn to generate background to feature field generator of this configuration, or is there some explicit supervision also. Paper seems to suggest it's unsupervised, but I just wanted to confirm.

  • I saw that you have N+1 generators for N objects (1 for background). So are all the N object generator MLPs essentially same generators / shared weights, or are they different. Assuming all objects are say cars, then probably one generator would be okay to generate all the objects, but if we have different objects in scene, like car, bicycle, pedestrian, etc, then probably a per-category object generator would make sense?

Thanks again!

bug in the config of chairs_64

translation range is set to zero in chairs_64.yaml
but translation is involved in chairs' default render output (they can't do any translation at all)
截屏2022-01-22 19 10 17

When I test code on the provided pre-trained models, it could not be completed successfully

when I run the code "python render.py configs/256res/cars_256_pretrained.yaml", it show the RuntimeError, I dont know how to solve it,could you give me some help,thank you so much!

The error as follows
PS D:\ChromeD\giraffe-main> python render.py configs/256res/cars_256_pretrained.yaml
https://s3.eu-central-1.amazonaws.com/avg-projects/giraffe/models/checkpoint_cars256-d9ea5e11.pt
=> Loading checkpoint from url...
Traceback (most recent call last):
File "render.py", line 26, in
checkpoint_io.load(cfg['test']['model_file'])
File "D:\ChromeD\giraffe-main\im2scene\checkpoints.py", line 62, in load
return self.load_url(filename)
File "D:\ChromeD\giraffe-main\im2scene\checkpoints.py", line 93, in load_url
state_dict = model_zoo.load_url(url, progress=True)
File "D:\Appstore3\Anaconda\envs\giraffe\lib\site-packages\torch\hub.py", line 559, in load_state_dict_from_url
return torch.load(cached_file, map_location=map_location)
File "D:\Appstore3\Anaconda\envs\giraffe\lib\site-packages\torch\serialization.py", line 587, in load
with _open_zipfile_reader(opened_file) as opened_zipfile:
File "D:\Appstore3\Anaconda\envs\giraffe\lib\site-packages\torch\serialization.py", line 242, in init
super(_open_zipfile_reader, self).init(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: [enforce fail at ..\caffe2\serialize\inline_container.cc:145] . PytorchStreamReader failed reading zip archive: failed finding central directory

config difference between giraffe and graf on celebahq

Hi, thanks for sharing the great work. I wonder why there are some difference of config parameters on celebahq, e.g. the near, far bounds, range of u, v. How you determine the min max bounds of camrea pose and object2scene transformation of celebahq dataset in giraffe?

confusion about camera and world coordinates

Thanks for your great work! There is some confusion when I run code:

  1. why would we change the second and first dims of translation when changing object depth and its horizontal position, respectively? What is the relationship between camera and world coordinates?
  2. how do you determine the scale and translation ranges of object to scene, e.g. the face experiments on celebahq? Do we need some prior on that?

About the role of prior file for clevr dataset and how to get them

Hello, I find that for clevr dataset, you don't get the translation t in the transformation {s,t,R} through random sampling, while you get s and R through random sampling during training process. Instead, you sample the translation t from a prior file, which consists of a large number of coordinate value to sample from. I have 2 questions about this: firstly, what is the role of this prior file? Why can't we obtain t just like s and R? secondly, I see that the prior file is attached with the images when download the dataset, but if we only have image dataset, how can we obtain the prior file from this? In other words, where is such a prior file come from?

About the model structure

Hi , thanks for your great work , i have two questions about the model:

  1. Why we abandoned the patch input and discriminator in the GARF?
  2. How did this mode solve the data demands of the dataset like the scene bounds .etc?(GARF use the LLFF or COLMAP)

[code] details about collison check

in bounding box generator class, function 'check_for_collison'. Can you explain why do this for n_boxes==2, and how do it (intuitive explanation for s and t params)?

ffhq fid score reproduce check.

Hello,

I trained the model with ffhq_256.yaml file.

But I was not able to reproduce the fid score with Pretrained GIRAFFE model you offered(ffhq_256_pretrained.yaml).

Could you please check the configuration file(ffhq_256.yaml file)?

FFHQ | FID (20000 images)
Pretrained from Github | 31.507948
My Reproduce Model | 43.068982

About image resolution

Hello, I'm sorry to trouble you,I would like to know if the resolution of the images can be changed to 512 * 512 or 640 * 480?

About fitting the latent codes of a given image

Hello, thanks for your great work. I tried to optimize the latent codes of a given face image with your pre-trained CelebAHQ model but failed with 'CUDA out of memory'(on a single TITAN xp GPU,12G). I simply use the rgb l1 loss between the gt image and generate image, Adam optimizer. Could you please offer some official code or some strategy about solving this problem?

About object separation

Hi,

Thank you for releasing the code. Your work is impressive.

I have some questions about object separation. It seems that your work can separate the objects in the input images. And we can set N to change the number of objects we want the network to recognize. So my question is, what would happen if we set a big N which is much bigger than the number of objects in one scene? In general, we see one person or one face as a whole object. But can the network learn different parts of one general object (like eyes, mouth, nose from one face)? If the network can do that, how? Maybe need more constraints?

Best regards,
Hengfei

Optimizer choice

Thank you for sharing this great work. I've been doing some experiments with the code, but I've noticed that the model would refuse to train if I change the optimizer to Adam (I've only tried on the LSUNCAR dataset, and when using RMSprop the model trains well). Is there any reason on why you choose to use RMSprop? And do you know why Adam would fail? Thanks.

[code] details about code

in giraffe/models/decoder.py, line 131, why need unsqueeze opr? the shape of net is (batch,hidden), and the output of self.fc_z(z_shape) is (batch, hidden) too.

Inconsistent docstring with code

When reading the source code and trying to understand it, I found some inconsistency between docstrings and the code.

For two examples, the docstrings in im2scene.common.origin_to_world and im2scene.common.images_points__to_world say that the argument invert is

invert (bool): whether to invert matrices (default: true)

but the code is invert=False. Can you please check these? Thanks!

Multi-GPU training

Hi,

Thank you for releasing your code. I have a few questions.

  1. I tried to train the model on cars_256 on my machine which has 2 gpus, each having around 11GB memory. It encountered the OOM error, because the code currently uses only one GPU and in an earlier response you have indicated that the 256x256 config requires 16GB. So I am thinking of changing the code to use multi-gpu using DataParallel as shown below. I would like to know if there is anything from a computational perspective that needs to be taken care of when running the code on multiple gpus (will there be any inaccuracies in the results if the training code is run on multiple gpus)?

model = torch.nn.DataParallel(model, device_ids=gpu_list)

thanks

360 rotation skipping angles?

I've tried to train the model without the neural renderer at 64x64 resolution on the compcar dataset and rendered the 360 degrees rotation image as well as acc map. The rotation angle is a linear interpolation from 0 to 1, but for some reason the rotation seems to be skipping a lot of angles?
Screen Shot 2021-10-19 at 1 25 15 AM
Screen Shot 2021-10-19 at 1 24 27 AM

I'm not quite sure where the problem might be. Any pointer would be helpful. Thanks!

Error when train the code

Thanks for your great work, when i try to train the giraffe use the FFHQ , i meet a error:

(giraffe) ➜  giraffe-main python train.py configs/256res/ffhq_256.yaml           
/home/rjs/.conda/envs/giraffe/lib/python3.8/site-packages/kornia/augmentation/augmentation.py:1872: DeprecationWarning: GaussianBlur is no longer maintained and will be removed from the future versions. Please use RandomGaussianBlur instead.
  warnings.warn(
Start loading file addresses ...
done! time: 0.00021123886108398438
Number of images found: 0
Traceback (most recent call last):
  File "train.py", line 54, in <module>
    train_loader = torch.utils.data.DataLoader(
  File "/home/rjs/.conda/envs/giraffe/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 262, in __init__
    sampler = RandomSampler(dataset, generator=generator)  # type: ignore
  File "/home/rjs/.conda/envs/giraffe/lib/python3.8/site-packages/torch/utils/data/sampler.py", line 103, in __init__
    raise ValueError("num_samples should be a positive integer "
ValueError: num_samples should be a positive integer value, but got num_samples=0

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.