inferno-pytorch / inferno Goto Github PK

View Code? Open in Web Editor NEW

244.0 244.0 41.0 39.6 MB

A utility library around PyTorch

License: Other

Python 99.47% Shell 0.09% Makefile 0.44%

deep-learning neural-networks pytorch

inferno's People

Contributors

Stargazers

Watchers

inferno's Issues

Example does not make sense

I just checked the example we have in the README and I think it does
not make sense....
We are adding a Softmax and use CrossEntropyLoss (which combines Softmax and NLLLoss).

Uniitests to agressive

@nasimrahaman I think the unit tests are to aggressive.

They tend to fail too often on travis.
I guess one should relax these tests a bit

self.assertLess(trainer.get_state('validation_error_averaged'), (1 - 1/self.NUM_CLASSES))
E       AssertionError: 0.9244186046511628 not less than 0.9

Store version information only once

Right now the version information is present in two different files:

https://github.com/inferno-pytorch/inferno/blob/master/setup.py#L41
https://github.com/inferno-pytorch/inferno/blob/master/inferno/__init__.py#L14

Would be better to store this in a single place. I am not sure about the best practices though.

Description

You guys seem to be doing something very similar to the torchsample project and it seems to me that there's a ton of room for collaboration / code merging. Please consider at least pulling in some of the functionality from that library. Also, I really like the way the callbacks are structured in torchsample. I didn't see anything similar in inferno but I think it would be a good idea.

Thanks

Require a save directory?

Several parts of the Trainer class require a location to save to but don't complain until it is too late.
Examples are (of course) when a save point is specified via save_every, but the trainer also defaults to saving after a validation run even without a necessary directory.

Clean up tests and set up Travis CI

Some tests require GPU, and they need to be unittest.skip-ped.
Need more travis-friendly tests for Trainer, perhaps with a dummy model on a dummy dataset with a dummy criterion and a dummy metric.

create conda recipe

cannot import name 'is_image_file'

inferno version:v0.1.7
Python version:3.6
Operating System:ubuntu16

Description

I try to use inferno trainers, come with the output:
cannot import name 'is_image_file'

What I Did

$ python alderley_patchcwganp.py 
Traceback (most recent call last):
  File "alderley_patchcwganp.py", line 9, in <module>
    from inferno.trainers.basic import Trainer
  File "/home/jiangsht/anaconda2/envs/chi/inferno/__init__.py", line 6, in <module>
    from . import io
  File "/home/jiangsht/anaconda2/envs/chi/inferno/io/__init__.py", line 1, in <module>
    from . import box
  File "/home/jiangsht/anaconda2/envs/chi/inferno/io/box/__init__.py", line 3, in <module>
    from .camvid import CamVid, get_camvid_loaders
  File "/home/jiangsht/anaconda2/envs/chi/inferno/io/box/camvid.py", line 9, in <module>
    from torchvision.datasets.folder import is_image_file, default_loader
ImportError: cannot import name 'is_image_file'

Remove deprecated imsave

scipy.misc.imsave is deprecated and not part of recent scipy versions any more.
We still use it here

We should use skimage.io.imsave or imageio.imwrite instead.

cc @FynnBe

Have Trainer support networks with multiple inputs and outputs.

This requires:

an extra method in the main Trainer class to specify the number of expected inputs and outputs,
Tensorboard support
(optional) support for multiple metrics,
(optional) method to bundle multiple criteria to one

Gradient clip?

Is there a way to register call back for gradient clip?

Remove deprecated toimage

scipy.misc.toimage is deprecated since scipy 1.0.0 and removed since 1.3.0.
we use it here:

inferno/inferno/trainers/callbacks/logging/tensorboard.py

Line 4 in d8287f5

from scipy.misc import toimage

Replace the print statements in Trainer with a Verbosity callback

Trainer.print may stay, but the print statements must be optional (especially with Tensorboard logging). This can be easily done with a Verbosity callback.

Can't run the basic example

inferno version: v0.3.1
Python version: 3.6.7
Operating System: macOS Mojave 10.14.4

Description

I have tried to run the script in the Readme, after having set the three directories that must be set and disabling CUDA. When running the script with python3 hello_world.py I got two errors, I made the first disappear (see below), but the second is still present. The expected behavior is to get no error.

What I Did

The full code is reported below, in a file called hello_world.py.

import torch.nn as nn
from inferno.io.box.cifar import get_cifar10_loaders
from inferno.trainers.basic import Trainer
from inferno.trainers.callbacks.logging.tensorboard import TensorboardLogger
from inferno.extensions.layers.convolutional import ConvELU2D
from inferno.extensions.layers.reshape import Flatten

# Fill these in:
LOG_DIRECTORY = 'log'
SAVE_DIRECTORY = 'save'
DATASET_DIRECTORY = 'data'
DOWNLOAD_CIFAR = True
USE_CUDA = False

# Build torch model
model = nn.Sequential(
    ConvELU2D(in_channels=3, out_channels=256, kernel_size=3),
    nn.MaxPool2d(kernel_size=2, stride=2),
    ConvELU2D(in_channels=256, out_channels=256, kernel_size=3),
    nn.MaxPool2d(kernel_size=2, stride=2),
    ConvELU2D(in_channels=256, out_channels=256, kernel_size=3),
    nn.MaxPool2d(kernel_size=2, stride=2),
    Flatten(),
    nn.Linear(in_features=(256 * 4 * 4), out_features=10),
    nn.LogSoftmax(dim=1)
)

# Load loaders
train_loader, validate_loader = get_cifar10_loaders(DATASET_DIRECTORY,
                                                    download=DOWNLOAD_CIFAR)

# Build trainer
trainer = Trainer(model) \
  .build_criterion('NLLLoss') \
  .build_metric('CategoricalError') \
  .build_optimizer('Adam') \
  .validate_every((2, 'epochs')) \
  .save_every((5, 'epochs')) \
  .save_to_directory(SAVE_DIRECTORY) \
  .set_max_num_epochs(10) \
  .build_logger(TensorboardLogger(log_scalars_every=(1, 'iteration'),
                                  log_images_every='never'),
                log_directory=LOG_DIRECTORY)

# Bind loaders
trainer \
    .bind_loader('train', train_loader) \
    .bind_loader('validate', validate_loader)

if USE_CUDA:
  trainer.cuda()

# Go!
trainer.fit()

I first created the three folders specified in the script with mkdir log, mkdir save, mkdir data. I then ran the script with python3 hello_world.py. I first got the error:

  File "hello_world.py", line 2, in <module>
    from inferno.io.box.cifar import get_cifar10_loaders
  File "/miniconda3/envs/my_test/lib/python3.6/site-packages/inferno/__init__.py", line 6, in <module>
    from . import io
  File "/miniconda3/envs/my_test/lib/python3.6/site-packages/inferno/io/__init__.py", line 4, in <module>
    from . import volumetric
  File "/miniconda3/envs/my_test/lib/python3.6/site-packages/inferno/io/volumetric/__init__.py", line 1, in <module>
    from .volume import VolumeLoader, HDF5VolumeLoader, TIFVolumeLoader
  File "/miniconda3/envs/my_test/lib/python3.6/site-packages/inferno/io/volumetric/volume.py", line 8, in <module>
    from ...utils import io_utils as iou
  File "/miniconda3/envs/my_test/lib/python3.6/site-packages/inferno/utils/io_utils.py", line 5, in <module>
    from scipy.misc import imsave
ImportError: cannot import name 'imsave'

which I could solve by running conda install -c anaconda scipy. I was not expecting this error because, since I installed inferno with conda, I expected all the dependencies to be already installed.

The second error that now I get is the following:

Traceback (most recent call last):
  File "hello_world.py", line 2, in <module>
    from inferno.io.box.cifar import get_cifar10_loaders
  File "/miniconda3/envs/my_test/lib/python3.6/site-packages/inferno/__init__.py", line 7, in <module>
    from . import trainers
  File "/miniconda3/envs/my_test/lib/python3.6/site-packages/inferno/trainers/__init__.py", line 1, in <module>
    from . import basic
  File "/miniconda3/envs/my_test/lib/python3.6/site-packages/inferno/trainers/basic.py", line 20, in <module>
    from .callbacks.logging.base import Logger
  File "/miniconda3/envs/my_test/lib/python3.6/site-packages/inferno/trainers/callbacks/logging/__init__.py", line 4, in <module>
    from .tensorboard import TensorboardLogger
  File "/miniconda3/envs/my_test/lib/python3.6/site-packages/inferno/trainers/callbacks/logging/tensorboard.py", line 1, in <module>
    import tensorboardX as tX
  File "/miniconda3/envs/my_test/lib/python3.6/site-packages/tensorboardX/__init__.py", line 5, in <module>
    from .torchvis import TorchVis
  File "/miniconda3/envs/my_test/lib/python3.6/site-packages/tensorboardX/torchvis.py", line 11, in <module>
    from .writer import SummaryWriter
  File "/miniconda3/envs/my_test/lib/python3.6/site-packages/tensorboardX/writer.py", line 15, in <module>
    from .event_file_writer import EventFileWriter
  File "/miniconda3/envs/my_test/lib/python3.6/site-packages/tensorboardX/event_file_writer.py", line 28, in <module>
    from .proto import event_pb2
  File "/miniconda3/envs/my_test/lib/python3.6/site-packages/tensorboardX/proto/event_pb2.py", line 15, in <module>
    from tensorboardX.proto import summary_pb2 as tensorboardX_dot_proto_dot_summary__pb2
  File "/miniconda3/envs/my_test/lib/python3.6/site-packages/tensorboardX/proto/summary_pb2.py", line 15, in <module>
    from tensorboardX.proto import tensor_pb2 as tensorboardX_dot_proto_dot_tensor__pb2
  File "/miniconda3/envs/my_test/lib/python3.6/site-packages/tensorboardX/proto/tensor_pb2.py", line 15, in <module>
    from tensorboardX.proto import resource_handle_pb2 as tensorboardX_dot_proto_dot_resource__handle__pb2
  File "/miniconda3/envs/my_test/lib/python3.6/site-packages/tensorboardX/proto/resource_handle_pb2.py", line 22, in <module>
    serialized_pb=_b('\n(tensorboardX/proto/resource_handle.proto\x12\x0ctensorboardX\"r\n\x13ResourceHandleProto\x12\x0e\n\x06\x64\x65vice\x18\x01 \x01(\t\x12\x11\n\tcontainer\x18\x02 \x01(\t\x12\x0c\n\x04name\x18\x03 \x01(\t\x12\x11\n\thash_code\x18\x04 \x01(\x04\x12\x17\n\x0fmaybe_type_name\x18\x05 \x01(\tB/\n\x18org.tensorflow.frameworkB\x0eResourceHandleP\x01\xf8\x01\x01\x62\x06proto3')

How to fix it?

tensorboard logger asking for a detach()

inferno version:
output from conda list | grep inferno

inferno                   v0.4.0                     py_0    conda-forge
inferno-pytorch           0.4.0                    pypi_0    pypi

Python version:
3.7.4
Operating System:
centOS

Description

I get this error (I show only the last line of the backtrace)

  File "/home/my_username/anaconda3/envs/my_project/lib/python3.7/site-packages/inferno/trainers/callbacks/logging/tensorboard.py", line 292, in extract_images_from_batch
    batch = batch.float().numpy()
RuntimeError: Can't call numpy() on Variable that requires grad. Use var.detach().numpy() instead.

Produced by running the following code (I post only the part that should be relevant)

    # ...
    trainer = Trainer(vae)
    trainer.save_to_directory(folder)
    trainer.cuda()
    trainer.build_criterion(vae.loss_function())
    trainer.build_optimizer('Adam', lr=0.001)
    trainer.save_every((1, 'epochs'))
    trainer.set_max_num_epochs(100)
    trainer.build_logger(TensorboardLogger(log_scalars_every=(1, 'iteration'),
                                           log_images_every=(1, 'iteration'),
                                           log_directory=folder))
    # ...

Neuroglancer integration in tensorboard logger.

It would be great to have the neuroglancer viewer availble for 3D volumetric data during inference.
This would make data inspection much easier especially for data with multiple channels.

create conda forge recipe

readthedocs vs self-build + githubpage hosted

Maintaining a building documentation on readthedocs is a pain in the ass:

Hard / Impossible to debug
No GPU => all examples which shall produce plots rely somehow on a cuda GPU. We will never get this on readthedocs => the example gallery will look poor on readthedocs

Building the docs by ourselfs and host them via https://pages.github.com/ is not very hard, It just means we need do this on a regular basis. But we get a supernice auto example gallery and it is not fragile at all.

@nasimrahaman what do you think?

Question about save_now logic

I think there is a logic bug is save_now. Please correct me if I am wrong.

https://github.com/inferno-pytorch/inferno/blob/master/inferno/trainers/basic.py#L484

The second condition is currently:

elif self._is_iteration_with_best_validation_score:
            return self._save_at_best_validation_score

Shouldn't that be:

elif self._save_at_best_validation_score:
            return self._is_iteration_with_best_validation_score

If you are only saving at the best score, then only save_now if you are the best score.

However, if you are currently at the best score and save at best is off, it will not save. Should be an easy fix just swapping those two variables.

pytorch 1.0 multi-processing not stable

It looks like the multi-processing is not really stable in pytorch 1.0 yet.
This leads to a few non-terminating tests (not an issue in 0.4.1 !).
I have disabled the 1.0 tests for now:
https://github.com/inferno-pytorch/inferno/blob/master/.travis.yml#L11

We should check again if this works once there is a new torch release.

TensorboardLogger defaults don't make sense

There are some issues with the TensorboardLogger default arguments.
All log_X_every get the default argument None, which will be mapped to once every iteration.
This is problematic:

log_histograms_every set to once every iteration will lead to calling log_histogram and raise a NotImplementedError
log_images_every set to once every iteration can result in huge log-files, because it stores a lot of images.

Probably the best solution is to change the handling of None for log_images and log_histogram

Update inferno-pytorch version in pypi

The latest version in pypi right now is 0.1.7.

Momentum is not suitable for smoothing validation score

In the current implementation of validation smoothing, we use momentum.
This puts a very high importance on the first validation score.
E.g. for 3 validation scores [.75, .2, .1] the smoothed value would be something like 0.7.

I think using a sliding window with some decay would be more appropriate.

Tensorboard image summaries are not systematically tagged

As of 77fd5e7, they're tagged training_prediction/0, training_prediction/1 and so on. We should have something like training_prediction/batch_0/channel_0, training_prediction/batch_2/channel_1/z_0.

Recursion error since there is no tensor/variable distinction

inferno version: 0.1.8 (installed via conda install -c pytorch -c conda-forge inferno)
Python version: 3.6.8
Operating System: Ubuntu 18.04

Description

Initialising a model raised an recursion error.
Specifically, in inferno/extensions/initializers/presets.py line 23

if isinstance(tensor, Variable):
    self.call_on_tensor(tensor.data)

the if clause is always true and one gets stuck in infinite recursion.

What I Did

Delete the lines

        if isinstance(tensor, Variable):
            self.call_on_tensor(tensor.data)
            return tensor

Print training / validation accuracies

Hey team. I was just wondering what the best way to get the trainer to print training accuracies to the console was? Thanks

project page is not up to date

project page is not up to date, the version number referenced there as goal are already reached

Wrong import in UNet Tutorial?

inferno/examples/plot_unet_tutorial.py

Line 71 in 5493e9e

from inferno.extensions.layers import ResBlockUNet

Am I missing something or shouldn't this import be from inferno.extensions.model instead?

(Btw. thanks for the Unets and the tutorial Thorsten)

RandomScaleSegmentation twice in transform/image.py

code for class RandomScaleSegmentation duplicated.
it's identical except for the padding mode (and empty lines)

inferno/inferno/io/transform/image.py

Line 86 in c83f2ec

class RandomScaleSegmentation(Transform):

inferno/inferno/io/transform/image.py

Line 644 in c83f2ec

class RandomScaleSegmentation(Transform):

Better support for Tensorboard

This project looks like a good replacement for the manual tensorboard business we currently have going. It makes it much easier to integrate histograms, distributions, and even audio.

Graph model fails to replicate on multiple devices.

As of this commit, the problem can be reproduced as follows:

import torch
from torch.autograd import Variable
import torch.nn as nn
from torch.nn.parallel.data_parallel import data_parallel
from inferno.extensions.containers.graph import Graph

input_shape = [8, 1, 3, 128, 128]
model = Graph()\
    .add_input_node('input')\
    .add_node('conv0', nn.Conv3d(1, 10, 3, padding=1), previous='input')\
    .add_node('conv1', nn.Conv3d(10, 1, 3, padding=1), previous='conv0')\
    .add_output_node('output', previous='conv1')

model.cuda()
input = Variable(torch.rand(*input_shape).cuda())
output = data_parallel(model, input, device_ids=[0, 1, 2, 3])

This raises:

RuntimeError: tensors are on different GPUs

Could this be due to this add_module?

make GarbageCollection a default

we should make the behavior off the GarbageCollection callback default and make this callback obsolete.
We should proidve an API like the following and have reasonable defaults:

trainer.garbage_collect(collect_every=(1, 'iteration'))

Power users can disable gc via

trainer.garbage_collect(collect_every='never')

Batches with Tags

I think it would be useful to have a nicer interface for input and output batches with multiple elements.

Currently it can be cumbersome to keep track of what is where in the batch, especially when using multiple transforms that add or remove elements from the batch, or when using multiple loss functions that act on different ground truth/predictions.

This would be a lot easier if elements in the batch could have tags (such as 'raw', 'segmentation', 'affinities'). Transforms and loss functions could use these tags to select what they act on, and also label their outputs.

I will probably implement this at least for myself, but doing so in a nice way while keeping the current functionality will be harder. So I am interested in whether this feature would be useful to others, and if someone has ideas on how to implement it.

Migrate to networkx 2.x

inferno/inferno/extensions/containers/graph.py

Line 389 in 2f51e50

for source, target in graph.edges_iter():

This line crashes with networkx version 2.2

Implement infinite training

As of now, Trainer does not work if max_num_epochs or max_num_iterations is not specified. Not providing either should result in the trainer training till interrupted (via Ctrl+C or SIGINT).

ImportError from torchvision

inferno version: 0.1.8 (installed via conda install -c pytorch -c conda-forge inferno)
Python version: 3.6.8
Operating System: Ubuntu 18.04

Description

When importing inferno, I got the error:

  File "<stdin>", line 1, in <module>
  File "/home/sdamrich/anaconda3/envs/condaenv/inferno/__init__.py", line 6, in <module>
    from . import io
  File "/home/sdamrich/anaconda3/envs/condaenv/inferno/io/__init__.py", line 1, in <module>
    from . import box
  File "/home/sdamrich/anaconda3/envs/condaenv/inferno/io/box/__init__.py", line 3, in <module>
    from .camvid import CamVid, get_camvid_loaders
  File "/home/sdamrich/anaconda3/envs/condaenv/inferno/io/box/camvid.py", line 9, in <module>
    from torchvision.datasets.folder import is_image_file, default_loader
ImportError: cannot import name 'is_image_file'

I use torchvision 0.2.1 and pytorch 1.0.1

What I Did

Commenting out the first two lines in

/home/sdamrich/anaconda3/envs/condaenv/inferno/io/box/__init__.py

solved the issue for me.

Normalize channels separately?

inferno version: n/a
Python version: n/a
Operating System: n/a

Description

inferno/io/trasnform/generic.py
class Normalize(Transform)

def tensor_function(self, tensor):
    mean = np.asarray(tensor.mean()) if self.mean is None else self.mean
    std = np.asarray(tensor.std()) if self.std is None else self.std
    # Figure out how to reshape mean and std
    reshape_as = [-1] + [1] * (tensor.ndim - 1)
    # Normalize
    tensor = (tensor - mean.reshape(*reshape_as))/(std.reshape(*reshape_as) + self.eps)
    return tensor

Issue

I am not sure I'm getting the intentions here, but I guess this reshaping the mean and std part is meant to apply separate means and stds for channels, right?
In this case it looks like it wouldn't work if the mean and std were not supplied as arguments (tensor.mean() would return the mean of a flattened array by default?)
Was it meant like this?

IOU is broken

With the current pytorch (1.0), the IOU metric fails with

File "/home/pape/Work/software/conda/miniconda3/envs/torch10/lib/python3.7/site-packages/inferno/extensions/metrics/categorical.py", line 104, in forward
numerator = (flattened_prediction * onehot_targets).sum(-1)
RuntimeError: expected type torch.cuda.FloatTensor but got torch.cuda.LongTensor

I could fix this by casting onehot_targets to float before this line:
https://github.com/inferno-pytorch/inferno/blob/master/inferno/extensions/metrics/categorical.py#L104
But we should probably double check that this is the right thing to do.

Clean up `extensions/layers`

We currently have a bunch of files in extensions/layers that implement somewhat redundant functionality:

building_blocks: Implements residual block in ResBlockBase and ResBlock
prefab: Implements residual block in ResidualBlock
res_unet: Implements residual u-net.
unet_base: Implements u-net base class.

I would vote to merge building_blocks and prefab and if possible also merge the residual block implementations in there. I like @DerThorsten suggestions to name the new file conv_blocks,
because this makes clear what's in there.

Regarding the unet:
Maybe put everything into a single unet file?

Add more examples:

We should add more examples for the following things:

infernos transformation pipeline, show / highlight the difference between torchvision and infernos transformations
usage of trainer with non-trivial dataset (smth where num_input>1 and num_output > 1)
load a model which was trained/saved via inferno and use this trained model to predict
...

github badges/shields seem to be down.

badges/shields seem to be down / 'gray'
I guess travis, readthedocs and other urls needs to be updated due to repo ownership transfer.

Volatile was removed

inferno/inferno/trainers/basic.py

Line 1050 in 17e7262

 batch = type(batch)([Variable(_batch, requires_grad=requires_grad, volatile=volatile) 

UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead.

There's no good way to prevent TensorboardLogger from logging images.

Something along the lines of TensorboardLogger(log_images_every='never') is what we're after. The cleanest way of getting that going is by making inferno.utils.train_utils.Frequency understand what 'never' means.

Extended Support for PyTorch v0.4+

This is an issue to track the next inferno release built around PyTorch 0.4. Below is a list of what is to come, feel free to populate it and/or suggest changes.

Core

Adapt to the Pytorch v0.4 paradigm and deprecate Variables,
Integrate zero-dimensional tensors (and get rid of all variable.data[0] in the codebase),
Integrate the new device-agnostic constructs (tensors.to(...) or model.to(...)),
Wrapper to manage gradient checkpointing,
Integrate support for reduce=False in all inferno-managed losses functions,

Visualization

Replace the existing tensorboard backend with lanpa/tensorboardX.

General

Make the trainer class more modular without compromising on functionality. Break-up the oversized Trainer class to smaller classes to facilitate future support for multi-model trainers.

Updates

15 Aug 2018

To fully implement all 0.4+ features without bloating the codebase, we'd need to deprecate v0.3 and below, potentially invalidating a lot of code. I guess this can wait till v1.0.

Tensorboard logging fails with tensorboardX 1.4

The TensorboardLogger fails when logging images and using tensorboardX 1.4 with the
stack trace below.
Note that this error does not occur in tensorboardX 1.2.

  File "/home/pape/Work/software/conda/miniconda3/envs/torch41/lib/python3.6/site-packages/inferno/trainers/callbacks/logging/tensorboard.py", line 354, in log_image_or_volume_batch
    self.log_images(tag, image_list, step)
  File "/home/pape/Work/software/conda/miniconda3/envs/torch41/lib/python3.6/site-packages/inferno/trainers/callbacks/logging/tensorboard.py", line 395, in log_images
    self.writer.add_image(tag, img_tensor=image, global_step=step)
  File "/home/pape/Work/software/conda/miniconda3/envs/torch41/lib/python3.6/site-packages/tensorboardX/writer.py", line 412, in add_image
    self.file_writer.add_summary(image(tag, img_tensor), global_step, walltime)
  File "/home/pape/Work/software/conda/miniconda3/envs/torch41/lib/python3.6/site-packages/tensorboardX/summary.py", line 205, in image
    image = make_image(tensor, rescale=rescale)
  File "/home/pape/Work/software/conda/miniconda3/envs/torch41/lib/python3.6/site-packages/tensorboardX/summary.py", line 243, in make_image
    image = Image.fromarray(tensor)
  File "/home/pape/Work/software/conda/miniconda3/envs/torch41/lib/python3.6/site-packages/PIL/Image.py", line 2463, in fromarray
    raise TypeError("Cannot handle this data type")

Remove Variable

Already mentioned in #103, we should finally remove torch.autograd.Variable

SaveAtBestValidationScore does not save the first computed value.

If one uses SaveAtBestValidationScore , the first computed validation score is wrongly never considered as the best and therefore not saved.

[INFO    ] Breaking to validate.                                                                                                         
[INFO    ] Validating.                                                                                                                   
[INFO    ] validate generator exhausted, breaking.                                                                                       
[INFO    ] Done validating. Logging results...                                                                                           
[INFO    ] Validation loss: 3.2442782860133543; validation error: None                                                                   
[INFO    ] Current smoothed validation score 3.2442782860133543 is not better than the best smoothed validation score 3.2442782860133543.

Error when trying to continue training a saved model

inferno version:
Python version:
Operating System:

Description

I build a model and saved it using

trainer.save_every((1, 'epochs'))
trainer.save_to_directory(folder)

When I rerun my Python script to load and continue training the previous model I get an error.

What I Did

This is my code.

def train(load=False, folder='out'):
    print('starting training')
    os.makedirs(folder, exist_ok=True)

    # setup logger
    Logger.instance().setup('log')

    vae = Vae()

    ds = MyDataset(root_folder=root_folder, training=True)
    train_loader = torch.utils.data.DataLoader(ds, batch_size=512, num_workers=16)

    # build trainer
    trainer = Trainer(vae)
    trainer.cuda()

    trainer.build_criterion(vae.loss_function())
    trainer.build_optimizer('Adam', lr=0.001)
    # trainer.validate_every((2, 'epochs'))
    trainer.save_every((1, 'epochs'))
    trainer.save_to_directory(folder)
    trainer.set_max_num_epochs(100)

    # bind loaders
    trainer.bind_loader('train', train_loader, num_inputs=1, num_targets=1)

    # bind callbacks
    trainer.register_callback(GarbageCollection())
    # trainer.register_callback(ShowMinimalConsoleInfo())

    if load:
        trainer.load()
    trainer.fit()

When calling train(load=True) I get the following error:

  File "main.py", line 104, in my_train
    trainer.fit()
  File "/data/l989o/anaconda3/envs/hemo/lib/python3.7/site-packages/inferno/trainers/basic.py", line 1336, in fit
    self.train_for(break_callback=lambda *args: self.stop_fitting(max_num_iterations,
  File "/data/l989o/anaconda3/envs/hemo/lib/python3.7/site-packages/inferno/trainers/basic.py", line 1410, in train_for
    batch = self.fetch_next_batch('train')
  File "/data/l989o/anaconda3/envs/hemo/lib/python3.7/site-packages/inferno/trainers/basic.py", line 1092, in fetch_next_batch
    self._loader_iters.update({from_loader: self._loaders[from_loader].__iter__()})
KeyError: 'train'

Any ideas how to fix it? Thanks.

Beautify the Usage page

This is the page we show most potential users, it shouldn't look shabby.

TensorboardLogger logs stale states for validation_*

The TensorboardLogger needs a end_of_validation_iteration method. Also, the _trainer_states_being_observed attribute needs to be split in two or more subsets (e.g. one for training and one for validation).

inferno-pytorch / inferno Goto Github PK

inferno's People

Contributors

Stargazers

Watchers

Forkers

inferno's Issues

Description

Description

What I Did

Description

What I Did

Description

Description

What I Did

Description

What I Did

Description

Issue

Core

Visualization

General

Updates

15 Aug 2018

Description

What I Did

Recommend Projects

Recommend Topics

Recommend Org

Jobs