GithubHelp home page GithubHelp logo

Comments (6)

mansimov avatar mansimov commented on June 26, 2024
  1. Use can think about pixels in MNIST as probabilities so cross entropy loss measures the distance between predicted and ground-truth probability distributions. You can try using MSE for MNIST not sure how it is going to work :)
  2. You are right. This paper http://arxiv.org/abs/1506.03099 addresses what you are saying.
  3. 10 mil steps is an arbitrary number. Don't remember exact number of steps we used but it converged fast.

from unsupervised-videos.

b3nk4n avatar b3nk4n commented on June 26, 2024

Regarding 2:

I'm right with what? To train directly on previously predicted frames, not ground truth frames?
Because by looking in your code, you are using ground truth frames in training. See lines 79-89 in lstm_combo.py:

# Fprop through future predictor.
    for t in xrange(self.future_seq_length_):
      this_init_state = init_state if t == 0 else []
      if self.is_conditional_fut_ and t > 0:
        if train:
            t2 = self.enc_seq_length_ + t - 1
            input_frame=self.v_.col_slice(t2 * self.num_dims_, (t2+1) * self.num_dims_)
        else:
          # Instead of conditioning on true frame, condition on the generated frame at the test time
            t2 = t - 1
            input_frame=self.v_fut_.col_slice(t2 * self.num_dims_, (t2+1) * self.num_dims_)
            if self.binary_data_:
              input_frame.apply_sigmoid()
            elif self.relu_data_:
              input_frame.lower_bound(0)
      else:
        input_frame = None
      self.lstm_stack_fut_.Fprop(input_frame=input_frame, init_state=this_init_state,
                                 output_frame=self.v_fut_.col_slice(t * self.num_dims_, (t+1) * self.num_dims_), copy_init_state=self.future_copy_init_state_)

In the paper, you are writing on page 6:

Next, we change the future predictor by making it conditional. We can see that this model makes sharper predictions.

But there is no hint if it conditions on ground truth frames, or previously predicted frames.

EDIT
I implemented a network similar to yours to predict future frames (without the reconstruction branch, consequenctly no combo-model. Additinally, I'm using LSTMConv2D cells without peephole connections and squared error as loss function) in TensorFlow. I'm getting kind of the same results, that when I condition on the ground truth frame during training, it looks like learning no motion at all. But it works quite good when I condition on the previously predicted frame during training.

Check out these two videos:
videos.tar.gz

My personal guess for this is that when we train on ground truth frames, the network is only trained on sharp edges, because all images in MovingMNIST have a high contrast and sharp edges. When we validate/test this model, the first predicted image looks very good and only is a little bit blurry. But starting from here, the future predictor is getting inputs of blurry images that it has never seen before. Hence, it can not predict these frame correctly.
In contrast, when we train the model using previously predicted frames even while training the model, it also learns how to handle and predict based on blurry input images.

What do you think about that?

from unsupervised-videos.

mansimov avatar mansimov commented on June 26, 2024

But there is no hint if it conditions on ground truth frames, or previously predicted frames.

As far as I remember we conditioned on ground truth frames. Yeah difference between distribution of ground-truth and predicted frames is causing this issue. I also suggest at the beginning of training to condition on ground-truth and then slowly change them to previously predicted frames as in http://arxiv.org/abs/1506.03099

Btw, how far in the future are you predicting ? It looks like more that 10 frames.

from unsupervised-videos.

b3nk4n avatar b3nk4n commented on June 26, 2024

Btw, how far in the future are you predicting ? It looks like more that 10 frames.

During training, I predicted 10 frames using 1-layer LSTMConv2D cells. I trained for 50k iterations and the batch size was I guess 24 on each of the 4 Titan X GPUs. So an effective batch size of 96.

After the model more or less converged, I created this video using the Test-set and enlarged to future predictor to 50 frames, just to see how it behaves after it's learned range of 50 frames.

I think you are doing the same on your (old) website and predict into the future for a very long time (100?):
http://www.cs.toronto.edu/~nitish/unsupervised_video/ (The gif in the top)
I'm just trying to get kind of the same results as yours, but can not reproduce it with my own model in TensorFlow, as well as with your code.

I'll try a another run of your code, and in case the validation set is not converging again, I can post a screenshot right here...

Last but not least: Thank you for your time! :)

Best regards from Munich

from unsupervised-videos.

b3nk4n avatar b3nk4n commented on June 26, 2024

As promised, here is the screenshot:
combo_1layer_40k

The screenshot was taken after 40k iterations. I used the 1layer combo model. All params are unchanged and as in the repository.

As you kan see, the validation loss is 2600+. I know 40k might not be enough training, but the last time I ran the code with about 650k iterations, the loss was about 2595. There it seams to get stuck somehow.

Edit:
Another one after 114.5k iterations:
combo_1layer_114k

from unsupervised-videos.

b3nk4n avatar b3nk4n commented on June 26, 2024

Yeah difference between distribution of ground-truth and predicted frames is causing this issue. I also suggest at the beginning of training to condition on ground-truth and then slowly change them to previously predicted frames as in http://arxiv.org/abs/1506.03099

Thank you so much for suggesting this paper. I just read it, at this is exactly what I was looking for and is highly valuable for my thesis! :)

from unsupervised-videos.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.