Hi, I have a couple of questions regarding some design decisions of

Use can think about pixels in MNIST as probabilities so cross entropy loss measu

As promised, here is the screenshot: <a target="_blank" rel="noopener noreferrer n

Questions regarding some design decisions used to train MovingMNIST in the paper about unsupervised-videos HOT 6 OPEN

mansimov commented on June 26, 2024

Questions regarding some design decisions used to train MovingMNIST in the paper

from unsupervised-videos.

Comments (6)

mansimov commented on June 26, 2024

Use can think about pixels in MNIST as probabilities so cross entropy loss measures the distance between predicted and ground-truth probability distributions. You can try using MSE for MNIST not sure how it is going to work :)
You are right. This paper http://arxiv.org/abs/1506.03099 addresses what you are saying.
10 mil steps is an arbitrary number. Don't remember exact number of steps we used but it converged fast.

from unsupervised-videos.

b3nk4n commented on June 26, 2024

Regarding 2:

I'm right with what? To train directly on previously predicted frames, not ground truth frames?
Because by looking in your code, you are using ground truth frames in training. See lines 79-89 in lstm_combo.py:

# Fprop through future predictor.
    for t in xrange(self.future_seq_length_):
      this_init_state = init_state if t == 0 else []
      if self.is_conditional_fut_ and t > 0:
        if train:
            t2 = self.enc_seq_length_ + t - 1
            input_frame=self.v_.col_slice(t2 * self.num_dims_, (t2+1) * self.num_dims_)
        else:
          # Instead of conditioning on true frame, condition on the generated frame at the test time
            t2 = t - 1
            input_frame=self.v_fut_.col_slice(t2 * self.num_dims_, (t2+1) * self.num_dims_)
            if self.binary_data_:
              input_frame.apply_sigmoid()
            elif self.relu_data_:
              input_frame.lower_bound(0)
      else:
        input_frame = None
      self.lstm_stack_fut_.Fprop(input_frame=input_frame, init_state=this_init_state,
                                 output_frame=self.v_fut_.col_slice(t * self.num_dims_, (t+1) * self.num_dims_), copy_init_state=self.future_copy_init_state_)

In the paper, you are writing on page 6:

Next, we change the future predictor by making it conditional. We can see that this model makes sharper predictions.

But there is no hint if it conditions on ground truth frames, or previously predicted frames.

EDIT
I implemented a network similar to yours to predict future frames (without the reconstruction branch, consequenctly no combo-model. Additinally, I'm using LSTMConv2D cells without peephole connections and squared error as loss function) in TensorFlow. I'm getting kind of the same results, that when I condition on the ground truth frame during training, it looks like learning no motion at all. But it works quite good when I condition on the previously predicted frame during training.

Check out these two videos:
videos.tar.gz

My personal guess for this is that when we train on ground truth frames, the network is only trained on sharp edges, because all images in MovingMNIST have a high contrast and sharp edges. When we validate/test this model, the first predicted image looks very good and only is a little bit blurry. But starting from here, the future predictor is getting inputs of blurry images that it has never seen before. Hence, it can not predict these frame correctly.
In contrast, when we train the model using previously predicted frames even while training the model, it also learns how to handle and predict based on blurry input images.

What do you think about that?

from unsupervised-videos.

mansimov commented on June 26, 2024

But there is no hint if it conditions on ground truth frames, or previously predicted frames.

As far as I remember we conditioned on ground truth frames. Yeah difference between distribution of ground-truth and predicted frames is causing this issue. I also suggest at the beginning of training to condition on ground-truth and then slowly change them to previously predicted frames as in http://arxiv.org/abs/1506.03099

Btw, how far in the future are you predicting ? It looks like more that 10 frames.

from unsupervised-videos.

b3nk4n commented on June 26, 2024

Btw, how far in the future are you predicting ? It looks like more that 10 frames.

During training, I predicted 10 frames using 1-layer LSTMConv2D cells. I trained for 50k iterations and the batch size was I guess 24 on each of the 4 Titan X GPUs. So an effective batch size of 96.

After the model more or less converged, I created this video using the Test-set and enlarged to future predictor to 50 frames, just to see how it behaves after it's learned range of 50 frames.

I think you are doing the same on your (old) website and predict into the future for a very long time (100?):
http://www.cs.toronto.edu/~nitish/unsupervised_video/ (The gif in the top)
I'm just trying to get kind of the same results as yours, but can not reproduce it with my own model in TensorFlow, as well as with your code.

I'll try a another run of your code, and in case the validation set is not converging again, I can post a screenshot right here...

Last but not least: Thank you for your time! :)

Best regards from Munich

from unsupervised-videos.

b3nk4n commented on June 26, 2024

As promised, here is the screenshot:

The screenshot was taken after 40k iterations. I used the 1layer combo model. All params are unchanged and as in the repository.

As you kan see, the validation loss is 2600+. I know 40k might not be enough training, but the last time I ran the code with about 650k iterations, the loss was about 2595. There it seams to get stuck somehow.

Edit:
Another one after 114.5k iterations:

from unsupervised-videos.

b3nk4n commented on June 26, 2024

Yeah difference between distribution of ground-truth and predicted frames is causing this issue. I also suggest at the beginning of training to condition on ground-truth and then slowly change them to previously predicted frames as in http://arxiv.org/abs/1506.03099

Thank you so much for suggesting this paper. I just read it, at this is exactly what I was looking for and is highly valuable for my thesis! :)

from unsupervised-videos.

Questions regarding some design decisions used to train MovingMNIST in the paper about unsupervised-videos HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs