chnsh / dcrnn_pytorch Goto Github PK
View Code? Open in Web Editor NEWDiffusion Convolutional Recurrent Neural Network Implementation in PyTorch
License: MIT License
Diffusion Convolutional Recurrent Neural Network Implementation in PyTorch
License: MIT License
I was looking at the implementation of the DCGRUCell, and I've spotted something out of order. If we are using 'fc' for U and R gates we are going to apply the sigmoid twice (line 121 and 96). Is this how it's intended to work, or is there a bug?
Hi @chnsh,
Hope you're doing well. While testing your pytorch implementation of the DCRNN code, I stumbled across a weird result. When turning off convolution with max_diffusion_step=0
the result was greatly improved over the original dcrnn paper result.
I tested this on the PeMS dataset, configurations are:
link to file: model_config_issue.txt. In short, I simplified the model a bit by doing the 15min forecast, using 'laplacian' filter, no curriculum learning and only 60 epochs. Of course also max_diffusion_step=0
is used to discard using a convolution.
This resulted in val_mae: 1.2388
at 60th epoch as can be seen in the snippet or full info.log. This result is better than the full blown published DCRNN which reported val_mae: 1.38
. The fact that even a simpler model without convolution is better than the original DCRNN should raise concern about the soundness of this implementation. This is kind of the same problem as issue #3.
I'm not familiar with tensorflow, that is why your implementation has given me much help with my thesis. Because this observation could probably bottleneck me down the road I would like to pin down the reason for this behaviour as early as possible. I think you have more insights in the workings of the original tensorflow implementation of DCRNN, thus I would like to ask you to have another look in finding the problem. I have a gut feeling the problem lies somewhere in the calculation of the error/loss.
Hope you can find the time to look into this issue. Thanks in advance.
Could you tell me why there must be a call "_setup_graph()" before load this model.
I am a pytorch beginner and hope to get your answers.
The code is here:
`
def load_model(self):
self._setup_graph()
assert os.path.exists('models/epo%d.tar' % self._epoch_num), 'Weights at epoch %d not found' % self._epoch_num
checkpoint = torch.load('models/epo%d.tar' % self._epoch_num, map_location='cpu')
self.dcrnn_model.load_state_dict(checkpoint['model_state_dict'])
self._logger.info("Loaded model at {}".format(self._epoch_num))
def _setup_graph(self):
with torch.no_grad():
self.dcrnn_model = self.dcrnn_model.eval()
val_iterator = self._data['val_loader'].get_iterator()
for _, (x, y) in enumerate(val_iterator):
x, y = self._prepare_data(x, y)
output = self.dcrnn_model(x)
break
`
I have a question.
To the proposed model in this paper, why do we need pre-trained ?
Can you tell me the difference between run_demo and dcrnn_train ?
Thank you very much !
the decoder of the DCRNNModel
in model/pytorch/dcrnn_model.py
seems to be fed with its own output when use_curriculum_learning
is set to False
:
for t in range(self.decoder_model.horizon):
decoder_output, decoder_hidden_state = self.decoder_model(decoder_input,
decoder_hidden_state)
decoder_input = decoder_output
outputs.append(decoder_output)
if self.training and self.use_curriculum_learning:
c = np.random.uniform(0, 1)
if c < self._compute_sampling_threshold(batches_seen):
decoder_input = labels[t]
However, I think it should be fed with ground truth labels, isn't it?
cfr. DCRNN paper, page 4 1st line
Hi,
Thanks for the great work.
I am sort of confused if for running the pems-bay dataset, I need to do have more files. Specifically the data/sensor_graph seems to be present for the LA dataset but not for BAY dataset. So to train on the BAY dataset, what are the additional steps to be taken?. Thank you
I was wondering how you reversed the DCRNN normalization in METRLADatasetLoader to plot the speed on the y-axis in your figures: https://github.com/chnsh/DCRNN_PyTorch/blob/d92490b808ba5c5be2f23d427d96e9a56b066d7f/README.md#pytorch-results
I'm using the the following Pytorch notebook: https://colab.research.google.com/drive/132hNQ0voOtTVk3I4scbD3lgmPTQub0KR?usp=sharing#scrollTo=EzrkqXPxFwIx
The one in the notebook where I want the y-axis to be like yours:
I'm confused with speed data while some of them are negative.
May I ask where the result diagram of your program is drawn?Seems not to see the corresponding code, run your program only a few lines of data out。
When I run python dcrnn_train_pytorch.py --config_filename=data/model/dcrnn_la.yaml
I get the error
RuntimeError:CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 5.80 GiB total capacity; 4.36 GiB already allocated; 6.69 MiB free; 4.63 GiB reserved in total by PyTorch)
I was running:
python run_demo_pytorch.py --config_filename=data/model/pretrained/METR-LA/config.yaml
This is what I got:
Traceback (most recent call last):
File "run_demo_pytorch.py", line 8, in
from model.pytorch.dcrnn_supervisor import DCRNNSupervisor
File "C:\Users\mi\Desktop\DCRNN_PyTorch-pytorch_scratch\model\pytorch\dcrnn_supervisor.py", line 6, in
from torch.utils.tensorboard import SummaryWriter
ImportError: No module named 'torch.utils.tensorboard'
Hi, thank you for the implementation very much, it helps me a lot! BTW, I am really confused about the performance improvement comparing the PyTorch implementation with the original Tensorflow implementation. I would be very grateful if you could give me some guidance, thank you in advance!
Hello, firstly I would like to thank you for the implementation. I've been trying to use your implementation and I've noticed a big difference, during training when evaluating (fx every 10 steps) you're only reporting the mae (over all 12 time stamps), while DCRNN reports mae/mape/rmse for every time stamp. I would be interested to see those numbers during training, or at least at the end of the training so I can compare it with other models. Do you have any suggestions how I could do this?
I tried to use dcrnnsupervise.evaulate() to calculate the test MAE but found that the results will be largely different when using different test batch sizes. Later I found that the current implementation directly calculates the (masked) MAE per batch and then simply averages them. This is not correct since the masking average is not a linear operation and then it cannot be done per batch and then calculate their average. The correct approach should be first to collect all predictions (and the corresponding targets) and then calculate the (masked) MAE over this full batch of data.
I was running the script of run_demo_pytorch.py using the command:
python run_demo_pytorch.py --config_filename=data/model/pretrained/METR-LA/config.yaml
This is what I got:
Traceback (most recent call last):
File "run_demo_pytorch.py", line 33, in
run_dcrnn(args)
File "run_demo_pytorch.py", line 18, in run_dcrnn
supervisor = DCRNNSupervisor(adj_mx=adj_mx, **supervisor_config)
File "/home/cyd/DCRNN_PyTorch/model/pytorch/dcrnn_supervisor.py", line 50, in init
self.load_model()
File "/home/cyd/DCRNN_PyTorch/model/pytorch/dcrnn_supervisor.py", line 93, in load_model
assert os.path.exists('models/epo%d.tar' % self._epoch_num), 'Weights at epoch %d not found' % self._epoch_num
AssertionError: Weights at epoch 64 not found
Could you please upload the 'models/epo64.tar' to the repo? I hope to reproduce the MAE results demonstrated in README. Thx!
I test ran the code in google colab and so far I got output as following
2024-02-06 18:09:14,584 - INFO - Log directory: data/model/dcrnn_DR_2_h_12_64-64_lr_0.01_bs_192_0206180913/
INFO:model.pytorch.dcrnn_supervisor:Log directory: data/model/dcrnn_DR_2_h_12_64-64_lr_0.01_bs_192_0206180913/
2024-02-06 18:09:35,626 - INFO - Model created
INFO:model.pytorch.dcrnn_supervisor:Model created
2024-02-06 18:09:38,948 - INFO - Loaded model at 50
INFO:model.pytorch.dcrnn_supervisor:Loaded model at 50
2024-02-06 18:09:40,199 - INFO - Start training ...
INFO:model.pytorch.dcrnn_supervisor:Start training ...
2024-02-06 18:09:40,204 - INFO - num_batches:125
INFO:model.pytorch.dcrnn_supervisor:num_batches:125
2024-02-06 18:18:39,040 - INFO - epoch complete
INFO:model.pytorch.dcrnn_supervisor:epoch complete
2024-02-06 18:18:39,045 - INFO - evaluating now!
INFO:model.pytorch.dcrnn_supervisor:evaluating now!
2024-02-06 18:19:24,359 - INFO - Epoch [50/100] (6375) train_mae: 1.9753, val_mae: 2.9198, lr: 0.010000, 584.1s
/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py:432: UserWarning: To get the last learning rate computed by the scheduler, please use `get_last_lr()`.
warnings.warn("To get the last learning rate computed by the scheduler, "
INFO:model.pytorch.dcrnn_supervisor:Epoch [50/100] (6375) train_mae: 1.9753, val_mae: 2.9198, lr: 0.010000, 584.1s
2024-02-06 18:19:24,384 - INFO - Saved model at 50
INFO:model.pytorch.dcrnn_supervisor:Saved model at 50
2024-02-06 18:19:24,391 - INFO - Val loss decrease from inf to 2.9198, saving to models/epo50.tar
INFO:model.pytorch.dcrnn_supervisor:Val loss decrease from inf to 2.9198, saving to models/epo50.tar
2024-02-06 18:28:27,688 - INFO - epoch complete
INFO:model.pytorch.dcrnn_supervisor:epoch complete
2024-02-06 18:28:27,692 - INFO - evaluating now!
INFO:model.pytorch.dcrnn_supervisor:evaluating now!
...
2024-02-06 21:27:25,031 - INFO - Epoch [69/100] (8750) train_mae: 1.9429, val_mae: 2.9616, lr: 0.000100, 589.0s
INFO:model.pytorch.dcrnn_supervisor:Epoch [69/100] (8750) train_mae: 1.9429, val_mae: 2.9616, lr: 0.000100, 589.0s
2024-02-06 21:28:55,730 - INFO - Epoch [69/100] (8750) train_mae: 1.9429, test_mae: 3.2499, lr: 0.000100, 589.0s
INFO:model.pytorch.dcrnn_supervisor:Epoch [69/100] (8750) train_mae: 1.9429, test_mae: 3.2499, lr: 0.000100, 589.0s
2024-02-06 21:37:59,490 - INFO - epoch complete
INFO:model.pytorch.dcrnn_supervisor:epoch complete
2024-02-06 21:37:59,494 - INFO - evaluating now!
INFO:model.pytorch.dcrnn_supervisor:evaluating now!
2024-02-06 21:38:44,803 - INFO - Epoch [70/100] (8875) train_mae: 1.9318, val_mae: 2.9033, lr: 0.001000, 589.1s
INFO:model.pytorch.dcrnn_supervisor:Epoch [70/100] (8875) train_mae: 1.9318, val_mae: 2.9033, lr: 0.001000, 589.1s
2024-02-06 21:38:44,823 - INFO - Saved model at 70
INFO:model.pytorch.dcrnn_supervisor:Saved model at 70
2024-02-06 21:38:44,827 - INFO - Val loss decrease from 2.9198 to 2.9033, saving to models/epo70.tar
INFO:model.pytorch.dcrnn_supervisor:Val loss decrease from 2.9198 to 2.9033, saving to models/epo70.tar
2024-02-06 21:47:48,164 - INFO - epoch complete
INFO:model.pytorch.dcrnn_supervisor:epoch complete
2024-02-06 21:47:48,169 - INFO - evaluating now!
INFO:model.pytorch.dcrnn_supervisor:evaluating now!
2024-02-06 21:48:33,495 - INFO - Epoch [71/100] (9000) train_mae: 1.9262, val_mae: 2.9057, lr: 0.001000, 588.7s
INFO:model.pytorch.dcrnn_supervisor:Epoch [71/100] (9000) train_mae: 1.9262, val_mae: 2.9057, lr: 0.001000, 588.7s
2024-02-06 21:57:36,690 - INFO - epoch complete
INFO:model.pytorch.dcrnn_supervisor:epoch complete
2024-02-06 21:57:36,698 - INFO - evaluating now!
INFO:model.pytorch.dcrnn_supervisor:evaluating now!
...
2024-02-06 23:57:38,667 - INFO - Epoch [84/100] (10625) train_mae: 1.9336, val_mae: 2.9073, lr: 0.000100, 588.8s
INFO:model.pytorch.dcrnn_supervisor:Epoch [84/100] (10625) train_mae: 1.9336, val_mae: 2.9073, lr: 0.000100, 588.8s
2024-02-07 00:06:42,161 - INFO - epoch complete
INFO:model.pytorch.dcrnn_supervisor:epoch complete
2024-02-07 00:06:42,165 - INFO - evaluating now!
INFO:model.pytorch.dcrnn_supervisor:evaluating now!
2024-02-07 00:07:27,510 - INFO - Epoch [85/100] (10750) train_mae: 1.9430, val_mae: 2.9063, lr: 0.000100, 588.8s
INFO:model.pytorch.dcrnn_supervisor:Epoch [85/100] (10750) train_mae: 1.9430, val_mae: 2.9063, lr: 0.000100, 588.8s
the valid and test result doesn't seems improving and the lr stayed unchanged. Are they expected and they will get better before the 100 epochs? Or something is wrong?
Thanks!
Hi, many thanks for converting the tensorflow implementation of DCRNN to pytorch, it helped me a lot.
I noticed that when resuming a model, the learning-rate scheduling is being reset. Any idea why this happens? I uploaded a picture of a test case below.
If you need any further details, let me know.
Which part of the code actually help to generate those figure for the results section, any help please, how to generate plot.
Hi, thank you for sharing the pytorch code.
Is it possible to release the implementation details of Figure 9?
Thank you
Thanks for your great work.
Compared to original tensorflow results, this pytorch version has better performance.
Could you explain the reason of this boost?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.