chnsh / dcrnn_pytorch Goto Github PK

View Code? Open in Web Editor NEW

432.0 432.0 110.0 207.98 MB

Diffusion Convolutional Recurrent Neural Network Implementation in PyTorch

License: MIT License

Python 100.00%

cnn graph-neural-networks

dcrnn_pytorch's People

Contributors

Stargazers

Watchers

Forkers

victorsoda karldenken wfccross lijiangguo abel0828 mcdragon anupam312nwd nikolausn tonylibing buffoon-n helen9977 liu6zijian sakastlord masterwall vahidrostami gm3g11 bczhu ibulu 2017wxyzwxyz bzp92 asfeng dawnywu feifan456 yang-x-c ckj-123 xiaoye77 shuowang-ai tacchan7412 yuanhaitao acjidi zay113 zhaoyuanm xhfei1224 yachuan baobunuo cylin-gn liweiowl jinyuanliu23 akashshah59 drownfish19 chaoshangcs qop2018 aishik-rakshit dongxiaw z223i vgsatorras saipenamakuri leonardyoung haowenlin xutete wangjie-97 zzw-zwzhang p-patil jnasasira amirunpri2018 wisdo0 techthiyanes bbrangeo ysun57 oovertone ninec-9 achendian1 tanwimallick gabrieleborg26 kvmduc xiaotrong hero-han skang132 chris-hzc asclepiusinformatica selected997 drwxyh minajwsy zhenyuanjin xiaohesoft tony2016edu mengfei25 shadow1999k qsimeon katiexin89 climbworm fabiofarina calonca tvhead98 szpal00 xiong-make zzxfaith mehulbhuradia whoa611 saeedrahmani ligong9527 littlebird-maker suigenerisloong max-chen2020 aiyaf kong102455 silviaaasong ryaneli l1uw3n jerronl

dcrnn_pytorch's Issues

Problem with curriculum learning?

the decoder of the DCRNNModel in model/pytorch/dcrnn_model.py seems to be fed with its own output when use_curriculum_learning is set to False:

for t in range(self.decoder_model.horizon):
            decoder_output, decoder_hidden_state = self.decoder_model(decoder_input,
                                                                      decoder_hidden_state)
            decoder_input = decoder_output
            outputs.append(decoder_output)
            if self.training and self.use_curriculum_learning:
                c = np.random.uniform(0, 1)
                if c < self._compute_sampling_threshold(batches_seen):
                    decoder_input = labels[t]

However, I think it should be fed with ground truth labels, isn't it?
cfr. DCRNN paper, page 4 1st line

Reason of huge improvement in pytorch version

Thanks for your great work.
Compared to original tensorflow results, this pytorch version has better performance.
Could you explain the reason of this boost?

About the performance improvement compared with Tensorflow implementation

Hi, thank you for the implementation very much, it helps me a lot! BTW, I am really confused about the performance improvement comparing the PyTorch implementation with the original Tensorflow implementation. I would be very grateful if you could give me some guidance, thank you in advance！

CUDA out of memory error

When I run python dcrnn_train_pytorch.py --config_filename=data/model/dcrnn_la.yaml I get the error
RuntimeError:CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 5.80 GiB total capacity; 4.36 GiB already allocated; 6.69 MiB free; 4.63 GiB reserved in total by PyTorch)

Weights at epoch 64 not found

I was running the script of run_demo_pytorch.py using the command:
python run_demo_pytorch.py --config_filename=data/model/pretrained/METR-LA/config.yaml

This is what I got:
Traceback (most recent call last):
File "run_demo_pytorch.py", line 33, in
run_dcrnn(args)
File "run_demo_pytorch.py", line 18, in run_dcrnn
supervisor = DCRNNSupervisor(adj_mx=adj_mx, **supervisor_config)
File "/home/cyd/DCRNN_PyTorch/model/pytorch/dcrnn_supervisor.py", line 50, in init
self.load_model()
File "/home/cyd/DCRNN_PyTorch/model/pytorch/dcrnn_supervisor.py", line 93, in load_model
assert os.path.exists('models/epo%d.tar' % self._epoch_num), 'Weights at epoch %d not found' % self._epoch_num
AssertionError: Weights at epoch 64 not found

Could you please upload the 'models/epo64.tar' to the repo? I hope to reproduce the MAE results demonstrated in README. Thx!

Node connectivity or sensors interaction

Hi, thank you for sharing the pytorch code.
Is it possible to release the implementation details of Figure 9?
Thank you

About Missing Data (0 values)

There is a lot of missing data (those with value 0) in both datasets (METR-LA and PEMS-BAY). Did you adopt any interplolation approach (such as linear interplolation) to fill in these missing values?

Double sigmoid inside RNNCell

I was looking at the implementation of the DCGRUCell, and I've spotted something out of order. If we are using 'fc' for U and R gates we are going to apply the sigmoid twice (line 121 and 96). Is this how it's intended to work, or is there a bug?

Run the Pre-trained Model on METR-LA

I have a question.
To the proposed model in this paper, why do we need pre-trained ?
Can you tell me the difference between run_demo and dcrnn_train ?
Thank you very much !

Using no convolution better than DCRNN paper result?

Hi @chnsh,

Hope you're doing well. While testing your pytorch implementation of the DCRNN code, I stumbled across a weird result. When turning off convolution with max_diffusion_step=0 the result was greatly improved over the original dcrnn paper result.

I tested this on the PeMS dataset, configurations are:

link to file: model_config_issue.txt. In short, I simplified the model a bit by doing the 15min forecast, using 'laplacian' filter, no curriculum learning and only 60 epochs. Of course also max_diffusion_step=0 is used to discard using a convolution.

This resulted in val_mae: 1.2388 at 60th epoch as can be seen in the snippet or full info.log. This result is better than the full blown published DCRNN which reported val_mae: 1.38. The fact that even a simpler model without convolution is better than the original DCRNN should raise concern about the soundness of this implementation. This is kind of the same problem as issue #3.

I'm not familiar with tensorflow, that is why your implementation has given me much help with my thesis. Because this observation could probably bottleneck me down the road I would like to pin down the reason for this behaviour as early as possible. I think you have more insights in the workings of the original tensorflow implementation of DCRNN, thus I would like to ask you to have another look in finding the problem. I have a gut feeling the problem lies somewhere in the calculation of the error/loss.

Hope you can find the time to look into this issue. Thanks in advance.

PEMS-BAY

Hi,

Thanks for the great work.

I am sort of confused if for running the pems-bay dataset, I need to do have more files. Specifically the data/sensor_graph seems to be present for the LA dataset but not for BAY dataset. So to train on the BAY dataset, what are the additional steps to be taken?. Thank you

How where the figures generated?

I was wondering how you reversed the DCRNN normalization in METRLADatasetLoader to plot the speed on the y-axis in your figures: https://github.com/chnsh/DCRNN_PyTorch/blob/d92490b808ba5c5be2f23d427d96e9a56b066d7f/README.md#pytorch-results

I'm using the the following Pytorch notebook: https://colab.research.google.com/drive/132hNQ0voOtTVk3I4scbD3lgmPTQub0KR?usp=sharing#scrollTo=EzrkqXPxFwIx

Your chart:

The one in the notebook where I want the y-axis to be like yours:

A problem in gcrnn_train_pytorch

I find this problem when i run the file named gcrnn_train_pytorch.Could you tell me how to deal with this issue?

ImportError

I was running:

python run_demo_pytorch.py --config_filename=data/model/pretrained/METR-LA/config.yaml

This is what I got:

Traceback (most recent call last):
File "run_demo_pytorch.py", line 8, in
from model.pytorch.dcrnn_supervisor import DCRNNSupervisor
File "C:\Users\mi\Desktop\DCRNN_PyTorch-pytorch_scratch\model\pytorch\dcrnn_supervisor.py", line 6, in
from torch.utils.tensorboard import SummaryWriter
ImportError: No module named 'torch.utils.tensorboard'

Where is the resulting graph of the program running

May I ask where the result diagram of your program is drawn?Seems not to see the corresponding code, run your program only a few lines of data out。

Lr scheduler resets when resuming model

Hi, many thanks for converting the tensorflow implementation of DCRNN to pytorch, it helped me a lot.

I noticed that when resuming a model, the learning-rate scheduling is being reset. Any idea why this happens? I uploaded a picture of a test case below.
If you need any further details, let me know.

Figure generation

Which part of the code actually help to generate those figure for the results section, any help please, how to generate plot.

Testing results during training

Hello, firstly I would like to thank you for the implementation. I've been trying to use your implementation and I've noticed a big difference, during training when evaluating (fx every 10 steps) you're only reporting the mae (over all 12 time stamps), while DCRNN reports mae/mape/rmse for every time stamp. I would be interested to see those numbers during training, or at least at the end of the training so I can compare it with other models. Do you have any suggestions how I could do this?

About the function "_setup_graph()”

Could you tell me why there must be a call "_setup_graph()" before load this model.
I am a pytorch beginner and hope to get your answers.
The code is here：

def load_model(self):
    self._setup_graph()
    assert os.path.exists('models/epo%d.tar' % self._epoch_num), 'Weights at epoch %d not found' % self._epoch_num
    checkpoint = torch.load('models/epo%d.tar' % self._epoch_num, map_location='cpu')
    self.dcrnn_model.load_state_dict(checkpoint['model_state_dict'])
    self._logger.info("Loaded model at {}".format(self._epoch_num))

def _setup_graph(self):
    with torch.no_grad():
        self.dcrnn_model = self.dcrnn_model.eval()

        val_iterator = self._data['val_loader'].get_iterator()

        for _, (x, y) in enumerate(val_iterator):
            x, y = self._prepare_data(x, y)
            output = self.dcrnn_model(x)
            break

The formulation of Diffusion Convolution is wrong

In DCRNN, there is no diffusion process in the graph, because we cannot find any restart probability alpha in the code. Here, I offer a correct Diffusion Matrix as below:

it doesn't seem to improve in the test run

I test ran the code in google colab and so far I got output as following



2024-02-06 18:09:14,584 - INFO - Log directory: data/model/dcrnn_DR_2_h_12_64-64_lr_0.01_bs_192_0206180913/

INFO:model.pytorch.dcrnn_supervisor:Log directory: data/model/dcrnn_DR_2_h_12_64-64_lr_0.01_bs_192_0206180913/

2024-02-06 18:09:35,626 - INFO - Model created

INFO:model.pytorch.dcrnn_supervisor:Model created

2024-02-06 18:09:38,948 - INFO - Loaded model at 50

INFO:model.pytorch.dcrnn_supervisor:Loaded model at 50

2024-02-06 18:09:40,199 - INFO - Start training ...

INFO:model.pytorch.dcrnn_supervisor:Start training ...

2024-02-06 18:09:40,204 - INFO - num_batches:125

INFO:model.pytorch.dcrnn_supervisor:num_batches:125

2024-02-06 18:18:39,040 - INFO - epoch complete

INFO:model.pytorch.dcrnn_supervisor:epoch complete

2024-02-06 18:18:39,045 - INFO - evaluating now!

INFO:model.pytorch.dcrnn_supervisor:evaluating now!

2024-02-06 18:19:24,359 - INFO - Epoch [50/100] (6375) train_mae: 1.9753, val_mae: 2.9198, lr: 0.010000, 584.1s

/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py:432: UserWarning: To get the last learning rate computed by the scheduler, please use `get_last_lr()`.
  warnings.warn("To get the last learning rate computed by the scheduler, "
INFO:model.pytorch.dcrnn_supervisor:Epoch [50/100] (6375) train_mae: 1.9753, val_mae: 2.9198, lr: 0.010000, 584.1s

2024-02-06 18:19:24,384 - INFO - Saved model at 50

INFO:model.pytorch.dcrnn_supervisor:Saved model at 50

2024-02-06 18:19:24,391 - INFO - Val loss decrease from inf to 2.9198, saving to models/epo50.tar

INFO:model.pytorch.dcrnn_supervisor:Val loss decrease from inf to 2.9198, saving to models/epo50.tar

2024-02-06 18:28:27,688 - INFO - epoch complete

INFO:model.pytorch.dcrnn_supervisor:epoch complete

2024-02-06 18:28:27,692 - INFO - evaluating now!

INFO:model.pytorch.dcrnn_supervisor:evaluating now!
...
2024-02-06 21:27:25,031 - INFO - Epoch [69/100] (8750) train_mae: 1.9429, val_mae: 2.9616, lr: 0.000100, 589.0s

INFO:model.pytorch.dcrnn_supervisor:Epoch [69/100] (8750) train_mae: 1.9429, val_mae: 2.9616, lr: 0.000100, 589.0s

2024-02-06 21:28:55,730 - INFO - Epoch [69/100] (8750) train_mae: 1.9429, test_mae: 3.2499,  lr: 0.000100, 589.0s

INFO:model.pytorch.dcrnn_supervisor:Epoch [69/100] (8750) train_mae: 1.9429, test_mae: 3.2499,  lr: 0.000100, 589.0s

2024-02-06 21:37:59,490 - INFO - epoch complete

INFO:model.pytorch.dcrnn_supervisor:epoch complete

2024-02-06 21:37:59,494 - INFO - evaluating now!

INFO:model.pytorch.dcrnn_supervisor:evaluating now!

2024-02-06 21:38:44,803 - INFO - Epoch [70/100] (8875) train_mae: 1.9318, val_mae: 2.9033, lr: 0.001000, 589.1s

INFO:model.pytorch.dcrnn_supervisor:Epoch [70/100] (8875) train_mae: 1.9318, val_mae: 2.9033, lr: 0.001000, 589.1s

2024-02-06 21:38:44,823 - INFO - Saved model at 70

INFO:model.pytorch.dcrnn_supervisor:Saved model at 70

2024-02-06 21:38:44,827 - INFO - Val loss decrease from 2.9198 to 2.9033, saving to models/epo70.tar

INFO:model.pytorch.dcrnn_supervisor:Val loss decrease from 2.9198 to 2.9033, saving to models/epo70.tar

2024-02-06 21:47:48,164 - INFO - epoch complete

INFO:model.pytorch.dcrnn_supervisor:epoch complete

2024-02-06 21:47:48,169 - INFO - evaluating now!

INFO:model.pytorch.dcrnn_supervisor:evaluating now!

2024-02-06 21:48:33,495 - INFO - Epoch [71/100] (9000) train_mae: 1.9262, val_mae: 2.9057, lr: 0.001000, 588.7s

INFO:model.pytorch.dcrnn_supervisor:Epoch [71/100] (9000) train_mae: 1.9262, val_mae: 2.9057, lr: 0.001000, 588.7s

2024-02-06 21:57:36,690 - INFO - epoch complete

INFO:model.pytorch.dcrnn_supervisor:epoch complete

2024-02-06 21:57:36,698 - INFO - evaluating now!

INFO:model.pytorch.dcrnn_supervisor:evaluating now!
...
2024-02-06 23:57:38,667 - INFO - Epoch [84/100] (10625) train_mae: 1.9336, val_mae: 2.9073, lr: 0.000100, 588.8s

INFO:model.pytorch.dcrnn_supervisor:Epoch [84/100] (10625) train_mae: 1.9336, val_mae: 2.9073, lr: 0.000100, 588.8s

2024-02-07 00:06:42,161 - INFO - epoch complete

INFO:model.pytorch.dcrnn_supervisor:epoch complete

2024-02-07 00:06:42,165 - INFO - evaluating now!

INFO:model.pytorch.dcrnn_supervisor:evaluating now!

2024-02-07 00:07:27,510 - INFO - Epoch [85/100] (10750) train_mae: 1.9430, val_mae: 2.9063, lr: 0.000100, 588.8s

INFO:model.pytorch.dcrnn_supervisor:Epoch [85/100] (10750) train_mae: 1.9430, val_mae: 2.9063, lr: 0.000100, 588.8s

the valid and test result doesn't seems improving and the lr stayed unchanged. Are they expected and they will get better before the 100 epochs? Or something is wrong?
Thanks!

Why are some speed data negative?

I'm confused with speed data while some of them are negative.

Test error calculation is not correct

I tried to use dcrnnsupervise.evaulate() to calculate the test MAE but found that the results will be largely different when using different test batch sizes. Later I found that the current implementation directly calculates the (masked) MAE per batch and then simply averages them. This is not correct since the masking average is not a linear operation and then it cannot be done per batch and then calculate their average. The correct approach should be first to collect all predictions (and the corresponding targets) and then calculate the (masked) MAE over this full batch of data.

chnsh / dcrnn_pytorch Goto Github PK

dcrnn_pytorch's People

Contributors

Stargazers

Watchers

Forkers

dcrnn_pytorch's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs