edward-sun / difusco Goto Github PK

Code of NeurIPS paper: arxiv.org/abs/2302.08224

Home Page: https://arxiv.org/abs/2302.08224

License: MIT License

Python 74.20% Shell 1.55% Cython 1.95% Makefile 2.50% C++ 8.55% C 11.25%

difusco's Introduction

DIFUSCO: Graph-based Diffusion Solvers for Combinatorial Optimization

See "DIFUSCO: Graph-based Diffusion Solvers for Combinatorial Optimization" for the paper associated with this codebase.

Setup

conda env create -f environment.yml
conda activate difusco

Running TSP experiments requires installing the additional cython package for merging the diffusion heatmap results:

cd difusco/utils/cython_merge
python setup.py build_ext --inplace
cd -

Codebase Structure

difusco/pl_meta_model.py: the code for a meta pytorch-lightning model for training and evaluation.
difusco/pl_tsp_model.py: the code for the TSP problem
difusco/pl_mis_model.py: the code for the MIS problem
difusco/trian.py: the handler for training and evaluation

Data

Please check the data folder.

Reproduction

Please check the reproducing_scripts for more details.

Pretrained Checkpoints

Please download the pretrained model checkpoints from here.

Reference

If you found this codebase useful, please consider citing the paper:

@inproceedings{
    sun2023difusco,
    title={{DIFUSCO}: Graph-based Diffusion Solvers for Combinatorial Optimization},
    author={Zhiqing Sun and Yiming Yang},
    booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
    year={2023},
    url={https://openreview.net/forum?id=JV8Ff0lgVV}
}

difusco's People

Contributors

Stargazers

Watchers

difusco's Issues

TSP Validation Datasets

Dear authors, Thank you for your excellent research. I am trying to reproduce your TSP experiments, but it is unclear how to obtain the validation datasets. I followed the README.md in the data folder, downloaded the test datasets, and generated the train datasets with your code. Therefore, now I have tsp{50,100,500,1000,10000}_{train,test}_concorde.txt, but I am missing the validation splits. I couldn't figure out the exact process based on your reproducing_scripts.md. Could you please help me with this?

The download address of the dataset

Hi, I admire your work very much, but I have encountered some troubles when trying to implement the algorithm. I would like to know if you can provide the download address of the data set (tsp500-10000)

Question of GPU Memory

Hi Edward,

First of all, I appreciate to your contribution of this area. Recently, I read your paper diligently and try to reproduce your algorithm.

During this, I found that my gpus can not support enough memory size.
I tried to reproduce the code for the case of TSP1000 with RTX 3090 (gpu memory: 24 gb) or A40 (gpu memory: 48gb). (V100 Tesla has 32gb gpu memory size)

For both cases, the gpu memory can not support the commented batch size 8 for each gpu. (1 for RTX 3090, 3 for A40 in my cases)

As my understanding, there are 12 layers with 256 nodes per layer, and fp16 option is not enabled in the reproduce code.
Is my understanding for the reproduction is correct?

Best regards,
Hyungseok Song.

--save_numpy_heatmap and MCTS Evaluation

Hello Edward,

I'm encountering a problem when using the --save_numpy_heatmap flag in the evaluate script, especially when trying to solve with MCTS. Here's a detailed description of the issues I'm facing:

For solve-500.sh: When I run this script, it throws a ValueError during the reshaping of the adjacency matrix. The error message is:

adj_matrix = np.load(file_name).reshape(num_nodes, num_nodes)
ValueError: cannot reshape array of size 25000 into shape (500,500)

This suggests that the numpy array size doesn't match the expected dimensions for a 500-node problem.

For solve-1000.sh: A different error occurs in the convert_numpy_to_txt.py script:

File "convert_numpy_to_txt.py", line 22, in main
adj_matrix = adj_matrix + 0.01 * (1.0 - dists)
ValueError: operands could not be broadcast together with shapes (100000,) (1000,1000)

This error seems to indicate a mismatch in the dimensions of the operands for matrix operations, implying an issue with the shape of the saved numpy array.

Upon reviewing the plt_tsp_model.py script, I suspect that the saved adj_matrix might be a sparse matrix, which could be causing these shape inconsistencies. Could you please help clarify if this is the case, and if so, suggest a way to correctly reshape or convert the matrix for proper usage in the scripts mentioned above?

Thank you for your assistance and for your work on this project.

Best regards,
Yifan

Questions about Table 1

Hello Edward,

I saw that in Table 1 your model achieved a negative drop, e.g., -0.01% for TSP-50 and -0.01% for TSP-100, taking Concorde as the baseline. But Concorde is the exact solver, right? So how could any model achieve a better solution than Concorde? Is the negative drop due to some round-off errors?

Best regards,
Yifan

Code for MCTS.

Hi, authors. This repo is very helpful for my research, but it doesn't seem to include the code for MCTS. This prevents me from reproducing the results of the paper directly. I have tried MCTS code from previous works but failed to reproduce the results in the paper, which I suspect may due to my improper post-processing of the heatmaps. I wonder whether it is possible to include this part of the code in this repo. I sincerely appreciate any help you can provide.

Conda environment failure related to pyconcorde and Python 3.8

I'm using conda version 24.1.2 in Ubuntu 22.04. When I attempt the command:

conda env create -f environment.yml

I get the following error, with the full output attached:

ERROR: Could not find a version that satisfies the requirement pyconcorde==0.1.0 (from versions: none)
ERROR: No matching distribution found for pyconcorde==0.1.0

I don't understand much about anaconda, but it seems the environment has Python3.7 but pyconcorde requires python 3.8. If pyconcorde is taken from here, the README states that Python 3.7 is enough. Is that a problem with pyconcorde or some other module? Can someone please help with the best fix, so I can run the code?

output_conda_env_difusco.txt

Request for Additional Information on Time Metrics and Heatmaps in Tables 1 and 2

Hello Edward,

I am currently reproducing the results presented in your paper, and I have encountered a couple of issues that I'm hoping you can help me with to ensure an accurate comparison with other works.

Firstly, while Table 1 in the paper provides details on Length and Drop, there is no mention of the Time taken for each dataset. Although Table 2 includes the Time, it is not explicitly divided into the time taken for generating the Heatmap and the time taken for the subsequent search process. For the sake of a fair comparison with other papers, could you kindly share the Time details for Table 1 and Table 2, as well as the breakdown of the time consumed for each part?

Secondly, I appreciate that you have shared the checkpoints, which are immensely helpful. However, to fully reproduce your paper's results, could you also share the Heatmaps generated for the datasets in Tables 1 and 2? Access to these Heatmaps would be incredibly valuable for my reproducing.

Thank you for your time and assistance, and for your contributions to the field. I look forward to your response.

Best regards,
Yifan

Discrepancies in Reproducing Results for TSP Diffusion Method

Hi Edward,

I hope this message finds you well. I'm reaching out regarding your recent paper on the diffusion method for TSP. First and foremost, I'd like to commend you on your innovative approach and the clarity of your paper. Your work has certainly contributed significantly to this field.

I've been attempting to replicate your results using the scripts provided in your GitHub repository. While I've been able to closely match most of your reported outcomes, there are a few discrepancies I've encountered, especially in the larger TSP instances.

For example, in the TSP-500 and TSP-1000 instances, my results were very close to yours:

SL+MCTS for TSP-500: Your result - 16.63, My result - 16.64
SL+MCTS for TSP-1000: Your result - 23.39, My result - 23.41
SL+S+2-opt for TSP-500: Your result - 16.65, My result - 16.70
SL+S+2-opt for TSP-1000: Your result - 23.45, My result - 23.42
SL+S+2-opt for TSP-10000: Your result - 73.89, My result - 73.83

These variations are minor and within acceptable ranges. However, I am facing a significant challenge with the SL+MCTS method for TSP-10000. According to your paper, the result for this instance is 73.62. In contrast, my replication yields a result of 73.92, which not only deviates notably from your reported outcome but is also less effective than the SL+S+2-opt method, which is unexpected. Despite using the provided script and following the instructions accurately, my results diverge notably from those in your paper. This is quite perplexing, and I'm unsure what might be causing this discrepancy.

For reference, here's the exact command I used:
python -u difusco/train.py --task "tsp" --wandb_logger_name "tsp_diffusion_graph_categorical_tsp10000" --diffusion_type "categorical" --do_test --learning_rate 0.0002 --weight_decay 0.0001 --lr_scheduler "cosine-decay" --storage_path "./data" --training_split "tsp10000_test_concorde.txt" --validation_split "tsp10000_test_concorde.txt" --test_split "tsp10000_test_concorde.txt" --sparse_factor 100 --batch_size 1 --num_epochs 25 --validation_examples 8 --inference_schedule "cosine" --inference_diffusion_steps 50 --ckpt_path "./checkpoint/tsp10000_categorical.ckpt" --resume_weight_only --save_numpy_heatmap

I then used the provided MCTS code for post-processing, but the results were not as expected. Despite multiple attempts at rerunning the scripts for generating the heatmap and conducting the MCTS search, my results are consistently not comparable to those presented in your paper.

I would greatly appreciate any insights or suggestions you might have that could help me align my results more closely with those reported in your paper. Your guidance in this matter would be invaluable.

Thank you for your time and consideration, and I look forward to your response.

Best regards,
Yifan

code review

pl_tsp_model.py, 61 line ~ 63 line

before:

xt = self.diffusion.sample(adj_matrix_onehot, t)
xt = xt * 2 - 1
xt = xt * (1.0 + 0.05 * torch.rand_like(xt))

after:

xt = self.diffusion.sample(adj_matrix_onehot, t)
xt = xt * (1.0 + 0.05 * torch.rand_like(xt))

I think it is appropriate for categorical diffusion training step.
Thank you for reading.

Clarification on MCTS Runtime Comparison in Table 2

Hello Edward,

In Table 2, for TSP1000, I noticed that for the DIFUSCO model, the SL+G's runtime is 11.86 minutes, and the SL+MCTS's runtime is 24.47 minutes. It can be estimated that the MCTS component alone takes about 12 minutes. However, when comparing this to the ATT-GCN and DIMES models, which also utilize MCTS, I observed that their total runtime (including both inference and MCTS) is only 4.1 minutes and 6.87 minutes, respectively.

This leads me to question the fairness of this comparison. The MCTS runtime for DIFUSCO seems significantly longer than that for the other two models. Given that providing more time to MCTS generally yields higher quality solutions, I am curious whether this difference in allocated runtime might affect the fairness of the comparison.

Could you please clarify whether the runtime for MCTS is consistent across these models, or if there are other factors contributing to this apparent discrepancy?

Thank you for your attention to this matter.

Best regards,
Yifan

edward-sun / difusco Goto Github PK

difusco's Introduction

DIFUSCO: Graph-based Diffusion Solvers for Combinatorial Optimization

Setup

Codebase Structure

Data

Reproduction

Pretrained Checkpoints

Reference

difusco's People

Contributors

Stargazers

Watchers

Forkers

difusco's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs