initzhang / ducati_sigmod Goto Github PK
View Code? Open in Web Editor NEWAccepted paper of SIGMOD 2023, DUCATI: A Dual-Cache Training System for Graph Neural Networks on Giant Graphs with the GPU
Accepted paper of SIGMOD 2023, DUCATI: A Dual-Cache Training System for Graph Neural Networks on Giant Graphs with the GPU
While running the run_allocate script for the Friendster dataset, the output indicates a discrepancy in the number of nodes. The script reports 124 million nodes, whereas the actual number of nodes should be 65 million, as documented on the corresponding dataset website
Here's the relevant output from the run_allocate script:
Graph(num_nodes=124836180, num_edges=1806067135, ndata_schemes={} edata_schemes={})
Initially I inspected the dataset to observe that they had listed nodes ranging from 1 to 124M, but the unique count of these nodes was only 65M. Therefore, I renumbered the nodes. Even after the pre-processing step, run_allocate still outputs 124M as num_nodes.
I would require some help in verifying the functional correctness of the code in this case. Any pointers would be appreciated.
Hello, following the instructions in the README, I successfully configured the environment. However, when I tried to run PA according to the commands, it seems that some issues have occurred. I would like to know where the problem is or if there might be an issue with the parameters I entered?
$CUDA_VISIBLE_DEVICES=0 python run_allocate.py --dataset ogbn-papers100M --fanouts 10,25 --fake-dim 128
2023-09-23 14:24:59,560 Namespace(adj_budget=0, adj_slope=1, batches=1000, bs=8000, dataset='ogbn-papers100M', fake_dim=128, fanouts='10,25', nfeat_budget=0, nfeat_slope=1, pre_batches=100, pre_epochs=2, runs=4, total_budget=1)
2023-09-23 14:25:00,598 loading raw dataset of ogbn-papers100M
2023-09-23 14:25:38,354 finish loading raw dataset, time elapsed: 37.76s
2023-09-23 14:26:09,762 finish preprocessing, time elapsed: 31.41s
2023-09-23 14:28:15,314 finish generating random features with dim=128, time elapsed: 124.61s
2023-09-23 14:28:16,243 Graph(num_nodes=111059956, num_edges=1615685872,
ndata_schemes={}
edata_schemes={})
2023-09-23 14:28:16,454 get 1000 seeds, 0.06GB on cuda:0
2023-09-23 14:28:16,455 start profiling and calculating slope
2023-09-23 14:28:42,581 finish calculating slope: adj(2.55) nfeat(13.45), time elapsed: 26.13s
2023-09-23 14:28:42,581 total cache budget: 1GB
2023-09-23 14:28:42,581 total adj size: 12.865GB, total nfeat size: 53.785GB
2023-09-23 14:28:44,675 finish constructing density and size array
2023-09-23 14:28:57,285 find the separate point 4565543
2023-09-23 14:28:57,325 nfeat entries: 1770741, adj entries: 2794802
2023-09-23 14:28:57,325 nfeat size: 0.858 GB, adj size: 0.142 GB
2023-09-23 14:28:57,684 dual cache allocation done, time_elapsed: 15.10s
2023-09-23 14:28:58,128 current allocation plan: 0.142GB adj cache & 0.858GB nfeat cache
Then
$CUDA_VISIBLE_DEVICES=0 python run_ducati.py
2023-09-23 14:42:06,607 Namespace(adj_budget=0, batches=1024, bs=8000, dataset='ogbn-papers100M', dropout=0.5, fake_dim=128, fanouts='10,25', lr=0.003, nfeat_budget=0, num_hidden=256, pre_batches=100, pre_epochs=2, runs=10)
2023-09-23 14:42:07,654 loading raw dataset of ogbn-papers100M
2023-09-23 14:42:45,337 finish loading raw dataset, time elapsed: 37.68s
2023-09-23 14:43:16,913 finish preprocessing, time elapsed: 31.58s
2023-09-23 14:45:22,367 finish generating random features with dim=128, time elapsed: 124.49s
2023-09-23 14:45:23,257 Graph(num_nodes=111059956, num_edges=1615685872,
ndata_schemes={}
edata_schemes={})
2023-09-23 14:45:23,485 get 1024 seeds, 0.06GB on cuda:0
gpu_flag None
gpu_map None
all_cache [None, None]
2023-09-23 14:45:23,882 buffer size: 0.185 GB
Traceback (most recent call last):
File "run_ducati.py", line 109, in <module>
entry(args, graph, all_data, seeds_list, counts)
File "run_ducati.py", line 63, in entry
run_one_list(seeds_list)
File "run_ducati.py", line 49, in run_one_list
cur_nfeat = nfeat_loader.load(input_nodes, nfeat_buf) # fetch nfeat
File "/home/bear/workspace/DUCATI_SIGMOD/NfeatLoader.py", line 9, in load
gpu_mask = self.gpu_flag[idx]
TypeError: 'NoneType' object is not subscriptable
File: /data/DUCATI_SIGMOD/run_allocate.py
I'm encountering an IndexError when running the code with the Uk Union dataset for batch sizes of 8192 and 4096. The total budget specified as params is 15-20GB onwards. Despite having 49.14 GB of available GPU memory, the code fails to execute intermittently.
`
P.S. I ran into a similar issue while using Twitter dataset, but on the adjacency cache allocation.
`
`
We observed that the above issue only popped up when the allocated adj cache size was larger than total adj size. Therefore, by increasing the fake_dim parameter, we essentially reduced the adj budget ( so the total adj never fits within the adj cache). But the issue still exists for adj cache as well. Wrt UK union, the issue is with nfeat cache allocation where the total nfeat size is way bigger than the total budget or the allocated nfeat cache.
I would really appreciate if someone could shed some light on this. Thanks.
I would like to ask, why do we need to use randomly generated feature vectors (i.e., fake input)? If I misunderstood, could you please tell me the meaning of the function DUCATI.CacheConstructor.separate_features_idx? The following is your raw code:
def separate_features_idx(args, graph):
separate_tic = time.time()
train_idx = torch.nonzero(graph.ndata.pop("train_mask")).reshape(-1)
adj_counts = graph.ndata.pop('adj_counts')
nfeat_counts = graph.ndata.pop('nfeat_counts')
# cleanup
graph.ndata.clear()
graph.edata.clear()
# we prepare fake input for all datasets
fake_nfeat = dgl.contrib.UnifiedTensor(torch.rand((graph.num_nodes(), args.fake_dim), dtype=torch.float), device='cuda')
fake_label = dgl.contrib.UnifiedTensor(torch.randint(args.n_classes, (graph.num_nodes(), ), dtype=torch.long), device='cuda')
mlog(f'finish generating random features with dim={args.fake_dim}, time elapsed: {time.time()-separate_tic:.2f}s')
return graph, [fake_nfeat, fake_label], train_idx, [adj_counts, nfeat_counts]
When running the PA dataset and preparing to test the model's correctness, I encountered some issues. I stored the model at the end of the entry function in the run_ducati.py file. Subsequently, I tested the trained model, but it seems that my testing has some problems, and the accuracy I obtained is not correct. I would like to know how to set the parameters to achieve results similar to the paper. If you could provide me with an update to the testing code, I would greatly appreciate it. The parameters I set are fanout [15, 15, 15], and epoch: 20.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.