GithubHelp home page GithubHelp logo

initzhang / ducati_sigmod Goto Github PK

View Code? Open in Web Editor NEW
13.0 13.0 2.0 152 KB

Accepted paper of SIGMOD 2023, DUCATI: A Dual-Cache Training System for Graph Neural Networks on Giant Graphs with the GPU

Python 100.00%

ducati_sigmod's People

Contributors

initzhang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

ducati_sigmod's Issues

Mismatch in Number of Nodes for Friendster Dataset

While running the run_allocate script for the Friendster dataset, the output indicates a discrepancy in the number of nodes. The script reports 124 million nodes, whereas the actual number of nodes should be 65 million, as documented on the corresponding dataset website

Here's the relevant output from the run_allocate script:

Graph(num_nodes=124836180, num_edges=1806067135, ndata_schemes={} edata_schemes={})

Initially I inspected the dataset to observe that they had listed nodes ranging from 1 to 124M, but the unique count of these nodes was only 65M. Therefore, I renumbered the nodes. Even after the pre-processing step, run_allocate still outputs 124M as num_nodes.

I would require some help in verifying the functional correctness of the code in this case. Any pointers would be appreciated.

How to resolve 'TypeError: 'NoneType' object is not subscriptable'?

Hello, following the instructions in the README, I successfully configured the environment. However, when I tried to run PA according to the commands, it seems that some issues have occurred. I would like to know where the problem is or if there might be an issue with the parameters I entered?

$CUDA_VISIBLE_DEVICES=0 python run_allocate.py --dataset ogbn-papers100M --fanouts 10,25 --fake-dim 128
2023-09-23 14:24:59,560 Namespace(adj_budget=0, adj_slope=1, batches=1000, bs=8000, dataset='ogbn-papers100M', fake_dim=128, fanouts='10,25', nfeat_budget=0, nfeat_slope=1, pre_batches=100, pre_epochs=2, runs=4, total_budget=1)
2023-09-23 14:25:00,598 loading raw dataset of ogbn-papers100M
2023-09-23 14:25:38,354 finish loading raw dataset, time elapsed: 37.76s
2023-09-23 14:26:09,762 finish preprocessing, time elapsed: 31.41s
2023-09-23 14:28:15,314 finish generating random features with dim=128, time elapsed: 124.61s
2023-09-23 14:28:16,243 Graph(num_nodes=111059956, num_edges=1615685872,
      ndata_schemes={}
      edata_schemes={})
2023-09-23 14:28:16,454 get 1000 seeds, 0.06GB on cuda:0
2023-09-23 14:28:16,455 start profiling and calculating slope
2023-09-23 14:28:42,581 finish calculating slope: adj(2.55) nfeat(13.45), time elapsed: 26.13s
2023-09-23 14:28:42,581 total cache budget: 1GB
2023-09-23 14:28:42,581 total adj size: 12.865GB, total nfeat size: 53.785GB
2023-09-23 14:28:44,675 finish constructing density and size array
2023-09-23 14:28:57,285 find the separate point 4565543
2023-09-23 14:28:57,325 nfeat entries: 1770741, adj entries: 2794802
2023-09-23 14:28:57,325 nfeat size: 0.858 GB, adj size: 0.142 GB
2023-09-23 14:28:57,684 dual cache allocation done, time_elapsed: 15.10s
2023-09-23 14:28:58,128 current allocation plan: 0.142GB adj cache & 0.858GB nfeat cache

Then

$CUDA_VISIBLE_DEVICES=0 python run_ducati.py

2023-09-23 14:42:06,607 Namespace(adj_budget=0, batches=1024, bs=8000, dataset='ogbn-papers100M', dropout=0.5, fake_dim=128, fanouts='10,25', lr=0.003, nfeat_budget=0, num_hidden=256, pre_batches=100, pre_epochs=2, runs=10)
2023-09-23 14:42:07,654 loading raw dataset of ogbn-papers100M
2023-09-23 14:42:45,337 finish loading raw dataset, time elapsed: 37.68s
2023-09-23 14:43:16,913 finish preprocessing, time elapsed: 31.58s
2023-09-23 14:45:22,367 finish generating random features with dim=128, time elapsed: 124.49s
2023-09-23 14:45:23,257 Graph(num_nodes=111059956, num_edges=1615685872,
      ndata_schemes={}
      edata_schemes={})
2023-09-23 14:45:23,485 get 1024 seeds, 0.06GB on cuda:0
gpu_flag None
gpu_map None
all_cache [None, None]
2023-09-23 14:45:23,882 buffer size: 0.185 GB
Traceback (most recent call last):
  File "run_ducati.py", line 109, in <module>
    entry(args, graph, all_data, seeds_list, counts)
  File "run_ducati.py", line 63, in entry
    run_one_list(seeds_list)
  File "run_ducati.py", line 49, in run_one_list
    cur_nfeat = nfeat_loader.load(input_nodes, nfeat_buf) # fetch nfeat
  File "/home/bear/workspace/DUCATI_SIGMOD/NfeatLoader.py", line 9, in load
    gpu_mask = self.gpu_flag[idx]
TypeError: 'NoneType' object is not subscriptable

Index error while running run_allocate file for UK Union dataset

File: /data/DUCATI_SIGMOD/run_allocate.py

I'm encountering an IndexError when running the code with the Uk Union dataset for batch sizes of 8192 and 4096. The total budget specified as params is 15-20GB onwards. Despite having 49.14 GB of available GPU memory, the code fails to execute intermittently.

`
image

`

P.S. I ran into a similar issue while using Twitter dataset, but on the adjacency cache allocation.
`
image

`
We observed that the above issue only popped up when the allocated adj cache size was larger than total adj size. Therefore, by increasing the fake_dim parameter, we essentially reduced the adj budget ( so the total adj never fits within the adj cache). But the issue still exists for adj cache as well. Wrt UK union, the issue is with nfeat cache allocation where the total nfeat size is way bigger than the total budget or the allocated nfeat cache.

  1. Twitter: total adj size: 11.251GB, total nfeat size: 59.274GB
  2. UK union: total adj size: 42.031GB, total nfeat size: 64.717GB

I would really appreciate if someone could shed some light on this. Thanks.

Question for function "DUCATI.CacheConstructor.separate_features_idx"

I would like to ask, why do we need to use randomly generated feature vectors (i.e., fake input)? If I misunderstood, could you please tell me the meaning of the function DUCATI.CacheConstructor.separate_features_idx? The following is your raw code:

def separate_features_idx(args, graph):
    separate_tic = time.time()
    train_idx = torch.nonzero(graph.ndata.pop("train_mask")).reshape(-1)
    adj_counts = graph.ndata.pop('adj_counts')
    nfeat_counts = graph.ndata.pop('nfeat_counts')

    # cleanup
    graph.ndata.clear()
    graph.edata.clear()

    # we prepare fake input for all datasets
    fake_nfeat = dgl.contrib.UnifiedTensor(torch.rand((graph.num_nodes(), args.fake_dim), dtype=torch.float), device='cuda')
    fake_label = dgl.contrib.UnifiedTensor(torch.randint(args.n_classes, (graph.num_nodes(), ), dtype=torch.long), device='cuda')

    mlog(f'finish generating random features with dim={args.fake_dim}, time elapsed: {time.time()-separate_tic:.2f}s')
    return graph, [fake_nfeat, fake_label], train_idx, [adj_counts, nfeat_counts]

How to obtain test results for accuracy?

When running the PA dataset and preparing to test the model's correctness, I encountered some issues. I stored the model at the end of the entry function in the run_ducati.py file. Subsequently, I tested the trained model, but it seems that my testing has some problems, and the accuracy I obtained is not correct. I would like to know how to set the parameters to achieve results similar to the paper. If you could provide me with an update to the testing code, I would greatly appreciate it. The parameters I set are fanout [15, 15, 15], and epoch: 20.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.