GithubHelp home page GithubHelp logo

juliagast / tgb2 Goto Github PK

View Code? Open in Web Editor NEW

This project forked from shenyanghuang/tgb

3.0 0.0 0.0 5.18 MB

Temporal Graph Benchmark project repo

Shell 0.28% Python 97.34% Dockerfile 0.06% C++ 2.32%

tgb2's Introduction

TGB logo

Temporal Graph Benchmark for Machine Learning on Temporal Graphs (NeurIPS 2023 Datasets and Benchmarks Track)

TGB 2.0: A Benchmark for Learning on Temporal Knowledge Graphs and Heterogeneous Graphs (preprint)

Overview of the Temporal Graph Benchmark (TGB) pipeline:

  • TGB includes large-scale and realistic datasets from five different domains with both dynamic link prediction and node property prediction tasks.
  • TGB automatically downloads datasets and processes them into numpy, PyTorch and PyG compatible TemporalData formats.
  • Novel TG models can be easily evaluated on TGB datasets via reproducible and realistic evaluation protocols.
  • TGB provides public and online leaderboards to track recent developments in temporal graph learning domain.
  • Now TGB supports temporal homogeneous graphs, temporal knowledge graphs and temporal heterogenenous graph datasets.

TGB dataloading and evaluation pipeline

To submit to TGB leaderboard, please fill in this google form

See all version differences and update notes here

Announcements

Excited to announce TGB 2.0, expanding TGB to Temporal Knowledge Graphs and Temporal Heterogeneous Graphs

See our preprint here for details. Please install locally first. We welcome your feedback and suggestions.

Excited to announce TGX, a companion package for analyzing temporal graphs in WSDM 2024 Demo Track

TGX supports all TGB datasets and provides numerous temporal graph visualization plots and statistics out of the box. See our paper: Temporal Graph Analysis with TGX and TGX website.

Excited to announce that TGB has been accepted to NeurIPS 2023 Datasets and Benchmarks Track

Thanks to everyone for your help in improving TGB! we will continue to improve TGB based on your feedback and suggestions.

Please update to version 0.9.2

version 0.9.2

Update the fix for tgbl-flight where now the unix timestamps are provided directly in the dataset. If you had issues with tgbl-flight, please remove TGB/tgb/datasets/tgbl_flightand redownload the dataset for a clean install

Pip Install

You can install TGB via pip. Requires python >= 3.9

pip install py-tgb

Links and Datasets

The project website can be found here.

The API documentations can be found here.

all dataset download links can be found at info.py

TGB dataloader will also automatically download the dataset as well as the negative samples for the link property prediction datasets.

if website is unaccessible, please use this link instead.

Running Example Methods

  • For the dynamic link property prediction task, see the examples/linkproppred folder for example scripts to run TGN, DyRep and EdgeBank on TGB datasets.
  • For the dynamic node property prediction task, see the examples/nodeproppred folder for example scripts to run TGN, DyRep and EdgeBank on TGB datasets.
  • For all other baselines, please see the TGB_Baselines repo.

Acknowledgments

We thank the OGB team for their support throughout this project and sharing their website code for the construction of TGB website.

Citation

If code or data from this repo is useful for your project, please consider citing our paper:

@article{huang2023temporal,
  title={Temporal graph benchmark for machine learning on temporal graphs},
  author={Huang, Shenyang and Poursafaei, Farimah and Danovitch, Jacob and Fey, Matthias and Hu, Weihua and Rossi, Emanuele and Leskovec, Jure and Bronstein, Michael and Rabusseau, Guillaume and Rabbany, Reihaneh},
  journal={Advances in Neural Information Processing Systems},
  year={2023}
}

tgb2's People

Contributors

shenyanghuang avatar juliagast avatar pitmonticone avatar fpour avatar alip67 avatar erfanloghmani avatar emalgorithm avatar

Stargazers

Hongjun Jang avatar  avatar Riccardo avatar

tgb2's Issues

rename modules

rename modules and adapt all imports to avoid clash with ray dependency

Negative Sampling

  • Discuss what Negative Sampling approach to use
  • Integrate Negative Sampling (preliminary first version

Can we have dataset.num_nodes and dataset.num_rels

It would be great to have these two properties for each dataset.

  • num_nodes: number of distinct node ids in train+valid+test set
  • num_rels: number of distinct edge_types in train+valid+test set

Otherwise I have to compute it for every method (also fine)

num_rels = len(set(dataset.edge_type))

num_nodes = len(set(dataset.full_data['sources']+ dataset.full_data['destinations']))

needed: 1-vs-all time-aware filtered for thgl-github

I think, for thgl-github, the negative sampling is not great
Reason:
the recurrency baseline should not be good for this dataset. It has a rather low recurrency degree, and applying the relaxed recurrency baseline only seems to lead to not good results either
Still, it gets a rather ok mrr of 0.17
In the examples that I checked, for a node-relation combi, there have been 2000-6000 different objects
-> 1000 is a small subsample of the possible nodes
Given that this is our smallest dataset, I strongly suggest that we do not do any negative sampling here.
We have a very small recurrency degree. I expect this dataset will be very hard. hm.

dataset.py inverse triples

  • create method that creates an inverse triple for each triple in triples. inverse triple swaps subject and objsect, and increase relation id by num_rels

Statistics to report for each dataset and plots

Statistics to report, brainstorming

  • surprise index/recurrency index
  • relation density/ density?
  • measure to compute how long a fact stays true (to differentiate between fact based and event based)

Figures, brainstorming

  • distribution of repeating vs non repeating triples?
  • average duration of facts?
  • number of different nodes, relations, triples
  • number of triples over time?

Integration of CEN model

CEN

ongoing

Tasks:

  • restructure code s.t. it fits tgb
  • implement data loading and make sure it is in correct format for CEN
  • integrate test()
  • store and load models at correct location
  • compute scores for negative samples only
  • compute and log mrr
  • compute and log runtime
  • hyperparameter tuning

Some more utility functions for dataset.py?

I need the following methods for recurrencybaseline.py, they will also be needed by other methods later. does it make sense to move them somewhere else, e.g. to utils?
(you can also find them here: https://github.com/JuliaGast/TGB2/blob/julia_new/examples/linkproppred/tkgl-polecat/recurrencybaseline.py)

def group_by(data: np.array, key_idx: int) -> dict:
    """
    group data in an np array to dict; where key is specified by key_idx. for example groups elements of array by relations
    :param data: [np.array] data to be grouped
    :param key_idx: [int] index for element of interest
    returns data_dict: dict with key: values of element at index key_idx, values: all elements in data that have that value
    """
    data_dict = {}
    data_sorted = sorted(data, key=itemgetter(key_idx))
    for key, group in groupby(data_sorted, key=itemgetter(key_idx)):
        data_dict[key] = np.array(list(group))
    return data_dict

def add_inverse_quadruples(triples: np.array, num_rels:int) -> np.array:
    """
    creates an inverse triple for each triple in triples. inverse triple swaps subject and objsect, and increases 
    relation id by num_rels
    :param triples: [np.array] dataset triples
    :param num_rels: [int] number of relations that we have originally
    returns all_triples: [np.array] triples including inverse triples
    """
    inverse_triples = triples[:, [2, 1, 0, 3]]
    inverse_triples[:, 1] = inverse_triples[:, 1] + num_rels  # we also need inverse triples
    all_triples = np.concatenate((triples[:,0:4], inverse_triples))

    return all_triples

def reformat_ts(timestamps):
    """ reformat timestamps s.t. they start with 0, and have stepsize 1.
    :param timestamps: np.array() with timestamps
    returns: np.array(ts_new)
    """
    all_ts = list(set(timestamps))
    all_ts.sort()
    ts_min = np.min(all_ts)
    ts_dist = all_ts[1] - all_ts[0]

    ts_new = []
    timestamps2 = timestamps - ts_min
    for timestamp in timestamps2:
        timestamp = int(timestamp/ts_dist)
        ts_new.append(timestamp)
    return np.array(ts_new)

Add Timetraveler

Timetraveler

ongoing

Tasks:

  • restructure code s.t. it fits tgb
  • implement data loading and make sure it is in correct format for Timetraveler
  • add preprocessing
  • add dirichlet distribution step
  • check if the relation ids still make sense
  • integrate test()
  • store and load models at correct location
  • compute scores for negative samples only
  • compute and log mrr
  • compute and log runtime

Integration of TLogic Model

TLogic

  • data loading
  • set random seed
  • learn rules
  • apply rules
  • evaluate
  • what to do with hyperparameters?
  • log results
  • change the window-bug from original tlogic to applesorangesversion
  • provide file relation2id.txt/.json?
  • compute valid mrr @shenyangHuang do we want to report validation mrr?

Integration of RE-GCN model

Zixuan Li, Xiaolong Jin, Wei Li, Saiping Guan, Jiafeng Guo, Huawei Shen, Yuanzhuo Wang and Xueqi Cheng. Temporal Knowledge Graph Reasoning Based on Evolutional Representation Learning. SIGIR 2021.

RE-GCN

Tasks:

  • restructure code s.t. it fits tgb
  • implement data loading and make sure it is in correct format for CEN
  • integrate test()
  • store and load models at correct location
  • compute scores for negative samples only
  • compute and log mrr
  • compute and log runtime
  • hyperparameter tuning
  • check if I can reproduce REGCN results from apples-oranges-paper

Dataset split in YAGO

needs to be modified for the automated downloaded dataset

dataset.py

        if ("yago" in self.name):

            _train_mask, _val_mask, _test_mask = self.generate_splits(full_data, val_ratio=0.1, test_ratio=0.10) 

Bug in Neg Sampler for THG

I think it should be
neg_d_arr = filtered_dst[(pos_t, pos_s, e_type)] in l. 123 of thg_negative_sampler.py

instead of
neg_d_arr = filtered_dst

right?

with current implementation I get
for this:
image
the following:
image

where:

image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.