GithubHelp home page GithubHelp logo

pathpy / pathpyg Goto Github PK

View Code? Open in Web Editor NEW
23.0 4.0 0.0 191.38 MB

GPU-accelerated Next-Generation Network Analytics and Graph Learning for Time Series Data on Complex Networks.

Home Page: https://www.pathpy.net

License: GNU Affero General Public License v3.0

Dockerfile 0.29% Python 92.34% CSS 0.10% HTML 0.19% JavaScript 6.80% TeX 0.28%
graph-learning network-analysis deep-learning

pathpyg's Introduction

pathpyG

Testing Status Linting Status

GPU-accelerated Next-Generation Network Analytics and Graph Learning for Time Series Data on Dynamic Networks.

Documentation

Online documentation is available at pathpy.net.

The docs include a tutorial, an API reference, and other useful information.

Dependencies

pathpyG supports Python 3.7+.

Installation requires numpy, scipy, torch, and torch-geometric.

Installation

The development version can be installed from Github as follows:

pip install git+https://github.com/pathpy/pathpyg.git

Testing

To test pathpy, run pytest in the root directory.

This will exercise both the unit tests and docstring examples (using pytest).

Development

pathpyG development takes place on Github: https://github.com/pathpy/pathpyG

Please submit any reproducible bugs you encounter to the issue tracker.

pathpyg's People

Contributors

franziskheeg avatar hackl avatar ingoscholtes avatar lisiq avatar lisiwue avatar m-lampert avatar vinsrr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

pathpyg's Issues

Unintended Boolean Conversions and Ordering Changes in `IndexMap`

Due to the new internal representation of node_ids in IndexMap as numpy.array, there are sometimes some unintended conversions from integers to booleans happening. I also noticed that the ordering of strings is sometimes different than what was inserted at the beginning.

This was first encountered by @lisiq and @chrisbloecker. Maybe the have some more details on this.

Write Documentation

With #35 now merged, the first steps to having a complete documentation are done.
In the next steps, everybody can help to write some parts so that we can complete the documentation soon. Open ToDos:

  • Check content (I suggest to have at least at least one person check what another person wrote (Four Eyes Principle))
    • Check home screen (Especially check the "hero" image and see if someone has a better idea of visualizing PathPyG?)
    • Check Getting started (Maybe someone who hasn't done the setup can try if it works)
    • Check Contributing page (and try the setup here as well)
  • Complete Code reference via docstrings
  • Complete tutorials via Jupyter notebooks
  • Check legal stuff (Copyright? Impressum? ...)

Bug in `Graph.is_edge`

The current implementation raises an IndexError when testing for the presence of a directed edge, where the source node has outdegree zero in the underlying network.

Example:

g = pp.Graph.from_edge_list([['a','b'], ['a','c'], ['a','a']])
g.is_edge('b', 'a')

Refactor PathData to split implementations for Walks and DAGs

Currently, the PathData object can simultaneously hold DAGs and Walks, which is handy for certain temporal graphs but also complicates the code. It also comes at a performance penalty, as we (i) have to store the type of each path (walk or DAG) and (ii) treat each path differently when calculating the k-th order edge index. This prevents vector operations and is likely to eat up the performance gain that we get from the joint storage.

I propose to split those implementations into two classes WalkData and DAGData that both inherit from a common PathData base class.

Improve CUDA performance of `PathData` class

A major performance bottleneck in the current implementation is the use of a dictionary to store path tensors in the class PathData. This creates a huge performance overhead when using CUDA vs. CPU., as each individual tensor in the dictionary must be copied to the GPU separately.

A better solution could possibly be implemented based on the new nested tensor class available in pyTorch 2.1. This would allow to store all paths with varying lengths in a single nested tensor, i.e. we would only need a single copy operation when moving the data to the GPU.

Extend `IndexMap.to_idxs(...)` to any shape

The method IndexMap.to_idxs(...) is meant as a shortcut if you want to convert an iterable of ids to the corresponding idx. In some cases one might want to input an numpy.array or a torch.Tensor with a specific shape. Currently we only allow conversions for one dimensional iterable objects but it would be great if you could input any shape and the output has the same shape and potentially also type. For the other direction (i.e. to_ids(....)) it would also make sense.
Example for the desired behaviour:

>>> mapping = IndexMap(["a", "b", "c"])
>>> print(mapping.to_idxs(np.array([
>>>         ["a", "a", "b"],
>>>         ["b", "c", "c"]]))
np.array([
   [0, 0, 1],
   [1, 2, 2]])

Use IndexMapping as node labels by default

I think it would be a good idea if the plot function used the ID-Index mapping stored in Graph.mapping as node labels by default (if such a mapping is defined). If no such mapping is defined, the plot function should use the node indices as labels by default.

This would allow users to omit passing the plot argument node_label=g.mapping.node_id.

Conversion of notebooks of ML4Nets course

To test whether the functionality needed for the ML4Nets course has been implemented in pathpyG, we should start converting the practice notebooks, which are available here.

  • Week 01: Introduction to python and pathpyG [@IngoScholtes]
  • Week 02: Shortest paths, Components, Spectral Clustering [@IngoScholtes]
  • Week 03: Random Graphs and Molloy-Reed Model [@IngoScholtes]
  • Week 04: Stochastic Block Model [@chrisbloecker]
  • Week 05: Entropy and Huffman Coding [@chrisbloecker]
  • Week 06: Random Walks and InfoMap [@FranziskHeeg]
  • Week 07: Similarity Scores and Link Prediction [@IngoScholtes]
  • Week 08: Dimensionality Reduction and Laplacian Eigenmaps [@VincenzoPerri]
  • Week 09: Logistic Regression and Neural Networks [does not use pathpyG]
  • Week 10: DeepWalk and node2vec [@lisiq]
  • Week 11: Graph Neural Networks [@lisiq]

Edge Weight Representation

In the previous versions, we saved the number of times each walk or DAG appeared as a count/weight. This would translate to a graph feature for each PyG Data object. For the path statistics, it is probably more convenient to have the count/weight as edge weight. At first we added it in both ways for all representations were it is possible, but that leads to an error when using the PyG DataLoader in MultiOrderModel.from_DAGs. This is fixed for now (#154) but we should think about what representation is best also thinking about the order detection (#107).

Idea: Time-Resolved Line Graph Transformations

Continuing on the discussion I had with @IngoScholtes today, where we were thinking about ways to speed up algorithms.temporal.temporal_graph_to_event_dag that creates a DAG from a temporal graph given a specific time window delta. Since I got this idea on the bike ride home today, I thought that I will write it down before I forget it again. We can discuss it in more detail in the upcoming sprint:

I know that we need the event DAG to construct the higher-order graphs for the temporal interactions. Do we need this anywhere else?
Because if not, then I may have an idea that would skip this step and directly construct the higher-order graph from the temporal graph. The procedure would be as follows:

  1. Construct a graph where each timestamped edge corresponds to one edge in the graph where the time stamp is the edge weight. This might lead to duplicate edges but with different edge weights (time stamps).
  2. Use the indexing-based lift order function inspired by PyG's Line Graph transformation (#132). This will create a 2nd order graph with all possible walks of length 2 without respecting the time. Note that since the edges now correspond to nodes, the edge weights now correspond to node features.
  3. Subtract the node feature (edge timestamp) of the source node from the destination node in the higher-order edge index. This can be done either with message passing, but a simple diff = edge_weight[edge_index[1]] - edge_weight[edge_index[0]] should also do the trick.
  4. Create a mask that filters the higher-order edges 0 < diff < delta and filter the higher-order edge index correspondingly. (Potentially remove isolated nodes.)

Benefits:

  • This will enable the use of continuous time stamps and does not require discretization.
  • Can be done on GPU

Potential Problems:

  • By using the time stamps as edge weights, we need to allow multiple edges between a pair of nodes. We need a sorted edge index (by source) for the line graph transformation to work. This might cause problems because coalesce in PyG removes all duplicate edges and aggregates the edge weights. I am not sure how or if there is sorting implemented for TemporalData but there might already be solutions there.
  • I am almost but not a 100% certain if this will work for any $k$ -> $k+1$. It should work because each node in the second-order graph corresponds to a time-stamped edge in the first-order graph. So we should only need to filter once from order $1$ -> $2$ and then all further transformations will respect the time automatically. But this leads to the next problem:
  • The higher-order representations will have multiple nodes for the same edge (at different time steps). Is this the higher-order representation that we need? I think it might be the same representation that we would get from the event DAG. If not, can we maybe map easily to the representation we need?
  • This might be very memory intensive since this will at first create a new graph with as many nodes as there are time-stamped edges and even more edges.

Refactor `lift_order`-logic from `MultiOrderModel` to `algorithms`

The methods

  • lift_order_edge_index
  • lift_order_edge_index_weighted
  • aggregate_edge_index
    potentially have uses outside of MultiOrderModel, e.g. for the centrality calculations. Since they are static methods anyway, it would make sense to separate them from the class altogether and put the into utils or algorithms`.

`HigherOrderGraph.predecessors` returns unexpected values

I suspect that the predecessors method returns wrong values for HigherOrderGraphs.
Right now, it inherits the implementation from Graph.

The self-containing example:

import pathpyG as pp
paths = pp.PathData.from_csv('../data/tube_paths_train.ngram')

k = 2
higher_order = pp.HigherOrderGraph(paths, order=k, node_id=paths.node_id)

nodes = list(higher_order.nodes)

node = nodes[2]

print(node)
print(list(higher_order.successors(node)))      
# -> seems to be okay
print(list(higher_order.predecessors(node)))    
# -> wrong, expected [('Embankment', 'Waterloo') , ('Westminster', 'Waterloo'), ('Kennington', 'Waterloo'), ...]

results in

('Waterloo', 'Southwark')
[('Southwark', 'London Bridge')]
[('Southwark', 'Waterloo'), ('Southwark', 'Waterloo'), ('Southwark', 'Waterloo'), ('Southwark', 'Waterloo')]

You can copy the example to a notebook in the tutorials folder.

Refactor core classes `Graph` and `PathData`

  • check use of pyG.Data (consider use of HeteroData)
  • implement more convenient uid-index mapping
  • implement support for string-type node/edge/graph features
  • simplify interface to access to attributes

Switch from Global `device`-`config` to `tensor`-wise configuration

As we now have almost all of the core functions implemented in torch operations that can utilize the GPU, we have fixed most runtime issues and now run into the next bottleneck namely memory (GPU-RAM). It thus might become necessary to give the user more control over what parts of a Graph object should be stored on CPU or GPU. Although it is more convenient if this is controlled via a global configuration, it might become necessary to add to(device) methods to enable batch-wise computations.

Implement plot functions for pathpyG

  • Transfer code from pathpy3
  • Modify plot code for pathpyG
    • matplotlib
    • dj3s
    • tikz
  • Add plot function for temporal graphs
  • Write and implement test cases
  • Document plot function
  • Write tutorial for new plot functions
  • Merge changes and close issue

notebook crashing when plotting weighted graph

The plots of weighted (higher-order) graphs in the dbgnn tutorial currently leads to the notebook crashing. The resulting HTML output has a lot of NaN errors, which are likely due to layout algorithm.

Minimal example to reproduce the issue:

import pathpyG as pp
pp.config['torch']['device'] = 'cpu'

# read paths
paths = pp.PathData.from_csv('../data/temporal_clusters.ngram')

ho = pp.HigherOrderGraph(paths, order=1)

pp.plot(ho, filename='test.html')

The resulting HTML file does not render and the console logs a large number of errors.

Fix file dependencies in Jupyter Tutorials

With the recent merges #110 and #109, we now provide links to check out our tutorials directly via Google Colab. But there are some errors on Colab due to the file dependencies in some of the notebooks.
Open Tasks:

  • Fix FileNotFoundError in temporal_graphs.ipynb
  • Come up with a solution to missing files when using Google Colab (temporal_graphs.ipynb, dbgnn.ipynb and paths_higher_order.ipynb)
    Since only the Notebook is opened via Colab, the accompanying files from the repository are not copied over and, thus, an exception is thrown. This could be either fixed by using datasets from Netzschleuder instead or by manually downloading the files from our GitHub Repo in the notebook if they are not available.

Cleanup pyproject.toml

  • Remove references to Microsoft
  • use poetry for dependency management
  • Add linting rules
  • Add formating rules

CUDA vs CPU

Add code to detect if CUDA is available; otherwise, use CPU as default.

Performance issue with HTML-based plots

In the latest version of the plot function, interactive HTML-based plots of larger networks are very slow.

For example, the plots of second-order networks for the example in the dbgnn.ipynb tutorial do not work anymore and render the notebook unusable. Changing the network to an undirected network does not help either. Plotting of the same networks worked fine with the previous version.

Implement temporal Cartesian layout

Time-varying node positions with an interpolation between the specified points in time, i.e. a "temporal layout" with time-stamped positions.

Change filenames

Currently files containing a class are named in CamelCase. We should follow PEP8 convention and name them in lowercase because this can otherwise lead to confusion during imports.
Example:
You would think that by importing the following, you would have the Graph class.

from pathpyG.core import Graph

Instead you now have the module Graph and would need to instantiate the class Graph as follows:

g = Graph.Graph(...)

or

from pathpyG.core.Graph import Graph

I think we should rename all files containing classes and import the classes in the modules __init__.py so that we could import e.g. Graph in all of the following ways:

from pathpyG import Graph
from pathpyG.core import Graph
from pathpyG.core.graph import Graph

I hope this will not lead to any circular imports! XD

`TemporalGraph.get_window(...)` and `TemporalGraph.get_snapshot` behaviour not as expected

As I understand both methods, get_snapshot should return a TemporalGraph that contains all edges from the original graph that occur in the time window from start to end. This means that the returned edge index can potentially have very different sizes. See the following example for the expected behaviour:

>>> t = TemporalGraph(TemporalData(
>>>    src=[0,0,1,0],
>>>    dst=[1,2,2,1],
>>>    t=[1,1,1,2]))
>>> t.get_snapshot(start=1, end=2)
TemporalGraph containing the first 3 edges

>>> t.get_snapshot(start=2, end=3)
TemporalGraph containing only the last edge

get_window(...) instead should return a window of fixed size end-start that contains the edges starting at index start and ending (non-inclusive) at index end in an edge index sorted by time. Thus, e.g. get_window(0,5) would return the first 5 events. The current implementation of get_window does exactly this but get_snapshot(...) also does this.

ALSO: For get_window to consistently work, we have to ensure to keep an edge_index sorted by time. This is not the case after applying shuffle_time in its current implementation.

PyG backwards compatibility 2.3.1 <- 2.4.0

This is something that already came up in one of our meetings, that self.data.keys worked with or without () depending on the versions. This is caused by PyG where they turned keys() into a method in version 2.4.0 while it was a property in version 2.3.1.

Extend `append_DAG` to work with nodes appearing at multiple points in time

@VincenzoPerri raised the issue that we currently cannot add a DAG that should contain the same node at multiple time stamps. We already fixed this issue for append_walk by reindexing all nodes contained in the walk. It is not as trivial for DAGs since a node can have multiple incoming and outgoing edges so we cannot use a new index for each occurrence in the edge index.
One solution could be that we use the combination of node index and time stamp to reindex the nodes.

Fix switch between versions 0.0.1 and 0.0.1-dev on web page

Currently, the version shown for the code reference is preset to 0.0.1, while the latest version from the main branch is under 0.0.1-dev. Moreover, switching from 0.0.1 to 0.0.1-dev fails for most of the pages, ending in a 404 error (e.g. from the Tutorial page).

I would suggest to remove the version tags for now, and only keep the dev version that always refers to the latest version in the main branch. We can later add support for version tags defined in github.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.