GithubHelp home page GithubHelp logo

Comments (6)

dhimmel avatar dhimmel commented on August 11, 2024

Thanks @semihsalihoglu-uw for your questions. For those who don't know, @semihsalihoglu-uw is currently conducting a survey on graph database usage, which I believe anyone is free to take. They have previousely explored this topic in a work titled The Ubiquity of Large Graphs and Surprising Challenges of Graph Processing. For some reason, the DOI for that article has not been registered (shame on the publisher).

Anyways, here are my answers to the questions:

What kind of queries and graph computations do you run on hetionet?

The primary computation we run is to compute degree-weighted path counts (DWPCs, initially described here). DWPCs measure the extent of connectivity between two nodes along a given type of path (metapath). They are related to path counts (the number of paths), but have an adjustment for node degree to downweight paths through high degree nodes.

We also ran some one-of-a-kind queries to investigate specific questions. These usually rely on Cypher queries to Hetionet in Neo4j (see examples).

What kind of software do you use to run these queries

We have three implementations for computing the DWPC. Here they are in order of both date created and sophistication:

  1. Using a function in the hetio python package which takes a hetio.hetnet.Graph object. This requires the whole graph to be read into memory.

  2. Using a Cypher implementation (background) that computes DWPCs from a Neo4j database. This has the advantages that the graph can be stored on disk and concurrent queries are possible. Generally, we still use functions from the hetio package to template these queries.

  3. Matrix multiplication approaches we're currently developing for the hetmech project. This approach stores hetnets as matrices (one adjacency matrix for each relationship type). We can achieve massive efficiency gains by computing DWPCs with matrix multiplication. The two downsides are that this method doesn't track which paths connect nodes (just how many) and that excluding duplicate nodes in a path is tricky. We have built considerable python infrastructure to do this. The hetmech infrastructure still uses parts of the hetio package.

So as you can see, each new implementation builds off the previous ones and often depends on parts of the existing codebases.

I was curious if you extract simpler, more homogenous graphs, out of hetionet, say of only gene gene interactions

We implement a get_subgraph method for hetio.hetnet.Graphs. However, we have mostly used this to generate sub-hetnets (usually to create testing networks) rather than homogeneous networks. Since I feel that hetnets are underutilized compared to homonets, I don't spend much time working on approaches for homonets.

Somewhat related, in hetmech, we've created a HetMat data structure that stores hetnets on disk. Each adjacency matrix is a different file (exported from numpy or scipy). In this way, users only interested in certain parts of the hetnet, don't have to read all relationship matrices.

Were there any features that you think was missing from the software that you were using, or things that were difficult to do?

I think visualization of hetnets is still a pain point. Especially visualizing large numbers of nodes and relationships. Of course, visualizing 50 thousand nodes and millions of relationships won't tell you much about specific nodes or relationships, but these views help communicate the network generally. We've used Cytoscape here, but even this become unwieldy and was very manual.

Feel free to follow up with any additional questions. Or if you have nothing else to ask, you can close the issue.

from hetionet.

semihsalihoglu-uw avatar semihsalihoglu-uw commented on August 11, 2024

Thank you very much for the detailed response. The Cypher queries here are especially very useful. Two follow up question:

  1. Are you building the hetmech for performance reasons only? Or were there computations you thought were simply much easier to express as matrix multiplications.
  2. There are several well developed graph libraries that provide adjacency matrix representations and operations of graphs, such as networkx. Would these serve your needs or did you choose to build hetmetch from scratch because there were specific computations they would not satisfy?

And I think VLDB might be registering the dois in September when conference is held. They always do but I'm not sure of the exact timeline.

from hetionet.

dhimmel avatar dhimmel commented on August 11, 2024

Are you building the hetmech for performance reasons only?

Hetmech and the related HetMat data structure are motivated primarily by performance. Personally I don't find matrices and their dot products an intuitive data structure for hetnets. To me, it's much more intuitive to use a data structure that more closely resembles a network and that more easily allows nodes/edges to be annotated with properties. However, the performance improvement from calculating path counts via matrix multiplication turns out to be too compelling to ignore. The matrix multiplication is faster than path traversal algorithms in two important ways:

  1. Computation time scales linearly with path length because matrix multiplication does not track which paths arrive at a certain destination. Path traversal methods blow up with increasing path length.
  2. The matrix multiplication approaches compute DWPCs for all source-target node pairs. We usually are interested in all pairs, so it's a huge bonus to get all the DWPCs in a single output matrix.

Together these factors lead to a several orders of magnitude efficiency improvement.

There are several well developed graph libraries that provide adjacency matrix representations and operations of graphs, such as networkx. Would these serve your needs or did you choose to build hetmetch from scratch because there were specific computations they would not satisfy?

We are always on the lookout for larger projects that provide functionality for hetnets and did consider alternatives before creating HetMat. networkx isn't a good option as it's hetnet support is mediocre --- the MultiGraph supports relationship types but not node types and doesn't really provide first-class type support. While network can export to adjacency matrices, it doesn't really have the functionality we needed to perform computations on them. Building the HetMat data structure from scratch allowed us to do some really cool things:

  1. implement an on-disk data structure for hetnets
  2. enable types of on-disk caching
  3. enable additional types of in-memory caching and optimizations

Currently, our hetnet stack consists of the following tools:

  1. hetio.hetnet.MetaGraph objects for storing the hetnet schema (metagraph)
  2. hetmech.hetmat.HetMat for storing hetnets as matrices
  3. neo4j for enabling custom cypher queries, interactive visualization, and path traversal operations

So as our research has progressed, it seems like we're using more tools, since we're finding where each tool excels and using it for just those applications. For some projects, we do use networkx (see obonet for example), just less so for our hetnet work.

from hetionet.

semihsalihoglu-uw avatar semihsalihoglu-uw commented on August 11, 2024

Great thank you very much for the detailed answer again.

One final question: When integrating data to the hetnet, or even after that when using it for research, did you ever have to do "graph cleaning"? For example, you noticed that some nodes or edges looked suspicious and then had to remove them? Or was each network data that you integrated clean data?

from hetionet.

dhimmel avatar dhimmel commented on August 11, 2024

When integrating data to the hetnet, or even after that when using it for research, did you ever have to do "graph cleaning"?

We released an initial version of Hetionet quite a bit before we released version 1.0, which had several additions, changes, and improvements. First, every relationship type required preprocessing, which is where the cleaning occurred. In general, each resource had it's own repository where most of the preprocessing took place. Then we had a single notebook that integrated the data from all of the source repositories.

Graph cleaning was an iterative approach. Neo4j was super helpful here because it provided a visual way for us to quickly explore and sanity check the networks. Of course, you often will notice bugs or possible improvements. For example, certain metadata may be missing, certain things may be misspelled, or additional processing may be required. Cleaning the data as well as mapping everything to common standardized identifiers was a laborious process, as were the legal issues surrounding the data reuse.

See this table with all of the resources we integrated and citations to the related supplementary materials.

from hetionet.

semihsalihoglu-uw avatar semihsalihoglu-uw commented on August 11, 2024

Great, thank you very much! I'm closing the issue.

from hetionet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.