GithubHelp home page GithubHelp logo

twitter / cassovary Goto Github PK

View Code? Open in Web Editor NEW
1.0K 1.0K 154.0 51.54 MB

Cassovary is a simple big graph processing library for the JVM

Home Page: http://twitter.com/cassovary

License: Apache License 2.0

Scala 98.33% Java 1.07% Shell 0.60%

cassovary's People

Contributors

adamkozuch avatar adelbertc avatar agrewal avatar alanbato avatar anishshah avatar bartekkalinka avatar bmckown avatar caniszczyk avatar drbild avatar fizx avatar jcccf avatar juliaferraioli avatar ningtwitter avatar pankajb64 avatar pankajgupta avatar plofgren avatar shreyas24 avatar szymonm avatar tlasica avatar travisbrown avatar vinodkumarlogan avatar wmiel avatar youngbink avatar zjpjack avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cassovary's Issues

Running examples should use 'sbt run'

There are a few examples in examples/scala (and in examples/java) that demonstrate how to use the cassovary library. These are built and run using a hacky bash script examples/scala/build_run_example.sh (which uses examples/build_run_example.sh). Instead we should be using the run command in sbt to build and run these examples.

Another alternative is to build a separate sbt project just for examples and use cassovary library as a dependency from maven central (as explained in README.md)

[enhancement] Explain impacts of Node ID scheme

Noticed an interesting issue when first experimenting with Cassovary. When trying to do PageRank on a 25,000 node graph I kept experiencing many OutOfMemory errors, even with 4G allocated to the java heap.

I realized one of the problems was due to the ID scheme being used by my nodes. Basically, some nodes had an ID similar to 9080256 and so the PageRank algorithm was iterating as many times as the maximum node ID, even though 9 million node IDs were not valid. It could be that I didn't properly read the source/docs but I can see this being a design flaw. It might be more helpful to explain the purpose of node IDs or to change the node ID scheme altogether.

Allow info to be stored per node in an array of node ids

Motivated by #138 where a centrality score is being computed per node and it might be sometimes most efficient to keep information per node stored in an array.

Currently we have graph.tourist.InfoKeeper that keeps info in a mutable.Map which precludes storing infoPerNode in an array.

How about declaring a new trait that has get/put. and allowing infoPerNode to return a collection of this trait type. This can be either a mutable.Map (as now) or an Array with implicit conversions from the Map and Array to this trait?

Comments @szymonm ?

Build for scala 2.10

It should be cross-built so as to allow both scala 2.9 and 2.10 users to be use it.

Improve reading Int graphs

Now, both graph readers use Regex matching to read graphs. This is very inefficient for the Int case. Could be performed better avoiding regex for something like readInt from InputStream.

Examples don't run

If I try running the examples I get:

Compiling HelloGraph ...
Running HelloGraph...
Exception in thread "main" java.lang.NoSuchMethodError: scala.Predef$.augmentString(Ljava/lang/String;)Ljava/lang/String;
        at HelloGraph$.main(HelloGraph.scala:24)
        at HelloGraph.main(HelloGraph.scala)

I've tried on both Linux and OSX with same result. Seems to be a issue with library dependencies and compiling with one scala version and trying to run with another... However I'm just starting to learn Scala so not sure what to do here...

SBT update not working

When running ./sbt update, I have the following issues:

[warn] Host binaries.local.twitter.com not found. url=http://binaries.local.twitter.com/maven/commons-lang/commons-lang/2.2/commons-lang-2.2-sources.jar
[info] You probably access the destination server through a proxy server that is not well configured.
[warn] Host binaries.local.twitter.com not found. url=http://binaries.local.twitter.com/maven/commons-lang/commons-lang/2.2/commons-lang-2.2-javadoc.jar
[info] You probably access the destination server through a proxy server that is not well configured.
[warn] Host binaries.local.twitter.com not found. url=http://binaries.local.twitter.com/maven/com/twitter/util-logging/1.8.5/util-logging-1.8.5.pom
[info] You probably access the destination server through a proxy server that is not well configured.
[warn] Host binaries.local.twitter.com not found. url=http://binaries.local.twitter.com/maven/com/twitter/util-logging/1.8.5/util-logging-1.8.5.jar
[info] You probably access the destination server through a proxy server that is not well configured.
[warn] Host binaries.local.twitter.com not found. url=http://binaries.local.twitter.com/maven/com/twitter/util-logging/1.8.5/util-logging-1.8.5-sources.jar
[info] You probably access the destination server through a proxy server that is not well configured.

Adding cassovary as dependency does not work

When adding cassovary as dependency in another project (according to Readme):
libraryDependencies += "com.twitter" %% "cassovary" % "4.0.0"
in build.sbt
I get compile error "object cassovary is not a member of package com.twitter".

I checked cassovary jar in Ivy repository cache and it does not contain any classes (just metainf directory). It's the same, when I compile cassovary with sbt package - jar in target directory is empty, and there's no classes in main target directory.

Workaround is:
libraryDependencies += "com.twitter" %% "cassovary-core" % "4.0.0"

Should we modify Readme or fix the build to match Readme?

Build for newer Scala versions (2.9.x)

The Cassovary tree, as shipped, only builds against Scala 2.8.1.

Local experiments with building for 2.9.1 and 2.9.2 work (compile and pass all tests), but specs has to be updated to 1.6.9 compiled for 2.9.1.

I would submit a pull request, but my current changes break 2.8.1 builds for some reason I have yet to debug. I redefine 'specs' as follows:

val specs = buildScalaVersion match {
  case "2.8.1" =>
    "org.scala-tools.testing" % "specs_2.8.1" % "1.6.6" % "test" withSources()
  case "2.9.1" | "2.9.2" =>
    "org.scala-tools.testing" % "specs_2.9.1" % "1.6.9" % "test" withSources()
}

and then set build.scala.versions to “2.8.1 2.9.1”. 2.9.1 builds & passes all tests, as does 2.9.2 (++2.9.2 test), but 2.8.1 does not for some reason.

Speed up random graph generation

Currently the methods generateRandomGraph() and generateRandomUndirectedGraph() in TestGraph.scala are slow as they iterate over each of the potential O(n^2) edges to determine whether that edge should be picked or not.

Parametrize Graph by NodeType

As @szymonm mentioned in #141 something like:

trait Graph[NodeType <: Node] {
getNodeById: Option[NodeType]
...
}

That allows Node to be subclassed and used in the respective Graph. e.g., in #141 we would do:

trait UndirectedGraph extends Graph[UndirectedNode] {
getNodeById: Option[UndirectedNode]
}

Parallelize Pagerank to run on multiple cores

Current pagerank implementation in com.twitter.cassovary.algorithms is single-threaded. Implement a multi-thread version to make it run faster. Also, report benchmark results comparing the parallel version with the current version (see PagerankBenchmark in cassovary-benchmarks)

Benchmark performance of dynamic graph implementation on a synthetic update stream on a large graph

There are currently two implementations for dynamic directed graphs. src/main/scala/com/twitter/cassovary/graph/{DynamicDirectedGraphHashMap.scala, SynchronizedDynamicGraph.scala}. Generate a synthetic update stream (adds/deletes) on a large graph and benchmark the performance of an algorithm such as calculatePersonalizedReputation() or bfsWalk() in src/main/scala/com/twitter/cassovary/graph/GraphUtils.scala while the graph is handling updates.

examples/build_run_example.sh not working

Hi, i encountered the problem when i tried to run the example.

The error is as following:

Compiling HelloGraph ...
error: error while loading TestGraphs, Scala signature TestGraphs has wrong version
expected: 4.1
found: 5.0
one error found
Running HelloGraph...
Exception in thread "main" java.lang.NoClassDefFoundError: HelloGraph
Caused by: java.lang.ClassNotFoundException: HelloGraph
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
Could not find the main class: HelloGraph. Program will exit.

Is it due to the sbt and scala version problem?
The version of scala are different for building and using.

$ ./sbt
[info] Standard project rules 0.12.7 loaded (2011-05-24).
[warn] No .svnrepo file; no svn repo will be configured.
[info] Building project cassovary 1.0.1-SNAPSHOT against Scala 2.8.1
[info] using CassovaryProject with sbt 0.7.4 and Scala 2.7.7

Implement some similarity algorithms

Such as cosine similarity and jaccard similarity when given
(1) Two nodes N1 and N2, find the similarity values between them in a given direction (Out or In)
(2) One node N1, find the top K similar nodes to N1

Cosine(N1, N2) = neighbors(N1) ∩ neighbors(N2) divided by (sqrt(numNeighbors(N1)) * sqrt(numNeighbors(N2))

and Jaccard(N1,N2) = neighbors(N1) ∩ neighbors(N2) divided by neighbors(N1) ∪ neighbors(N2)

Please add the algorithms in com.twitter.cassovary.algorithms

Remove library eviction warnings with sbt

When using sbt, we are currently getting:

[warn] There may be incompatibilities among your library dependencies.
[warn] Here are some of the libraries that were evicted:
[warn] * com.google.guava:guava:13.0.1 -> 16.0.1
[warn] Run 'evicted' to see detailed eviction warnings

Clarify support for Undirected Graphs

I have an application where I needed an undirected graph. I actually didn't realize until today that "Mutual" in StoredGraphDir meant undirected and was starting to write my own UndirectedGraph wrapper class when I noticed it. How should we make this more clear? For example I could add a comment to StoredGraphDir along the lines of
Mutual, // In a mutual graph, the outbound and inbound neighbors of each node are equal, and space is typically saved by only storing the outbound neighbors of each node
and/or add to the documentation of DirectedGraph that it actually also supports undirected graphs. Or should there be an UndirectedGraph wrapper in Cassovary?
@pankajgupta

Allow node ids to be Long

This is related to #36 but different. We want the ability of node ids to be Long. The use case is when storing a portion of a huge graph in memory on one machine. The nodes can be too many to be able to map to int. Even though the number of nodes on a single machine are small enough, they can refer to arbitrary remote nodes in their adjacency list.

This can be done by making Node and Graph to be of type[@specialized(Int, Long) V], making 'id' to be of type V and doing suitable refactoring. When the node is of type Long, a hash can be supplied to map node id to Node as required by getNodeById()

Create a graph which keeps neighboring nodes sorted by id

In some applications one needs fast intersections of adjacency lists and fast search of whether a node v is a neighbor of a node u. In such cases, a graph that keeps the adjacency list in sorted order would be great.

This should work for both in and out directions in both ArrayBasedDirectedGraph and SharedArrayBasedDirectedGraph.

Remove the concept of internal/external id in BipartiteGraph

Simplify it to both LeftNode and RightNode being the same BipartiteNode and the choice of which ids belong to LHS and which ids belong to RHS left to the user. The constructor of BipartiteGraph can take in a function parameter isLeftNode: Int => Boolean that tells whether an id is for left or right side of the graph, which should be checked for neighbors of a BipartiteNode and a runtime exception thrown if a left node has another left node has a neighbor (or right node has another right node as a neighbor).
CC @AnishShah

Extend com.twitter.cassovary.util.NodeRenumberer to use Long and String node ids

NodeRenumberer is a utility to convert the node ids supplied to a more compact, sequentially increasing integer node ids used internally. However, the current implementation only allows integer node ids to be supplied at the input. Would be great to allow Long and String node ids as well (the internal node ids would still be integer).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.