twitter / cassovary Goto Github PK
View Code? Open in Web Editor NEWCassovary is a simple big graph processing library for the JVM
Home Page: http://twitter.com/cassovary
License: Apache License 2.0
Cassovary is a simple big graph processing library for the JVM
Home Page: http://twitter.com/cassovary
License: Apache License 2.0
There are a few examples in examples/scala (and in examples/java) that demonstrate how to use the cassovary library. These are built and run using a hacky bash script examples/scala/build_run_example.sh (which uses examples/build_run_example.sh). Instead we should be using the run command in sbt to build and run these examples.
Another alternative is to build a separate sbt project just for examples and use cassovary library as a dependency from maven central (as explained in README.md)
Cassovary uses sbt (see http://www.scala-sbt.org) and has its build definition in build.sbt. It would be nicer to port that to a scala based build definition.
Noticed an interesting issue when first experimenting with Cassovary. When trying to do PageRank on a 25,000 node graph I kept experiencing many OutOfMemory errors, even with 4G allocated to the java heap.
I realized one of the problems was due to the ID scheme being used by my nodes. Basically, some nodes had an ID similar to 9080256 and so the PageRank algorithm was iterating as many times as the maximum node ID, even though 9 million node IDs were not valid. It could be that I didn't properly read the source/docs but I can see this being a design flaw. It might be more helpful to explain the purpose of node IDs or to change the node ID scheme altogether.
Motivated by #138 where a centrality score is being computed per node and it might be sometimes most efficient to keep information per node stored in an array.
Currently we have graph.tourist.InfoKeeper that keeps info in a mutable.Map which precludes storing infoPerNode in an array.
How about declaring a new trait that has get/put. and allowing infoPerNode to return a collection of this trait type. This can be either a mutable.Map (as now) or an Array with implicit conversions from the Map and Array to this trait?
Comments @szymonm ?
It should be cross-built so as to allow both scala 2.9 and 2.10 users to be use it.
Compiling in scala 2.10 gives a few warnings. To look at these warnings do: "./sbt clean update ++2.10.3 test"
Use datasets on http://snap.stanford.edu/data/index.html and benchmark performance of a couple of algorithms for very big graphs, such as (1) Global pagerank, (2) Personalized pagerank for every node in the graph.
Now, both graph readers use Regex matching to read graphs. This is very inefficient for the Int case. Could be performed better avoiding regex for something like readInt from InputStream.
Currently it only reports speed.
If I try running the examples I get:
Compiling HelloGraph ...
Running HelloGraph...
Exception in thread "main" java.lang.NoSuchMethodError: scala.Predef$.augmentString(Ljava/lang/String;)Ljava/lang/String;
at HelloGraph$.main(HelloGraph.scala:24)
at HelloGraph.main(HelloGraph.scala)
I've tried on both Linux and OSX with same result. Seems to be a issue with library dependencies and compiling with one scala version and trying to run with another... However I'm just starting to learn Scala so not sure what to do here...
Currently for Traversals you have to specify the direction of edges that will be used. This is to allow walks on directed graphs as if they were undirected.
The randomly generated graph does not consistently provide a triangle count of 0 +/- 20.
When running ./sbt update
, I have the following issues:
[warn] Host binaries.local.twitter.com not found. url=http://binaries.local.twitter.com/maven/commons-lang/commons-lang/2.2/commons-lang-2.2-sources.jar
[info] You probably access the destination server through a proxy server that is not well configured.
[warn] Host binaries.local.twitter.com not found. url=http://binaries.local.twitter.com/maven/commons-lang/commons-lang/2.2/commons-lang-2.2-javadoc.jar
[info] You probably access the destination server through a proxy server that is not well configured.
[warn] Host binaries.local.twitter.com not found. url=http://binaries.local.twitter.com/maven/com/twitter/util-logging/1.8.5/util-logging-1.8.5.pom
[info] You probably access the destination server through a proxy server that is not well configured.
[warn] Host binaries.local.twitter.com not found. url=http://binaries.local.twitter.com/maven/com/twitter/util-logging/1.8.5/util-logging-1.8.5.jar
[info] You probably access the destination server through a proxy server that is not well configured.
[warn] Host binaries.local.twitter.com not found. url=http://binaries.local.twitter.com/maven/com/twitter/util-logging/1.8.5/util-logging-1.8.5-sources.jar
[info] You probably access the destination server through a proxy server that is not well configured.
When adding cassovary as dependency in another project (according to Readme):
libraryDependencies += "com.twitter" %% "cassovary" % "4.0.0"
in build.sbt
I get compile error "object cassovary is not a member of package com.twitter".
I checked cassovary jar in Ivy repository cache and it does not contain any classes (just metainf directory). It's the same, when I compile cassovary with sbt package - jar in target directory is empty, and there's no classes in main target directory.
Workaround is:
libraryDependencies += "com.twitter" %% "cassovary-core" % "4.0.0"
Should we modify Readme or fix the build to match Readme?
The Cassovary tree, as shipped, only builds against Scala 2.8.1.
Local experiments with building for 2.9.1 and 2.9.2 work (compile and pass all tests), but specs has to be updated to 1.6.9 compiled for 2.9.1.
I would submit a pull request, but my current changes break 2.8.1 builds for some reason I have yet to debug. I redefine 'specs' as follows:
val specs = buildScalaVersion match {
case "2.8.1" =>
"org.scala-tools.testing" % "specs_2.8.1" % "1.6.6" % "test" withSources()
case "2.9.1" | "2.9.2" =>
"org.scala-tools.testing" % "specs_2.9.1" % "1.6.9" % "test" withSources()
}
and then set build.scala.versions
to “2.8.1 2.9.1”. 2.9.1 builds & passes all tests, as does 2.9.2 (++2.9.2 test
), but 2.8.1 does not for some reason.
If you start working on this, please leave a comment here.
Currently the methods generateRandomGraph() and generateRandomUndirectedGraph() in TestGraph.scala are slow as they iterate over each of the potential O(n^2) edges to determine whether that edge should be picked or not.
Documentation contains a lot of errors - esp. param names.
A violation of an assert() statement results in a stack trace. The asserts should be removed and replaced by "shouldEqual" type statements.
cc @szymonm
As @szymonm mentioned in #141 something like:
trait Graph[NodeType <: Node] {
getNodeById: Option[NodeType]
...
}
That allows Node to be subclassed and used in the respective Graph. e.g., in #141 we would do:
trait UndirectedGraph extends Graph[UndirectedNode] {
getNodeById: Option[UndirectedNode]
}
https://github.com/twitter/ostrich is currently used to maintain internal Stats in Cassovary. Ostrich does not seem to be actively maintained. Instead, should move to https://github.com/twitter/commons/tree/master/src/java/com/twitter/common/metrics as is used in Finagle-stats (http://twitter.github.io/finagle/guide/Metrics.html) as well.
Current pagerank implementation in com.twitter.cassovary.algorithms is single-threaded. Implement a multi-thread version to make it run faster. Also, report benchmark results comparing the parallel version with the current version (see PagerankBenchmark in cassovary-benchmarks)
There are currently two implementations for dynamic directed graphs. src/main/scala/com/twitter/cassovary/graph/{DynamicDirectedGraphHashMap.scala, SynchronizedDynamicGraph.scala}. Generate a synthetic update stream (adds/deletes) on a large graph and benchmark the performance of an algorithm such as calculatePersonalizedReputation() or bfsWalk() in src/main/scala/com/twitter/cassovary/graph/GraphUtils.scala while the graph is handling updates.
CC @AnishShah
In the future, please use git-tag for releases.
It only seems to keep the out edges in a shared array.
When running HelloLoadGraph the program does not finish. There is something wrong with reading graph from file.
Right now, the graph data structure can not have labels on nodes or edges. Allow arbitrary labels in a new class. As a special case, allow only one int to be associated with each edge and each node.
Hi, i encountered the problem when i tried to run the example.
The error is as following:
Compiling HelloGraph ...
error: error while loading TestGraphs, Scala signature TestGraphs has wrong version
expected: 4.1
found: 5.0
one error found
Running HelloGraph...
Exception in thread "main" java.lang.NoClassDefFoundError: HelloGraph
Caused by: java.lang.ClassNotFoundException: HelloGraph
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
Could not find the main class: HelloGraph. Program will exit.
Is it due to the sbt and scala version problem?
The version of scala are different for building and using.
$ ./sbt
[info] Standard project rules 0.12.7 loaded (2011-05-24).
[warn] No .svnrepo file; no svn repo will be configured.
[info] Building project cassovary 1.0.1-SNAPSHOT against Scala 2.8.1
[info] using CassovaryProject with sbt 0.7.4 and Scala 2.7.7
We use fastutil 6.4.4 while 6.5.16 is the current.
for example, the randomized algorithm in http://arxiv.org/abs/1212.2264 or one of the other algorithms mentioned in this paper.
Such as cosine similarity and jaccard similarity when given
(1) Two nodes N1 and N2, find the similarity values between them in a given direction (Out or In)
(2) One node N1, find the top K similar nodes to N1
Cosine(N1, N2) = neighbors(N1) ∩ neighbors(N2) divided by (sqrt(numNeighbors(N1)) * sqrt(numNeighbors(N2))
and Jaccard(N1,N2) = neighbors(N1) ∩ neighbors(N2) divided by neighbors(N1) ∪ neighbors(N2)
Please add the algorithms in com.twitter.cassovary.algorithms
i.a. some examples would be useful.
Now that cassovary is open sourced, it should be published to maven central every release.
To do this with SBT, see this link: http://www.scala-sbt.org/using_sonatype.html
If you need help doing this, consult with the Scalding project who does this...
http://search.maven.org/#search%7Cga%7C1%7Ca%3A%22scalding_2.8.1%22
When using sbt, we are currently getting:
[warn] There may be incompatibilities among your library dependencies.
[warn] Here are some of the libraries that were evicted:
[warn] * com.google.guava:guava:13.0.1 -> 16.0.1
[warn] Run 'evicted' to see detailed eviction warnings
It doesn't seem to me that this is necessary. The original idea I guess was to ensure that the files are read during graph construction and not upfront. But even if we just have Seq[Iterator[NodeIdEdgesMaxId]] only the iterators are constructed and not actually traversed. Hence it seems like the extra empty parentheses are un-necessary.
Would you agree @szymonm ?
I have an application where I needed an undirected graph. I actually didn't realize until today that "Mutual" in StoredGraphDir meant undirected and was starting to write my own UndirectedGraph wrapper class when I noticed it. How should we make this more clear? For example I could add a comment to StoredGraphDir along the lines of
Mutual, // In a mutual graph, the outbound and inbound neighbors of each node are equal, and space is typically saved by only storing the outbound neighbors of each node
and/or add to the documentation of DirectedGraph that it actually also supports undirected graphs. Or should there be an UndirectedGraph wrapper in Cassovary?
@pankajgupta
This is related to #36 but different. We want the ability of node ids to be Long. The use case is when storing a portion of a huge graph in memory on one machine. The nodes can be too many to be able to map to int. Even though the number of nodes on a single machine are small enough, they can refer to arbitrary remote nodes in their adjacency list.
This can be done by making Node and Graph to be of type[@specialized(Int, Long) V], making 'id' to be of type V and doing suitable refactoring. When the node is of type Long, a hash can be supplied to map node id to Node as required by getNodeById()
Would be nice to have some more specific error message.
In some applications one needs fast intersections of adjacency lists and fast search of whether a node v is a neighbor of a node u. In such cases, a graph that keeps the adjacency list in sorted order would be great.
This should work for both in and out directions in both ArrayBasedDirectedGraph and SharedArrayBasedDirectedGraph.
'org.specs' has been deprecated and the tests should move to using org.scalatest. An example of the latter is in com.twitter.cassovary.graph.TestGraphSpec.
Simplify it to both LeftNode and RightNode being the same BipartiteNode and the choice of which ids belong to LHS and which ids belong to RHS left to the user. The constructor of BipartiteGraph can take in a function parameter isLeftNode: Int => Boolean that tells whether an id is for left or right side of the graph, which should be checked for neighbors of a BipartiteNode and a runtime exception thrown if a left node has another left node has a neighbor (or right node has another right node as a neighbor).
CC @AnishShah
Please could you release a build for Scala 2.11?
After the node renumbering, there is currently no way to read graphs with long / string ids. Assume a specified single char separator (i.e., similar to the use of space right now) in both cases.
CC @AnishShah
Currently this object uses a couple of hand-rolled methods "parallelForEach" and "parallelWork" in com.twitter.cassovary.util.ExecutorUtils. It will be more elegant to use scala futures and/or promises instead.
NodeRenumberer is a utility to convert the node ids supplied to a more compact, sequentially increasing integer node ids used internally. However, the current implementation only allows integer node ids to be supplied at the input. Would be great to allow Long and String node ids as well (the internal node ids would still be integer).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.