twitter / cassovary Goto Github PK

View Code? Open in Web Editor NEW

1.0K 1.0K 154.0 51.54 MB

Cassovary is a simple big graph processing library for the JVM

Home Page: http://twitter.com/cassovary

License: Apache License 2.0

Scala 98.33% Java 1.07% Shell 0.60%

cassovary's People

Contributors

Stargazers

Watchers

Forkers

chenhaot pankajgupta everpeace jsalim zeedark lahi niedfelj delip d6y gdevanla ningliang saurabhk julienledem aneeshs amisarca write2munish agrewal caniszczyk jcccf kaychaks rodzyn0688 jdyaberi akyrolatwitter ganeshgitrepo erwink tusharbihani dariobottazzi mailmahee pankajb64 hellcoderz moh-yakoub nod3x ahmed26 codejitsu halfnhav4 forschnix shrikrishnaholla dfuhry shreyas24 szymonm romilbansal ducnguyen1911 evan hengzhang bdacode akshay-bhardwaj he-yunlong tchen0123 junzhusecurity lpand data-processing alisheikh huitseeker emaxerrno icomputational smazumder05 hcchen 9cc9 edwardt bartekkalinka adamkozuch melin plofgren xianghang travisbrown bmckown anamariamocanu juggernautchetu09 cloudcalvin nextlane daniharsh28 wmiel joliny 54chenjian fysoft2006 tlasica d13sl0w zjpjack constantineg1 codeaudit xorlev vivacevivo tmp2 hardiku shree-shubham anukat2015 chandansaha2014 zxh19800626 lintool fizx tinggalklik emmanuj youngbink yeshwanth-reddy bedeedidiong giserh alok-saini malterb wiltonlazary rjsen

cassovary's Issues

Running examples should use 'sbt run'

There are a few examples in examples/scala (and in examples/java) that demonstrate how to use the cassovary library. These are built and run using a hacky bash script examples/scala/build_run_example.sh (which uses examples/build_run_example.sh). Instead we should be using the run command in sbt to build and run these examples.

Another alternative is to build a separate sbt project just for examples and use cassovary library as a dependency from maven central (as explained in README.md)

Port build file (build.sbt) to Build.scala

Cassovary uses sbt (see http://www.scala-sbt.org) and has its build definition in build.sbt. It would be nicer to port that to a scala based build definition.

[enhancement] Explain impacts of Node ID scheme

Noticed an interesting issue when first experimenting with Cassovary. When trying to do PageRank on a 25,000 node graph I kept experiencing many OutOfMemory errors, even with 4G allocated to the java heap.

I realized one of the problems was due to the ID scheme being used by my nodes. Basically, some nodes had an ID similar to 9080256 and so the PageRank algorithm was iterating as many times as the maximum node ID, even though 9 million node IDs were not valid. It could be that I didn't properly read the source/docs but I can see this being a design flaw. It might be more helpful to explain the purpose of node IDs or to change the node ID scheme altogether.

Allow info to be stored per node in an array of node ids

Motivated by #138 where a centrality score is being computed per node and it might be sometimes most efficient to keep information per node stored in an array.

Currently we have graph.tourist.InfoKeeper that keeps info in a mutable.Map which precludes storing infoPerNode in an array.

How about declaring a new trait that has get/put. and allowing infoPerNode to return a collection of this trait type. This can be either a mutable.Map (as now) or an Array with implicit conversions from the Map and Array to this trait?

Comments @szymonm ?

Build for scala 2.10

It should be cross-built so as to allow both scala 2.9 and 2.10 users to be use it.

Fix warnings from scala 2.10 compilation

Compiling in scala 2.10 gives a few warnings. To look at these warnings do: "./sbt clean update ++2.10.3 test"

Benchmark current performance on public datasets

Use datasets on http://snap.stanford.edu/data/index.html and benchmark performance of a couple of algorithms for very big graphs, such as (1) Global pagerank, (2) Personalized pagerank for every node in the graph.

Improve reading Int graphs

Now, both graph readers use Regex matching to read graphs. This is very inefficient for the Int case. Could be performed better avoiding regex for something like readInt from InputStream.

Enhance PerformanceBenchmark in cassovary-benchmarks to also report java memory in use

Currently it only reports speed.

Examples don't run

If I try running the examples I get:

Compiling HelloGraph ...
Running HelloGraph...
Exception in thread "main" java.lang.NoSuchMethodError: scala.Predef$.augmentString(Ljava/lang/String;)Ljava/lang/String;
        at HelloGraph$.main(HelloGraph.scala:24)
        at HelloGraph.main(HelloGraph.scala)

I've tried on both Linux and OSX with same result. Seems to be a issue with library dependencies and compiling with one scala version and trying to run with another... However I'm just starting to learn Scala so not sure what to do here...

allow traversals to use both in/out edges

Currently for Traversals you have to specify the direction of edges that will be used. This is to allow walks on directed graphs as if they were undirected.

TriangleCountSpec does not consistently pass

The randomly generated graph does not consistently provide a triangle count of 0 +/- 20.

https://travis-ci.org/twitter/cassovary/jobs/48922273#L1352

SBT update not working

When running ./sbt update, I have the following issues:

[warn] Host binaries.local.twitter.com not found. url=http://binaries.local.twitter.com/maven/commons-lang/commons-lang/2.2/commons-lang-2.2-sources.jar
[info] You probably access the destination server through a proxy server that is not well configured.
[warn] Host binaries.local.twitter.com not found. url=http://binaries.local.twitter.com/maven/commons-lang/commons-lang/2.2/commons-lang-2.2-javadoc.jar
[info] You probably access the destination server through a proxy server that is not well configured.
[warn] Host binaries.local.twitter.com not found. url=http://binaries.local.twitter.com/maven/com/twitter/util-logging/1.8.5/util-logging-1.8.5.pom
[info] You probably access the destination server through a proxy server that is not well configured.
[warn] Host binaries.local.twitter.com not found. url=http://binaries.local.twitter.com/maven/com/twitter/util-logging/1.8.5/util-logging-1.8.5.jar
[info] You probably access the destination server through a proxy server that is not well configured.
[warn] Host binaries.local.twitter.com not found. url=http://binaries.local.twitter.com/maven/com/twitter/util-logging/1.8.5/util-logging-1.8.5-sources.jar
[info] You probably access the destination server through a proxy server that is not well configured.

Adding cassovary as dependency does not work

When adding cassovary as dependency in another project (according to Readme):
libraryDependencies += "com.twitter" %% "cassovary" % "4.0.0"
in build.sbt
I get compile error "object cassovary is not a member of package com.twitter".

I checked cassovary jar in Ivy repository cache and it does not contain any classes (just metainf directory). It's the same, when I compile cassovary with sbt package - jar in target directory is empty, and there's no classes in main target directory.

Workaround is:
libraryDependencies += "com.twitter" %% "cassovary-core" % "4.0.0"

Should we modify Readme or fix the build to match Readme?

Build for newer Scala versions (2.9.x)

The Cassovary tree, as shipped, only builds against Scala 2.8.1.

Local experiments with building for 2.9.1 and 2.9.2 work (compile and pass all tests), but specs has to be updated to 1.6.9 compiled for 2.9.1.

I would submit a pull request, but my current changes break 2.8.1 builds for some reason I have yet to debug. I redefine 'specs' as follows:

val specs = buildScalaVersion match {
  case "2.8.1" =>
    "org.scala-tools.testing" % "specs_2.8.1" % "1.6.6" % "test" withSources()
  case "2.9.1" | "2.9.2" =>
    "org.scala-tools.testing" % "specs_2.9.1" % "1.6.9" % "test" withSources()
}

and then set build.scala.versions to “2.8.1 2.9.1”. 2.9.1 builds & passes all tests, as does 2.9.2 (++2.9.2 test), but 2.8.1 does not for some reason.

Fix compilation warnings post move to scala 2.11 in cassovary version 4.0.0+

If you start working on this, please leave a comment here.

Speed up random graph generation

Currently the methods generateRandomGraph() and generateRandomUndirectedGraph() in TestGraph.scala are slow as they iterate over each of the potential O(n^2) edges to determine whether that edge should be picked or not.

fix graphUtils documentation

Documentation contains a lot of errors - esp. param names.

Change asserts in GraphBehaviours to ShoudMatchers for more graceful test failures

A violation of an assert() statement results in a stack trace. The asserts should be removed and replaced by "shouldEqual" type statements.

cc @szymonm

Parametrize Graph by NodeType

As @szymonm mentioned in #141 something like:

trait Graph[NodeType <: Node] {
getNodeById: Option[NodeType]
...
}

That allows Node to be subclassed and used in the respective Graph. e.g., in #141 we would do:

trait UndirectedGraph extends Graph[UndirectedNode] {
getNodeById: Option[UndirectedNode]
}

Deprecate Ostrich and use Metrics

https://github.com/twitter/ostrich is currently used to maintain internal Stats in Cassovary. Ostrich does not seem to be actively maintained. Instead, should move to https://github.com/twitter/commons/tree/master/src/java/com/twitter/common/metrics as is used in Finagle-stats (http://twitter.github.io/finagle/guide/Metrics.html) as well.

Parallelize Pagerank to run on multiple cores

Current pagerank implementation in com.twitter.cassovary.algorithms is single-threaded. Implement a multi-thread version to make it run faster. Also, report benchmark results comparing the parallel version with the current version (see PagerankBenchmark in cassovary-benchmarks)

Benchmark performance of dynamic graph implementation on a synthetic update stream on a large graph

There are currently two implementations for dynamic directed graphs. src/main/scala/com/twitter/cassovary/graph/{DynamicDirectedGraphHashMap.scala, SynchronizedDynamicGraph.scala}. Generate a synthetic update stream (adds/deletes) on a large graph and benchmark the performance of an algorithm such as calculatePersonalizedReputation() or bfsWalk() in src/main/scala/com/twitter/cassovary/graph/GraphUtils.scala while the graph is handling updates.

Add bipartite graph tests that were removed from PR #148

CC @AnishShah

git-tag releases

In the future, please use git-tag for releases.

SharedArrayBasedDirectedGraph does not keep in edges in shared array

It only seems to keep the out edges in a shared array.

HelloLoadGraph does not finish

When running HelloLoadGraph the program does not finish. There is something wrong with reading graph from file.

Create labeled graph in com.twitter.cassovary.graph

Right now, the graph data structure can not have labels on nodes or edges. Allow arbitrary labels in a new class. As a special case, allow only one int to be associated with each edge and each node.

examples/build_run_example.sh not working

Hi, i encountered the problem when i tried to run the example.

The error is as following:

Compiling HelloGraph ...
error: error while loading TestGraphs, Scala signature TestGraphs has wrong version
expected: 4.1
found: 5.0
one error found
Running HelloGraph...
Exception in thread "main" java.lang.NoClassDefFoundError: HelloGraph
Caused by: java.lang.ClassNotFoundException: HelloGraph
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
Could not find the main class: HelloGraph. Program will exit.

Is it due to the sbt and scala version problem?
The version of scala are different for building and using.

$ ./sbt
[info] Standard project rules 0.12.7 loaded (2011-05-24).
[warn] No .svnrepo file; no svn repo will be configured.
[info] Building project cassovary 1.0.1-SNAPSHOT against Scala 2.8.1
[info] using CassovaryProject with sbt 0.7.4 and Scala 2.7.7

Add check for negative numbers in getNodeById()

bump fastutils to the newest version

We use fastutil 6.4.4 while 6.5.16 is the current.

Implement a triangle counting algorithm

for example, the randomized algorithm in http://arxiv.org/abs/1212.2264 or one of the other algorithms mentioned in this paper.

Implement some similarity algorithms

Such as cosine similarity and jaccard similarity when given
(1) Two nodes N1 and N2, find the similarity values between them in a given direction (Out or In)
(2) One node N1, find the top K similar nodes to N1

Cosine(N1, N2) = neighbors(N1) ∩ neighbors(N2) divided by (sqrt(numNeighbors(N1)) * sqrt(numNeighbors(N2))

and Jaccard(N1,N2) = neighbors(N1) ∩ neighbors(N2) divided by neighbors(N1) ∪ neighbors(N2)

Please add the algorithms in com.twitter.cassovary.algorithms

Improve SharedArrayBasedDirectedGraph documentation

i.a. some examples would be useful.

Publish cassovary to maven central

Now that cassovary is open sourced, it should be published to maven central every release.

To do this with SBT, see this link: http://www.scala-sbt.org/using_sonatype.html

If you need help doing this, consult with the Scalding project who does this...
http://search.maven.org/#search%7Cga%7C1%7Ca%3A%22scalding_2.8.1%22

Remove library eviction warnings with sbt

When using sbt, we are currently getting:

[warn] There may be incompatibilities among your library dependencies.
[warn] Here are some of the libraries that were evicted:
[warn] * com.google.guava:guava:13.0.1 -> 16.0.1
[warn] Run 'evicted' to see detailed eviction warnings

Remove the unnecessary "() =>" in `def iteratorSeq` from com.twitter.cassovary.util.io.GraphReader

It doesn't seem to me that this is necessary. The original idea I guess was to ensure that the files are read during graph construction and not upfront. But even if we just have Seq[Iterator[NodeIdEdgesMaxId]] only the iterators are constructed and not actually traversed. Hence it seems like the extra empty parentheses are un-necessary.

Would you agree @szymonm ?

Clarify support for Undirected Graphs

I have an application where I needed an undirected graph. I actually didn't realize until today that "Mutual" in StoredGraphDir meant undirected and was starting to write my own UndirectedGraph wrapper class when I noticed it. How should we make this more clear? For example I could add a comment to StoredGraphDir along the lines of
Mutual, // In a mutual graph, the outbound and inbound neighbors of each node are equal, and space is typically saved by only storing the outbound neighbors of each node
and/or add to the documentation of DirectedGraph that it actually also supports undirected graphs. Or should there be an UndirectedGraph wrapper in Cassovary?
@pankajgupta

Allow node ids to be Long

This is related to #36 but different. We want the ability of node ids to be Long. The use case is when storing a portion of a huge graph in memory on one machine. The nodes can be too many to be able to map to int. Even though the number of nodes on a single machine are small enough, they can refer to arbitrary remote nodes in their adjacency list.

This can be done by making Node and Graph to be of type[@specialized(Int, Long) V], making 'id' to be of type V and doing suitable refactoring. When the node is of type Long, a hash can be supplied to map node id to Node as required by getNodeById()

AdjacencyListGraphReader throws NPE when dir does not exist

Would be nice to have some more specific error message.

Create a graph which keeps neighboring nodes sorted by id

In some applications one needs fast intersections of adjacency lists and fast search of whether a node v is a neighbor of a node u. In such cases, a graph that keeps the adjacency list in sorted order would be great.

This should work for both in and out directions in both ArrayBasedDirectedGraph and SharedArrayBasedDirectedGraph.

move *Spec tests in src/test (one by one) from 'specs' to scalatest

'org.specs' has been deprecated and the tests should move to using org.scalatest. An example of the latter is in com.twitter.cassovary.graph.TestGraphSpec.

Remove the concept of internal/external id in BipartiteGraph

Simplify it to both LeftNode and RightNode being the same BipartiteNode and the choice of which ids belong to LHS and which ids belong to RHS left to the user. The constructor of BipartiteGraph can take in a function parameter isLeftNode: Int => Boolean that tells whether an id is for left or right side of the graph, which should be checked for neighbors of a BipartiteNode and a runtime exception thrown if a left node has another left node has a neighbor (or right node has another right node as a neighbor).
CC @AnishShah

Build for Scala 2.11

Please could you release a build for Scala 2.11?

Extend AdjacencyListG..R... and ListOfEdgesG...R... to work on Long and String Node ids

After the node renumbering, there is currently no way to read graphs with long / string ids. Assume a specified single char separator (i.e., similar to the use of space right now) in both cases.

Make all directed graph implementations parametrized by NodeType post PR #148

CC @AnishShah

Analyze PersonalizedPageRankBenchmark on a graph and identify opportunities for speed and storage optimization

Profile using YourKit profiler
Identify biggest contributors to time and storage
Figure out ways to optimize the biggest contributors

Cross publish Scala 2.10/2.11 releases

Use scala futures/promises in com.twitter.cassovary.graph.ArrayBasedDirectedGraph

Currently this object uses a couple of hand-rolled methods "parallelForEach" and "parallelWork" in com.twitter.cassovary.util.ExecutorUtils. It will be more elegant to use scala futures and/or promises instead.

Extend com.twitter.cassovary.util.NodeRenumberer to use Long and String node ids

NodeRenumberer is a utility to convert the node ids supplied to a more compact, sequentially increasing integer node ids used internally. However, the current implementation only allows integer node ids to be supplied at the input. Would be great to allow Long and String node ids as well (the internal node ids would still be integer).

twitter / cassovary Goto Github PK

cassovary's People

Contributors

Stargazers

Watchers

Forkers

cassovary's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs