graphframes / graphframes Goto Github PK

Home Page: http://graphframes.github.io/graphframes

License: Apache License 2.0

Scala 69.48% Shell 4.32% Python 23.80% Makefile 2.05% Dockerfile 0.27% HTML 0.08%

graphframes's Introduction

graphframes

GraphFrames: DataFrame-based Graphs

This is a package for DataFrame-based graphs on top of Apache Spark. Users can write highly expressive queries by leveraging the DataFrame API, combined with a new API for motif finding. The user also benefits from DataFrame performance optimizations within the Spark SQL engine.

You can find user guide and API docs at https://graphframes.github.io/graphframes.

Building and running unit tests

To compile this project, run build/sbt assembly from the project home directory. This will also run the Scala unit tests.

To run the Python unit tests, run the run-tests.sh script from the python/ directory. You will need to set SPARK_HOME to your local Spark installation directory.

Spark version compatibility

This project is compatible with Spark 2.4+. However, significant speed improvements have been made to DataFrames in more recent versions of Spark, so you may see speedups from using the latest Spark version.

Contributing

GraphFrames is collaborative effort among UC Berkeley, MIT, and Databricks. We welcome open source contributions as well!

Releases:

See release notes.

graphframes's People

Contributors

Stargazers

Watchers

Forkers

jkbradley thunterdb ankurdave agilemobiledev mengxr bllchmbrs waterytowers anand-singh rklick-solutions felixcheung ee08b397 kingyiu87 dennyglee kamir desperado1992 shagunsodhani stevekaeser schevalier billho fred-lefebvre weichenxu123 javadba frank-dkvan mats-sx alvarodlg pehlivanian suyannone brandonmburroughs lewisc holdenk gupta-himanshu akamlani jack18jack evo-company treena123 sethah giserh marlanbar aezero mahedi-kaysar nieguangyang pzz2011 aray zhuobao yasirarfat32 nchammas bobquest33 huiwenhan learking ikwattro criteo-forks bikramnehra bcajes kylin27 kevinfs drewrobb estebandonato bryant1410 phi-dbq mbrukman kevinykuo yaozhang-cs thomasopsomer fullcontact mfcardenas lizhen0909 vmrh harsha2010 saikocat appcoreopc wshino zhouyonglong msohail07 semanticbeeng blueroutecn artem-aliev grandata metismachine zhensongqian jason-huling hstack kedarmhaswade darthsuogles aymenwah scapeqin goungoun yun15 umardatageek ghshu profes nicolecvaldez liguo86 adfel70 tzw5099 xiashuijun munro mrbago rekhajoshm jimmy-newtron spendyala

graphframes's Issues

Mistake in Documentation

in http://graphframes.github.io/api/python/graphframes.html#graphframes.GraphFrame

Documentation states:

connectedComponents()
Computes the connected components of the graph.

See Scala documentation for more details.

Returns: GraphFrame with new vertices column “component”

should be: Returns: DataFrame with new vertices column “component” (as is in Scala)

Py4JJavaError: An error occurred while calling o57.find.

I am using graphframes:graphframes:0.1.0-spark1.6 from the pyspark interface with current master build of Spark. I get the following error when I am trying to use the g.find functions and other functions as shown in the example notebook at: http://go.databricks.com/hubfs/notebooks/3-GraphFrames-User-Guide-python.html

In [16]: motifs = g.find("(a)-[e]->(b); (b)-[e2]->(a)")
---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-16-ac1d920bb1a7> in <module>()
----> 1 motifs = g.find("(a)-[e]->(b); (b)-[e2]->(a)")

/content/tmp/spark-e344f1b3-a40f-488a-9cef-57049b7b3a04/userFiles-54d9528e-17fa-4a53-907e-dc4eca1da328/graphframes_graphframes-0.1.0-spark1.6.jar/graphframes/graphframe.pyc in find(self, pattern)

/content/SOFTWARE/spark/python/lib/py4j-0.9.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
    833         answer = self.gateway_client.send_command(command)
    834         return_value = get_return_value(
--> 835             answer, self.gateway_client, self.target_id, self.name)
    836 
    837         for temp_arg in temp_args:

/content/SOFTWARE/spark/python/pyspark/sql/utils.pyc in deco(*a, **kw)
     43     def deco(*a, **kw):
     44         try:
---> 45             return f(*a, **kw)
     46         except py4j.protocol.Py4JJavaError as e:
     47             s = e.java_exception.toString()

/content/SOFTWARE/spark/python/lib/py4j-0.9.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    308                 raise Py4JJavaError(
    309                     "An error occurred while calling {0}{1}{2}.\n".
--> 310                     format(target_id, ".", name), value)
    311             else:
    312                 raise Py4JError(

Py4JJavaError: An error occurred while calling o57.find.
: java.lang.NoSuchMethodError: scala.collection.immutable.$colon$colon.hd$1()Ljava/lang/Object;
        at org.graphframes.GraphFrame.findSimple(GraphFrame.scala:370)
        at org.graphframes.GraphFrame.find(GraphFrame.scala:263)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
        at py4j.Gateway.invoke(Gateway.java:290)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:209)
        at java.lang.Thread.run(Thread.java:745)

In [17]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:paths = g.find("(a)-[e]->(b)")\
:  .filter("e.relationship = 'follow'")\
:  .filter("a.age < b.age")
:# The `paths` variable contains the vertex information, which we can extract:
:e2 = paths.select("e.src", "e.dst", "e.relationship")
:
:# In Spark 1.5+, the user may simplify the previous call to:
:# val e2 = paths.select("e.*")
:
:# Construct the subgraph
:g2 = GraphFrame(g.vertices, e2)
:--
---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-17-7411759ff966> in <module>()
----> 1 paths = g.find("(a)-[e]->(b)")  .filter("e.relationship = 'follow'")  .filter("a.age < b.age")
      2 # The `paths` variable contains the vertex information, which we can extract:
      3 e2 = paths.select("e.src", "e.dst", "e.relationship")
      4 
      5 # In Spark 1.5+, the user may simplify the previous call to:

/content/tmp/spark-e344f1b3-a40f-488a-9cef-57049b7b3a04/userFiles-54d9528e-17fa-4a53-907e-dc4eca1da328/graphframes_graphframes-0.1.0-spark1.6.jar/graphframes/graphframe.pyc in find(self, pattern)

/content/SOFTWARE/spark/python/lib/py4j-0.9.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
    833         answer = self.gateway_client.send_command(command)
    834         return_value = get_return_value(
--> 835             answer, self.gateway_client, self.target_id, self.name)
    836 
    837         for temp_arg in temp_args:

/content/SOFTWARE/spark/python/pyspark/sql/utils.pyc in deco(*a, **kw)
     43     def deco(*a, **kw):
     44         try:
---> 45             return f(*a, **kw)
     46         except py4j.protocol.Py4JJavaError as e:
     47             s = e.java_exception.toString()

/content/SOFTWARE/spark/python/lib/py4j-0.9.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    308                 raise Py4JJavaError(
    309                     "An error occurred while calling {0}{1}{2}.\n".
--> 310                     format(target_id, ".", name), value)
    311             else:
    312                 raise Py4JError(

Py4JJavaError: An error occurred while calling o57.find.
: java.lang.NoSuchMethodError: scala.collection.immutable.$colon$colon.hd$1()Ljava/lang/Object;
        at org.graphframes.GraphFrame.findSimple(GraphFrame.scala:370)
        at org.graphframes.GraphFrame.find(GraphFrame.scala:263)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
        at py4j.Gateway.invoke(Gateway.java:290)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:209)
        at java.lang.Thread.run(Thread.java:745)

Also, when using the function BFS, ConnectedComponents, LabelPropagation, triangleCount, shortestPaths, pageRank, stronglyConnectedComponents, I get the following errors about the Methods not implemented.

In [18]: paths = g.bfs("name = 'Esther'", "age < 32")
---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-18-90070d1d699a> in <module>()
----> 1 paths = g.bfs("name = 'Esther'", "age < 32")

/content/tmp/spark-e344f1b3-a40f-488a-9cef-57049b7b3a04/userFiles-54d9528e-17fa-4a53-907e-dc4eca1da328/graphframes_graphframes-0.1.0-spark1.6.jar/graphframes/graphframe.pyc in bfs(self, fromExpr, toExpr, edgeFilter, maxPathLength)

/content/SOFTWARE/spark/python/lib/py4j-0.9.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
    833         answer = self.gateway_client.send_command(command)
    834         return_value = get_return_value(
--> 835             answer, self.gateway_client, self.target_id, self.name)
    836 
    837         for temp_arg in temp_args:

/content/SOFTWARE/spark/python/pyspark/sql/utils.pyc in deco(*a, **kw)
     43     def deco(*a, **kw):
     44         try:
---> 45             return f(*a, **kw)
     46         except py4j.protocol.Py4JJavaError as e:
     47             s = e.java_exception.toString()

/content/SOFTWARE/spark/python/lib/py4j-0.9.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    308                 raise Py4JJavaError(
    309                     "An error occurred while calling {0}{1}{2}.\n".
--> 310                     format(target_id, ".", name), value)
    311             else:
    312                 raise Py4JError(

Py4JJavaError: An error occurred while calling o101.run.
: java.lang.NoSuchMethodError: scala.collection.immutable.$colon$colon.hd$1()Ljava/lang/Object;
        at org.graphframes.GraphFrame.findSimple(GraphFrame.scala:370)
        at org.graphframes.GraphFrame.find(GraphFrame.scala:263)
        at org.graphframes.lib.BFS$.org$graphframes$lib$BFS$$run(BFS.scala:159)
        at org.graphframes.lib.BFS.run(BFS.scala:126)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
        at py4j.Gateway.invoke(Gateway.java:290)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:209)
        at java.lang.Thread.run(Thread.java:745)


In [19]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:filteredPaths = g.bfs(
:  fromExpr = "name = 'Esther'",
:  toExpr = "age < 32",
:  edgeFilter = "relationship != 'friend'",
:  maxPathLength = 3)
:--
---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-19-9a217d29ca2a> in <module>()
      3   toExpr = "age < 32",
      4   edgeFilter = "relationship != 'friend'",
----> 5   maxPathLength = 3)

/content/tmp/spark-e344f1b3-a40f-488a-9cef-57049b7b3a04/userFiles-54d9528e-17fa-4a53-907e-dc4eca1da328/graphframes_graphframes-0.1.0-spark1.6.jar/graphframes/graphframe.pyc in bfs(self, fromExpr, toExpr, edgeFilter, maxPathLength)

/content/SOFTWARE/spark/python/lib/py4j-0.9.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
    833         answer = self.gateway_client.send_command(command)
    834         return_value = get_return_value(
--> 835             answer, self.gateway_client, self.target_id, self.name)
    836 
    837         for temp_arg in temp_args:

/content/SOFTWARE/spark/python/pyspark/sql/utils.pyc in deco(*a, **kw)
     43     def deco(*a, **kw):
     44         try:
---> 45             return f(*a, **kw)
     46         except py4j.protocol.Py4JJavaError as e:
     47             s = e.java_exception.toString()

/content/SOFTWARE/spark/python/lib/py4j-0.9.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    308                 raise Py4JJavaError(
    309                     "An error occurred while calling {0}{1}{2}.\n".
--> 310                     format(target_id, ".", name), value)
    311             else:
    312                 raise Py4JError(

Py4JJavaError: An error occurred while calling o122.run.
: java.lang.NoSuchMethodError: scala.collection.immutable.$colon$colon.hd$1()Ljava/lang/Object;
        at org.graphframes.GraphFrame.findSimple(GraphFrame.scala:370)
        at org.graphframes.GraphFrame.find(GraphFrame.scala:263)
        at org.graphframes.lib.BFS$.org$graphframes$lib$BFS$$run(BFS.scala:159)
        at org.graphframes.lib.BFS.run(BFS.scala:126)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
        at py4j.Gateway.invoke(Gateway.java:290)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:209)
        at java.lang.Thread.run(Thread.java:745)


In [20]: result = g.connectedComponents()
---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-20-7eb76cabdc93> in <module>()
----> 1 result = g.connectedComponents()

/content/tmp/spark-e344f1b3-a40f-488a-9cef-57049b7b3a04/userFiles-54d9528e-17fa-4a53-907e-dc4eca1da328/graphframes_graphframes-0.1.0-spark1.6.jar/graphframes/graphframe.pyc in connectedComponents(self)

/content/SOFTWARE/spark/python/lib/py4j-0.9.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
    833         answer = self.gateway_client.send_command(command)
    834         return_value = get_return_value(
--> 835             answer, self.gateway_client, self.target_id, self.name)
    836 
    837         for temp_arg in temp_args:

/content/SOFTWARE/spark/python/pyspark/sql/utils.pyc in deco(*a, **kw)
     43     def deco(*a, **kw):
     44         try:
---> 45             return f(*a, **kw)
     46         except py4j.protocol.Py4JJavaError as e:
     47             s = e.java_exception.toString()

/content/SOFTWARE/spark/python/lib/py4j-0.9.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    308                 raise Py4JJavaError(
    309                     "An error occurred while calling {0}{1}{2}.\n".
--> 310                     format(target_id, ".", name), value)
    311             else:
    312                 raise Py4JError(

Py4JJavaError: An error occurred while calling o141.run.
: java.lang.NoSuchMethodError: org.apache.spark.sql.DataFrame.map(Lscala/Function1;Lscala/reflect/ClassTag;)Lorg/apache/spark/rdd/RDD;
        at org.graphframes.GraphFrame.toGraphX(GraphFrame.scala:136)
        at org.graphframes.GraphFrame.cachedGraphX$lzycompute(GraphFrame.scala:438)
        at org.graphframes.GraphFrame.cachedGraphX(GraphFrame.scala:438)
        at org.graphframes.GraphFrame.cachedTopologyGraphX$lzycompute(GraphFrame.scala:432)
        at org.graphframes.GraphFrame.cachedTopologyGraphX(GraphFrame.scala:431)
        at org.graphframes.lib.ConnectedComponents$.run(ConnectedComponents.scala:50)
        at org.graphframes.lib.ConnectedComponents.run(ConnectedComponents.scala:43)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
        at py4j.Gateway.invoke(Gateway.java:290)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:209)
        at java.lang.Thread.run(Thread.java:745)


In [21]: g.vertices
Out[21]: DataFrame[id: string, name: string, age: bigint]

In [22]: g.vertices.show()
+---+-------+---+
| id|   name|age|
+---+-------+---+
|  a|  Alice| 34|
|  b|    Bob| 36|
|  c|Charlie| 30|
|  d|  David| 29|
|  e| Esther| 32|
|  f|  Fanny| 36|
|  g|  Gabby| 60|
+---+-------+---+


In [23]: result = g.stronglyConnectedComponents(maxIter=10)
---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-23-9cbad8f66c11> in <module>()
----> 1 result = g.stronglyConnectedComponents(maxIter=10)

/content/tmp/spark-e344f1b3-a40f-488a-9cef-57049b7b3a04/userFiles-54d9528e-17fa-4a53-907e-dc4eca1da328/graphframes_graphframes-0.1.0-spark1.6.jar/graphframes/graphframe.pyc in stronglyConnectedComponents(self, maxIter)

/content/SOFTWARE/spark/python/lib/py4j-0.9.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
    833         answer = self.gateway_client.send_command(command)
    834         return_value = get_return_value(
--> 835             answer, self.gateway_client, self.target_id, self.name)
    836 
    837         for temp_arg in temp_args:

/content/SOFTWARE/spark/python/pyspark/sql/utils.pyc in deco(*a, **kw)
     43     def deco(*a, **kw):
     44         try:
---> 45             return f(*a, **kw)
     46         except py4j.protocol.Py4JJavaError as e:
     47             s = e.java_exception.toString()

/content/SOFTWARE/spark/python/lib/py4j-0.9.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    308                 raise Py4JJavaError(
    309                     "An error occurred while calling {0}{1}{2}.\n".
--> 310                     format(target_id, ".", name), value)
    311             else:
    312                 raise Py4JError(

Py4JJavaError: An error occurred while calling o163.run.
: java.lang.NoSuchMethodError: org.apache.spark.sql.DataFrame.map(Lscala/Function1;Lscala/reflect/ClassTag;)Lorg/apache/spark/rdd/RDD;
        at org.graphframes.GraphFrame.toGraphX(GraphFrame.scala:136)
        at org.graphframes.GraphFrame.cachedGraphX$lzycompute(GraphFrame.scala:438)
        at org.graphframes.GraphFrame.cachedGraphX(GraphFrame.scala:438)
        at org.graphframes.GraphFrame.cachedTopologyGraphX$lzycompute(GraphFrame.scala:432)
        at org.graphframes.GraphFrame.cachedTopologyGraphX(GraphFrame.scala:431)
        at org.graphframes.lib.StronglyConnectedComponents$.org$graphframes$lib$StronglyConnectedComponents$$run(StronglyConnectedComponents.scala:52)
        at org.graphframes.lib.StronglyConnectedComponents.run(StronglyConnectedComponents.scala:44)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
        at py4j.Gateway.invoke(Gateway.java:290)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:209)
        at java.lang.Thread.run(Thread.java:745)


In [24]: result = g.labelPropagation(maxIter=5)
---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-24-d841648f0e34> in <module>()
----> 1 result = g.labelPropagation(maxIter=5)

/content/tmp/spark-e344f1b3-a40f-488a-9cef-57049b7b3a04/userFiles-54d9528e-17fa-4a53-907e-dc4eca1da328/graphframes_graphframes-0.1.0-spark1.6.jar/graphframes/graphframe.pyc in labelPropagation(self, maxIter)

/content/SOFTWARE/spark/python/lib/py4j-0.9.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
    833         answer = self.gateway_client.send_command(command)
    834         return_value = get_return_value(
--> 835             answer, self.gateway_client, self.target_id, self.name)
    836 
    837         for temp_arg in temp_args:

/content/SOFTWARE/spark/python/pyspark/sql/utils.pyc in deco(*a, **kw)
     43     def deco(*a, **kw):
     44         try:
---> 45             return f(*a, **kw)
     46         except py4j.protocol.Py4JJavaError as e:
     47             s = e.java_exception.toString()

/content/SOFTWARE/spark/python/lib/py4j-0.9.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    308                 raise Py4JJavaError(
    309                     "An error occurred while calling {0}{1}{2}.\n".
--> 310                     format(target_id, ".", name), value)
    311             else:
    312                 raise Py4JError(

Py4JJavaError: An error occurred while calling o185.run.
: java.lang.NoSuchMethodError: org.apache.spark.sql.DataFrame.map(Lscala/Function1;Lscala/reflect/ClassTag;)Lorg/apache/spark/rdd/RDD;
        at org.graphframes.GraphFrame.toGraphX(GraphFrame.scala:136)
        at org.graphframes.GraphFrame.cachedGraphX$lzycompute(GraphFrame.scala:438)
        at org.graphframes.GraphFrame.cachedGraphX(GraphFrame.scala:438)
        at org.graphframes.GraphFrame.cachedTopologyGraphX$lzycompute(GraphFrame.scala:432)
        at org.graphframes.GraphFrame.cachedTopologyGraphX(GraphFrame.scala:431)
        at org.graphframes.lib.LabelPropagation$.org$graphframes$lib$LabelPropagation$$run(LabelPropagation.scala:63)
        at org.graphframes.lib.LabelPropagation.run(LabelPropagation.scala:54)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
        at py4j.Gateway.invoke(Gateway.java:290)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:209)
        at java.lang.Thread.run(Thread.java:745)


In [25]: results = g.pageRank(resetProbability=0.15, tol=0.01)
---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-25-7ba4099c0dbc> in <module>()
----> 1 results = g.pageRank(resetProbability=0.15, tol=0.01)

/content/tmp/spark-e344f1b3-a40f-488a-9cef-57049b7b3a04/userFiles-54d9528e-17fa-4a53-907e-dc4eca1da328/graphframes_graphframes-0.1.0-spark1.6.jar/graphframes/graphframe.pyc in pageRank(self, resetProbability, sourceId, maxIter, tol)

/content/SOFTWARE/spark/python/lib/py4j-0.9.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
    833         answer = self.gateway_client.send_command(command)
    834         return_value = get_return_value(
--> 835             answer, self.gateway_client, self.target_id, self.name)
    836 
    837         for temp_arg in temp_args:

/content/SOFTWARE/spark/python/pyspark/sql/utils.pyc in deco(*a, **kw)
     43     def deco(*a, **kw):
     44         try:
---> 45             return f(*a, **kw)
     46         except py4j.protocol.Py4JJavaError as e:
     47             s = e.java_exception.toString()

/content/SOFTWARE/spark/python/lib/py4j-0.9.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    308                 raise Py4JJavaError(
    309                     "An error occurred while calling {0}{1}{2}.\n".
--> 310                     format(target_id, ".", name), value)
    311             else:
    312                 raise Py4JError(

Py4JJavaError: An error occurred while calling o208.run.
: java.lang.NoSuchMethodError: org.apache.spark.sql.DataFrame.map(Lscala/Function1;Lscala/reflect/ClassTag;)Lorg/apache/spark/rdd/RDD;
        at org.graphframes.GraphFrame.toGraphX(GraphFrame.scala:136)
        at org.graphframes.GraphFrame.cachedGraphX$lzycompute(GraphFrame.scala:438)
        at org.graphframes.GraphFrame.cachedGraphX(GraphFrame.scala:438)
        at org.graphframes.GraphFrame.cachedTopologyGraphX$lzycompute(GraphFrame.scala:432)
        at org.graphframes.GraphFrame.cachedTopologyGraphX(GraphFrame.scala:431)
        at org.graphframes.lib.PageRank$.runUntilConvergence(PageRank.scala:153)
        at org.graphframes.lib.PageRank.run(PageRank.scala:102)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
        at py4j.Gateway.invoke(Gateway.java:290)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:209)
        at java.lang.Thread.run(Thread.java:745)


In [26]: results = g.shortestPaths(landmarks=["a", "d"])
---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-26-05cc91bf2d89> in <module>()
----> 1 results = g.shortestPaths(landmarks=["a", "d"])

/content/tmp/spark-e344f1b3-a40f-488a-9cef-57049b7b3a04/userFiles-54d9528e-17fa-4a53-907e-dc4eca1da328/graphframes_graphframes-0.1.0-spark1.6.jar/graphframes/graphframe.pyc in shortestPaths(self, landmarks)

/content/SOFTWARE/spark/python/lib/py4j-0.9.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
    833         answer = self.gateway_client.send_command(command)
    834         return_value = get_return_value(
--> 835             answer, self.gateway_client, self.target_id, self.name)
    836 
    837         for temp_arg in temp_args:

/content/SOFTWARE/spark/python/pyspark/sql/utils.pyc in deco(*a, **kw)
     43     def deco(*a, **kw):
     44         try:
---> 45             return f(*a, **kw)
     46         except py4j.protocol.Py4JJavaError as e:
     47             s = e.java_exception.toString()

/content/SOFTWARE/spark/python/lib/py4j-0.9.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    308                 raise Py4JJavaError(
    309                     "An error occurred while calling {0}{1}{2}.\n".
--> 310                     format(target_id, ".", name), value)
    311             else:
    312                 raise Py4JError(

Py4JJavaError: An error occurred while calling o231.run.
: java.lang.NoSuchMethodError: org.apache.spark.sql.DataFrame.map(Lscala/Function1;Lscala/reflect/ClassTag;)Lorg/apache/spark/rdd/RDD;
        at org.graphframes.GraphFrame.toGraphX(GraphFrame.scala:136)
        at org.graphframes.GraphFrame.cachedGraphX$lzycompute(GraphFrame.scala:438)
        at org.graphframes.GraphFrame.cachedGraphX(GraphFrame.scala:438)
        at org.graphframes.GraphFrame.cachedTopologyGraphX$lzycompute(GraphFrame.scala:432)
        at org.graphframes.GraphFrame.cachedTopologyGraphX(GraphFrame.scala:431)
        at org.graphframes.lib.ShortestPaths$.org$graphframes$lib$ShortestPaths$$run(ShortestPaths.scala:69)
        at org.graphframes.lib.ShortestPaths.run(ShortestPaths.scala:59)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
        at py4j.Gateway.invoke(Gateway.java:290)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:209)
        at java.lang.Thread.run(Thread.java:745)


In [27]: results = g.triangleCount()
---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-27-8e965378aa62> in <module>()
----> 1 results = g.triangleCount()

/content/tmp/spark-e344f1b3-a40f-488a-9cef-57049b7b3a04/userFiles-54d9528e-17fa-4a53-907e-dc4eca1da328/graphframes_graphframes-0.1.0-spark1.6.jar/graphframes/graphframe.pyc in triangleCount(self)

/content/SOFTWARE/spark/python/lib/py4j-0.9.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
    833         answer = self.gateway_client.send_command(command)
    834         return_value = get_return_value(
--> 835             answer, self.gateway_client, self.target_id, self.name)
    836 
    837         for temp_arg in temp_args:

/content/SOFTWARE/spark/python/pyspark/sql/utils.pyc in deco(*a, **kw)
     43     def deco(*a, **kw):
     44         try:
---> 45             return f(*a, **kw)
     46         except py4j.protocol.Py4JJavaError as e:
     47             s = e.java_exception.toString()

/content/SOFTWARE/spark/python/lib/py4j-0.9.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    308                 raise Py4JJavaError(
    309                     "An error occurred while calling {0}{1}{2}.\n".
--> 310                     format(target_id, ".", name), value)
    311             else:
    312                 raise Py4JError(

Py4JJavaError: An error occurred while calling o252.run.
: java.lang.NoSuchMethodError: scala.collection.immutable.$colon$colon.hd$1()Ljava/lang/Object;
        at org.graphframes.GraphFrame.findSimple(GraphFrame.scala:370)
        at org.graphframes.GraphFrame.find(GraphFrame.scala:263)
        at org.graphframes.lib.TriangleCount$.org$graphframes$lib$TriangleCount$$run(TriangleCount.scala:58)
        at org.graphframes.lib.TriangleCount.run(TriangleCount.scala:39)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
        at py4j.Gateway.invoke(Gateway.java:290)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:209)
        at java.lang.Thread.run(Thread.java:745)

Copy general graph docs from GraphX

This issue is for copying text from GraphX's user guide to GraphFrames' user guide for graphs in general. There is a separate issue for copying text related to specific algorithms.

Create a simple Java example in the documentation

The documentation currently only provides Scala and Python examples, but I can't find any Java examples online.

For instance these two links work fine:
http://go.databricks.com/hubfs/notebooks/3-GraphFrames-User-Guide-scala.html
http://go.databricks.com/hubfs/notebooks/3-GraphFrames-User-Guide-python.html

But what I assume should be the Java examples, gives a 403 Forbidden response:
http://go.databricks.com/hubfs/notebooks/3-GraphFrames-User-Guide-java.html

When I attempt to create a GraphFrame instance in Java the compiler tells me the constructor has only 'private access':
DataFrame v = sqlContext.createDataFrame(list1, schema1);
DataFrame e = sqlContext.createDataFrame(list2, schema2);
GraphFrame g = new GraphFrame(v, e);

invalid dependency when converting from Graphx

scala version 2.11.7
spark-2.0.0-bin-hadoop2.7
graphframes-0.2.0-spark2.0-s_2.11.jar

$ ./bin/spark-shell --master local[4] --jars /Downloads/graphframes-0.2.0-spark2.0-s_2.11.jar
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
16/09/14 09:29:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/09/14 09:29:38 WARN SparkContext: Use an existing SparkContext, some configuration may not take effect.
Spark context Web UI available at http://10.0.1.26:4040
Spark context available as 'sc' (master = local[4], app id = local-1473870577895).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.0.0
      /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_66)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import org.graphframes._
import org.graphframes._

scala> import org.apache.spark.graphx._
import org.apache.spark.graphx._

scala> import org.apache.spark.graphx.util.GraphGenerators
import org.apache.spark.graphx.util.GraphGenerators

scala> import org.apache.spark.rdd.RDD
import org.apache.spark.rdd.RDD

scala> val myVertices = sc.makeRDD(Array((1L, "Ann"),(2L, "Bill"),(3L, "Charles"),(4L, "Diane"),(5L, "Went to gym this morning")))
myVertices: org.apache.spark.rdd.RDD[(Long, String)] = ParallelCollectionRDD[0] at makeRDD at <console>:32

scala> val myEdges = sc.makeRDD(Array(Edge(1L, 2L, "is-friends-with"),Edge(2L, 3L, "is-friends-with"),Edge(3L, 4L, "is-friends-with"),Edge(4L, 5L, "Likes-status"),Edge(3L, 5L, "Wrote-status")))
myEdges: org.apache.spark.rdd.RDD[org.apache.spark.graphx.Edge[String]] = ParallelCollectionRDD[1] at makeRDD at <console>:32

scala> val myGraph = Graph(myVertices, myEdges)
myGraph: org.apache.spark.graphx.Graph[String,String] = org.apache.spark.graphx.impl.GraphImpl@38f3dbbf

scala> val gf = GraphFrame.fromGraphX(myGraph)
error: missing or invalid dependency detected while loading class file 'Logging.class'.
Could not access term typesafe in package com,
because it (or its dependencies) are missing. Check your build definition for
missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see the problematic classpath.)
A full rebuild may help if 'Logging.class' was compiled against an incompatible version of com.
error: missing or invalid dependency detected while loading class file 'Logging.class'.
Could not access term scalalogging in value com.typesafe,
because it (or its dependencies) are missing. Check your build definition for
missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see the problematic classpath.)
A full rebuild may help if 'Logging.class' was compiled against an incompatible version of com.typesafe.
error: missing or invalid dependency detected while loading class file 'Logging.class'.
Could not access type LazyLogging in value com.slf4j,
because it (or its dependencies) are missing. Check your build definition for
missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see the problematic classpath.)
A full rebuild may help if 'Logging.class' was compiled against an incompatible version of com.slf4j.

Graph Partitioning

How to do graph partition in GraphFrames similar to the partitionBy feature in GraphX? Can we use the Dataframe's repartition feature in 1.6 to provide a graph partitioning in graphFrames?

wrap power iteration clustering

MLlib implements power iteration clustering. We can add a wrapper in GraphFrames. For example, g.powerIterationClustering.k(10).maxIter(5).run() returns a vertex DataFrame with cluster assignments. Note that we fixed a bug in PIC recently. So we might need to copy the implementation from Spark master before the next Spark release, as we did for PageRank.

Will GraphFrames become part of Spark?

DataFrames are becoming increasingly central to Spark, so it raises the question: Will GraphFrames become part of the main Spark project, alongside GraphX, or will it continue as a separate library?

I think a line in the README commenting on this would be helpful.

(another?) invalid dependency

Environment

scala 2.11.8
spark-core 2.0.0
spark-sql 2.11
graphframes 0.2.0-spark2.0-s_2.11

In addition to the logging dependency error in #109, we're getting

Error:scalac: missing or invalid dependency detected while loading class file 'GraphFrame.class'.
Could not access type DataFrame in package org.apache.spark.sql.package,
because it (or its dependencies) are missing. Check your build definition for
missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see the problematic classpath.)
A full rebuild may help if 'GraphFrame.class' was compiled against an incompatible version of org.apache.spark.sql.package.

Any ideas? It appears that our DataFrame class is in org.apache.spark.sql, not org.apache.spark.sql.package (which doesn't appear to exist).

Copy algorithm docs from GraphX

Current state: We copied the Scala docstrings from GraphX. We did not copy text from the user guide.

This issue is for copying text from the user guide for the standard library of graph algorithms.

Add merge PR script to squash commits before merge

Similar to Spark and some other projects like Redshift Data Source for Spark, we can add dev/merge_pr.py to squash commits before merge.

Api for graph operations is inconsistent with PageRank

The PageRank algorithm returns a graph, while the other similar graph operations (degree, connected components etc.) all return a DataFrame with 2 columns, the vertices and the corresponding statistic for that vertex (i.e. which component/degree etc.).

It seems odd that similar algorithms don't have a consistent return type, is there a reason for this behavior?

SVD++ should support source DataFrame of other column types

It seems GraphFrames SVD++ fails if edges are not exactly of types "long", "long", "double" (scala.MatchError in patten matching on this line)

Perhaps we could support other variations, such as:

"long", "long", "float"
"int", "int", "float"
"int", "int", "double"

Add Java compatibility test for public APIs

The GraphFrames Scala APIs are designed to be compatible with Java. We should add unit tests to ensure Java compatibility.

Clean up SVDPlusPlus API

We should make SVDPlusPlus easier to use. This means improving the API:

parameter names
returned model parameters
making predictions with the learned model

We should also improve the documentation.

Subgraph selection helper method

It would be nice to have a helper method for selecting subgraphs. I'm imagining something using the stdlib API:

def selectVertexSubgraph(expr): GraphFrame
def selectEdgeSubgraph(expr): GraphFrame
def selectSubgraph(vertexExpr, edgeExpr): GraphFrame

where expr is a Column or String for filtering.

These methods should ensure consistency between the resulting vertex, edge DataFrames. E.g., if a subset of vertices are selected, then any edges connected with a dropped vertex should be dropped.

When selecting subsets of edges, it might also be nice to have options for choosing what to do with the vertices: Drop any vertices not connected to a selected edge, or keep all vertices?

Scala documentation out of date

The Scala example seems a little out-of-date, as it calls the 'numIter()' method on PageRank, when in fact that method seems to be now called 'maxIter()', i.e. change this:

val results = g.pageRank.resetProbability(0.01).numIter(20).run()

...to this:

val results = g.pageRank.resetProbability(0.01).maxIter(20).run()

Also when running this example by cutting-and-pasting into 'spark-shell', I needed to manually import the GraphFrames class before instantiating 'g', i.e.:

import org.graphframes.GraphFrame

Get all neighbors

Is there any method similar to degrees(), which can be access using the python api, to not only get the number of edges but the id of them?

Thank you!

Get all neighbors

Is there any method similar to degrees(), which can be access using the python api, to not only get the number of edges but the id of them?

Thank you!

Connected Components: support legacy GraphX implementation

Add option to use GraphX instead of new CC implementation

Number of Connected Components

Hello,

how large can a Graphframe be in order to calculate its connected components? I need to scale for thousand of edges in very strong machines.

Thanks

Add style check for Scala and Python in CI

I notice some lines are longer than 100 characters, for example.

Maven Repository?

SLF4J Logging Error

Hi,
I tried creating a graph frame, but it gave me this ClassDefNotFound error:

java.lang.NoClassDefFoundError: com/typesafe/scalalogging/slf4j/LazyLogging

I am using GraphFrames 0.2.0 for Spark 2.0 and Scala 2.11. Any ideas what could be causing this? Thanks!

Add aggregateMessages API to Python

It should mimic the Scala API and will likely need to be re-implemented in Python, following the Scala implementation.

release graphframes for scala 2.11

Currently graphframes is only released for Scala 2.10

It would be cool if like for the other spark components there could also be a release for Scala 2.11

Thanks a lot.

some problems when using graphframe API find()

When I use graphframes API find, I meet up with some problems. For example, graphframes g, var motif=g.find((a)-[e1]->(b);(a)-[e2]->(d);(c)-[e3]->(b);(c)-[e4]->(d)),I can get a dataframe as a result. But in the result vertex a and vertex c may be the same, if I want all the vertexes which have different names are different vertexes,except I can use motif=motif.filter("XXX"),what other methods can I use?

Add GraphFrameReader/Writer to simplify graph import/export

They should have similar APIs to DataFrameReader/Writer, supporting different data sources with Parquet being the default.

Consistency checks upon construction

When GraphFrames are constructed, we do not check that the vertices DataFrame contains all vertices from the edges DataFrame. We should do that somehow (ideally lazily).

This will likely be a problem for subgraph selection.

I'll create a separate issue for better subgraph selection methods.

GraphFrame API in SparkR

Is there a plan in the roadmap to include GraphFrame API in SparkR as well?

Investigate using GraphX triangleCount implementation for Spark 2.0

After the SPARK-3650 fix, we could call GraphX's triangleCount instead of using a DataFrame-based one when the GraphFrames package is compiled against Spark 2.0. GraphX may be faster. This issue is for doing some speed and scalability tests to see which is better.

Add triplets to Python GraphFrame

Missing triplets method.

Graphframes => Apache Spark 2.0 Compatible

I'm playing around with getting this set up. It seems like there are some pretty significant code changes, not just remove .map on DataFrames.

Questions:
[ ] What framework should we use for logging internally since org.apache.spark.Logging no longer exists?

Tasks:
[ ] Fix LogInfo
[ ] Fix Logging
[ ] Fix .map calls
[ ] Fix callUDF calls.

How to use graphframes in Jupyter notebook by referencing graphrames.jar

I'd like to user it locally in Jupyter notebook. I've downloaded the graphrames.jar and created PYSPARK_SUBMIT_ARGS variable that references the jar.
The import from graphframes import * works but fails on call g = GraphFrame(v, e)
Py4JJavaError: An error occurred while calling o57.loadClass.
: java.lang.ClassNotFoundException: org.graphframes.GraphFramePythonAPI

Operating system: Windows

Update link to graphframes on homepage

Update the link to graphframes in the Downloading section of the homepage

Heterogeneous vertices?

val v = sqlContext.createDataFrame(List(
  ("a", "Alice", 34),
  ("b", "Bob", 36),
  ("c", "Charlie", 30)
)).toDF("id", "name", "age")

What if I wanted to say, for example, that these people work at a company? Say we didn't care about the "age" of the company, just who the employees are.

So how would this data be added as a Vertex, or is it even possible?

("companyA", "Foobar, Inc.")

The edge is straightforward

("a", "companyA", "works_at")

To clarify, this isn't possible because the Company has no "age", so the schema can't be applied

val v = sqlContext.createDataFrame(List(
  ("a", "Alice", 34),
  ("b", "Bob", 36),
  ("c", "Charlie", 30),
  ("companyA", "FooBar, Inc.")
)).toDF("id", "name", "age")

SLF4J error

I am running Scala 2.10.4 with Spark 1.5.0-cdh5.5.2 and am getting the following error when running a GraphFrames job:

scala

> val g = GraphFrame(v, e)
error: bad symbolic reference. A signature in Logging.class refers to type LazyLogging
in package com.typesafe.scalalogging.slf4j which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling Logging.class.

I am starting my spark-shell with the following command:

spark-shell --jars /data/spark-jars/scalalogging-slf4j_2.10-1.1.0.jar,/data/spark-jars/graphframes-0.2.0-spark1.5-s_2.10.jar

I have tried different versions of scalalogging, but nothing seems to work.

Thanks for the help.

log4j.properties file ignored

When GraphFrames jar/package is loaded, Spark will ignore log4j.properties in SPARK_CONF_DIR. Perhaps src/main/resources/log4j.properties is overriding it?

@thunterdb seems to be working on a solution in #55.

Sort columns from motif finding

Motif finding outputs columns in an arbitrary order, but we should sort the columns to match the order of vertices and edges specified in the motif. That way, if a user writes "(a)-[e]->(b); (b)-[e2]->(c)", then the output columns will be ordered as expected: a, e, b, e2, c.

Connected Components: improve unit tests

We should improve unit test coverage for CC before we release the new version.

connectedComponents() raises lots of warnings that say "block locks were not released by TID = ..."

Trying to run a simple connectedComponents() analysis on an example dataset, even the one from the quick start, yields a flurry of warnings (several dozen?) like this:

16/10/17 22:45:40 WARN Executor: 1 block locks were not released by TID = 358:
[rdd_95_5]
16/10/17 22:45:40 WARN Executor: 1 block locks were not released by TID = 353:
[rdd_95_0]
16/10/17 22:45:40 WARN Executor: 1 block locks were not released by TID = 359:
[rdd_95_6]
...

And this is for a graph with literally 3-4 vertices and edges.

Is this an issue? Would it cause performance issues at scale? (Here's a related question on Stack Overflow.)

I'm running Python 2.7, Spark 2.0.1, and GraphFrames 0.2.

Check algorithm parameters for validity

For the stdlib algorithms (PageRank, etc.), we should check for valid parameter values in the setters, rather than failing somewhere internally.

Clarify special columns in API docs

We should make it clear what happens when a vertex or edge DataFrame contains a column with a special name. E.g., what happens when you call PageRank on a graph whose vertices already have a "pagerank" column? We currently throw an error, but we should be more explicit about this, both in runtime checks in the code and in the API docs.

More scalable connected components implementation

There have been many reports of the connected components algorithm in GraphX and GraphFrames not scaling. @mengxr has a prototype of a better algorithm. This issue is for tracking adding it to GraphFrames master.

Subtasks:

Implementation: #119
Improved unit tests: #121
GraphX legacy support: #122
Python API: #123
(optional) checkpoint interval param: #124
handle skewness in assigning long IDs

Add unit tests for GraphFrame-GraphX conversions

Convert GraphFrame to GraphX and back, and then test:

structure
vertex, edge data
vertex IDs

This should cover a few different graphs to hit edge cases.

motif finding should avoid name conflict when matching with anonymous edge

Currently it is using __tmp as column name which could theoretically conflict with an actual column with that name.
See #50

Publish graphframes for Spark 2.0

GraphFrames will be published for Spark 2.0. This will be the first version that officially supports scala 2.10 and scala 2.11 .

How to use is with python3?

When I runt this import with pyspark with python3.51, it errors:

from graphframes import *
ZipImportError: can't find module 'graphframes'

but is success in pyspark with python 2.7.1

Motif resulting in empty DataFrame

Hi,
I am trying to use the motif functionality within Graphframes. I tried a bunch of different different motifs, including those in the user guide, on a variety of different GraphFrames, but to no avail. I even tried the simple motif "(a)-[e]->(b)", but the resulting DataFrame was empty. If I am understanding this correctly, it should return all a,e, and b such that a has an edge to b. This was the command that I used.

g.find("(a)-[e]->(b)")

Thanks!

Add cache(), persist() methods

They should just call cache() and persist() on both DataFrames.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs

Jooble