GithubHelp home page GithubHelp logo

sparkgis's Introduction

SparkGIS

SparkGIS adds GIS functionalities to SparkSQL through:

  • a user-defined type (UDT): GeometryType
  • a class representing values of such type: Geometry
  • a set of user-defined Functions (UDF) and predicates operating on one or two Geometry values

Creating Geometry values

To create values, use factory methods in the Geometry object:

  import Geometry.WGS84
  val pt = Geometry.point(12.5, 14.6)
  val ln = Geometry.line((0,0), (20,50))
  val collection = Geometry.collection(pt, ln)

Each factory method has two argument lists:

  • the first one is the set of 2D coordinates describing the geometry (one, many, many sequences, depending on the geometry type)
  • the second one is the coordinate reference system id; an implicit value Geometry.WGS84 is provided for this

You can also create Geometry values from WKB (Well Known Binary), WKT (Well Known Text), and GeoJSON formats:

  val mp = Geometry.fromString("MULTIPOINT ((1 1), (2 2))")
  val ml = Geometry.fromGeoJSON("{\"type\":\"MultiLineString\",\"coordinates\":[[[12,13],[15,20]],[[7,9],[11,17]]]}}")

Defining table schemas

Simply use the GeometryType instance as a type:

  val schema = StructType(Seq(
    StructField("id", IntegerType),
    StructField("geo", GeometryType.Instance)
  ))	

Creating RDDs

The GeometryType is able to produce Geometry values from any supported serialization format ("WKB, WKT, GeoJSON) as well as from schema-less JSON RDDs. So simply load your data and apply the schema as shown below:

  // using GeoJSON
  val data = Seq(
    "{\"id\":1,\"geo\":{\"type\":\"Point\",\"coordinates\":[1,1]}}",
    "{\"id\":2,\"geo\":{\"type\":\"LineString\",\"coordinates\":[[12,13],[15,20]]}}",
    "{\"id\":3,\"geo\":{\"type\":\"MultiLineString\",\"coordinates\":[[[12,13],[15,20]],[[7,9],[11,17]]]}}",
    ...
  )
  val rdd = sc.parallelize(data)
  val df = sqlContext.jsonRDD(rdd, schema)

  // or other means
  val data = Seq(
    Row(1, Geometry.point(1,1)),
    Row(2, Geometry.fromString("MULTIPOINT ((1 1), (2 2))"),
    ...
  )
  val rdd = sc.parallelize(data)
  val df = sqlContext.createDataFrame(rdd, schema)

Using functions

Each function is defined as a method of the Functions object and can be used freely in any suitable context. Moreover, they can be registered in the SQLContext and used inside SparkSQL queries:

  Functions.register(sqlContext)
  df.registerTempTable("features")
  result = sqlContext.sql("SELECT ST_Length(geo) FROM features")

Using geometry methods

GIS functions are just aliasing methods from the class hierarchy rooting at GisGeometry. GisGeometry hierarchy wraps GeoTools classes to provide a consistent interface - like returning options instead of magic values or to model the absence of some property for some geometry type.

An instance of GisGeometry is wrapped by the SparkSQL Geometry type; the easiest way to access it and invoke its methods is by importing Geometry.ImplicitConversions:

  import Geometry.ImplicitConversions._

  val l = Geometry.line((10.0,10.0), ...)
  if (!l.isEmpty) {
    val p: Option[Geometry] = l.startPoint
    ....
  }

Some method is also aliased as operator by GeometryOperators implicit class:

  import GeometryOperators._
  import Geometry.ImplicitConversions._

  val l1 = Geometry.line((10.0,10.0), (20.0,20.0), (10.0,30.0))
  val l2 = Geometry.line((20.0,20.0), (30.0,30.0), (40.0,40.0))
  if ((l1 <-> l2) < 50.0) { // distance less than 50
    ...
  }

Build, test and doc

The project uses Maven as build system, so you should be comfortable with it. If not, install Maven 3, cd in your SparkGIS directory and

  mvn package -DskipTests
  mvn test
  mvn scala:doc

You'll find the jar under the target directory, have run all available tests, and generated the documentation under target/site/scaladocs.

Credits

The Geometry value class is written on top of the GeoTools library.

UDFs aim to adhere to OGC Simple Feature Access recommendation. Name and documentation of GIS functions have been copied from PostGIS.

Remarks

In order to work within jsonRDDs, Spark >= 1.4 is needed.

Changelog

0.3.0

  • Abandoned ESRI Geometry library in favor of GeoTools
  • Moved to Scala 2.11
  • Moved to Spark 1.6
  • Added ST_Perimeter
  • Added tolerance argument to ST_Simplify
  • SRID in factory methods is now an implicit argument

sparkgis's People

Contributors

drubbo avatar estoianovici avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.