GithubHelp home page GithubHelp logo

fagan2888 / geopandas_opt Goto Github PK

View Code? Open in Web Editor NEW

This project forked from gboeing/geopandas_opt

0.0 0.0 0.0 7.96 MB

Example demonstrating current slowdowns in Geopandas

Python 10.35% Jupyter Notebook 89.65%

geopandas_opt's Introduction

Improving spatial operations

In this doc I will be using as reference a generic ~30k row csv. All that is relevant is that there is a geometry column and any number of additional columns containing floats. Aggregations such as sums, means, etc. are performed on these columns after a spatial operation has been performed to return a subset of the data frame. See the example geometries csv as well as the run.py (Python3) file to see a stripped down example dataset and operation logic.

Current situation

All geospatial operations in Geopandas are costly, this includes but is not limited to: unary unions, intersections, within booleans, distance calculations. The underlying functions that Geopandas is calling are those from GEOS, which is itself a C++ port of the Java Topo Suite. All of this is to say, it is fast and reliable. Parts of PostGIS work with GEOS, too.

So what is causing the slowdown? I can think of two possibilities right now:

  1. The bottleneck here is Python itself.
  2. There are optimizations that should be implemented that I've failed to identify.

Example

In the included example Pythons script, I load in a dataset containing geometries and a value column (named “value”). The goal is to generate a sum of all values within a certain buffer (in this case, 500 units) of the reference geometry. This operation is to be performed on each row’s geometry.

Presently, the operation performs two tasks intended to speed up the operation. First, it creates a spatial index from the GeoDataFrame based on its geometry column. Second, it preprocesses a buffer column.

The summarize operation applied to each row performs a comparative buffer (buffering the reference geometry and getting all intersections to that, and then doing the inverse and buffering the other geometries and seeing if the normal geometry intersects in that situation). This methodology replicates the same for ST_DWithin from PostGIS (reference in their repo).

Results

Running run.py will result in the following output (time in seconds before the colon):

$ python3 run.py
0.244s: loaded in reference csv

1.787s: updated geom column to MultiPolygon type

0.008s: converted dataframe to geodataframe

3.345s: created spatial indices

5.472s: Created buffered geometry column

563.553s: completed summarize() application on geodataframe

          value
0  10319.032630
1     32.595417
2      3.991619
3     14.131560
4     32.533223

Note: Times will vary on each run through, naturally.

The apply operation takes a long time; about 9.4 minutes. That's nearly 0.2 seconds per row, which seems terribly slow. You can imagine how a series of spatial operations can quickly amount to cumbersome execution times.

Question

What, if any, additional optimizations could be used to make geometric operations faster in Python? For example, I've found additional prepartion steps that could potentially be used, such as shapely.prepared() (and asked about them on SO), but from what I can tell they provide no substantive improvements over what is already achieved via the spatial index.

A Final Note

Tom Augspurger makes a strong case for vectorization and the utilization of Numpy list comprehension in lieu of the costly .apply() operation. I'm very open to exploring how to operationalize this concept with large datasets but have concerns over its feasibility with larger datasets. Our example is similar to his "origin-destination" matrix calculation. With this methodology, we would need to create a subindex of each of the 30k geometries for each "origin" index. Dealing with a matrix that is (n rows)^2 long becomes unreasonable for in-memory operations and brings into question how "really" operational the vector methodology is.

geopandas_opt's People

Contributors

gboeing avatar kuanb avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.