Polygon Index / Linestring Index for fast Spatial aggregations

Relating points to polygons and points to lines at 1 million points / second.

Geospatial problems are hard, especially when having to relatee a set of geometries (lines or polygons) to a substantial set of points, traditional methodologies essentially are infeasible over 100k + points without huge computing time lost. The solution is pretty simple remove geometry entirely from the operation. Like many things in CS this trades precomputation time of the indexs with the performence boost of using said indicies.

An example webmap of the kind of output it creates and how it works and can be found here

Pictures

The picture above gives some insight on how the lines algorithm works internally.

What is ult?

Ult solves traditional geospatial problems in a way that relies heavily on the fastest localization method that I'm aware of for 2-d objects: geohashing.

Geohashing is way of taking a 2-d object (lat/longs) and turning it into a 1-d string that importantly displays/conveys hiearchy. Ult takes the following geohash repo in python written in C++ that can geohash approximately 400k points / second and geospatially relates a point to a specific object(s) being either lines or polygons.

While the geohashing process may be slowish this type of localization allows for multiple indexs that use geohashing to work seamlessly, and as fast as possible.

In other words, assuming no memory constraints, any ultindex applied to a set of points should run about the same speed no matter how many polygons or lines is within the index. For example you could have an ultindex for the US states (50 polygon), and another one for zipcodes (35k polygons) and applying each of these indexs to a set of 1 millions points would return results in 1 second for each ultindex. You could also apply an index for road or other lines at basically the same speed.

What is an ultindex object?

An ultindex is a HDFS pandas dataframe like file that contains the following data structures:

alignmentdf - a pandas dataframe that was the input to make the ultindex this object can easily be output to geojson
ultindex - a flat dictionary that utilizes the geohashs hierarchy that is assembled slightly different depending on whether the index represents lines or polygons
areamask - a dictionary mask that saves bytes or size on the large ultindex by having a key value structure for area or line ids to a hexidecimal range so a hexidecimal number can be used in the heavily repeated ultindex instead of a long id or area.
metadata - contains information about the type of ultindex that is being used

To get the ultindex into memory simply read in the h5 file with readh() function above. Thee output of this function can be directly sent into a localization algorithm and determine how the localization algorithm needs to be applied based on the metadata. It can also be used to easily output/style the data structure your aggregating on and then output to a geojson file.

You may be asking why would you need a wrapper function for simply reading in h5? Well the dataframes I place in the outputs of these files are actully all json string in a column value in pandas, I then have an index on that mirrors the json structure intended. Again you can by definately do this yourself, its just more annoying, so I wrote a dedicated function to unpack it to its desired structure.

Benchmarks / Reasons to Use

Polygon Benchmarks

You can run the benchmarks on your own dataset or the one I provided here.

Polygon benchmarks can be performed here basically I had 3 factors to measure, how fast it takes to create the index, how fast it is to get into workable memory, and how fast is the speed of the operation were applying to points using this object.

The benchmarks below distinguish the difference between the two output types the output type regions splits dictionaries up into smaller subsideraries instead of one combined one which is bad if your indexs are over the amount of ram on your personal computer. So I wrote a method that allows you to get at least some performence when your index is to large, and its still probably better than alot of pips.

Total is the total time it took to make the index and write it out to the output h5 file. The first two make_set() and make_h5_output() are subsets of the make_polygon_index() process. readh() is the time it takes to get the index into memory where can be used and applied on datasets. 1 M indexs is the time it took to complete 1 million point in polygon queries from randomly generated points this assumes the points have already been geohashed. Which is a complete linear process and wouldn't be condusive to accurate benchmarks by profiling it within indexing alg.

To map a table with lat / lng fields simply use the function ult.map_table(data,12,map_only =True) and it will take the lat / lng fields and hash each point on the table adding the column for you as well.

single	single_values (s)	regions	regions_values (s)
"make_set(""single.h5"")"	17.6731444597	"make_set(""regions.h5"")"	109.75850451
"make_h5_output((""single.h5"")"	0.542840123177	"make_h5_output((""regions.h5"")"	4.00887596607
"total(""single.h5"")"	18.2159904242	"total(""regions.h5"")"	113.767390013
"readh(""single.h5"")"	0.0348734855652	"readh(""regions.h5"")"	0.275012969971
"1M INDEXS(""single.h5"")"	0.5987585783	"1M INDEXS(""regions.h5"")"	3.49416255951

Line Index Benchmarks

TBA

Shapely Comparisons

This comparison was all 50 US states vs one in shapely, I don't how to use collections methods on geometry collections in shapely and I don't feel like learning how to make a collection in the first place and I doubut it would get faster so I just did 1.

Its worth nothing shapely would just iterate through each polygon doing ray-casting SO IT WOULD GET SLOWER ANYWAY.

The index also contains every single area given instead of what would be a point comparison against 50 diffent geometries in a typical geometric comparison. Benchmarks in comparison against shapely:

SHAPELY | 0.027808
ULT | 0.001157
dtype: float64
24.0287388881x 100 points
SHAPELY    2.367857
ULT        0.009810
dtype: float64
241.381324357x 10000 points
SHAPELY    24.440322
ULT         0.095100
dtype: float64
256.996006301x 100000 points

Random Benchmarks on some Example Philly Data

This data was actually collected from a philly bus API so it would be a real world test case. I didn't have any modules that I know of that I could really benchmark against this. (sorry)

1.92039895058 seconds to add to dataframe
0.608725070953 to map the points to geohashs
597722 points

The difference here occurs from the fact a dataframe to be created to add the columns to the input table as were return 3 fields per one mapped field.

Reasons to use ult

Ult has several advantages I think makes it worth looking at.

Its fast, faster than anything I've seen at least
Its relatively simple in implementation but yields an index that still represents areas or lines.
Lines output being the lineid, the distance along the line, as well as the percentage along the line is something I've never seen done on an algorithm like this.
higher level polygon indexs can easily be related to the lowest smallest area hiearchy so that area_index can yield 6 different polygon indexs from the hierarchy that can easily be created.

Video of me creating a polygon hierachy - this is going to be awesome later, I still haven't figured out how to best implement it within the API.

To create a hiearchy simply apply your lower geohash hiearchy ultindex to your next step up index until you get to the end. This actually has an added benefit I didn't even thing of, many libraries that utitilize ray casting and then represent hiearchy to drill down find themselves in situations where they have to clip alot of areas to complete there hiearchy correctly. This operation in of itsself is super complex annoying, my indexs complete negate needing to do this by mapping the previous index to the next one up. In other words I go completely backwards from traditional ray-casting hierachy algs but can still yield every single output corresponding to all the higher level geohashs at that lowest layer hiearchy. I guess hiearchy isn't something thats required like in other algs, it just happens.

That being said I would be remiss if I didn't tell you the biggest problem with ult currently, large dictionary objects on an 8 gb macbook pro. While these algs. are sort of intended for servers where memory constrictions aren't really a thing, I still exceed my memory by the time I get to something as granular as zipcodes.

I'm currently experimenting with cache systems to cache part of the index without reading in the entire thing. Things like extrema slicing could remove a large majority of the areas if your only workign in a certain locality.

How is the ultindex created?

Polygons

The polygon index creation process is complex but I'll try to be as explicit as possible about how its created. The make_polygon_index() accepts a dataframe thats structured in a very specific manner (see example) that accurately depicts the hierarchies within each area, I may create a repo that does this for all geospatial filetypes but currently I only have one for kmls here. Although it shouldn't be hard to replicate on geojson or any other file type. Essentially the output is df table that contains ring hiearchy about specific areas, nothing crazy, just strictly required to create an ultindex.

So the question now is how do you create an index for each area? This relies on a property that I read in some white paper a while back that states that an areas complexity is related to/proportional to the perimeter of the area, meaning we can get a specific amount of complexity or resolution of are index for a particular area by requiring a minimum amount of blocks/squares or geohashs within a perimeters alignment. So by increasing are geohash precision (making smaller blocks) and then looking at the size of the geohashs that make up the alignment we are effecitively ensuring a minimum resolution for the index. Hopefully the pictures below convey the concept I'm trying to explain.

So now we have a minimum geohash size we can get the bulk of the algorithm by taking the top left geohash block of the extrema and the bottom right and using a function I created from the repo geohash logic we can produce a grid of all geohashs within an extrema.

This is where the most complex components of the process begin so I'll just summarize: a collision alg. is applied to get remove the low hanging fruit of geohashs that are definately outside of the area, then ray-casting is applied to every single geohash to remove any remaining geohashs that are hidden by complex geometries. Yielding the minimum geohashs within the desired area. Again, very abridged version of whats happening.

After this we can do a simple groupby truncating the last byte off each string and if the size of that df is 32 (the size of a block hierarchy) we can add just the truncated geohash instead of all 32 if not all geohashs are added. This is compiled into a dataframe and appened to a progess h5 file.

After all the areas have been compiled the minimum and maximum geohash sizes are found and all resultant dataframes are compiled about the minimum size.

If this is confusing I understand, but taking a look at the mainline function for relating a geohash to a polygon might help below

# maps one geohash to what it needs to be
def one_polygon_index(ghash):
	global minsize
	global maxsize
	global ultindex
	global areamask

	current = minsize
	while current < maxsize:
		output = ultindex.get(ghash[:current],'')
		# logic for continuing or not
		if output == 'na':
			current += 1
		elif output == '':
			return ''
		else:
			return areamask.get(output,'')

As you can see a geohash is sent in if the resultant is 'na' that means the geohash has areas lower in the hiearchy and to drill down another level until either an area is found or '' string is returned indicating no area within the ultindex corresponding to the given geohash. The global instances of variables are used instead of inputs because this function is mapped to a dataframe.

Linestring

The linestring index is a bit harder to explain as instead of the algorithm being straight forward, judgement and preference about how you would like to apply geohashs to the different linestrings have to be considered. For example the lines index has probably been overwritten 5 times with widely varying implementations, some of which you can still see the reminents of via functions. I decided with the current iteration of the line index based on a few factors that I observed / realized.

At first I was just doing geohashs at size 8 (which is probably 120 ft) and let the chips fall where they may for itersections. Obviously, that was lacking so I began considering other impementations, however I was always considered about the size in memory blowing up using level 9 geohashs. (I still am) This got me thinking about a system in which the itersection geohash size 8s are drilled down to while non collision size 8s could remain that size. I also decided to use a level 8 key with a '-u' to represent all the points that made up the most of the line. By doing this little manipulation we saved space in the index and ensured all level 9 geohashs within the level 8 square would be occupied.

There were some other things I tried to do like control the hiearcharchy of pure alignments and neighbor alignments. (i.e. a pure alignment should always get presidence of a neighbor obviously. None of this stuff was super hard to figure but when it was all compounded together it was a lot of complexity. I also got a little two zealous to fill in level 8 geohashs that could were very close to be along the alignment but currently were being covered by my alg.

The conclusion slowly became apparant after I observed a few basic things I felt to be true about GPS alignments. For one GPS alignments are generally pretty clean, and more importantly the greater lengths you have to go acquire a point the higher likelihood it has negative value or not a point for the line we wanted. In other words the harder you tried to catch them all, the less valuable the points you had became. In other words points we know are on alignments are much more indicitive of the traffic patterns were trying to observe then the outliers that may or may not be apart of the line were looking at. So a focus on quality data was made, after building algorithms that contained some frankly insane dataframe operations to get the blocks the way I desired.

Another factor heavily heavily influenced the current decision that exists now for the line_index algorithm. Kyoto a project initially intended to localize points along a line (i.e. distance along a line) worked but it was rather slow and used a wierd dictionary structure that had to be compiled just to get the distances. Considering that I was already pretty sure I was going to take a more conservative approach I had a realization that I could derive distance directly through the geohashs now on the same index! The granularity of a level 9 geohashs is something like 15 ft. more than we'd probably ever want for measuring traffic flow, therefore getting each geohashs distance along the alignment was entirely feasible if not logical. So the final line_index currently returns 3 fields distance along a line, the lineid, and percentage along the line. Losing little if no speed.

If your wondering the final geohash schema I came up within was absolute alignment and absolute neighbor of courese giving priority to then neighbor. I have a feeling this won't be the last time I'll change this, things are just much more simple with 1 index. Say if I had a mixed set of geohashs that couldnt give us contious distances along the alignments therefore requiring an another entire index just for distances.

As far as implemenation and creating is considered the functions are mirrors of the points section (i.e. make_line_index and line_index(data,ultindex) to map line ids to points) and hovers around the same performence if not better than the polygon index.

Kyoto will now be used to transfrom this point data that will be output into the most useful datastucture I can thing of for traffic / infastructure problems. These algorithms assuming lines would have its own instance for a given city are pretty close to scaling for real-time traffic.

Examples

Polygon Usage Example

Writing an h5 output

The following example uses the 'states_total.csv' which can be found in this repo.

import pandas as pd
import ult

# reading in the example states csv file
# with the correct hiearchy syntax
data = pd.read_csv('states_total.csv')

# creating the ultindex h5 file 
ult.make_polygon_index(data,'states.h5')

Relating the output Ultindex to points

The next example takes the h5 file created and a csv file (random_points.csv) of dataframe containing randomly generated points (lat/long) and maps it to the created index.

import ult
import pandas as pd

# reading in random point csv file
# this df has two fields: [LAT and LONG]
data = pd.read_csv('random_points.csv')

# adding the geohash column to the dataframe
# the precision of the geohash is consequential to performence
# in this case were using the precision of 12 
# in reality it doesnt matter as long as your precison
# is greater then or equal to the maxsize of the index
data = ult.map_table(data,12,map_only=True)

# reading the h5 file into memory
ultindex = ult.read_h5('states.h5')

# now mapping the the ultindex using the information 
# contained in the data dataframe
data = ult.area_index(data,ultindex)

# slicing only the point areas that were found
data = data[data['AREA'].astype(str).str.len() > 0]

print data
'''
OUTPUT:
              LAT        LONG       GEOHASH            AREA
0       48.089478  -93.750921  cbt3ddktt0qk       Minnesota
1       41.379246 -102.953335  9xqg3hzet6te        Nebraska
9       38.548906 -122.784367  9qbemcx5suuk      California
13      36.818091 -100.050876  9y8c7q90vtef        Oklahoma
20      37.413120  -94.186330  9ythbdze5rn4        Missouri
21      31.053591  -97.402089  9vd8xt1uqbf5           Texas
24      38.206359  -93.041011  9yvc6g385m9c        Missouri
25      45.455005  -94.609907  cbhdx4g4t7hh       Minnesota
28      46.870968 -117.590168  c2k6sht7nmxe      Washington
31      38.144311 -117.411059  9qu8bpzysnbu          Nevada
34      26.836894  -99.089117  9uc89jt80pxk           Texas
35      41.297582 -108.241780  9x74bvuxzer1         Wyoming
37      48.329405 -109.694491  c89fzy4gjn5k         Montana
39      33.392247  -96.339963  9vgmzwz7unj7           Texas
40      46.189429 -110.379100  c81wb282ut97         Montana
41      41.284224  -92.552133  9zq4vgtd4cpx            Iowa
43      33.941764 -105.590139  9whcj4zt57mw      New Mexico
45      30.025959  -82.955948  djm4b613fk3y         Florida
51      44.436350 -108.419526  9xfuuc7k9bn3         Wyoming
52      38.544225 -107.390838  9wgek2b2kh71        Colorado
53      37.161598  -81.516280  dnw53j19uzzy        Virginia
54      45.059698 -105.257990  c8j0kfc7z0d3         Montana
56      35.134067  -78.141553  dq0rvggncbg5  North Carolina
58      43.789761  -99.213492  9zc3n7dvy546    South Dakota
62      33.539471  -89.381152  djbqy1dts4w1     Mississippi
67      43.246302  -92.687529  9zwn4bepbp14            Iowa
69      45.177513 -116.541798  c2j1h02f56bb           Idaho
74      45.228663  -99.014039  cb196cshj9v4    South Dakota
80      43.500666 -110.228819  9x9x7xjcdrg5         Wyoming
81      40.523553  -83.452706  dphwscpdg1b5            Ohio
...           ...         ...           ...             ...
499947  35.421860  -81.665577  dnmct8kgq1k4  North Carolina
499952  28.083702  -98.556564  9uczv23r3te0           Texas
499957  38.334317  -77.553863  dqbf54x18486        Virginia
499958  39.703135  -80.271515  dpncv7spd0uh   West Virginia
499961  31.590770  -91.625442  9vwgenbms6q3       Louisiana
499964  33.090380  -88.253559  djchpxztrjbu         Alabama
499965  29.758773  -95.962625  9v7c236jkv2d           Texas
499966  37.459543 -117.099734  9qstp769dkvu          Nevada
499968  42.820795 -108.558834  9xdg9emc5ydr         Wyoming
499969  45.995796  -68.105952  f2ptdjqbbmch           Maine
499974  37.802177  -96.779052  9yepj9xtsss8          Kansas
499976  46.637673 -104.575035  c8m9k64u1h3f         Montana
499977  39.991365 -109.969346  9x1g984y8db8            Utah
499978  41.834374 -108.453019  9x6vupdyvkx7         Wyoming
499979  30.095319  -85.659142  dj754y8vvmcz         Florida
499980  31.872310 -102.342257  9txjr4htst4j           Texas
499983  46.408355  -90.466860  cbr8j2kh327e       Wisconsin
499984  40.862529 -115.929773  9rm83yy92rrv          Nevada
499985  41.278137  -92.395107  9zq6c6rytfw2            Iowa
499987  36.780333  -76.100634  dq9chr9hywyn        Virginia
499988  29.312756  -97.970709  9v4qf85zjj8y           Texas
499989  33.345653 -100.084029  9vbvdw3h905p           Texas
499990  45.729267 -106.524539  c8hhpuz0q34u         Montana
499991  34.240690 -100.781937  9y06f9kf6xjp           Texas
499992  30.274349 -102.445957  9trhhz30u5wn           Texas
499993  34.624113 -109.412432  9w4hyr0k1d99         Arizona
499994  40.431791  -80.862371  dpnqpbmu60jt            Ohio
499996  46.079738 -120.169030  c24w1kx9ftsq      Washington
499997  41.527151  -86.241824  dp6sjxb95vgn         Indiana
499999  35.673768 -120.603379  9q64zw5qc86b      California

[215820 rows x 4 columns]
'''

Usage Information

Currently lines may accept st_asekwt as a string field in a dataframe and a string representation of coords that would in geojson alignment as a coords field, it will support more alignment types layer but these are the way I treat alignments personally, specially the string geojson is awesome for nlgeojson.

Spark parallelization can be used in the make_polygon_index algorithm to do enable a spark split simply pass in a sql context sc, into sc as a kwarg (sc=sc) and splits on the remaining progress or entire table will be split evenly.

Outputs for area_index can take two forms one: a single method (default that compiles all subdfs into one dict, 'regions', based on the level 2 precision of a geohash splits each dict into sub dictionaries based on this. This is useful if your running into memory walls when mapping your output.

cuulee / ult Goto Github PK

ult's Introduction