thinkingmachines / geowrangler Goto Github PK
View Code? Open in Web Editor NEW🌏 A python package for wrangling geospatial datasets
Home Page: https://geowrangler.thinkingmachin.es/
License: MIT License
🌏 A python package for wrangling geospatial datasets
Home Page: https://geowrangler.thinkingmachin.es/
License: MIT License
This might be pedantic, but what do you think about distinguishing between grids
and grid tile
for added clarity?
This will imply some renaming in the GridGenerator
class
Download and cache from Geofabrik website given desired country.
I think there’s room for further optimization especially for generate_grids
in the GridGenerator
class. Right now the grid tiles are first generated across the entire span of xrange
and yrange
and then filtered out after. While this isn’t an issue for very coarse grids, this can easily run into runtime and memory issues for fine grids.
Instead of generating all the tiles and then filtering after, we can generate only the grid tiles we need.
To determine which grid tiles to generate in the first place, we can use the cheapest possible geometric operations.
xrange
and yrange
and intersecting tiles with the gdf’s unary_union
can be expensive since the unary_union
is a single geometry most likely has a large number of points.Can make a PR for this too!
Add to docs
Download and cache from Ookla’s S3 bucket given parameters (wired/wireless, year, quarter).
<debug>PackageInfo:</debug> Invalid constraint (black>='19.3' ; python_version >= "3.6") found in nb-black-1.0.7 dependencies, skipping
Colab notebook for testing:
https://colab.research.google.com/drive/147HWUgaBztsZuBPrI_HTckBrz_vl9l1l#scrollTo=wvLenjgDUgod
Scenario
Error
Colab crashes due to exceeding the RAM limit.
Just creating this issue to check if there are straightforward ways to optimize. Otherwise, are there workarounds for handling such vector datasets that are relatively large?
We can use the gist provided here to implement the conversion of the quadkey to its geometry so it can be used by the raster zonal stats module.
Calculate DHS Wealth Index given specified data. Useful when re-calculating the wealth index across multiple countries. Or when applying the same procedure to non-DHS surveys (e.g. Indonesia Susenas).
Originally posted by mosesckim June 30, 2022
Noticed default is to fill NaNs with zero after aggregation; this might make it difficult to identify original NaNs if there are actual zeros in the aggregation.
https://github.com/thinkingmachines/geowrangler/blob/master/geowrangler/vector_zonal_stats.py#L205
Also, a suggestion would be, in the case fillna
option is set to True, make the replacement value (0 in this case at the moment) a variable users can input (e.g. -1, etc.)
Thanks!
Moses
This could be a feature to consider for Geowrangler Geometry Validation
I encountered an error where I tried to upload a geopandas dataframe to BQ and it said
GenericGBQException: Reason: 400 Error while reading data, error message: Invalid geography value for column 'geometry', error: Polygon loop should have at least 3 unique vertices, but only had 2; in WKB geography
It turns out there was a "polygon" that was actually a line. I verified it by computing the area which was actually 0.
'POLYGON ((122.95320551089915 11.473736609261481, 122.952381 11.4737421, 122.95320551089915 11.47373660926148, 122.95320551089915 11.473736609261481))'
The weird thing is it's not caught by is_valid
on the epsg:4326 GeoSeries but it's caught by is_valid
when the GeoSeries was projected to epsg:3123. I expected is_valid
to return FALSE even if the polygon was not projected.
Perhaps this can be something geowrangler's geometry validation can also catch?
Combination of data across multiple DHS files into cluster-level data
Hello, noticed that the OSM data download does not support any kind of caching.
It would be nice to support this natively so users don't have to keep writing their own file existence checks when they need to re-run cells in a Jupyter notebook or re-run scripts. Otherwise, the line of code would download the file again, resulting to long runtimes.
Maybe we can:
overwrite
to the function. E.g. geofabrik.download_geofabrik_region("laos", "../test_dir", overwrite=False)
False
by default, so that caching is enabled by default.@ncdejito created additional broken geometries to add to the testing
{
"type": "FeatureCollection",
"name": "broken2",
"crs": { "type": "name", "properties": { "name": "urn:ogc:def:crs:OGC:1.3:CRS84" } },
"features": [
{ "type": "Feature", "properties": { "description": "correct" }, "geometry": { "type": "Polygon", "coordinates": [ [ [ 120.343802163610064, 16.376524213304915 ], [ 120.343822299718241, 16.37652526707825 ], [ 120.343823581106932, 16.376508933591005 ], [ 120.343802895832169, 16.376508406704307 ], [ 120.343802163610064, 16.376524213304915 ] ] ] } },
{ "type": "Feature", "properties": { "description": "counterclockwise coordinates" }, "geometry": { "type": "Polygon", "coordinates": [ [ [ 120.343795024444461, 16.376520173840436 ], [ 120.343777268058162, 16.376519295695978 ], [ 120.343779647780039, 16.376504894126214 ], [ 120.343797038055257, 16.376506123528557 ], [ 120.343795024444461, 16.376520173840436 ] ] ] } },
{ "type": "Feature", "properties": { "description": "self-intersecting polygons (e.g. twirled edges)" }, "geometry": { "type": "Polygon", "coordinates": [ [ [ 120.343746880840385, 16.376508231075398 ], [ 120.34376646778199, 16.376510689880028 ], [ 120.343768481392758, 16.376492424473494 ], [ 120.343748528340186, 16.376490141297555 ], [ 120.343759511671877, 16.376513148684623 ], [ 120.343746880840385, 16.376508231075398 ] ] ] } },
{ "type": "Feature", "properties": { "description": "slither polygons" }, "geometry": { "type": "Polygon", "coordinates": [ [ [ 120.343721070010858, 16.376486453090216 ], [ 120.343736263619718, 16.376488736266193 ], [ 120.34373571445316, 16.376486804348058 ], [ 120.343721070010858, 16.376486453090216 ] ] ] } },
{ "type": "Feature", "properties": { "description": "coordinates outside of -180,180" }, "geometry": { "type": "Polygon", "coordinates": [ [ [ 71.163398107805307, -88.300673651538673 ], [ 77.641630757327448, -88.362103634964285 ], [ 75.050337697518586, -88.738011928477775 ], [ 69.867751577900847, -88.612832998982199 ], [ 71.163398107805307, -88.300673651538673 ] ] ] } },
{ "type": "Feature", "properties": { "description": "holes" }, "geometry": { "type": "Polygon", "coordinates": [ [ [ 120.343828569370118, 16.376482940511725 ], [ 120.343827196453617, 16.376463972586816 ], [ 120.343804589095839, 16.376465289803889 ], [ 120.34380632812335, 16.376483818656357 ], [ 120.343828569370118, 16.376482940511725 ] ], [ [ 120.343810080761713, 16.376468538939285 ], [ 120.343823352287544, 16.376467397351178 ], [ 120.343824725204001, 16.376478901046394 ], [ 120.343810263817204, 16.376479252304257 ], [ 120.343810080761713, 16.376468538939285 ] ] ] } },
{ "type": "Feature", "properties": { "description": "non-closed polygon" }, "geometry": { "type": "Polygon", "coordinates": [ [ [ 120.343799646596622, 16.37644096519395 ], [ 120.343815206316521, 16.376441140822909 ], [ 120.343814840205468, 16.376428319908158 ], [ 120.343800012707675, 16.376429022424055 ] ] ] } },
{ "type": "Feature", "properties": { "description": "multipolygon" }, "geometry": { "type": "MultiPolygon", "coordinates": [ [ [ [ 120.343776581599954, 16.376478374159603 ], [ 120.343789395487008, 16.376478549788533 ], [ 120.343789578542498, 16.376465553247304 ], [ 120.343776947711007, 16.376466607020941 ], [ 120.343776581599954, 16.376478374159603 ] ] ], [ [ [ 120.343764499935091, 16.376454137365769 ], [ 120.343777313822116, 16.376453786107874 ], [ 120.343778046044221, 16.376441843338764 ], [ 120.343765781323782, 16.376442370225647 ], [ 120.343764499935091, 16.376454137365769 ] ] ], [ [ [ 120.343738872161055, 16.376437979501546 ], [ 120.343751136881494, 16.376438155130508 ], [ 120.343752601325718, 16.376426036731463 ], [ 120.343739970494241, 16.376425685473517 ], [ 120.343738872161055, 16.376437979501546 ] ] ] ] } },
{ "type": "Feature", "properties": { "description": "polygon z" }, "geometry": { "type": "Polygon", "coordinates": [ [ [ 120.343707386610134, 16.376456069284234, 0.0 ], [ 120.343727888829335, 16.376455718026328, 0.0 ], [ 120.343720200497131, 16.376444653402149, 0.0 ], [ 120.343707752721187, 16.376445531546949, 0.0 ], [ 120.343707386610134, 16.376456069284234, 0.0 ] ] ] } },
{ "type": "Feature", "properties": { "description": "complex self-intersecting polygon" }, "geometry": { "type": "Polygon", "coordinates": [ [ [ 120.343830354161554, 16.376432886261462 ], [ 120.343846096936986, 16.376432008116602 ], [ 120.343845547770428, 16.376417782169376 ], [ 120.34382962193942, 16.376418309056326 ], [ 120.343834564438708, 16.376425685473521 ], [ 120.343839140826901, 16.376425685473521 ], [ 120.343838591660344, 16.376421821635979 ], [ 120.343834747494199, 16.376421646006996 ], [ 120.343830354161554, 16.376432886261462 ] ] ] } }
]
}
In lines
geowrangler/geowrangler/grids.py
Lines 29 to 36 in a918c66
there is a case the resulting x_mask returns false if x_min and x_max, the bounds of the aoi, is less then the cell size
self.x_min is 12621582.219997052
self.x_max is 14243844.181000795
cell_size = 100
x_min = 13762392.958057601
x_max = 13762473.616812669
in this scenario, the following returns an empty array
xrange = np.arange(self.x_min, self.x_max, cell_size)
np.nonzero(x_mask)
the solution is to add a buffer to the x_max
x_mask = (xrange >= x_min) & (xrange <= x_max + cell_size)
This will improve the doc site aesthetics (see new nbdev)
Caveats:
I made poetry install run every time in commit: b9397f5
This is because gh actions did not have pytest when using the cache/skipping poetry install.
https://github.com/thinkingmachines/geowrangler/actions/runs/2432724743/attempts/1
https://github.com/thinkingmachines/geowrangler/actions/runs/2432724743/attempts/2
The following errors out if no polygons/cells are within the aoi
geowrangler/geowrangler/grids.py
Lines 107 to 117 in a918c66
if polygons:
dest = GeoDataFrame(polygons, geometry="geometry", crs=self.grid_projection)
dest_reproject = dest.to_crs(self.gdf.crs)
final = dest_reproject[dest_reproject.intersects(self.gdf.unary_union)]
return final
else:
return GeoDataFrame({"x":[], "y":[], "geometry":[]}, geometry="geometry", crs=self.gdf.crs)
Currently we are installing a bunch of things within notebooks just so it works in collab
geowrangler/notebooks/02_vector_zonal_stats.ipynb
Lines 20 to 25 in a918c66
Is this something we should keep doing?
It's a common feature to calculate the distance to the nearest X. E.g. in poverty mapping, distance to the nearest hospital, road, etc. We should also support this eventually.
Originally posted by mosesckim July 12, 2022
Would like to request a progress bar to be added for aggregation using tqdm
's pandas integration (ref: https://github.com/tqdm/tqdm#pandas-integration)
Hi! I'm currently using the grid generation tutorial and would like to clarify if the output grids in the example are in degrees or meters?
In the create grids section, the tutorial mentions that the units of the grids are dependent on the units of the projection.
Create a grid generator with a size of 50000. The units of the grid size are dependent on the projection of the geodataframe, in this case, EPSG:4326.
When printing out the CRS of region3_gdf
, it shows that the units are in decimal degrees.
>> region3_gdf.crs # CRS info
<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich
However, the output grids in the tutorial are described to be in kilometer scale:
grid_generator5k = grids.SquareGridGenerator(
region3_gdf, 5000
) # 5 km x 5 km square cells
Let me know also if this is the correct place for this question or if I should place this in Discussions instead. Thanks! :D
We should leverage the benefits of nbdev to lower the friction of trying out Geowrangler's modules by making it easy to download and open the notebooks as jupyter notebooks.
Here's a gist of the notebook running on colab.
The error message is
ValueError: 'box_aspect' and 'fig_aspect' must be positive
Upon initial investigation, it seems that there might be something wrong with generated h3 tiles coordinates.
The total bounds of the source gdf and the generated h3 gdfs seem to have interchanged the lat/long coordinates?
Right now, the create_raster_zonal_stats
method of the raster zonal stats module only support single bands per call. Being able to handle multiple bands might be a nice enhancement to minimize having to do multiple passes on the same raster file across the same zones.
The underlying rasterstats modules has some suggestions on how to handle this (which I'm documenting below so that it may serve as a future reference for an implementation)
enhancement request: add asset index computation result as additional column to original data frame
https://github.com/thinkingmachines/geowrangler/blob/master/geowrangler/dhs.py#L144
I ran poetry install
and got the following error
• Updating nbdev (1.2.8 -> 1.2.9 c151342): Failed
CalledProcessError
Command '['git', '--git-dir', '/home/jt/.cache/pypoetry/virtualenvs/geowrangler-U9oiUrW5-py3.9/src/nbdev/.git', '--work-tree', '/home/jt/.cache/pypoetry/virtualenvs/geowrangler-U9oiUrW5-py3.9/src/nbdev', 'checkout', 'c15134220b4d6b96dd67952e27a57a8e5c1bf4c3']' returned non-zero exit status 128.
at ~/.poetry/lib/poetry/utils/_compat.py:217 in run
213│ process.wait()
214│ raise
215│ retcode = process.poll()
216│ if check and retcode:
→ 217│ raise CalledProcessError(
218│ retcode, process.args, output=stdout, stderr=stderr
219│ )
220│ finally:
221│ # None because our context manager __exit__ does not use them.
This error also came up once we merged #30 to master https://github.com/thinkingmachines/geowrangler/runs/6833093902 but not when the PR was reviewed: https://github.com/thinkingmachines/geowrangler/actions/runs/2453771035
In the GridGenerator
class here, the overall bounding box (minx, miny, maxx, maxy) are automatically derived from the projected gdf. Perhaps it would be good to make the overall bounding box be an optional parameter. If None, it could automatically compute the bounding box from the gdf, otherwise it uses the user defined bounding box.
Having a user defined bounding box is useful in making consistently defined grids. This is useful in cases where our AOIs are not necessarily encompass the entire country. We can define our overall bounding box based on the country admin boundaries, and regardless of what AOI we supply as a gdf input, the x and y coordinates of the grid tiles remain consistent.
From https://github.com/thinkingmachines/geowrangler/runs/7194843190?check_suite_focus=true
converting: /home/runner/work/geowrangler/geowrangler/notebooks/index.ipynb
converting: /home/runner/work/geowrangler/geowrangler/notebooks/02_vector_zonal_stats.ipynb
An error occurred while executing the following cell:
------------------
show_doc(_fix_agg)
------------------
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Input In [2], in <cell line: 1>()
----> 1 show_doc(_fix_agg)
NameError: name '_fix_agg' is not defined
NameError: name '_fix_agg' is not defined
converting: /home/runner/work/geowrangler/geowrangler/notebooks/tutorial.geometry_validation.ipynb
converting: /home/runner/work/geowrangler/geowrangler/notebooks/00_validation.ipynb
converting: /home/runner/work/geowrangler/geowrangler/notebooks/tutorial.grids.ipynb
converting: /home/runner/work/geowrangler/geowrangler/notebooks/00_grids.ipynb
converting: /home/runner/work/geowrangler/geowrangler/notebooks/tutorial.vector_zonal_stats.ipynb
Conversion failed on the following:
02_vector_zonal_stats.ipynb
Tried refactoring my code to generate Ookla vector zonal stats using Bing Tile quadkeys instead of the regular spatial joins.
In theory, it should be faster. But in practice, regular vector zonal stats takes a few seconds (~15-30s) while the bing tile version takes too long (tried running for ~13mins and interrupted it). Not sure if I'm doing something wrong.
Here's a Colab notebook to replicate it:
https://colab.research.google.com/drive/1IdwTu2oQjL6fBPwe-Kgk1KReeIc0n8NW#scrollTo=3FYr2rJu-wwj
The notebook needs this file: phl_tiles.csv
A current hunch is that doing raw spatial joins benefits from spatial indexes, but the implementation of matching/aggregating by quadkeys does not.
Related to Feature #42
Having conditional enabling extensions is a code smell. Extensions should be optional for users. We can add a config file that enables it for dev and disable it for end users
https://ipython.readthedocs.io/en/stable/config/extensions/#using-extensions
https://jupyter-notebook.readthedocs.io/en/stable/config.html
Colab notebook for testing:
Scenario:
Could it be a typo? Since the key error mentions "GeoWrangleer_aoi_index"
In case it's relevant, noting that the function works when I passed in grid tiles instead of the Subang Regency gdf itself.
Ran into this error when trying to do bing tile vector zonal stats. Root cause was because of loading a dataframe from a CSV. The quadkeys were being automatically interpreted as ints by pandas.
On the user-side, it's easy enough to just manually fix this and change the column dtype to string. But wondering if in the bing tile vector zonal stats function, we can (and should) just do this auto-conversion for convenience (at least just for the purposes of joining based on quadkeys)?
Related to Feature #42
Creating an issue to track that Vector Zonal Stats currently only primarily supports point geometries.
There is a work-around where you simply convert these non-point geometries into points by getting the centroids. From there you can utilize the VZS feature already. But that might cause some imprecision.
A popular dataset example that doesn't come in the form of points is Ookla, where data is in the form of tiles.
Case 1: AOI Tile is smaller than data tile (Ookla)
For example, if your AOI tiles are 1m x 1m, but Ookla tiles are 30m x 30m, and assuming they are aligned (all AOI tiles are within an Ookla tile), then ideally, the average download speed of the Ookla tile should be attributed to all 900 AOI tiles within it. However, if we convert to point geometries, only one of the AOI tiles will intersect with the Ookla tile centroid and get the right attributes; the rest will be null.
Case 2: AOI Tile is bigger than data tile (Ookla)
In the reverse case where the AOI tile is bigger than the Ookla tile (which is likely the majority case), this should be more tolerable because it is likely that each AOI tile will get multiple Ookla tile centroids anyway, leading to a reasonable approximation.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.