thinkingmachines / geowrangler Goto Github PK

View Code? Open in Web Editor NEW

47.0 6.0 13.0 75.24 MB

🌏 A python package for wrangling geospatial datasets

Home Page: https://geowrangler.thinkingmachin.es/

License: MIT License

Python 5.76% Jupyter Notebook 94.23% CSS 0.01%

geopandas gis python

geowrangler's People

Contributors

Stargazers

Watchers

Forkers

tm-jc-nacpil anthonymockler mosesckim alronlam tm-dafrose-bajaro tm-abby-moreno rnvllflores butchtm tm-jace-peralta vbcalinao unicef dalvarez83 kahlilpanopio

geowrangler's Issues

Distinguishing between grids and tiles

This might be pedantic, but what do you think about distinguishing between grids and grid tile for added clarity?

Grid Tile (or tile): a single square
Grid: the collection of squares

This will imply some renaming in the GridGenerator class

OSM Geofabrik Data Download

Download and cache from Geofabrik website given desired country.

#112

Migrate grid generator to use unary_union to auto iterate through polygons.

Discussed in #63

Optimizing grid tile generation

I think there’s room for further optimization especially for generate_grids in the GridGenerator class. Right now the grid tiles are first generated across the entire span of xrange and yrange and then filtered out after. While this isn’t an issue for very coarse grids, this can easily run into runtime and memory issues for fine grids.

Instead of generating all the tiles and then filtering after, we can generate only the grid tiles we need.

This drastically saves on memory since our gdf won’t need to store tiles it doesn’t need.
This saves on runtime since we don’t need to generate unnecessary tiles, and don’t need to reproject unnecessary tiles.

To determine which grid tiles to generate in the first place, we can use the cheapest possible geometric operations.

Looping through xrange and yrange and intersecting tiles with the gdf’s unary_union can be expensive since the unary_union is a single geometry most likely has a large number of points.
We can instead generate tiles per polygon within the gdf. (If there are multipolygons we can explode to polygons). Per polygon, we can:
1. Find the bounding box of the polygon. (should be cheap)
2. Determine which tiles are within the polygon’s bounding box. (should be cheap since you don’t need to use shapely operations)
3. Check which tiles are within the polygon itself by doing an intersection. (this is the most computationally expensive part but mitigated since we have less tiles to compare with, and a single polygon should have less points than the unary union)
4. Avoid generating tiles that have already been generated. (can be a quick Python set lookup) This is relevant in cases where two polygons share the same tiles.

Can make a PR for this too!

Test feature

Test subtask #38

Document conventions on Geometry Validators

Add to docs

default orientation from OrientationValidator is counter clockwise
for custom validator, fix will only be applied to False results

Ookla Data Download

Download and cache from Ookla’s S3 bucket given parameters (wired/wireless, year, quarter).

nb-black doesnt look maintained and prints out a lot warnings when running poetry update

<debug>PackageInfo:</debug> Invalid constraint (black>='19.3' ; python_version >= "3.6") found in nb-black-1.0.7 dependencies, skipping

Geometry Validation

Onboard UAT Tester
#49
#50
#88

Grid Tile Spatial Imputation

Vector Zonal Stats: Colab crashing due to RAM consumption

Colab notebook for testing:
https://colab.research.google.com/drive/147HWUgaBztsZuBPrI_HTckBrz_vl9l1l#scrollTo=wvLenjgDUgod

Scenario

Created AOI grid tiles for the Subang Regency in Indonesia (~36k 250mx250m grid tiles)
Tried to get the average population density per tile using HRSL vector data (CSV file is around 1.8GB)

Error
Colab crashes due to exceeding the RAM limit.

Just creating this issue to check if there are straightforward ways to optimize. Otherwise, are there workarounds for handling such vector datasets that are relatively large?

Test issue

Add the functionality to generate geometry from quadkeys and use in creating raster zonal stats

Discussed in #109

We can use the gist provided here to implement the conversion of the quadkey to its geometry so it can be used by the raster zonal stats module.

DHS Wealth Index Calculation

Calculate DHS Wealth Index given specified data. Useful when re-calculating the wealth index across multiple countries. Or when applying the same procedure to non-DHS surveys (e.g. Indonesia Susenas).

Change default fillna to be false and allow configurable fill value

Discussed in #62

^{Originally posted by mosesckim June 30, 2022}
Noticed default is to fill NaNs with zero after aggregation; this might make it difficult to identify original NaNs if there are actual zeros in the aggregation.

https://github.com/thinkingmachines/geowrangler/blob/master/geowrangler/vector_zonal_stats.py#L205

Also, a suggestion would be, in the case fillna option is set to True, make the replacement value (0 in this case at the moment) a variable users can input (e.g. -1, etc.)

Thanks!
Moses

Geometry Validation: Catching polygons with less than 3 unique vertices

This could be a feature to consider for Geowrangler Geometry Validation

I encountered an error where I tried to upload a geopandas dataframe to BQ and it said

GenericGBQException: Reason: 400 Error while reading data, error message: Invalid geography value for column 'geometry', error: Polygon loop should have at least 3 unique vertices, but only had 2; in WKB geography

It turns out there was a "polygon" that was actually a line. I verified it by computing the area which was actually 0.

'POLYGON ((122.95320551089915 11.473736609261481, 122.952381 11.4737421, 122.95320551089915 11.47373660926148, 122.95320551089915 11.473736609261481))'

The weird thing is it's not caught by is_valid on the epsg:4326 GeoSeries but it's caught by is_valid when the GeoSeries was projected to epsg:3123. I expected is_valid to return FALSE even if the polygon was not projected.

Perhaps this can be something geowrangler's geometry validation can also catch?

badges to the readme

DHS Pre-processing Utils

Combination of data across multiple DHS files into cluster-level data

#105

Vector Zonal Stats

Enable "cache" behavior by default when downloading OSM data

Hello, noticed that the OSM data download does not support any kind of caching.

It would be nice to support this natively so users don't have to keep writing their own file existence checks when they need to re-run cells in a Jupyter notebook or re-run scripts. Otherwise, the line of code would download the file again, resulting to long runtimes.

Maybe we can:

Add an optional param like overwrite to the function. E.g. geofabrik.download_geofabrik_region("laos", "../test_dir", overwrite=False)
Make this False by default, so that caching is enabled by default.
This param will still allow users to overwrite the file should they wish to do so (e.g. it's been a while and they want to get a newer version of the OSM data).

Add more broken geometry examples

@ncdejito created additional broken geometries to add to the testing

{
"type": "FeatureCollection",
"name": "broken2",
"crs": { "type": "name", "properties": { "name": "urn:ogc:def:crs:OGC:1.3:CRS84" } },
"features": [
{ "type": "Feature", "properties": { "description": "correct" }, "geometry": { "type": "Polygon", "coordinates": [ [ [ 120.343802163610064, 16.376524213304915 ], [ 120.343822299718241, 16.37652526707825 ], [ 120.343823581106932, 16.376508933591005 ], [ 120.343802895832169, 16.376508406704307 ], [ 120.343802163610064, 16.376524213304915 ] ] ] } },
{ "type": "Feature", "properties": { "description": "counterclockwise coordinates" }, "geometry": { "type": "Polygon", "coordinates": [ [ [ 120.343795024444461, 16.376520173840436 ], [ 120.343777268058162, 16.376519295695978 ], [ 120.343779647780039, 16.376504894126214 ], [ 120.343797038055257, 16.376506123528557 ], [ 120.343795024444461, 16.376520173840436 ] ] ] } },
{ "type": "Feature", "properties": { "description": "self-intersecting polygons (e.g. twirled edges)" }, "geometry": { "type": "Polygon", "coordinates": [ [ [ 120.343746880840385, 16.376508231075398 ], [ 120.34376646778199, 16.376510689880028 ], [ 120.343768481392758, 16.376492424473494 ], [ 120.343748528340186, 16.376490141297555 ], [ 120.343759511671877, 16.376513148684623 ], [ 120.343746880840385, 16.376508231075398 ] ] ] } },
{ "type": "Feature", "properties": { "description": "slither polygons" }, "geometry": { "type": "Polygon", "coordinates": [ [ [ 120.343721070010858, 16.376486453090216 ], [ 120.343736263619718, 16.376488736266193 ], [ 120.34373571445316, 16.376486804348058 ], [ 120.343721070010858, 16.376486453090216 ] ] ] } },
{ "type": "Feature", "properties": { "description": "coordinates outside of -180,180" }, "geometry": { "type": "Polygon", "coordinates": [ [ [ 71.163398107805307, -88.300673651538673 ], [ 77.641630757327448, -88.362103634964285 ], [ 75.050337697518586, -88.738011928477775 ], [ 69.867751577900847, -88.612832998982199 ], [ 71.163398107805307, -88.300673651538673 ] ] ] } },
{ "type": "Feature", "properties": { "description": "holes" }, "geometry": { "type": "Polygon", "coordinates": [ [ [ 120.343828569370118, 16.376482940511725 ], [ 120.343827196453617, 16.376463972586816 ], [ 120.343804589095839, 16.376465289803889 ], [ 120.34380632812335, 16.376483818656357 ], [ 120.343828569370118, 16.376482940511725 ] ], [ [ 120.343810080761713, 16.376468538939285 ], [ 120.343823352287544, 16.376467397351178 ], [ 120.343824725204001, 16.376478901046394 ], [ 120.343810263817204, 16.376479252304257 ], [ 120.343810080761713, 16.376468538939285 ] ] ] } },
{ "type": "Feature", "properties": { "description": "non-closed polygon" }, "geometry": { "type": "Polygon", "coordinates": [ [ [ 120.343799646596622, 16.37644096519395 ], [ 120.343815206316521, 16.376441140822909 ], [ 120.343814840205468, 16.376428319908158 ], [ 120.343800012707675, 16.376429022424055 ] ] ] } },
{ "type": "Feature", "properties": { "description": "multipolygon" }, "geometry": { "type": "MultiPolygon", "coordinates": [ [ [ [ 120.343776581599954, 16.376478374159603 ], [ 120.343789395487008, 16.376478549788533 ], [ 120.343789578542498, 16.376465553247304 ], [ 120.343776947711007, 16.376466607020941 ], [ 120.343776581599954, 16.376478374159603 ] ] ], [ [ [ 120.343764499935091, 16.376454137365769 ], [ 120.343777313822116, 16.376453786107874 ], [ 120.343778046044221, 16.376441843338764 ], [ 120.343765781323782, 16.376442370225647 ], [ 120.343764499935091, 16.376454137365769 ] ] ], [ [ [ 120.343738872161055, 16.376437979501546 ], [ 120.343751136881494, 16.376438155130508 ], [ 120.343752601325718, 16.376426036731463 ], [ 120.343739970494241, 16.376425685473517 ], [ 120.343738872161055, 16.376437979501546 ] ] ] ] } },
{ "type": "Feature", "properties": { "description": "polygon z" }, "geometry": { "type": "Polygon", "coordinates": [ [ [ 120.343707386610134, 16.376456069284234, 0.0 ], [ 120.343727888829335, 16.376455718026328, 0.0 ], [ 120.343720200497131, 16.376444653402149, 0.0 ], [ 120.343707752721187, 16.376445531546949, 0.0 ], [ 120.343707386610134, 16.376456069284234, 0.0 ] ] ] } },
{ "type": "Feature", "properties": { "description": "complex self-intersecting polygon" }, "geometry": { "type": "Polygon", "coordinates": [ [ [ 120.343830354161554, 16.376432886261462 ], [ 120.343846096936986, 16.376432008116602 ], [ 120.343845547770428, 16.376417782169376 ], [ 120.34382962193942, 16.376418309056326 ], [ 120.343834564438708, 16.376425685473521 ], [ 120.343839140826901, 16.376425685473521 ], [ 120.343838591660344, 16.376421821635979 ], [ 120.343834747494199, 16.376421646006996 ], [ 120.343830354161554, 16.376432886261462 ] ] ] } }
]
}

Fix issue with grid generator when mask returns all false but are within bounds

In lines

geowrangler/geowrangler/grids.py

Lines 29 to 36 in a918c66

 def get_range_subset( 

 self, x_min: float, y_min: float, x_max: float, y_max: float, cell_size: float 

 ) -> Tuple[float, List[float], float, List[float]]: 

 """Returns a subset of grids from the orginal boundary based on the boundary and a grid size""" 

 xrange = np.arange(self.x_min, self.x_max, cell_size) 

 yrange = np.arange(self.y_min, self.y_max, cell_size) 

 x_mask = (xrange >= x_min) & (xrange <= x_max) 

 y_mask = (yrange >= y_min) & (yrange <= y_max)

there is a case the resulting x_mask returns false if x_min and x_max, the bounds of the aoi, is less then the cell size

self.x_min is 12621582.219997052
self.x_max is 14243844.181000795
cell_size = 100
x_min = 13762392.958057601
x_max = 13762473.616812669

in this scenario, the following returns an empty array

 xrange = np.arange(self.x_min, self.x_max, cell_size)
 np.nonzero(x_mask)

the solution is to add a buffer to the x_max

 x_mask = (xrange >= x_min) & (xrange <= x_max + cell_size)

Update the nbdev version to nbdev2

This will improve the doc site aesthetics (see new nbdev)

replaces jekyll w/ quarto for site generation
removes need for custom nbdev (since black formatting is already implemented in nbdev2)
lots of other improvements

Caveats:

some features (e.g. source links) are still being worked on
nbdev2 commands have changed slightly and will require readjustments esp. w/ GH workflows.
need to test the update thoroughly before deploying to prod.

Look into why pytest does not get installed when using github action caches

I made poetry install run every time in commit: b9397f5

This is because gh actions did not have pytest when using the cache/skipping poetry install.
https://github.com/thinkingmachines/geowrangler/actions/runs/2432724743/attempts/1
https://github.com/thinkingmachines/geowrangler/actions/runs/2432724743/attempts/2

Catch case where the aoi is outside the bounds of the grid generator

The following errors out if no polygons/cells are within the aoi

geowrangler/geowrangler/grids.py

Lines 107 to 117 in a918c66

 for x_idx, x in enumerate(xrange): 

 for y_idx, y in enumerate(yrange): 

 polygons.append( 

 { 

 "x": x_idx + x_idx_offset, 

 "y": y_idx + y_idx_offset, 

 "geometry": self.create_cell(x, y), 

 } 

 ) 

 dest = GeoDataFrame(polygons, geometry="geometry", crs=self.grid_projection)

Catch instances like this making empty dataframes if polygon is nullish

    if polygons:
        dest = GeoDataFrame(polygons, geometry="geometry", crs=self.grid_projection)
        dest_reproject = dest.to_crs(self.gdf.crs)
        final = dest_reproject[dest_reproject.intersects(self.gdf.unary_union)]
        return final
    else:
        return GeoDataFrame({"x":[], "y":[], "geometry":[]}, geometry="geometry", crs=self.gdf.crs)

Create meta function to install necessary packages and resources to run in colab

Currently we are installing a bunch of things within notebooks just so it works in collab

geowrangler/notebooks/02_vector_zonal_stats.ipynb

Lines 20 to 25 in a918c66

 "# hide\n", 

 "# no_test\n", 

 "! [ -e /content ] && pip install -Uqq git+https://github.com/thinkingmachines/geowrangler.git\n", 

 "! [ -e /content ] && pip install -Uqq git+https://github.com/butchland/nbdev.git@add-black-format\n", 

 "# downgrade tornado if in colab\n", 

 "! [ -e /content ] && pip install -Uqq tornado==5.1.0"

Is this something we should keep doing?

Vector Zonal Stats: Support for distance to nearest X

It's a common feature to calculate the distance to the nearest X. E.g. in poverty mapping, distance to the nearest hospital, road, etc. We should also support this eventually.

Raster zonal stats: nodata arg doesn't work

Hi! I think the nodata arg doesn't work for raster zonal stats, as I still got NaNs even when specifying 0.

vector zonal stats: progress bar

Discussed in #97

^{Originally posted by mosesckim July 12, 2022}
Would like to request a progress bar to be added for aggregation using tqdm's pandas integration (ref: https://github.com/tqdm/tqdm#pandas-integration)

Clarification about grid generation units

Hi! I'm currently using the grid generation tutorial and would like to clarify if the output grids in the example are in degrees or meters?

In the create grids section, the tutorial mentions that the units of the grids are dependent on the units of the projection.

Create a grid generator with a size of 50000. The units of the grid size are dependent on the projection of the geodataframe, in this case, EPSG:4326.

When printing out the CRS of region3_gdf, it shows that the units are in decimal degrees.

>> region3_gdf.crs  # CRS info
<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

However, the output grids in the tutorial are described to be in kilometer scale:

grid_generator5k = grids.SquareGridGenerator(
    region3_gdf, 5000
)  # 5 km x 5 km square cells

Let me know also if this is the correct place for this question or if I should place this in Discussions instead. Thanks! :D

Enhancement: Update doc site index page and provide links to reference and tutorial docs as notebooks

We should leverage the benefits of nbdev to lower the friction of trying out Geowrangler's modules by making it easy to download and open the notebooks as jupyter notebooks.

Running the grid tutorial triggers an error in plotting the h3 grid (lat/long interchanged?)

Here's a gist of the notebook running on colab.
The error message is

ValueError: 'box_aspect' and 'fig_aspect' must be positive

Upon initial investigation, it seems that there might be something wrong with generated h3 tiles coordinates.
The total bounds of the source gdf and the generated h3 gdfs seem to have interchanged the lat/long coordinates?

Raster Zonal Stats

Multiband support for Raster zonal stats

Right now, the create_raster_zonal_stats method of the raster zonal stats module only support single bands per call. Being able to handle multiple bands might be a nice enhancement to minimize having to do multiple passes on the same raster file across the same zones.
The underlying rasterstats modules has some suggestions on how to handle this (which I'm documenting below so that it may serve as a future reference for an implementation)

Parallelizing the processing across different zones - ref issue
Rasterstats multiband support discussion
A recommended rasterstats Multiband implementation

H3 Grid Generation

Enhancement: add asset indices as an additional column to the original dataset

enhancement request: add asset index computation result as additional column to original data frame

https://github.com/thinkingmachines/geowrangler/blob/master/geowrangler/dhs.py#L144

Deploy to Pypi

Migrate domain out of firebase default domain

Getting the `CalledProcessError` when running `poetry install`

I ran poetry install and got the following error

• Updating nbdev (1.2.8 -> 1.2.9 c151342): Failed

  CalledProcessError

  Command '['git', '--git-dir', '/home/jt/.cache/pypoetry/virtualenvs/geowrangler-U9oiUrW5-py3.9/src/nbdev/.git', '--work-tree', '/home/jt/.cache/pypoetry/virtualenvs/geowrangler-U9oiUrW5-py3.9/src/nbdev', 'checkout', 'c15134220b4d6b96dd67952e27a57a8e5c1bf4c3']' returned non-zero exit status 128.

  at ~/.poetry/lib/poetry/utils/_compat.py:217 in run
      213│                 process.wait()
      214│                 raise
      215│             retcode = process.poll()
      216│             if check and retcode:
    → 217│                 raise CalledProcessError(
      218│                     retcode, process.args, output=stdout, stderr=stderr
      219│                 )
      220│         finally:
      221│             # None because our context manager __exit__ does not use them.

This error also came up once we merged #30 to master https://github.com/thinkingmachines/geowrangler/runs/6833093902 but not when the PR was reviewed: https://github.com/thinkingmachines/geowrangler/actions/runs/2453771035

Decoupling the bounding box from the AOIs

In the GridGenerator class here, the overall bounding box (minx, miny, maxx, maxy) are automatically derived from the projected gdf. Perhaps it would be good to make the overall bounding box be an optional parameter. If None, it could automatically compute the bounding box from the gdf, otherwise it uses the user defined bounding box.

Having a user defined bounding box is useful in making consistently defined grids. This is useful in cases where our AOIs are not necessarily encompass the entire country. We can define our overall bounding box based on the country admin boundaries, and regardless of what AOI we supply as a gdf input, the x and y coordinates of the grid tiles remain consistent.

Fix 02_vector_zonal_stats.html

From https://github.com/thinkingmachines/geowrangler/runs/7194843190?check_suite_focus=true

converting: /home/runner/work/geowrangler/geowrangler/notebooks/index.ipynb
converting: /home/runner/work/geowrangler/geowrangler/notebooks/02_vector_zonal_stats.ipynb
An error occurred while executing the following cell:
------------------
show_doc(_fix_agg)
------------------
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [2], in <cell line: 1>()
----> 1 show_doc(_fix_agg)
NameError: name '_fix_agg' is not defined
NameError: name '_fix_agg' is not defined
converting: /home/runner/work/geowrangler/geowrangler/notebooks/tutorial.geometry_validation.ipynb
converting: /home/runner/work/geowrangler/geowrangler/notebooks/00_validation.ipynb
converting: /home/runner/work/geowrangler/geowrangler/notebooks/tutorial.grids.ipynb
converting: /home/runner/work/geowrangler/geowrangler/notebooks/00_grids.ipynb
converting: /home/runner/work/geowrangler/geowrangler/notebooks/tutorial.vector_zonal_stats.ipynb
Conversion failed on the following:
02_vector_zonal_stats.ipynb

Bing Tile vector zonal stats much slower compared to the regular one

Tried refactoring my code to generate Ookla vector zonal stats using Bing Tile quadkeys instead of the regular spatial joins.

In theory, it should be faster. But in practice, regular vector zonal stats takes a few seconds (~15-30s) while the bing tile version takes too long (tried running for ~13mins and interrupted it). Not sure if I'm doing something wrong.

Here's a Colab notebook to replicate it:
https://colab.research.google.com/drive/1IdwTu2oQjL6fBPwe-Kgk1KReeIc0n8NW#scrollTo=3FYr2rJu-wwj

The notebook needs this file: phl_tiles.csv

A current hunch is that doing raw spatial joins benefits from spatial indexes, but the implementation of matching/aggregating by quadkeys does not.

Related to Feature #42

Remove enabled 3rd party extensions in notebooks and migrate them to config file

Having conditional enabling extensions is a code smell. Extensions should be optional for users. We can add a config file that enables it for dev and disable it for end users
https://ipython.readthedocs.io/en/stable/config/extensions/#using-extensions
https://jupyter-notebook.readthedocs.io/en/stable/config.html

Grid Tile Generation

Vector Zonal Stats: Error when aligning OSM POIs to an Indonesian Regency

Colab notebook for testing:

https://colab.research.google.com/drive/147HWUgaBztsZuBPrI_HTckBrz_vl9l1l?usp=sharing

Scenario:

Created an AOI GDF representing the Subang Regency of Indonesia.
Tried simply counting the POIs within this area, but got the error seen in the screenshot:

Could it be a typo? Since the key error mentions "GeoWrangleer_aoi_index"

In case it's relevant, noting that the function works when I passed in grid tiles instead of the Subang Regency gdf itself.

Add default aggregations param value for create_distance_zonal_stats

Since the create_distance_zonal_stats function will always have one aggregation by default (the distance to nearest), maybe we can just add set default aggregations=[] for convenience? It's useful when all I really want to do is just get distance to nearest.

Noting this could be a good first issue.

Geometry Simplification

Bing Tile vector zonal stats: auto-conversion of quadkey to str

Ran into this error when trying to do bing tile vector zonal stats. Root cause was because of loading a dataframe from a CSV. The quadkeys were being automatically interpreted as ints by pandas.

On the user-side, it's easy enough to just manually fix this and change the column dtype to string. But wondering if in the bing tile vector zonal stats function, we can (and should) just do this auto-conversion for convenience (at least just for the purposes of joining based on quadkeys)?

Related to Feature #42

Bing Tile Grid Generator

Vector Zonal Stats: Support for vector data with non-point geometry (e.g. tiles)

Creating an issue to track that Vector Zonal Stats currently only primarily supports point geometries.

There is a work-around where you simply convert these non-point geometries into points by getting the centroids. From there you can utilize the VZS feature already. But that might cause some imprecision.

A popular dataset example that doesn't come in the form of points is Ookla, where data is in the form of tiles.

Case 1: AOI Tile is smaller than data tile (Ookla)
For example, if your AOI tiles are 1m x 1m, but Ookla tiles are 30m x 30m, and assuming they are aligned (all AOI tiles are within an Ookla tile), then ideally, the average download speed of the Ookla tile should be attributed to all 900 AOI tiles within it. However, if we convert to point geometries, only one of the AOI tiles will intersect with the Ookla tile centroid and get the right attributes; the rest will be null.

Case 2: AOI Tile is bigger than data tile (Ookla)
In the reverse case where the AOI tile is bigger than the Ookla tile (which is likely the majority case), this should be more tolerable because it is likely that each AOI tile will get multiple Ookla tile centroids anyway, leading to a reasonable approximation.

	def get_range_subset(
	self, x_min: float, y_min: float, x_max: float, y_max: float, cell_size: float
	) -> Tuple[float, List[float], float, List[float]]:
	"""Returns a subset of grids from the orginal boundary based on the boundary and a grid size"""
	xrange = np.arange(self.x_min, self.x_max, cell_size)
	yrange = np.arange(self.y_min, self.y_max, cell_size)
	x_mask = (xrange >= x_min) & (xrange <= x_max)
	y_mask = (yrange >= y_min) & (yrange <= y_max)

	for x_idx, x in enumerate(xrange):
	for y_idx, y in enumerate(yrange):
	polygons.append(
	{
	"x": x_idx + x_idx_offset,
	"y": y_idx + y_idx_offset,
	"geometry": self.create_cell(x, y),
	}
	)

	dest = GeoDataFrame(polygons, geometry="geometry", crs=self.grid_projection)

	"# hide\n",
	"# no_test\n",
	"! [ -e /content ] && pip install -Uqq git+https://github.com/thinkingmachines/geowrangler.git\n",
	"! [ -e /content ] && pip install -Uqq git+https://github.com/butchland/nbdev.git@add-black-format\n",
	"# downgrade tornado if in colab\n",
	"! [ -e /content ] && pip install -Uqq tornado==5.1.0"