tudelftgeodesy / stmtools Goto Github PK

View Code? Open in Web Editor NEW

6.0 2.0 0.0 6.33 MB

Xarray extension for Space-Time Matrix

Home Page: https://tudelftgeodesy.github.io/stmtools/

License: Apache License 2.0

Python 36.97% Jupyter Notebook 63.03%

insar interferometry psi radar scatterer space-time

stmtools's Introduction

STMtools: Space Time Matrix Toolbox

STMTools (Space-Time Matrix Tools) is an Xarray extension for Space-Time Matrix (Bruna et al., 2021; van Leijen et al., 2021). It provides tools to read, write, enrich, and manipulate a Space-Time Matrix (STM).

A STM is a data array containing data with a space (point, location) and time (epoch) component, as well as contextual data. STMTools utilizes Xarray’s multi-dimensional labeling feature, and Zarr's chunk storage feature, to efficiently read and write large Space-Time matrices.

The contextual data enrichment functionality is implemented with Dask. Therefore it can be performed in a paralleled style on Hyper-Performance Computation (HPC) systems.

At this stage, stmtools specifically focus on the implementation for radar interferometry measurements, e.g. Persistent Scatterer, Distributed Scatterer, etc, with the possibility to be extended to other measurements with space and time attributes.

Installation

STMtools can be installed from PyPI:

pip install stmtools

or from the source:

git clone [email protected]:TUDelftGeodesy/stmtools.git
cd stmtools
pip install .

Note that Python version >=3.10 is required for STMtools.

Documentation

For more information on usage and examples of STMTools, please refer to the documentation.

References

[1] Bruna, M. F. D., van Leijen, F. J., & Hanssen, R. F. (2021). A Generic Storage Method for Coherent Scatterers and Their Contextual Attributes. In 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS: Proceedings (pp. 1970-1973). [9553453] (International Geoscience and Remote Sensing Symposium (IGARSS); Vol. 2021-July). IEEE . https://doi.org/10.1109/IGARSS47720.2021.9553453

[2] van Leijen, F. J., van der Marel, H., & Hanssen, R. F. (2021). Towards the Integrated Processing of Geodetic Data. In 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS: Proceedings (pp. 3995-3998). [9554887] IEEE . https://doi.org/10.1109/IGARSS47720.2021.9554887

stmtools's People

Contributors

Stargazers

Watchers

stmtools's Issues

Extract time information in `from_csv` function

Following the discussion in #39 , the from_csv function should extract time coordinates by assuming a time string pattern in the header of space-time columns. The following default behavior should be applied:

Assume the string pattern yyyymmdd for time information
Assume the same pattern is shared by all space-time columns
The time coordinates should be stored in np.datetime64 format which is adapted by Xarray.

Adding support for WKT polygons (as strings) in polygon query

Further implementation to enhance the polygon subsetting.

# Get polygon type and the first row
        if isinstance(polygon, gpd.GeoDataFrame):

adding support for WKT polygons (as strings) might be useful for users here

Originally posted by @meiertgrootes in #10 (comment)

Doc: Add documentation for `get_order` and `re_order`

Add usage example as a new page in docs dir. Nested under Usage tab. (Check mkdocs.yml for doc files structure)
Add example usage in the docs example notebook. Before subset and enrichment. Elaborate on the performance imporvement.

Implement attributes in STM based on MATLAB code

We need to define the attributes in STM according to the existing implementation in MATLAB.
In the docs of stmread.m, the Space Time Matrix structure ST has the following fields

%      .datasetId       Solution Identifier             char string (free)
%      .techniqueId     Technique Identifier            char string (reserved)
%      .datasetAttrib   Dataset Attributes              struct
%      .techniqueAttrib Technique Attributes            struct 
%      .globalAttrib    Global Attributes               struct
%      .numPoints       Number of Points                int scalar
%      .numEpochs       Number of Epochs                int scalar
%      .pntName         Point name                      cell array [numPoints] 
%      .pntCrd          Point Cooordinates (lat,lon,h)  double [numPoints,3] matrix
%      .pntAttrib       Point Attributes ....           table  [numPoints,* ] 
%      .epochDyear      Decimal year                    double [numEpochs] matrix 
%      .epochAttrib     Epoch Attributes                table [numEpochs,*] 
%      .parTypes        Parameter Types                 cell array  (reserved)
%      .obsTypes        Observation Types               cell array  (reserved)
%      .auxTypes        Auxiliary data Types            cell array (free)
%      .obsData         Observation Space Time Matrix   single [numPoints,numEpochs,dimObs]
%      .auxData         Auxiliary data Space Time Matrix  single [numPoints,numEpoch,dimAux]
%      .sensitivityMatrix  Sensitivity Matrix           single [numPoints,dimPar,dimObs]
%      .stochModel      Stochastic model description    cell array of strings
%      .stochData       Stochastic model data           single [numPoints,numEpochs,dimStochData]
%      .inputDatasets() Structure array with for each input dataset  struct array

We would like to have this implemented as similarly as possible in the following:

Extended Xarray Dataset object, i.e. xr.Dataset.stm
Zarr output.

Things to discuss:

Go through all the attributes, should all the above attributes be either:
- one of the following under Dataset: 1) data_vars; 2) coordinates; 3) attrs, or
- the function to return this under .stm
Should we create a hierarchy in Xarray? E.g. xr.Dataset.stm.obsData.phase or just xr.Dataset.stm.phase?
Should we create a hierarchy in Zarr?

Add performance test: should we add it?

Add profiling tests on:

STM subset/enrichment with multi-polygon on sorted STM should be faster than non-sorted STM.
Time of loading a csv should roughly linearly scale with num of rows.

The profiling shoulf be executed only locally.

We can recommend running this test locally in CONTRIBUTE.md

Space-time matrix: test interface to Zarr

Polygon query: Exposing the chosen predicate here via keyword

this is currently based on the "contains" predicate but others are possible in principle. Maybe consider exposing the chosen predicate here via keyword, for possible use in other fct.

Originally posted by @meiertgrootes in #10 (comment)

Dask warning when applying polygon subsetting

/home/caroline-oku/miniconda3/envs/jupyter_dask/lib/python3.11/site-packages/distributed/worker.py:2988: UserWarning: Large object of size 1.53 MiB detected in task graph: 
  ([('points',), <xarray.IndexVariable 'points' (poi ... 799999]), {}],)
Consider scattering large objects ahead of time
with client.scatter to reduce scheduler burden and 
keep data on workers

    future = client.submit(func, big_data)    # bad

    big_future = client.scatter(big_data)     # good
    future = client.submit(func, big_future)  # good
  warnings.warn(

STM add temporal contextual information

Add temporal contextual data into an STM object.

Documentation for `enrich_from_dataset` function

Add a usage example
Add a Notebook example

Unit test for duplicated points in an STM

Issue coming from a discussion in PR #66

A duplicated coordinate (lat, lon, time) will cause the spatial temporal query enrich_from_dataset fail. We need to create a check function to validate there is no duplicated 3D coordinates in an STM.

Also quote Sarah's comments here which are good for consideration:

@rogerkuou to check if the points are unique, the test np.unique(ds['lat'].values).shape == ds['lat'].values.shape is not enough because it only checks the duplicates in one dimension here lat. However, for example, points can be located on one line.
Instead, we need a test if there are cases where (lat, lon, time) are duplicated. Functions like xarray.Dataset.drop_duplicates and pandas.DataFrame.duplicated can be used to write a test. But these functions only work on dim and not coords. In our cases, lat and lon are coords and space is the dim. So we might need to use unstack which leads to memory problems.

A reordering function for stmtools

We would like to have a reordering function for stmtools, to make the spatially close-by points also close in the points orders. This will benefit the enrichment function.

requirements:

It only loads 2D coordinates needed for reordering.
The reordering only needs to be applied on the point dimension.
The other data variables and coordinates should remain delayed.

Example application:

import xarray as xr
stmat = xr.open_zarr('stm.zarr')

# Reorder stmat
stmat_reorder = stmat.stm.reorder(xlabel='X', ylabel='Y')

Example dataset can be retrieved from here

How to handle the time reference system

Although most of the time coordinates are in UTC, we need to come with a way to handle the time in different references.

Is there a Python implementation storing time references and doing the conversion?
Is there an existing Xarray extension doing this?

Controlling output chunk size and type

I am loading the test full-pixel_psi_amsterdam_tsx_asc_t116_v4_ampl_std_H_c16643.csv.part CSV file (246 MB) with the from_csv function and writing it out as a Zarr store.

When I am loading the STM data from the CSV file I get the following dataset:

stm = stmtools.from_csv(CSV_STM_PATH, output_chunksize={'space': 25_000, 'time': -1})
stm = stm.persist()
stm

<xarray.Dataset>
Dimensions:                (space: 50000, time: 198)
Coordinates:
  * space                  (space) int64 0 1 2 3 4 ... 49996 49997 49998 49999
  * time                   (time) int64 0 1 2 3 4 5 ... 192 193 194 195 196 197
    lat                    (space) object dask.array<chunksize=(25000,), meta=np.ndarray>
    lon                    (space) object dask.array<chunksize=(25000,), meta=np.ndarray>
Data variables: (12/25)
    pnt_id                 (space) object dask.array<chunksize=(25000,), meta=np.ndarray>
    pnt_flags              (space) object dask.array<chunksize=(25000,), meta=np.ndarray>
    pnt_line               (space) object dask.array<chunksize=(25000,), meta=np.ndarray>
    pnt_pixel              (space) object dask.array<chunksize=(25000,), meta=np.ndarray>
    pnt_height             (space) object dask.array<chunksize=(25000,), meta=np.ndarray>
    pnt_demheight          (space) object dask.array<chunksize=(25000,), meta=np.ndarray>
    ...                     ...
    pnt_std_linear         (space) object dask.array<chunksize=(25000,), meta=np.ndarray>
    pnt_std_quadratic      (space) object dask.array<chunksize=(25000,), meta=np.ndarray>
    pnt_std_seasonal       (space) object dask.array<chunksize=(25000,), meta=np.ndarray>
    deformation            (space, time) object dask.array<chunksize=(25000, 198), meta=np.ndarray>
    amplitude              (space, time) object dask.array<chunksize=(25000, 198), meta=np.ndarray>
    h2ph                   (space, time) object dask.array<chunksize=(25000, 198), meta=np.ndarray>

So all variables and coordinates (dimension coordinates excluded) have object type, and the chunk size is correctly as specified in input.

Only after writing the dataset to Zarr the data type is resolved. However, the chunks of the variables that are both space and time dependent is now modified (see e.g. aplitude below):

stm.to_zarr(ZARR_STM_PATH, mode='w')
stm_ = xr.open_zarr(ZARR_STM_PATH)
print(stm_)

<xarray.Dataset>
Dimensions:                (space: 50000, time: 198)
Coordinates:
    lat                    (space) float64 dask.array<chunksize=(25000,), meta=np.ndarray>
    lon                    (space) float64 dask.array<chunksize=(25000,), meta=np.ndarray>
  * space                  (space) int64 0 1 2 3 4 ... 49996 49997 49998 49999
  * time                   (time) int64 0 1 2 3 4 5 ... 192 193 194 195 196 197
Data variables: (12/25)
    amplitude              (space, time) float64 dask.array<chunksize=(6250, 25), meta=np.ndarray>
    deformation            (space, time) float64 dask.array<chunksize=(6250, 25), meta=np.ndarray>
    h2ph                   (space, time) float64 dask.array<chunksize=(6250, 25), meta=np.ndarray>
    pnt_ampconsist         (space) float64 dask.array<chunksize=(25000,), meta=np.ndarray>
    pnt_demheight          (space) float64 dask.array<chunksize=(25000,), meta=np.ndarray>
    pnt_demheight_highres  (space) float64 dask.array<chunksize=(25000,), meta=np.ndarray>
    ...                     ...
    pnt_seasonal_sin       (space) float64 dask.array<chunksize=(25000,), meta=np.ndarray>
    pnt_std_defo           (space) float64 dask.array<chunksize=(25000,), meta=np.ndarray>
    pnt_std_height         (space) float64 dask.array<chunksize=(25000,), meta=np.ndarray>
    pnt_std_linear         (space) float64 dask.array<chunksize=(25000,), meta=np.ndarray>
    pnt_std_quadratic      (space) float64 dask.array<chunksize=(25000,), meta=np.ndarray>
    pnt_std_seasonal       (space) float64 dask.array<chunksize=(25000,), meta=np.ndarray>

My questions:

Why is the chunk size modified when writing to Zarr?
Can one figure out the correct data type already when loading the CSV file?

Subset function: make thresholding lazy

Do not force the mask evaluation in subset function for thresholding.

Density subsetting improvements

Freek mentioned there is an existing Algorithm implemented in SkyGeo. It could be a valuable inspiration for us to improve the density option in subset function.

Potential implementation:

x and y coordinates: from pixel coords + pixel size
x and y coordinates: from georeferenced coords, lat lon need conversion

How to handle the Spatial Coordinate Reference system

The space entries of STM can have coords in different CRS. Can we handle the CRS conversion of them within STMTools?

Make a draft flowchart: user selection on the contextual data

Make a flowchart to demonstrate the data flow of stm contextual data query.

Current idea: (function -> metadata -> data)

Function: xarray.stm.query(name_data_type)
Metadata: json files with href to data location
data: hosted in the own FS or remote

Add API documentation

Rename the "points" dimension to space dimension

stmtools now uses the label "points" for the space dimension. A more generic and relevant label for that would be "space".

Investigate catalog method for contextual data archiving

Make a workflow for cataloging contextual datasets. The purpose is to make contextual data easily to be used for enrichment by STMTools.

Download the following datasets and related metadata:

BAG (Building registrations. Represents large volume polygons, 3GB in zip file, 7GB in gpkg)
BRP bodem (Soil map. Represents small volume polygons, 140MB gpkg)
A static raster? (is this needed? AHN?)
An NetCDF file for Spatial-Temporal data from ECMWF (ERA-5 / KNMI weather station)?

Preprocessing:

CRS conversion
Digest metadata: can it be easily read in Python?

Cloud optimized saving:

What is the best file format? Should we already use chunk saving like Zarr/GeoParquet?

Create catalog

Is the STAC catalog the best choice?

From CSV to Zarr

I am loading the test full-pixel_psi_amsterdam_tsx_asc_t116_v4_ampl_std_H_c16643.csv.part CSV file (246 MB) with the from_csv function and writing it out as a Zarr store. Everything runs locally with the default LocalCluster.

Doing the following takes a long time (3min 55s) and print a warning repeatedly (27 times), from which I get that one is maybe reading the file multiple times?

%%time
stm = stmtools.from_csv(CSV_STM_PATH, blocksize=100e6)
stm.to_zarr(ZARR_STM_PATH, mode='w')

Output

<timed exec>:2: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
<timed exec>:2: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
<timed exec>:2: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
<timed exec>:2: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
<timed exec>:2: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
<timed exec>:2: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
<timed exec>:2: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
<timed exec>:2: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
<timed exec>:2: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
<timed exec>:2: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
<timed exec>:2: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
<timed exec>:2: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
<timed exec>:2: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
<timed exec>:2: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
<timed exec>:2: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
<timed exec>:2: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
<timed exec>:2: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
<timed exec>:2: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
<timed exec>:2: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
<timed exec>:2: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
<timed exec>:2: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
<timed exec>:2: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
<timed exec>:2: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
<timed exec>:2: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
<timed exec>:2: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
<timed exec>:2: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
<timed exec>:2: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
CPU times: user 24.5 s, sys: 5.73 s, total: 30.2 s
Wall time: 3min 55s
<xarray.backends.zarr.ZarrStore at 0x16b550740>

If I instead persist the data before writing it out, I only get the warning once and the full load/write out is quite faster (33.1 s):

%%time
stm = stmtools.from_csv(CSV_STM_PATH, blocksize=100e6)
stm = stm.persist()
stm.to_zarr(ZARR_STM_PATH, mode='w')

Output

<timed exec>:3: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
CPU times: user 10.3 s, sys: 2.3 s, total: 12.6 s
Wall time: 33.1 s
<xarray.backends.zarr.ZarrStore at 0x1685ebd80>

Performance warning in re-order

The demo notebook or reordering gives performance warning when calling re-order:

# Time the reordering operation.
time_ordering = %timeit -o stmat.copy().stm.reorder(xlabel="azimuth", ylabel="range")
time_ordering

/storage/miniforge3/envs/mbl_stmtools/lib/python3.11/site-packages/xarray/core/indexing.py:1430: PerformanceWarning: Slicing with an out-of-order index is generating 230 times more chunks
  return self.array[key]

This seem to be inevitable since we are performing reorder. It seems we can

Supress the warning
Check some Dask reordering, e.g. :https://docs.dask.org/en/stable/order.html

memoryError on dask.Dataset larger than ~500,000,000 points (32GB memory)

When creating a Dask Dataset as space-time matrix, one index (int64) for each entry is always computed. On a 32GB machine this will cause a memory overflow for lazy, not-computed stms larger than ~500,000,000 points, yielding the error Stream closed.

stm.enrich_from_polygon expects lat & lon coordinates

stm.enrich_from_polygon always requires a latitude longitude format for coordinates, even if the gpkg-coordinates and the xarray dataset is in RD

Querying temporal attributes to an STM

Add a enrich_from_stm() function to enrich a STM with spatial temporal data (another STM)

This function should take another xr.Dataset with the same space/time dimension and CRS.

The enrichment should support block-wise operation as enrich_from_polygon .

An example usage adapted from the Documentation example

import xarray as xr

# Read temperature data to query
ds_temperature = xr.read_nc('temperature.nc')

# Read existing stm data
path_stm = Path('./stm.zarr')
stmat = xr.open_zarr(path_stm)

# Enrich
stmat.stm.enrich_from_stm(ds_temperature , space_label=('lat', 'lon'), time_label ='time')

Work example: enrich the STM (Zarr) over Amsterdam in demo with KNMI daily mean temperature.

A "from_csv" function in STMtools

In some situations, researchers would like to build an STM object from existing csv file.

One example can be found in this notebook, where Cell 11 and Cell 14 gives the skeleton of the from_csv. Dask DataFrame is used to handle the large csv in a delayed way. However due to the text nature of the csv file, the chunk size need to be computed before performing lazy functions (as implemented in Cell 11). In Cell 14 there is an implementation of walk through all columns and separate according to the column names.

software release

Github workflow in a pull request should only run on changed files

in PR #66, I have to fix the linter errors in _io that is not changed in the PR. The github workflow needs a fix

Adapt stm to stmtools

installation
Internal import
test

Space-Time matrix query from polygon fields

Define a member function of class SpaceTimeMatrix, which queries the a field from a Polygon file with the coordinates of each point, then returns the query results as a new data variable.

Expected result:

import geopandas as gpd
import xarray as xr
import stmat

# Read polygon
polygon = gdp.read_file('example_polygon.shp')

# Initiate a dataset, ds contains both data and coordinates
ds = xr.Dataset(...)

# This will add a new data variable "field_name_str" to ds
ds = ds.stm.query_polygon(polygon, "field_name_str")

stm.subset(method='polygon', polygon=path_polygon) expects points and time dimensions without checking

It just crashes at the end

File ~/Workspace/stmtools/stm/stm.py:152, in SpaceTimeMatrix.subset(self, method, **kwargs)
144 case other:
145 raise NotImplementedError(
146 "Method: {} is not implemented.".format(method)
147 )
148 chunks = {
149 "points": min(
150 self._obj.chunksizes["points"][0], data_xr_subset.points.shape[0]
151 ),
--> 152 "time": min(self._obj.chunksizes["time"][0], data_xr_subset.time.shape[0]),
153 }
155 data_xr_subset = data_xr_subset.chunk(chunks)
157 return data_xr_subset

File ~/mambaforge/envs/jupyter_dask/lib/python3.11/site-packages/xarray/core/utils.py:455, in Frozen.getitem(self, key)
454 def getitem(self, key: K) -> V:
--> 455 return self.mapping[key]

KeyError: 'time'

Example Python production script

Now we have a good Jupyter notebook to demo sarxarray and stm, it would be good to have it also as Python scripts, and give instructions.

Expected deviverables:

Excutable Python script working with Dask-SLURM cluster.
README file with instructions of executing the data library

Too much subset routine's base code is within the subset fuction

I still think too much of each subset routine's base code is within the routine. I would suggest trying to move that out, only making a call from the subset function. This would also help with future additions of subset methods

Originally posted by @meiertgrootes in #10 (comment)

Space-Time matrix: cluster points based on polygon

stm.enrich_from_polygon cannot handle polygons with holes

If a polygon in a geopackage has holes, the query only works for a small subset of the points that are in that polygon. Other polygons are unaffected.

Warnings in Unit Test

======================================= warnings summary ========================================
tests/test_io.py::TestFromCSV::test_readcsv_dims
tests/test_io.py::TestFromCSV::test_readcsv_vars
tests/test_io.py::TestFromCSV::test_readcsv_output_chunksize
tests/test_io.py::TestFromCSV::test_readcsv_custom_var_name
tests/test_io.py::TestFromCSV::test_readcsv_list_coords
tests/test_io.py::TestFromCSV::test_readcsv_custom_coords
tests/test_io.py::TestFromCSV::test_readcsv_timevalues
tests/test_io.py::TestFromCSV::test_readcsv_timevalues
tests/test_io.py::TestFromCSV::test_readcsv_dtypes
  /home/oku/miniforge3/envs/mobyle/lib/python3.11/site-packages/dask/dataframe/core.py:3828: UserWarning: Dask currently has limited support for converting pandas extension dtypes to arrays. Converting string to object dtype.

Query from AHN DEM data

The AHN data can be one format of the contextual data. It is neither raster (if we are not using the geoTiff version) nor vector data (would be too big to fit in a geopandas dataframe), but a large pointcloud. Besides, one may need to query AHN with 3D coordinates, instead of 2D.

Future warning `Dataset.dims`

In: https://github.com/TUDelftGeodesy/stmtools/pull/71/files#diff-0b3352e5838067dc55b6b083d1385268ce032917d9d0d7cac6f5d7bdf1c09913

%%timeit -o
# Time subset operation on chunked STM.
stmat_chunked_subset = stmat_chunked.stm.subset(method='polygon', polygon=path_polygon)
phase = stmat_chunked_subset['phase'].compute()

/storage/MobyLe/stmtools/stmtools/stm.py:134: FutureWarning: The return type of `Dataset.dims` will be changed to return a set of dimension names in future, in order to be more consistent with `DataArray.dims`. To access a mapping from dimension names to lengths, please use `Dataset.sizes`.
  if dim not in self._obj.dims.keys():

Update notebooks

Publish the zarr data in the [example] to Figshare
Use the second part for STMExample for a stand alone notebook.
README for example directory on setup.
Another TUDelft dedicated version for large scale data

Add support for "<=" and"=>" in subset fuction

only ">" or "<" are possible. Users may expect to also be able to use "<=" or "=>". Can this be supported?

A thought for the future might be a band/range selection

Originally posted by @meiertgrootes in #10 (comment)

chunck size calculation error

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[3], line 5
      2 stack = sarxarray.from_binary(list_slcs, shape, dtype=dtype)
      4 # Load coordinates
----> 5 lat = sarxarray.from_binary(f_lat, shape, vlabel="lat", dtype=np.float32)
      6 lon = sarxarray.from_binary(f_lon, shape, vlabel="lon", dtype=np.float32)
      7 stack = stack.assign_coords(lat = (("azimuth", "range"), lat.squeeze().lat.data), lon = (("azimuth", "range"), lon.squeeze().lon.data))

File ~/miniconda3/envs/jupyter_dask/lib/python3.11/site-packages/sarxarray/_io.py:61, in from_binary(slc_files, shape, vlabel, dtype, chunks, ratio)
     59 # Calculate appropriate chunk size if not user-defined
     60 if -1 in chunks:
---> 61     chunks = _calc_chunksize(shape, dtype, chunks, ratio)
     63 # Read in all SLCs
     64 slcs = None

File ~/miniconda3/envs/jupyter_dask/lib/python3.11/site-packages/sarxarray/_io.py:212, in _calc_chunksize(shape, dtype, chunks, ratio)
    190 def _calc_chunksize(shape, dtype, chunks, ratio):
    191     """
    192     Calculate an optimal chunking size in the azimuth and range direction for
    193     reading with dask and store it in variable `chunks`
   (...)
    210         Default value of [-1, -1] when unmodified activates this function.
    211     """
--> 212     n_elements = 100*1024*1024/dtype.itemsize # Optimal number of elements for a memory size of 200mb (first number)
    213     chunks_az = int(math.ceil((n_elements*ratio)**0.5/1000.0)) * 1000 # Chunking size in azimuth direction up to nearest thousand
    214     chunks_ra = int(math.ceil(n_elements/chunks_az/1000.0)) * 1000 # Chunking size in range direction up to nearest thousand

TypeError: unsupported operand type(s) for /: 'int' and 'getset_descriptor'

Enrichment experiment on an existing DePSI output dataset

Since we already have the enrichment functionality working for polygon data, we can try to apply this to an output of DePSI and do enrichment as a post-processing.

this should end up into a from_csv function for stmtools

Space-time matrix: Subset function

Define a member function of class SpaceTimeMatrix, to return a subset of itself based on a specific criterion.

Possible criteria are (priority high to low):

Within a spatial polygon
Below/above a certain threshold of a temporal attribute. At this stage, we can implement amplitude dispersion as a start.
Below a specific density, e.g. 1 point in every 100m.

A placeholder has been added to stmat.py

An example result:

# Load polygon/multi-polygon using geopandas
import geopandas as gpd
polygon = gpd.read_file('example_polygon.shp')

# Assuming an `xarray.Dataset` exists as `xr_stm`

# Subset with polygon
stm_xr_subset = xr_stm.stm.subset(polygon) # Return all the points within the polygon 

# Subset with amplitude dispersion
stm_xr_subset = xr_stm.stm.subset('amplitude_dispersion', '>0,8') # Return all the points with amplitude dispersion >0.8, can discuss syntax

# Subset with density
stm_xr_subset = xr_stm.stm.subset('density', '200m') # Return points with a 200m/point density

A function to initiate an STM from scratch

Suggestion by Freek:

We currently have functions to read an STM from Zarr and CSV. It would also be easier to have function (from_dict(), for example) to initiate an STM directly in memory, make assumtions on dimensions (space, time) and pre-fill some parts.

This can be a wrapper creating an xr.Dataset.

Deprecation warning for Jupyter in mkdocs

warning when executing

mkdocs serve

INFO    -  DeprecationWarning: Jupyter is migrating its paths to use standard platformdirs
           given by the platformdirs library.  To remove this warning and
           see the appropriate new directories, set the environment variable
           `JUPYTER_PLATFORM_DIRS=1` and then run `jupyter --paths`.
           The use of platformdirs will be the default in `jupyter_core` v6
             File "/home/oku/miniforge3/envs/mobyle/lib/python3.11/site-packages/jupyter_core/utils/__init__.py", line 90, in
           deprecation
               warnings.warn(message, DeprecationWarning, stacklevel=stacklevel + 1)
             File "/home/oku/miniforge3/envs/mobyle/lib/python3.11/site-packages/jupyter_client/connect.py", line 22, in
               from jupyter_core.paths import jupyter_data_dir, jupyter_runtime_dir, secure_write

Version check:

mamba list mkdocs

# packages in environment at /home/oku/miniforge3/envs/mobyle:
#
# Name                    Version                   Build  Channel
mkdocs                    1.5.3              pyhd8ed1ab_0    conda-forge
mkdocs-autorefs           0.5.0                    pypi_0    pypi
mkdocs-gen-files          0.5.0                    pypi_0    pypi
mkdocs-jupyter            0.24.6                   pypi_0    pypi
mkdocs-material           9.5.4                    pypi_0    pypi
mkdocs-material-extensions 1.3.1                    pypi_0    pypi
mkdocstrings              0.24.0                   pypi_0    pypi
mkdocstrings-python       1.8.0                    pypi_0    pypi

tudelftgeodesy / stmtools Goto Github PK

stmtools's Introduction

STMtools: Space Time Matrix Toolbox

Installation

Documentation

References

stmtools's People

Contributors

Stargazers

Watchers

stmtools's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs