GithubHelp home page GithubHelp logo

tudelftgeodesy / stmtools Goto Github PK

View Code? Open in Web Editor NEW
6.0 2.0 0.0 6.33 MB

Xarray extension for Space-Time Matrix

Home Page: https://tudelftgeodesy.github.io/stmtools/

License: Apache License 2.0

Python 36.97% Jupyter Notebook 63.03%
insar interferometry psi radar scatterer space-time

stmtools's Introduction

STMtools: Space Time Matrix Toolbox

DOI PyPI Build Quality Gate Status OpenSSF Best Practices License

STMTools (Space-Time Matrix Tools) is an Xarray extension for Space-Time Matrix (Bruna et al., 2021; van Leijen et al., 2021). It provides tools to read, write, enrich, and manipulate a Space-Time Matrix (STM).

A STM is a data array containing data with a space (point, location) and time (epoch) component, as well as contextual data. STMTools utilizes Xarray’s multi-dimensional labeling feature, and Zarr's chunk storage feature, to efficiently read and write large Space-Time matrices.

The contextual data enrichment functionality is implemented with Dask. Therefore it can be performed in a paralleled style on Hyper-Performance Computation (HPC) systems.

At this stage, stmtools specifically focus on the implementation for radar interferometry measurements, e.g. Persistent Scatterer, Distributed Scatterer, etc, with the possibility to be extended to other measurements with space and time attributes.

Installation

STMtools can be installed from PyPI:

pip install stmtools

or from the source:

git clone [email protected]:TUDelftGeodesy/stmtools.git
cd stmtools
pip install .

Note that Python version >=3.10 is required for STMtools.

Documentation

For more information on usage and examples of STMTools, please refer to the documentation.

References

[1] Bruna, M. F. D., van Leijen, F. J., & Hanssen, R. F. (2021). A Generic Storage Method for Coherent Scatterers and Their Contextual Attributes. In 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS: Proceedings (pp. 1970-1973). [9553453] (International Geoscience and Remote Sensing Symposium (IGARSS); Vol. 2021-July). IEEE . https://doi.org/10.1109/IGARSS47720.2021.9553453

[2] van Leijen, F. J., van der Marel, H., & Hanssen, R. F. (2021). Towards the Integrated Processing of Geodetic Data. In 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS: Proceedings (pp. 3995-3998). [9554887] IEEE . https://doi.org/10.1109/IGARSS47720.2021.9554887

stmtools's People

Contributors

cpranav93 avatar rogerkuou avatar sarahalidoost avatar vanlankveldthijs avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

stmtools's Issues

Extract time information in `from_csv` function

Following the discussion in #39 , the from_csv function should extract time coordinates by assuming a time string pattern in the header of space-time columns. The following default behavior should be applied:

  1. Assume the string pattern yyyymmdd for time information
  2. Assume the same pattern is shared by all space-time columns
  3. The time coordinates should be stored in np.datetime64 format which is adapted by Xarray.

Implement attributes in STM based on MATLAB code

We need to define the attributes in STM according to the existing implementation in MATLAB.
In the docs of stmread.m, the Space Time Matrix structure ST has the following fields

%      .datasetId       Solution Identifier             char string (free)
%      .techniqueId     Technique Identifier            char string (reserved)
%      .datasetAttrib   Dataset Attributes              struct
%      .techniqueAttrib Technique Attributes            struct 
%      .globalAttrib    Global Attributes               struct
%      .numPoints       Number of Points                int scalar
%      .numEpochs       Number of Epochs                int scalar
%      .pntName         Point name                      cell array [numPoints] 
%      .pntCrd          Point Cooordinates (lat,lon,h)  double [numPoints,3] matrix
%      .pntAttrib       Point Attributes ....           table  [numPoints,* ] 
%      .epochDyear      Decimal year                    double [numEpochs] matrix 
%      .epochAttrib     Epoch Attributes                table [numEpochs,*] 
%      .parTypes        Parameter Types                 cell array  (reserved)
%      .obsTypes        Observation Types               cell array  (reserved)
%      .auxTypes        Auxiliary data Types            cell array (free)
%      .obsData         Observation Space Time Matrix   single [numPoints,numEpochs,dimObs]
%      .auxData         Auxiliary data Space Time Matrix  single [numPoints,numEpoch,dimAux]
%      .sensitivityMatrix  Sensitivity Matrix           single [numPoints,dimPar,dimObs]
%      .stochModel      Stochastic model description    cell array of strings
%      .stochData       Stochastic model data           single [numPoints,numEpochs,dimStochData]
%      .inputDatasets() Structure array with for each input dataset  struct array 

We would like to have this implemented as similarly as possible in the following:

  • Extended Xarray Dataset object, i.e. xr.Dataset.stm
  • Zarr output.

Things to discuss:

  • Go through all the attributes, should all the above attributes be either:
    • one of the following under Dataset: 1) data_vars; 2) coordinates; 3) attrs, or
    • the function to return this under .stm
  • Should we create a hierarchy in Xarray? E.g. xr.Dataset.stm.obsData.phase or just xr.Dataset.stm.phase?
  • Should we create a hierarchy in Zarr?

Add performance test: should we add it?

Add profiling tests on:

  1. STM subset/enrichment with multi-polygon on sorted STM should be faster than non-sorted STM.
  2. Time of loading a csv should roughly linearly scale with num of rows.

The profiling shoulf be executed only locally.

We can recommend running this test locally in CONTRIBUTE.md

Dask warning when applying polygon subsetting

/home/caroline-oku/miniconda3/envs/jupyter_dask/lib/python3.11/site-packages/distributed/worker.py:2988: UserWarning: Large object of size 1.53 MiB detected in task graph: 
  ([('points',), <xarray.IndexVariable 'points' (poi ... 799999]), {}],)
Consider scattering large objects ahead of time
with client.scatter to reduce scheduler burden and 
keep data on workers

    future = client.submit(func, big_data)    # bad

    big_future = client.scatter(big_data)     # good
    future = client.submit(func, big_future)  # good
  warnings.warn(

Unit test for duplicated points in an STM

Issue coming from a discussion in PR #66

A duplicated coordinate (lat, lon, time) will cause the spatial temporal query enrich_from_dataset fail. We need to create a check function to validate there is no duplicated 3D coordinates in an STM.

Also quote Sarah's comments here which are good for consideration:

@rogerkuou to check if the points are unique, the test np.unique(ds['lat'].values).shape == ds['lat'].values.shape is not enough because it only checks the duplicates in one dimension here lat. However, for example, points can be located on one line.
Instead, we need a test if there are cases where (lat, lon, time) are duplicated. Functions like xarray.Dataset.drop_duplicates and pandas.DataFrame.duplicated can be used to write a test. But these functions only work on dim and not coords. In our cases, lat and lon are coords and space is the dim. So we might need to use unstack which leads to memory problems.

A reordering function for stmtools

We would like to have a reordering function for stmtools, to make the spatially close-by points also close in the points orders. This will benefit the enrichment function.

requirements:

  • It only loads 2D coordinates needed for reordering.
  • The reordering only needs to be applied on the point dimension.
  • The other data variables and coordinates should remain delayed.

Example application:

import xarray as xr
stmat = xr.open_zarr('stm.zarr')

# Reorder stmat
stmat_reorder = stmat.stm.reorder(xlabel='X', ylabel='Y')

Example dataset can be retrieved from here

How to handle the time reference system

Although most of the time coordinates are in UTC, we need to come with a way to handle the time in different references.

  • Is there a Python implementation storing time references and doing the conversion?
  • Is there an existing Xarray extension doing this?

Controlling output chunk size and type

I am loading the test full-pixel_psi_amsterdam_tsx_asc_t116_v4_ampl_std_H_c16643.csv.part CSV file (246 MB) with the from_csv function and writing it out as a Zarr store.

When I am loading the STM data from the CSV file I get the following dataset:

stm = stmtools.from_csv(CSV_STM_PATH, output_chunksize={'space': 25_000, 'time': -1})
stm = stm.persist()
stm
<xarray.Dataset>
Dimensions:                (space: 50000, time: 198)
Coordinates:
  * space                  (space) int64 0 1 2 3 4 ... 49996 49997 49998 49999
  * time                   (time) int64 0 1 2 3 4 5 ... 192 193 194 195 196 197
    lat                    (space) object dask.array<chunksize=(25000,), meta=np.ndarray>
    lon                    (space) object dask.array<chunksize=(25000,), meta=np.ndarray>
Data variables: (12/25)
    pnt_id                 (space) object dask.array<chunksize=(25000,), meta=np.ndarray>
    pnt_flags              (space) object dask.array<chunksize=(25000,), meta=np.ndarray>
    pnt_line               (space) object dask.array<chunksize=(25000,), meta=np.ndarray>
    pnt_pixel              (space) object dask.array<chunksize=(25000,), meta=np.ndarray>
    pnt_height             (space) object dask.array<chunksize=(25000,), meta=np.ndarray>
    pnt_demheight          (space) object dask.array<chunksize=(25000,), meta=np.ndarray>
    ...                     ...
    pnt_std_linear         (space) object dask.array<chunksize=(25000,), meta=np.ndarray>
    pnt_std_quadratic      (space) object dask.array<chunksize=(25000,), meta=np.ndarray>
    pnt_std_seasonal       (space) object dask.array<chunksize=(25000,), meta=np.ndarray>
    deformation            (space, time) object dask.array<chunksize=(25000, 198), meta=np.ndarray>
    amplitude              (space, time) object dask.array<chunksize=(25000, 198), meta=np.ndarray>
    h2ph                   (space, time) object dask.array<chunksize=(25000, 198), meta=np.ndarray>

So all variables and coordinates (dimension coordinates excluded) have object type, and the chunk size is correctly as specified in input.

Only after writing the dataset to Zarr the data type is resolved. However, the chunks of the variables that are both space and time dependent is now modified (see e.g. aplitude below):

stm.to_zarr(ZARR_STM_PATH, mode='w')
stm_ = xr.open_zarr(ZARR_STM_PATH)
print(stm_)
<xarray.Dataset>
Dimensions:                (space: 50000, time: 198)
Coordinates:
    lat                    (space) float64 dask.array<chunksize=(25000,), meta=np.ndarray>
    lon                    (space) float64 dask.array<chunksize=(25000,), meta=np.ndarray>
  * space                  (space) int64 0 1 2 3 4 ... 49996 49997 49998 49999
  * time                   (time) int64 0 1 2 3 4 5 ... 192 193 194 195 196 197
Data variables: (12/25)
    amplitude              (space, time) float64 dask.array<chunksize=(6250, 25), meta=np.ndarray>
    deformation            (space, time) float64 dask.array<chunksize=(6250, 25), meta=np.ndarray>
    h2ph                   (space, time) float64 dask.array<chunksize=(6250, 25), meta=np.ndarray>
    pnt_ampconsist         (space) float64 dask.array<chunksize=(25000,), meta=np.ndarray>
    pnt_demheight          (space) float64 dask.array<chunksize=(25000,), meta=np.ndarray>
    pnt_demheight_highres  (space) float64 dask.array<chunksize=(25000,), meta=np.ndarray>
    ...                     ...
    pnt_seasonal_sin       (space) float64 dask.array<chunksize=(25000,), meta=np.ndarray>
    pnt_std_defo           (space) float64 dask.array<chunksize=(25000,), meta=np.ndarray>
    pnt_std_height         (space) float64 dask.array<chunksize=(25000,), meta=np.ndarray>
    pnt_std_linear         (space) float64 dask.array<chunksize=(25000,), meta=np.ndarray>
    pnt_std_quadratic      (space) float64 dask.array<chunksize=(25000,), meta=np.ndarray>
    pnt_std_seasonal       (space) float64 dask.array<chunksize=(25000,), meta=np.ndarray>

My questions:

  1. Why is the chunk size modified when writing to Zarr?
  2. Can one figure out the correct data type already when loading the CSV file?

Density subsetting improvements

Freek mentioned there is an existing Algorithm implemented in SkyGeo. It could be a valuable inspiration for us to improve the density option in subset function.

Potential implementation:

  • x and y coordinates: from pixel coords + pixel size
  • x and y coordinates: from georeferenced coords, lat lon need conversion

Make a draft flowchart: user selection on the contextual data

Make a flowchart to demonstrate the data flow of stm contextual data query.

Current idea: (function -> metadata -> data)

  1. Function: xarray.stm.query(name_data_type)
  2. Metadata: json files with href to data location
  3. data: hosted in the own FS or remote

Investigate catalog method for contextual data archiving

Make a workflow for cataloging contextual datasets. The purpose is to make contextual data easily to be used for enrichment by STMTools.

  1. Download the following datasets and related metadata:
  • BAG (Building registrations. Represents large volume polygons, 3GB in zip file, 7GB in gpkg)
  • BRP bodem (Soil map. Represents small volume polygons, 140MB gpkg)
  • A static raster? (is this needed? AHN?)
  • An NetCDF file for Spatial-Temporal data from ECMWF (ERA-5 / KNMI weather station)?
  1. Preprocessing:
  • CRS conversion
  • Digest metadata: can it be easily read in Python?
  1. Cloud optimized saving:
  • What is the best file format? Should we already use chunk saving like Zarr/GeoParquet?
  1. Create catalog
  • Is the STAC catalog the best choice?

From CSV to Zarr

I am loading the test full-pixel_psi_amsterdam_tsx_asc_t116_v4_ampl_std_H_c16643.csv.part CSV file (246 MB) with the from_csv function and writing it out as a Zarr store. Everything runs locally with the default LocalCluster.

Doing the following takes a long time (3min 55s) and print a warning repeatedly (27 times), from which I get that one is maybe reading the file multiple times?

%%time
stm = stmtools.from_csv(CSV_STM_PATH, blocksize=100e6)
stm.to_zarr(ZARR_STM_PATH, mode='w')
Output
<timed exec>:2: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
<timed exec>:2: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
<timed exec>:2: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
<timed exec>:2: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
<timed exec>:2: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
<timed exec>:2: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
<timed exec>:2: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
<timed exec>:2: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
<timed exec>:2: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
<timed exec>:2: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
<timed exec>:2: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
<timed exec>:2: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
<timed exec>:2: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
<timed exec>:2: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
<timed exec>:2: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
<timed exec>:2: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
<timed exec>:2: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
<timed exec>:2: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
<timed exec>:2: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
<timed exec>:2: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
<timed exec>:2: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
<timed exec>:2: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
<timed exec>:2: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
<timed exec>:2: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
<timed exec>:2: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
<timed exec>:2: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
<timed exec>:2: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
CPU times: user 24.5 s, sys: 5.73 s, total: 30.2 s
Wall time: 3min 55s
<xarray.backends.zarr.ZarrStore at 0x16b550740>

If I instead persist the data before writing it out, I only get the warning once and the full load/write out is quite faster (33.1 s):

%%time
stm = stmtools.from_csv(CSV_STM_PATH, blocksize=100e6)
stm = stm.persist()
stm.to_zarr(ZARR_STM_PATH, mode='w')
Output
<timed exec>:3: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
CPU times: user 10.3 s, sys: 2.3 s, total: 12.6 s
Wall time: 33.1 s
<xarray.backends.zarr.ZarrStore at 0x1685ebd80>

Performance warning in re-order

The demo notebook or reordering gives performance warning when calling re-order:

# Time the reordering operation.
time_ordering = %timeit -o stmat.copy().stm.reorder(xlabel="azimuth", ylabel="range")
time_ordering
/storage/miniforge3/envs/mbl_stmtools/lib/python3.11/site-packages/xarray/core/indexing.py:1430: PerformanceWarning: Slicing with an out-of-order index is generating 230 times more chunks
  return self.array[key]

This seem to be inevitable since we are performing reorder. It seems we can

  1. Supress the warning
  2. Check some Dask reordering, e.g. :https://docs.dask.org/en/stable/order.html

Querying temporal attributes to an STM

Add a enrich_from_stm() function to enrich a STM with spatial temporal data (another STM)

This function should take another xr.Dataset with the same space/time dimension and CRS.

The enrichment should support block-wise operation as enrich_from_polygon .

An example usage adapted from the Documentation example

import xarray as xr

# Read temperature data to query
ds_temperature = xr.read_nc('temperature.nc')

# Read existing stm data
path_stm = Path('./stm.zarr')
stmat = xr.open_zarr(path_stm)

# Enrich
stmat.stm.enrich_from_stm(ds_temperature , space_label=('lat', 'lon'), time_label ='time')

Work example: enrich the STM (Zarr) over Amsterdam in demo with KNMI daily mean temperature.

A "from_csv" function in STMtools

In some situations, researchers would like to build an STM object from existing csv file.

One example can be found in this notebook, where Cell 11 and Cell 14 gives the skeleton of the from_csv. Dask DataFrame is used to handle the large csv in a delayed way. However due to the text nature of the csv file, the chunk size need to be computed before performing lazy functions (as implemented in Cell 11). In Cell 14 there is an implementation of walk through all columns and separate according to the column names.

software release

  • Tests
    • already there
    • lint test
  • Documentation ---> @rogerkuou
    • add logo
    • mkdoc rendering
    • Function Examples
    • Add tutorial
    • Add contributing or developerguide
  • GitHub actions ---> @SarahAlidoost
    • Add a pyproject.toml file
      - Update doc link
      - Update change log link
      - Update author names
    • Build package and documentation when push and pull request
    • Doc deploy when release
    • Test using sonarcloud!
      - set up sonar (one-time)
      - setup sonar and github secrets token
      - add sonar properties file
  • Update citations and author names
    • cff
    • zenodo json
  • README ---> @SarahAlidoost
    • Update description
    • Installation example
    • Checklist Badge: OpenSSF Best Practices
    • Other badges:
      • Sonarcloud
  • Publish to PyPI
    • GH workflow

Space-Time matrix query from polygon fields

Define a member function of class SpaceTimeMatrix, which queries the a field from a Polygon file with the coordinates of each point, then returns the query results as a new data variable.

Expected result:

import geopandas as gpd
import xarray as xr
import stmat

# Read polygon
polygon = gdp.read_file('example_polygon.shp')

# Initiate a dataset, ds contains both data and coordinates
ds = xr.Dataset(...)

# This will add a new data variable "field_name_str" to ds
ds = ds.stm.query_polygon(polygon, "field_name_str")

stm.subset(method='polygon', polygon=path_polygon) expects points and time dimensions without checking

It just crashes at the end

File ~/Workspace/stmtools/stm/stm.py:152, in SpaceTimeMatrix.subset(self, method, **kwargs)
144 case other:
145 raise NotImplementedError(
146 "Method: {} is not implemented.".format(method)
147 )
148 chunks = {
149 "points": min(
150 self._obj.chunksizes["points"][0], data_xr_subset.points.shape[0]
151 ),
--> 152 "time": min(self._obj.chunksizes["time"][0], data_xr_subset.time.shape[0]),
153 }
155 data_xr_subset = data_xr_subset.chunk(chunks)
157 return data_xr_subset

File ~/mambaforge/envs/jupyter_dask/lib/python3.11/site-packages/xarray/core/utils.py:455, in Frozen.getitem(self, key)
454 def getitem(self, key: K) -> V:
--> 455 return self.mapping[key]

KeyError: 'time'

Example Python production script

Now we have a good Jupyter notebook to demo sarxarray and stm, it would be good to have it also as Python scripts, and give instructions.

Expected deviverables:

  1. Excutable Python script working with Dask-SLURM cluster.
  2. README file with instructions of executing the data library

Warnings in Unit Test

======================================= warnings summary ========================================
tests/test_io.py::TestFromCSV::test_readcsv_dims
tests/test_io.py::TestFromCSV::test_readcsv_vars
tests/test_io.py::TestFromCSV::test_readcsv_output_chunksize
tests/test_io.py::TestFromCSV::test_readcsv_custom_var_name
tests/test_io.py::TestFromCSV::test_readcsv_list_coords
tests/test_io.py::TestFromCSV::test_readcsv_custom_coords
tests/test_io.py::TestFromCSV::test_readcsv_timevalues
tests/test_io.py::TestFromCSV::test_readcsv_timevalues
tests/test_io.py::TestFromCSV::test_readcsv_dtypes
  /home/oku/miniforge3/envs/mobyle/lib/python3.11/site-packages/dask/dataframe/core.py:3828: UserWarning: Dask currently has limited support for converting pandas extension dtypes to arrays. Converting string to object dtype.

Query from AHN DEM data

The AHN data can be one format of the contextual data. It is neither raster (if we are not using the geoTiff version) nor vector data (would be too big to fit in a geopandas dataframe), but a large pointcloud. Besides, one may need to query AHN with 3D coordinates, instead of 2D.

Future warning `Dataset.dims`

In: https://github.com/TUDelftGeodesy/stmtools/pull/71/files#diff-0b3352e5838067dc55b6b083d1385268ce032917d9d0d7cac6f5d7bdf1c09913

%%timeit -o
# Time subset operation on chunked STM.
stmat_chunked_subset = stmat_chunked.stm.subset(method='polygon', polygon=path_polygon)
phase = stmat_chunked_subset['phase'].compute()
/storage/MobyLe/stmtools/stmtools/stm.py:134: FutureWarning: The return type of `Dataset.dims` will be changed to return a set of dimension names in future, in order to be more consistent with `DataArray.dims`. To access a mapping from dimension names to lengths, please use `Dataset.sizes`.
  if dim not in self._obj.dims.keys():

Update notebooks

  1. Publish the zarr data in the [example] to Figshare
  2. Use the second part for STMExample for a stand alone notebook.
  3. README for example directory on setup.
  4. Another TUDelft dedicated version for large scale data

chunck size calculation error

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[3], line 5
      2 stack = sarxarray.from_binary(list_slcs, shape, dtype=dtype)
      4 # Load coordinates
----> 5 lat = sarxarray.from_binary(f_lat, shape, vlabel="lat", dtype=np.float32)
      6 lon = sarxarray.from_binary(f_lon, shape, vlabel="lon", dtype=np.float32)
      7 stack = stack.assign_coords(lat = (("azimuth", "range"), lat.squeeze().lat.data), lon = (("azimuth", "range"), lon.squeeze().lon.data))

File ~/miniconda3/envs/jupyter_dask/lib/python3.11/site-packages/sarxarray/_io.py:61, in from_binary(slc_files, shape, vlabel, dtype, chunks, ratio)
     59 # Calculate appropriate chunk size if not user-defined
     60 if -1 in chunks:
---> 61     chunks = _calc_chunksize(shape, dtype, chunks, ratio)
     63 # Read in all SLCs
     64 slcs = None

File ~/miniconda3/envs/jupyter_dask/lib/python3.11/site-packages/sarxarray/_io.py:212, in _calc_chunksize(shape, dtype, chunks, ratio)
    190 def _calc_chunksize(shape, dtype, chunks, ratio):
    191     """
    192     Calculate an optimal chunking size in the azimuth and range direction for
    193     reading with dask and store it in variable `chunks`
   (...)
    210         Default value of [-1, -1] when unmodified activates this function.
    211     """
--> 212     n_elements = 100*1024*1024/dtype.itemsize # Optimal number of elements for a memory size of 200mb (first number)
    213     chunks_az = int(math.ceil((n_elements*ratio)**0.5/1000.0)) * 1000 # Chunking size in azimuth direction up to nearest thousand
    214     chunks_ra = int(math.ceil(n_elements/chunks_az/1000.0)) * 1000 # Chunking size in range direction up to nearest thousand

TypeError: unsupported operand type(s) for /: 'int' and 'getset_descriptor'

Space-time matrix: Subset function

Define a member function of class SpaceTimeMatrix, to return a subset of itself based on a specific criterion.

Possible criteria are (priority high to low):

  • Within a spatial polygon
  • Below/above a certain threshold of a temporal attribute. At this stage, we can implement amplitude dispersion as a start.
  • Below a specific density, e.g. 1 point in every 100m.

A placeholder has been added to stmat.py

An example result:

# Load polygon/multi-polygon using geopandas
import geopandas as gpd
polygon = gpd.read_file('example_polygon.shp')

# Assuming an `xarray.Dataset` exists as `xr_stm`

# Subset with polygon
stm_xr_subset = xr_stm.stm.subset(polygon) # Return all the points within the polygon 

# Subset with amplitude dispersion
stm_xr_subset = xr_stm.stm.subset('amplitude_dispersion', '>0,8') # Return all the points with amplitude dispersion >0.8, can discuss syntax

# Subset with density
stm_xr_subset = xr_stm.stm.subset('density', '200m') # Return points with a 200m/point density

A function to initiate an STM from scratch

Suggestion by Freek:

We currently have functions to read an STM from Zarr and CSV. It would also be easier to have function (from_dict(), for example) to initiate an STM directly in memory, make assumtions on dimensions (space, time) and pre-fill some parts.

This can be a wrapper creating an xr.Dataset.

Deprecation warning for Jupyter in mkdocs

warning when executing

mkdocs serve
INFO    -  DeprecationWarning: Jupyter is migrating its paths to use standard platformdirs
           given by the platformdirs library.  To remove this warning and
           see the appropriate new directories, set the environment variable
           `JUPYTER_PLATFORM_DIRS=1` and then run `jupyter --paths`.
           The use of platformdirs will be the default in `jupyter_core` v6
             File "/home/oku/miniforge3/envs/mobyle/lib/python3.11/site-packages/jupyter_core/utils/__init__.py", line 90, in
           deprecation
               warnings.warn(message, DeprecationWarning, stacklevel=stacklevel + 1)
             File "/home/oku/miniforge3/envs/mobyle/lib/python3.11/site-packages/jupyter_client/connect.py", line 22, in
               from jupyter_core.paths import jupyter_data_dir, jupyter_runtime_dir, secure_write

Version check:

mamba list mkdocs
# packages in environment at /home/oku/miniforge3/envs/mobyle:
#
# Name                    Version                   Build  Channel
mkdocs                    1.5.3              pyhd8ed1ab_0    conda-forge
mkdocs-autorefs           0.5.0                    pypi_0    pypi
mkdocs-gen-files          0.5.0                    pypi_0    pypi
mkdocs-jupyter            0.24.6                   pypi_0    pypi
mkdocs-material           9.5.4                    pypi_0    pypi
mkdocs-material-extensions 1.3.1                    pypi_0    pypi
mkdocstrings              0.24.0                   pypi_0    pypi
mkdocstrings-python       1.8.0                    pypi_0    pypi

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.