GithubHelp home page GithubHelp logo

cci-tools / cate Goto Github PK

View Code? Open in Web Editor NEW
50.0 12.0 15.0 128.12 MB

ESA CCI Toolbox (Cate)

License: MIT License

Python 59.25% Jupyter Notebook 40.45% Batchfile 0.22% Shell 0.06% Dockerfile 0.02%
python conda climate esa cci

cate's Introduction

Cate: ESA CCI Toolbox

Build status GH actions Build status codecov.io Documentation Status

cate

ESA CCI Toolbox (Cate) Python package, API and CLI.

Installation

Cate can be installed into a new or existing Python 3.7 Miniconda or Anaconda environment as follows:

$ conda install -c ccitools cate-cli

Installation from Sources

Cate's sources (this repository) are organised as follows:

  • setup.py - main build script to be run with Python 3.6+
  • cate/ - main package and production code
  • test/ - test package and test code
  • doc/ - documentation in Sphinx/RST format

We recommend installing Cate into an isolated Python 3 environment, because this approach avoids clashes with existing versions of Cate's 3rd-party Python package requirements. Using Miniconda or Anaconda will usually avoid platform-specific issues caused by module native binaries.

The first step is to clone latest Cate code and step into the check out directory:

$ git clone https://github.com/CCI-Tools/cate.git
$ cd cate

Using Conda

Conda is the package manager used by the Miniconda or Anaconda Python distributions.

Creating a new Python environment for Cate will require around 2.2 GB disk space on Linux/Darwin and and 1.2 GB on Windows. To create a new Conda environment cate-env in your Anaconda/Miniconda installation directory, type:

$ conda env create

If you want the environment to be installed in another location, e.g. due to disk space limitations, type:

$ conda env create --prefix some/other/location/for/cate

Next step is to activate the new environment.

$ conda activate cate-env

You can now safely install Cate sources into the new cate-env environment.

(cate-env) $ python setup.py install

Using Docker

You can also use pre-build Docker images that contain a Python environment with the cate package already installed. The images are quay.io/bcdev/cate:<version>. E.g.

$ docker run -d -v ${my_local_dir}:/home/cate quay.io/bcdev/cate:2.1.1 bash
(cate-env) $ cate -h  

where ${my_local_dir} refers to any directory on your computer that you may want to access from within the running Docker container.

Getting started

To test the installation, first run the Cate command-line interface. Type

$ cate -h

IPython notebooks for various Cate use cases are on the way, they will appear in the project's notebooks folder.

To use them interactively, you'll need to install Jupyter and run its Notebook app:

$ conda install jupyter
$ jupyter notebook

Open the notebooks folder and select a use case.

Running Cate App in Stand-Alone mode

To run the the graphical user interface Cate App in stand-alone mode you'll need to start a Cate Web API service. To do so, first install the cate Python package as described above. Then Cate Web API service is started from the command-line. To run the service on port 9090 on your local computer, type:

$ cate-webapi-start --port 9090 

Then open Cate App in a browser and enter the URL http://localhost:9090. Press the "Cate Stand-Alone Mode" button above. This will launch the Cate App in stand-alone mode. If you wish to run a service with limited file system access (sandboxed), you can specify the root option that defines a new file system root:

$ cate-webapi-start --port 9090 --root /home/fritz

Use CTRL+C or the command

$ cate-webapi-stop --port 9090

to stop the service.

To run the service from the docker image, type:

$ docker run -it -v ${my_local_dir}:/home/cate -p 9090:4000 quay.io/bcdev/cate:2.1.1 bash
(cate-env) $ cate-webapi-start --port 4000 --root ${my_local_dir}    

Conda Deployment

There is a dedicated repository cate-conda which provides scripts and configuration files to build Cate's Conda packages and a stand-alone installer.

Development

Contributors

Contributors are asked to read and adhere to our Developer Guide.

Unit-testing

For unit testing we use pytest and its coverage plugin pytest-cov.

To run the unit-tests with coverage, type

$ export NUMBA_DISABLE_JIT=1
$ py.test --cov=cate test

We need to set environment variable NUMBA_DISABLE_JIT to disable JIT compilation by numba, so that coverage reaches the actual Python code. We use Numba's JIT compilation to speed up numeric Python number crunching code.

Other recognized environment variables to customize the unit-level tests are

CATE_DISABLE_WEB_TESTS=1
CATE_DISABLE_PLOT_TESTS=1
CATE_DISABLE_GEOPANDAS_TESTS=1
CATE_DISABLE_CLI_UPDATE_TESTS=1

Generating the Documentation

We use the wonderful Sphinx tool to generate Cate's documentation on ReadTheDocs. If there is a need to build the docs locally, first create a Conda environment:

$ cd cate
$ conda env create -f environment-rtd.yml

To regenerate the HTML docs, type

$ cd doc
$ make html

License

The CCI Toolbox is distributed under terms and conditions of the MIT license.

cate's People

Contributors

alicebalfanz avatar dzelge avatar forman avatar hans-permana avatar herzogstephan avatar janisgailis avatar kbernat avatar mzuehlke avatar papesci avatar pont-us avatar pwambach avatar stratosgear avatar suvarchal avatar tomblock avatar toniof avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cate's Issues

CLI should offer interactive plotting

The CLI should offer some basic, interactive plotting so that users can inspect intermediate results when they work with workspace resources:

ect res set RES1 OP1 ...
ect res set RES2 OP1 ...
ect res plot RES1 RES2 --var X1,X2

Allow proxy configuration for remote CCI ODP access

Some firewall configurations may prevent user computers to directly access the CCI ODP services.

In such cases the ECT shall allow for configuration of a proxy server. User manual shall comprise a section how to do the proxy configuration.

  • Network connection setting should be configurable through $HOME/.cate/conf.py, see also cate.conf package.
  • Document the value

Python API docs shall include linked type annotations

Although we Sphinx' sphinx_autodoc_annotation option, the generated type information is not linked with the type's API docs. For example, if a type is xarray.Dataset the type annotation shall be linked like so: xarray.Dataset.

This is very important as we use xarray.Dataset as our common data model.

Standard operations should have short names

In the CLI, and when writing own Workflow JSON files, users must currently use fully qualified type names to identify an operation. A fully qualified type name comprises the (Python) package path and the actual function name. Instead of

$ ect op info ect.ops.coregistration.coregister

users should be able to simply type

$ ect op info coregister

For the standard ECT operations - the ones defined within the ect-core repository - we should define aliases or allow for using the function name only (without the package path prefix).

Allow CCI ODP data store to use OPeNDAP

Opening datasets via OPeNDAP would free ECT (software and users) from downloading individual data files from ESA's CCI ODP.

The real power of OPeNDAP lies in its ability to access data subsets.

  • Advantage: No local data files to be stored. Option to implement subsets comprising only individual variables. Option to implement spatial, temporal data subsets.
  • Disadvantage: No caching takes place therefore each analysis requires recurring download of same data.

Interactive CLI using WebAPI service

The following CLI command sequence represents a typical ECT session:

ect ws init
ect res read X input.nc
ect res set Y func1 ds=X p1=53.2
ect res set Z func2 ds=Y p2='nearest'
ect res write Z output.nc

While the ect res read and ect res set commands just alter the workspace workflow, ect res write will finally execute the workflow and write the results to disk.

It is highly desirable to also be able to inspect intermediate results such as X and Y in the example above, e.g. by plotting and printing them. However, this means that the workflow would be executed for each such inspection command, possibly taking a lot of data retrieval and processing time. To get around this limitation, intermediate resources values should some somehow be retained between CLI calls so that they don't need to be recomputed.

This feature is directly related to #25 because a cached state would be needed to validate against cached, intermediate resource values whose types would otherwise remain unknown.

Issue #27 is a part of this issue.

Implementation idea

All ect res commands could be executed by delegating them to a running ECT WebAPI service that could hold intermediate resource values in a two-level cache: The first cache maps a resource name to a resource value, a Python in-memory object. If it does not exist, a second cache could map a resource name to a file which has been previously written by a ect res write command. The resource value would be read from the file and then put into the first cache. Cached values would be removed as soon as any of the providing workflow steps are modified or removed.

One problem remains: the ECT WebAPI service should start transparently once it is required by a ect command. However, it should also stop if it is no longer needed. Otherwise a (Python) process would hang in the background and presumably allocating lots of RAM. How can we tell that the service is no longer needed? Here are some suggestions:

  1. Service may stop automatically after a (configurable) duration of user inactivity.
  2. Service may stop once all previously open workspaces are closed. Commands like ect ws init and a new ect ws open would start the WebAPI service if its not already running. A new ect ws close would close a workspace while ect ws closeall (or ect ws close --all) would close them all at once.
  3. Service may be stopped explicitly. ect ws exit would close all workspaces directly stop the service.

Note, the ECT WebAPI service is an essential, building block also of the ECT GUI. See also here.

Unions can not be used with isinstance

(ect_env)ccitbx@ccitbx-VirtualBox:~/Development/ect-core$ ect res set cc_tot ect.ops.select.select_var ds=cl07 var=cc_total
Using FSWorkspaceManager
ect res: error: Unions cannot be used with isinstance().

Apparently ect res does not like my new select_var implementation with

var: Union[None, str, List[str]] = None

Let CLI display variable names of a dataset

The CLI shall provide a command which displays the names of dataset variables. For example

ect ds info DS_ID -v

Output shall include the netCDF variable name and, if available, also the units, CF standard name and description (i.e. long name).

This is essential because otherwise there is no chance for users to pass variable references which may be required by setting up a workflow JSON or some other CLI commands such as ect run.

ops.select.select_variables should not take a list

select_variables takes a list, this is not ideal for CLI operation. It would be better to use a single regex string and always use regex for searching the dataset. Then, one could select two or more precisely named variables by simply passing the following string that uses the regex OR operator:

'variable1|variable2'

Derive temporal dataset coverage from ODP services

We are currently using the ESGF service at CEDA of the ESA CCI ODP.

Its responses do not include any temporal dataset coverage as it is urgently required when we e.g. perform CLI commands such as

ect ds sync DATASET_ID START_DATE STOP_DATE

This information is also urgently required for the GUI, e.g. for displaying a dashboard that shows availability of data in time.

Progress monitors may prompt users

Progress monitors may prompt users before progress observation begins whether they want to continue. For example, some process may detect that large amounts of data would need to be downloaded before some task can be completed. In such a situation it would be nice if the ECT asks the user if it is ok to do so.

This new API feature is a consequence of issue #21.

Hint: We'll could enhance the ConsoleMonitor implementationso that it conditionally halts in its start() method and displays a command-line prompt do be answered with Do you really want ... bla ... [Y]es/No?, if answered with No, an InterruptedError may be raised.

Let CLI display locally available datasets

All CCI data is provided remotely by the ESA ODP. The ECT stores data downloaded from ESA ODP in a cache directory in the user's local file system. Users need a CLI command to identify the locally available datasets so that they can perform operations that won't imply fetching new data from the ESA ODP.

For example, the command

ect ds info --local

would list all locally cached datasets with their temporal coverage.

Introduce new Workspace concept

A Workspace shall be a collection of named operation executions and related data(set) files. A workspace preserves the order of operations a user has executed, where the first operations are usually loading of datasets from given data sources and the last ones are storing data to files using a given format or mime-type. Read more in Workspaces.

Use ESA CCI ODP web services

ECT shall make direct use of the ESA CCI ODP web services and the provided datasets shall serve as ECT's primary data source. The current ECT implementation is based on the ESA CCI ODP FTP server where we create a local index from scanning all files in the FTP tree (slow!).

ECT shall use the ESA CCI ODP web services, namely the ESGF Search RESTful API, to update meta-information about available (CCI) datasets. That is, the CLI shall print index info from ESGF index when users type ect ds list and ect ds info DS_ID (DS_ID = dataset_id of the index). File download required by the ect ds sync DS_ID command shall be from a suitable url provided by the index for a given DS_ID.

Unfortunately, the ESGF index does not provide any time coverage information about its datasets. Using FTP, we extracted time info from filenames.

Operations should recognize auxiliary variables

Selected ECT operations should be able to correctly recognize auxiliary variables associated with a measurement variable. For example

  • uncertainty variables provided for a measurement may be considered when deriving measurement statistics
  • uncertainty variables may be resampled using a different methodoloy
  • when extracting a measurement variable from a dataset, its auxiliary variables may be automatically extracted too

Pearson correlation only calculates positive correlation

Pearson correlation now uses abs() in its calculation, always resulting in positive correlation. This is because the correlation at some point for some data:

esacci.CLOUD.mon.L3C.CLD_PRODUCTS.AVHRR.NOAA-15.AVHRR_NOAA.1-0.r1 2007-01-01 2007-12-30
esacci.OZONE.mon.L3.NP.multi-sensor.multi-platform.MERGED.fv0002.r1 2007-01-01 2007-12-30

ect res set cc_tot select_var ds=cl07 var=cc_total
ect res set oz_tot select_var ds=oz07 var=O3_du_tot

results in sqrt(-x).

It has to be found out, how to calculate the correlation properly in this situation. E,g, if abs can be used.

Allow workflow execution to use cached values

The runtime performance of subsequently executed workflows could be drastically increased if a value cache could be used so that some steps don't need to be executed again. Subsequent workflow execution will occur in the ECT WebAPI service, which is an implementation detail of issue #32.

Caching shall occur only if allowed by a certain workflow step. Actually, caching shall only occur for steps, whose operation can be said to be a pure function - the function output depends solely on the function input, with no side-effects, and no random values generated.

Therefore, an operation's meta-info header should include information of whether its output values are allowed to be cached or not.

Coregistration should be able to handle datasets with more dimensions than lat/lon/time

Currently coregistration works well only if both datasets to coregister have lat/lon/time dimensions. This should not be the case, as it should be possible to bring two datasets on the same spatial lat/lon grid irrespective of how many other dimensions there are.

For example, having two datasets with such dimensions:

lat/lon/time

lat/lon/time/layer/pressure,

it should be possible to resample the second dataset to the lat/lon resolution of the first one such that it still preserves the other dimensions. E.g., instead of only 'group_by' time, but somehow other dimensions as well.

Analysis report as HTML or PDF

Users performing e.g. an analysis as defined by use case #9 would like to a visually appealing report that can be displayed in a browser or written as PDF. The report should include the ECVs used, the operations performed and shall contain any analysis results, such as plots and statistics tables. The need comes from users that are unsatisfied by the plot image output and who want to be supported in publishing their result.

CLI must validate operation arguments

The CLI already validates operation arguments when passed to the ect run OP ... command. It must also validate operation arguments when calling the

ect res set OP ...

even though actual OP execution is deferred until some data is written using ect res write FILE. Otherwise users create an invalid workspace that is later on very hard to debug.

Allow opening dataset comprising multiple local files

Currently ect res open allows for opening datasets from a registered data store only. It should also allow for accessing a dataset comprising multiple local files. Alternatively, we could also introduce an additional command, say ect res mfopen (multi-file open).

Another idea is to add a new data source globally to ECT, in order to make it permanently available:

ect ds add [-r, --recursive] DS_NAME FILE_OR_DIR...

The new data source DS_NAME can then be used like any other (remote) data source. FILE_OR_DIR... is a list of file names and/or directory names which may include wildcards (on Windows).

This issue is a blocker as long as issue #40 is not solved.

Harmonize data schema of ingested datasets ("CDM")

When ECT opens a dataset, e.g. from the CCI ODP, it shall make sure that the resulting (xarray.Dataset) in-memory representation conforms to a unique schema that we'd can expect in all our data operations. This is also what we refer to as Common Data Model (CDM).

Harmonization shall address:

  • Different names used for latitude and longitude for both dimensions and coordinate variables
  • Different interpretation to which actual location a given lat/lon coordinate value refers to within a grid's cell boundaries
  • Different names used for time and time boundaries, as well as time data types.
  • Variable interpretation, e.g. distinguish between primary, auxiliary (e.g. uncertainty), or coordinate variables (e.g. latitude)
  • Minimum set of metadata (global) attributes, e.g. the data source name from which the dataset has been opened, the processing history in form an ECT Workflow JSON dump.
  • Not-a-value management: missing_value, _FillValue, NaN

Use a liberal OSS license

Currently our LICENSE file contains the GPL v3.
However GPL is restrictive regarding redistribution of source code, hard to understand (~10 pages of text), incompatible with a number of other, more liberal OSS licenses, and often creates problems when building commercial software.

The proposal is too move to a simple license that is compatible with the majority of other OSS licenses, e.g. the MIT license.

For comparison:

CLI workspace and resource commands must execute asynchronously

Issues #25, #27, and finally #32 led to a software design, which requires a running ECT WebAPI service running as a second process for all workspace and resource commands submited by the ECT CLI.

Currently all commands are delegated from the ECT CLI to the ECT WebAPI service, which executes requested tasks on the main thread of the (Tornado) web service, hence they are executed synchronously. This means that long running tasks will cause the CLI client to timeout. Furthermore the service cannot respond to new requests, e.g. from another shell, or later, the GUI.

Therefore all CLI workspace and resource commands must be executed asynchronously by the WebAPI service.

How to properly handle datetime in operations?

Having an operation that expects to get string formatted datetime, such as '2007-30-10', the cli interface apparently translates that to an int and complains about it.

Any ideas how to solve this?

expected_str_got_int

Clean-up after UC9 merge

  • Use @op_return instead of @op_output
  • Change 'filter_dataset' to 'select_variables'
  • Use @op with tags
  • Use 'ds' as the default name for xarray datasets
  • Ensure operations take CLI passable inputs, e.g., use single values where possible
  • Don't use the 'required' keyword
  • Update the 'timeseries' operation so that it checks that the time dimension is there and that data is aggregated along other dimensions.
  • Try to make slow tests faster.

Fully annotate and document public operations

Make sure all public ECT operations are tagged, have type annotations and comprehensive usage and background information and clear input/output descriptions.

Verify by listing all operations by

ect op list

and then printing information like so

ect op info OP
  • coregister
  • pearson_correlation
  • harmonize
  • open_dataset
  • read_json
  • read_netcdf
  • read_object
  • read_text
  • save_dataset
  • write_json
  • write_netcdf3
  • write_netcdf4
  • write_object
  • write_text
  • plot_map
  • select_var
  • subset_spatial
  • subset_temporal
  • subset_temporal_index
  • timeseries
  • timeseries_mean

Create installer

So far, ECT has no installer which would include a deeply freezed Conda Python environment and provides Desktop/Startmenu shortcuts for the CLI (and later include the frontend binaries and shortcuts).

See also wiki page Software Installation Approach.

Let CLI display supported I/O data formats

CLI must offer a command or option that lists all supported input/output formats. Otherwise users don't know what values to pass to the various CLI commands that offer a format name option.

CLI must lock workspace in use

When the CLI is performing operations within a workspace (ect ws init, ect res set, ...) it should be locked so that another CLI instance cannot accidentally modify files and resources currently in use.

EDIT:

Same applies to WebAPI, there the lock must be used in class Workspace.

CLI should be able to clean a workspace

After

ect ws init

we currently have no means to

ect ws clean

which would delete ./.ect-workspace/workflow.json, while

ect ws delete

would delete the entire ./.ect-workspace directory.

Retrieving data from CCI ODP should occur implicitly

Currently, users using the CLI are forced to explicitly "synchronize" a data source so that it is available for computation, processing and analysis:

ect ds sync DS --time START[,END]

It would be way more user-friendly if the ECT CLI (and later the GUI) would implicitly download any required data from the CCI ODP. A cancellable progress monitor shall be displayed in this case. Ideally, the user would be asked before ECT automatically initiates the data download. When prompted, it would also be very helpful if ECT could given an indication how long the download will take.

Harmonize operation names, their parameters and return values

The names of ECT operations shall be catchy and generally understandable. Ideally, we would only use verbs because every operation does something with its inputs and returns a the result of the operation , which is ideally a pure function.

Different ECT operations that have a parameter for a similar purpose should all use the same parameter name, same data type, and same description, and ideally use the same argument position.

Examples:

  • If an operation transforms a xarray.Dataset it should always be the first parameter and should always be called ds. Other parameters of type xarray.Dataset should have names using a ds_ prefix.
  • Operations which have a parameter that selects variables from a given dataset, it may be called var and its value may be either None, a str, or a sequence of str. Other rules may apply, e.g. a strmay contain comma-separated names or use wildcards ect., that's fine, but this shall apply to all the operations that have the var parameter.
  • Operations which have a parameter that identifies a file to read from or write to, it may be called file and its value may be either None, a str (= file path), or Python file object.
  • coregister
  • pearson_correlation
  • harmonize
  • open_dataset
  • read_json
  • read_netcdf
  • read_object
  • read_text
  • save_dataset
  • write_json
  • write_netcdf3
  • write_netcdf4
  • write_object
  • write_text
  • plot_map
  • select_var
  • subset_spatial
  • subset_temporal
  • subset_temporal_index
  • timeseries
  • timeseries_mean

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.