GithubHelp home page GithubHelp logo

noaa-owp / hydrotools Goto Github PK

View Code? Open in Web Editor NEW
52.0 5.0 12.0 15.3 MB

Suite of tools for retrieving USGS NWIS observations and evaluating National Water Model (NWM) data.

License: Other

Python 99.52% Makefile 0.48%
python pandas noaa hydrology evaluation verification validation forecasting modeling simulation

hydrotools's Introduction

hydrotools.caches testing status hydrotools.events testing status hydrotools.metrics testing status hydrotools.nwis_client testing status hydrotools.nwm_client testing status hydrotools.nwm_client_new testing status hydrotools._restclient testing status hydrotools.svi_client testing status weekly unit tests

OWPHydroTools

Documentation

OWPHydroTools GitHub pages documentation

Motivation

We developed OWPHydroTools with data scientists in mind. We attempted to ensure the simplest methods such as get both accepted and returned data structures frequently used by data scientists using scientific Python. Specifically, this means that pandas.DataFrames, geopandas.GeoDataFrames, and numpy.arrays are the most frequently encountered data structures when using OWPHydroTools. The majority of methods include sensible defaults that cover the majority of use-cases, but allow customization if required.

We also attempted to adhere to organizational (NOAA-OWP) data standards where they exist. This means pandas.DataFrames will contain column labels like usgs_site_code, start_date, value_date, and measurement_unit which are consistent with organization wide naming conventions. Our intent is to make retrieving, evaluating, and exporting data as easy and reproducible as possible for scientists, practitioners and other hydrological experts.

What's here?

We've taken a grab-and-go approach to installation and usage of OWPHydroTools. This means, in line with a standard toolbox, you will typically install just the tool or tools that get your job done without having to install all the other tools available. This means a lighter installation load and that tools can be added to the toolbox, without affecting your workflows!

It should be noted, we commonly refer to individual tools in OWPHydroTools as a subpackage or by their name (e.g. nwis_client). You will find this lingo in both issues and documentation.

Currently the repository has the following subpackages:

  • events: Variety of methods used to perform event-based evaluations of hydrometric time series

  • nwm_client: Provides methods for retrieving National Water Model data from various sources including Google Cloud Platform and NOMADS

  • metrics: Variety of methods used to compute common evaluation metrics

  • nwis_client: Provides easy to use methods for retrieving data from the USGS NWIS Instantaneous Values (IV) Web Service

  • svi_client: Provides programmatic access to the Center for Disease Control's (CDC) Social Vulnerability Index (SVI)

  • _restclient: A generic REST client with built in cache that make the construction and retrieval of GET requests painless

  • caches: Provides a variety of object caching utilities

UTC Time

Note: the canonical pandas.DataFrames used by OWPHydroTools use time-zone naive datetimes that assume UTC time. In general, do not assume methods are compatible with time-zone aware datetimes or timestamps. Expect methods to transform time-zone aware datetimes and timestamps into their timezone naive counterparts at UTC time.

Usage

Refer to each subpackage's README.md or documentation for examples of how to use each tool.

Installation

In accordance with the python community, we support and advise the usage of virtual environments in any workflow using python. In the following installation guide, we use python's built-in venv module to create a virtual environment in which the tools will be installed. Note this is just personal preference, any python virtual environment manager should work just fine (conda, pipenv, etc. ).

# Create and activate python environment, requires python >= 3.8
$ python3 -m venv venv
$ source venv/bin/activate
$ python3 -m pip install --upgrade pip

# Install all tools
$ python3 -m pip install hydrotools

# Alternatively you can install a single tool
#  This installs the NWIS Client tool
$ python3 -m pip install hydrotools.nwis_client

OWPHydroTools Canonical Format

"Canonical" labels are protected and part of a fixed lexicon. Canonical labels are shared among all hydrotools subpackages. Subpackage methods should avoid changing or redefining these columns where they appear to encourage cross-compatibility. Existing canonical labels are listed below:

  • value [float32]: Indicates the real value of an individual measurement or simulated quantity.
  • value_time [datetime64[ns]]: formerly value_date, this indicates the valid time of value.
  • variable_name [category]: string category that indicates the real-world type of value (e.g. streamflow, gage height, temperature).
  • measurement_unit [category]: string category indicating the measurement unit (SI or standard) of value
  • qualifiers [category]: string category that indicates any special qualifying codes or messages that apply to value
  • series [integer32]: Use to disambiguate multiple coincident time series returned by a data source.
  • configuration [category]: string category used as a label for a particular time series, often used to distinguish types of model runs (e.g. short_range, medium_range, assimilation)
  • reference_time [datetime64[ns]]: formerly, start_date, some reference time for a particular model simulation. Could be considered an issue time, start time, end time, or other meaningful reference time. Interpretation is simulation or forecast specific.
  • longitude [category]: float32 category, WGS84 decimal longitude
  • latitude [category]: float32 category, WGS84 decimal latitude
  • crs [category]: string category, Coordinate Reference System, typically "EPSG:4326"
  • geometry [geometry]: GeoPandas compatible GeoSeries used as the default "geometry" column

Non-Canonical Column Labels

"Non-Canonical" labels are subpackage specific extensions to the canonical standard. Packages may share these non-canonical lables, but cross-compatibility is not guaranteed. Examples of non-canonical labels are given below.

  • usgs_site_code [category]: string category indicating the USGS Site Code/gage ID
  • nwm_feature_id [integer32]: indicates the NWM reach feature ID/ComID
  • nws_lid [category]: string category indicating the NWS Location ID/gage ID
  • usace_gage_id [category]: string category indicating the USACE gage ID
  • start [datetime64[ns]]: datetime returned by event_detection that indicates the beginning of an event
  • end [datetime64[ns]]: datetime returned by event_detection that indicates the end of an event

Categorical Data Types

OWPHydroTools uses pandas.Dataframe that contain pandas.Categorical values to increase memory efficiency. Depending upon your use-case, these values may require special consideration. To see if a Dataframe returned by a OWPHydroTools subpackage contains pandas.Categorical you can use pandas.Dataframe.info like so:

print(my_dataframe.info())
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5706954 entries, 0 to 5706953
Data columns (total 7 columns):
 #   Column            Dtype         
---  ------            -----         
 0   value_date        datetime64[ns]
 1   variable_name     category      
 2   usgs_site_code    category      
 3   measurement_unit  category      
 4   value             float32       
 5   qualifiers        category      
 6   series            category      
dtypes: category(5), datetime64[ns](1), float32(1)
memory usage: 141.5 MB
None

Columns with Dtype category are pandas.Categorical. In most cases, the behavior of these columns is indistinguishable from their primitive types (in this case str) However, there are times when use of categories can lead to unexpected behavior such as when using pandas.DataFrame.groupby as documented here. pandas.Categorical are also incompatible with fixed format HDF files (must use format="table") and may cause unexpected behavior when attempting to write to GeoSpatial formats using geopandas.

Possible solutions include:

Cast Categorical to str

Casting to str will resolve all of the aformentioned issues including writing to geospatial formats.

my_dataframe['usgs_site_code'] = my_dataframe['usgs_site_code'].apply(str)

Remove unused categories

This will remove categories from the Series for which no values are actually present.

my_dataframe['usgs_site_code'] = my_dataframe['usgs_site_code'].cat.remove_unused_categories()

Use observed option with groupby

This limits groupby operations to category values that actually appear in the Series or DataFrame.

mean_flow = my_dataframe.groupby('usgs_site_code', observed=True).mean()

American Geophysical Union 2021 Fall Meeting Poster

OWPHydroTools_AGU2021.pdf

hydrotools's People

Contributors

aaraney avatar groutr avatar hankherr-noaa avatar hellkite500 avatar jarq6c avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

hydrotools's Issues

Document "gotchas" when dealing with Categorical dtype

@aaraney @hellkite500
I think it's a good idea to explicitly document some peculiarities of dealing with pandas.Categorical which are quite common in evaluation_tools canonical pandas.Dataframe. Bare minimum, I'll add something like this to README.md. Thoughts?

Note about pandas.Categorical data types

evaluation_tools uses pandas.Dataframe that contain pandas.Categorical values to increase memory efficiency. Depending upon your use-case, these values may require special consideration. To see if a Dataframe returned by evaluation_tools contains pandas.Categorical you can use pandas.Dataframe.info like so:

print(my_dataframe.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5706954 entries, 0 to 5706953
Data columns (total 7 columns):
 #   Column            Dtype         
---  ------            -----         
 0   value_date        datetime64[ns]
 1   variable_name     category      
 2   usgs_site_code    category      
 3   measurement_unit  category      
 4   value             float32       
 5   qualifiers        category      
 6   series            category      
dtypes: category(5), datetime64[ns](1), float32(1)
memory usage: 141.5 MB
None

Columns with Dtype category are pandas.Categorical. It's important to note that these categories persist even if your Dataframe does not contain corresponding values. A possible consequence of this can be found on this stackoverflow question.

Three possible solutions to this issue include:

Casting to string

my_dataframe['usgs_site_code`] = my_dataframe['usgs_site_code'].astype(str)

Remove unused categories

my_dataframe['usgs_site_code`] = my_dataframe['usgs_site_code'].cat.remove_unused_categories()

Use observed option with groupby

mean_flow = my_dataframe.groupby('usgs_site_code', observed=True).mean()

Document Canonical Format

As I'm revisiting the nwis_client and gcp_client tools, I think this is a good time to be daring and document the fixed vocabulary that describes hydrotools canonical dataframes. I'm open to discussion on this topic, but so far I'm leaning toward the column definitions given below. These definitions are not exactly compatible with our internal services, but our internal services are not exactly compatible with each other. My motivation was to establish a vocabulary for hydrotools that was consistent and sufficiently descriptive. Not all of these columns will be present or relevant to all dataframes, but where present definitions should be consistent.

There are two possible breaking changes, value_date is now valid_time and start_date is now reference_time.

These definitions also lean a lot on categorical types to avoid the use of multi-indexes. However, nothing prevents users from recasting these dataframes to use multi-indexes or from adding new columns to their data (like a custom_site_identifier column for example). These column labels are just meant to cover and define what the various client tools might return.

@aaraney @hellkite500

HydroTools Canonical DataFrame Column Definitions

value [float32]: Indicates the real value of an individual measurement or simulated quantity.
valid_time [datetime64[ns]]: formerly value_date, this indicates the valid time of value.
variable_name [category]: string category that indicates the real-world type of value (e.g. streamflow, gage height, temperature).
usgs_site_code [category]: string category indicating the USGS Site Code/gage ID
nwm_feature_id [category]: string category indicating the NWM reach feature ID/ComID
nws_lid [category]: string category indicating the NWS Location ID/gage ID
usace_gage_id [category]: string category indicating the USACE gage ID
measurement_unit [category]: string category indicating the measurement unit (SI or standard) of value
qualifiers [category]: string category that indicates any special qualifying codes or messages that apply to value
series [integer32]: Use to disambiguate multiple coincident time series returned by a data source.
configuration [category]: string category used as a label for a particular model simulation configuration (e.g. short_range, medium_range)
reference_time [datetime64[ns]]: formerly, start_date, some reference time for a particular model simulation. Could be considered an issue time, start time, end time, or other meaningful reference time. Interpretation is simulation or forecast specific.
longitude [category]: float32 category, WGS84 decimal longitude
latitude [category]: float32 category, WGS84 decimal latitude
geometry [geometry]: GeoPandas compatible GeoSeries

NWIS Client: startDT and endDT get shifted an hour

When retrieving data using the nwis_client tool, something is happening when specifying the startDT and endDT options where the returned data is shift forward in time by 1 hour. May be able to clean-up the date handling and hand-off a lot to pandas.

As user, I would like event detection to consider when a station is temporarily discontinued

Good morning.
I am trying to create a list of events for the past 10 days at station FLRV2

https://water.weather.gov/ahps2/hydrograph.php?wfo=rnk&gage=flrv2

https://waterdata.usgs.gov/nwis/uv?cb_00060=on&cb_00065=on&format=gif_default&site_no=02064000&period=10&begin_date=2021-05-03&end_date=2021-05-10

There has been a gage relocation due to a bridge construction. The function rolling_minimum is failing.

Thanks,
Alex

NWIS IVDataService Failing Unit Test

Pytest output:

============================= test session starts ==============================
platform linux -- Python 3.7.9, pytest-6.2.2, py-1.10.0, pluggy-0.13.1
rootdir: /home/runner/work/evaluation_tools/evaluation_tools, configfile: pytest.ini
collected 96 items / 11 deselected / 85 selected

python/_restclient/tests/test_restclient.py .................
python/events/tests/test_decomposition.py .
python/gcp_client/tests/test_gcp.py ..
python/gcp_client/tests/test_utils.py ......
python/nwis_client/tests/test_nwis.py ......F....................................................

=================================== FAILURES ===================================
____________________________ test_get_throw_warning ____________________________

monkeypatch = <_pytest.monkeypatch.MonkeyPatch object at 0x7fb338867a10>

    def test_get_throw_warning(monkeypatch):
        def wrapper(*args, **kwargs):
            return []
    
        # Monkey patch get_raw method to return []
>       monkeypatch.setattr(IVDataService, "get_raw", wrapper)
E       NameError: name 'IVDataService' is not defined

python/nwis_client/tests/test_nwis.py:87: NameError
=============================== warnings summary ===============================
python/nwis_client/evaluation_tools/nwis_client/iv.py:910
  /home/runner/work/evaluation_tools/evaluation_tools/python/nwis_client/evaluation_tools/nwis_client/iv.py:910: DeprecationWarning: invalid escape sequence \d
    pattern = "^(-?)P(?=\d|T\d)(?:(\d+)Y)?(?:(\d+)M)?(?:(\d+)([DW]))?(?:T(?:(\d+)H)?(?:(\d+)M)?(?:(\d+(?:\.\d+)?)S)?)?$"

python/nwis_client/tests/test_nwis.py::test_handle_dates[2020-08-10T04:15-05:00-2020-08-10T09:15+0000]
python/nwis_client/tests/test_nwis.py::test_handle_dates[test5-2020-08-10T09:15+0000]
  /home/runner/work/evaluation_tools/evaluation_tools/python/nwis_client/evaluation_tools/nwis_client/iv.py:883: DeprecationWarning: parsing timezone aware datetimes is deprecated;
  the date has been converted to UTC and the tz information has been dropped, ergo the date is now considered `naive` UTC.
  See https://github.com/NOAA-OWP/evaluation_tools/issues/46
    warnings.warn(warning_message, DeprecationWarning)

-- Docs: https://docs.pytest.org/en/stable/warnings.html
=========================== short test summary info ============================
FAILED python/nwis_client/tests/test_nwis.py::test_get_throw_warning - NameEr...
=========== 1 failed, 84 passed, 11 deselected, 3 warnings in 1.68s ============

No way to install all evaluation tools at once

The top level namespace package evaluation_tools does not have a setup.py script. This means there is no way to grab the whole toolbox. Some users may prefer to install all tools at once.

stateCd does not work with nwis_client

I had not realized certain options were non-functional with the current version of nwis_client. These options appear to work with the version on the restclient_transition branch. So, I would say that makes getting this branch merged into main a high priority.

Limit nwm_client.gcp memory usage

The default settings limit retrieval to channel reaches that coincide with USGS gage sites. However, things quickly spiral out of control when attempting to conduct large scale analyses that require all 2.7+ million NWM reaches. This submodule needs a way to limit the maximum number of values held in memory. This likely means a bit of a redesign and a change to the default cache.

Please update 'About' description with more details

In response to Andy's request to make individual tools more discoverable, please add more details to the repository's description.

For example,

"Suite of tools for retrieving USGS NWIS observations and evaluating National Water Model (NWM) data."

Is there a better description?

google collab bug affecting nwis_client and _restclient. Throws RuntimeError: This event loop is already running

In google collab instantiating a RestClient ( this is implicitly done by nwis_client.IVDataService ) will cause a RuntimeError: This event loop is already running in google collab. This issue is well documented in the jupyter notebook repo. In that thread, a work around using nest_asyncio was mentioned as shown below. The problem and a solution to this issue as shown below.

Reproduce Problem

!pip install hydrotools.nwis_client

from hydrotools import nwis_client

client = nwis_client.IVDataService()

---------------------------------------------------------------------------

RuntimeError                              Traceback (most recent call last)

<ipython-input-10-d37c2bf4ee70> in <module>()
----> 1 service = nwis_client.IVDataService()

4 frames

/usr/lib/python3.7/asyncio/base_events.py in _check_runnung(self)
    521     def _check_runnung(self):
    522         if self.is_running():
--> 523             raise RuntimeError('This event loop is already running')
    524         if events._get_running_loop() is not None:
    525             raise RuntimeError(

RuntimeError: This event loop is already running

Solution

!pip install hydrotools.nwis_client


import nest_asyncio
nest_asyncio.apply()

from hydrotools import nwis_client
client = nwis_client.IVDataService()

IMO I think the best way to get around this is to try catch where the error propagates from and then try to import nest_asyncio and call nest_asyncio.apply(). If nest_asyncio is not installed, throw a ModuleNotFoundError refing this issue and noting to install nest_asyncio. Given that this is such an edge case and nest_asyncio is required by nbclient which is required by nbconvert which is required by jupyter notebook, it is unlikely that a user will ever not have nest_asyncio installed and run into this issue. Before I open a PR to resolve this, I'd like to hear your thoughts @jarq6c.

Update nwis client to canonical format

nwis_client returns a value_date column. We will add a value_time_label="value_date" option to continue this behavior.

Update 1: Add value_time_label="value_date" to __init__ that will raise a deprecation warning. Default behavior will return value_date. value_time_label="value_time" will return value_time.

Update 2: Default behavior is value_time_label="value_time". Warn if this option is not explicitly set by user.

Update 3: Default behavior is value_time_label="value_time" with no warning.

Jupyter Notebook RuntimeError: This event loop is already running

Duplicate

from hydrotools.nwis_client.iv import IVDataService

data_service = IVDataService(value_time_label="value_time")
stage_df = data_service.get(
    sites=["01646500", "01013500"],
    startDT="2019-08-01",
    endDT="2019-09-01",
    parameterCd="00065"
)

Tagging @aaraney for advice. It seems #100 fixed the issue in Google Collab, but it still persists in Jupyter Notebook and Spyder (which may have further issues with asyncio).

Running the code above inside a Jupyter Notebook or from Spyder resulted in RuntimeError: This event loop is already running.

Fix

You can resolve the error by adding

import nest_asyncio
nest_asyncio.apply()

Error in calling the nwis function for some of the catchments/station numbers

When running my python code, which extract USGS data for hydrograph, it works for most of the station sites. But occasionally, it fails for some of the station sites, and produces the following errors. The station sites that cause failure are:
station #:
01189000
01010070
01017000
01015800
start date: "2015-12-01 00:00:00", end date: "2015-12-30 23:00:00" are all the same.

obs = nexus._hydro_location.get_data("2015-12-01 00:00:00", "2015-12-30 23:00:00")
File "/home/shengting.cui/ngen-cal-test/ngen-cal/venv/lib/python3.8/site-packages/hypy/hydrolocation/nwis_location.py", line 50, in get_data
return self._nwis.get(self._station_id, startDT=start, endDT=end)
File "/home/shengting.cui/ngen-cal-test/ngen-cal/venv/lib/python3.8/site-packages/hydrotools/nwis_client/iv.py", line 262, in get
dfs.loc[:, "value"] = pd.to_numeric(dfs["value"], downcast="float")
File "/home/shengting.cui/ngen-cal-test/ngen-cal/venv/lib/python3.8/site-packages/pandas/core/frame.py", line 2906, in getitem
indexer = self.columns.get_loc(key)
File "/home/shengting.cui/ngen-cal-test/ngen-cal/venv/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 2900, in get_loc
raise KeyError(key) from err
KeyError: 'value'

Request: evaluation_tools.nwis_client CLI

This is a low priority, but CLIs for different evaluation tools may be useful to non-Python users and various old wizards that like doing everything through bash and system calls. The workflows of these users might benefit from evaluation_tools features like standardized data formats, efficient data retrieval, auto-parsing, caching, canned evaluations, etc. A CLI is a way to meet them halfway.

I can imagine something like:

$ evaluation_tools nwis_client --sites 02146600 --output USGS_site_data_02146600.csv

I had a good experience using Click to implement simple CLIs for my personal workflows. I'd like to explore this more.

Events package structure question

I noticed while looking around the package today that the event_detection module is nested one level deeper than typical and does not have a parent subsubpackage (directory above with __init__.py).

More concretely, on the path hydrotools/python/events/src/hydrotools/events/event_detection/, there is an __init__.py in the event_detection directory but not in the events directory. @jarq6c can you clarify why this is the case? I assume there should be an __init__.py in the events directory as well.

Thanks!

As a developer, I would like to specify development dependencies at the subpackage level

At the moment, all development deps are installed at the namespace package level, namely pytest. I would like to come to an agreed upon standard for specifying development deps to setuptools. Below are two examples setup.py file implementations that solve this problem:

# file: setup.py

from setuptools import setup
from setuptools.command.develop import develop
import subprocess

DEVELOPMENT_REQUIREMENTS = ["pytest"]

# Development installation
class Develop(develop):
    def run(self):
        # Install development requirements
        for dev_requirement in DEVELOPMENT_REQUIREMENTS:
            subprocess.check_call(
                [sys.executable, "-m", "pip", "install", dev_requirement]
            )
        develop.run(self)

setup(
    name="mypackage",
    description="some package that does something",
    install_requires=["pandas"],
    cmdclass={"develop": Develop},
)

The above is used as such, python setup.py develop. This will install all the deps, development deps, and the package in an "editable" form (i.e. like pip install -e .).

# file: setup.py

from setuptools import setup

DEVELOPMENT_REQUIREMENTS = ["pytest"]

setup(
    name="mypackage",
    description="some package that does something",
    install_requires=["pandas"],
    extras_require={"develop" : DEVELOPMENT_REQUIREMENTS},
)

2 is used as follows, pip install -e ".[develop]" (the quotes are not required in bash, but are required in zsh and likely other shells).

Personally, I prefer the second option over the first. The second option uses pip and thus files like pyproject.toml are regarded, whereas in 1, they are not. Likewise, this functionality should work on pypi, I am unsure at this moment how you would specify to include tests in the package though, so that may be a moot point. The second solution is also far less code and more maintainable across subpackages IMO.

Automate gh-pages deployment evalution_tools version tracking

Evaluation tools 1.0.0 -- this is the incorrect version.

As shown, right now the version number requires manual updating. I need to explore how to improve this.

Tangentially, I think we could improve version bumping for this package and the sub-packages included. I know that bumb2version is commonly used, but I think it makes sense to do some research to find the best option.

Evaluate the influence of NaN values on event detection

Possible bug. The event detection methods validate the index, but not the values. The presence of NaN in the value series may produce undefined behavior given the recursive nature of the filters. I want to see if this case needs handling.

Update GitFlow documentation

Further elaborate the GitFlow used in this repo. Elucidate on the use of develop branch, forking, pull requests, and how to pull upstream changes into your fork to continue work.

nwis-client latest builds broken when installed on Python 3.6

Importing the latest nwis-client using python 3.6

from hydrotools.nwis_client.iv import IVDataService

fails with

TypeError: 'type' object is not subscriptable

The error may be more ubiquitous than just the IVDataService import, but that is where we have had trouble with it.

A brief conversation with the developers suggested that a down stream dependency forces a bump to python 3.7 as the minimum requirement and they are considering workarounds to allow backward compatibility. The newer version is worth the effort, with order-of-magnitude faster retrieval speeds from the NWIS service. We do not have any real reason to continue using 3.6, so we may look at an upgrade path.

Workaround

For now, uninstalling the nwis-client, -restclient, and events modules

pip uninstall hydrotools.nwis_client
pip uninstall hydrotools.-restclient
pip uninstall hydrotools.events

then reinstalling the following versions allowed us to continuing using the service in the meantime on python 3.6.8

pip install hydrotools.-restclient==2.0.0a0
pip install hydrotools.nwis-client==2.0.0a0

Document best practices to use event detection.

Update event detection docstring with some suggested parameters, something like:

Guide to applying event detection using time series decomposition

The evaluation_tools.events.event_detection.decomposition method has two main parameters: halflife and window. These parameters are passed directly to underlying filters used to remove noise and model the underlying trend (AKA baseflow) in a streamflow time series. Significant contiguous deviations from this trend are flagged as "events". This method was originally conceived to detect rainfall-driven runoff events in small watersheds from records of volumetric discharge or total runoff. Before using decomposition you will want to have some idea of the event timescales you hope to detect in your original time series.

Specific Advice

  1. Ensure your time series is monotically increasing with a frequency significantly less than the frequency of events (this could/should be checked on the module side)
  2. Use pandas.Timedelta compatible str to specific halflife and window
  3. Specify a halflife larger than the expected frequency of noise, but smaller than the event frequency/timescale
  4. Specify a window larger than the event frequency/timescale, but at least 4 to 8 times smaller than the entire length of the time series.
  5. Filter/Refine the final list of events to remove false events from especially noisy signals

Cache GCP client responses to reduce repeated operations and network calls

Each time the gcp client is used, it hits gcp to get the requested data. Given the size of the data and that repeated process, it only makes sense to implement some kind of cache. I propose that we use a file db (i.e. sqlite) to accomplish this for simplicity and broad support in python.

High level logic

  1. check if db cache has desired data (stored as df)
    1. yes - return data
    2. no - continue
  2. get data from gcp
  3. create a df from the gcp data
  4. cache this df in an sqlite db using the URL path as the key
  5. return the df

Requirements

  • sqlite lib must support multiprocessing and batch commits

The sqlitedict library is mature, maintained, and seems to fit this bill for this feature. The lib lets you create/connect with a db and use it like you would a python dictionary. Most importantly, it supports multiprocessing.

Return empty df to avoid pd.concat Value Error in iv get method

@hellkite500 found an edge case where an nwis_client get request can throw a ValueError when pd.concat has nothing to concatenate. This often occurs when a user asks for data from a site that is 'active' however the gage is not active. This should just return an empty df with the typical fields headers present.

https://github.com/NOAA-OWP/evaluation_tools/blob/1e3b0701f68977d16371740df802fc180445024f/python/nwis_client/evaluation_tools/nwis_client/iv.py#L141

Update CONTRIBUTING.md

The contents of CONTRIBUTING.md, specifically the guide to create a new subpackage is little outdated. This is just a placeholder for when someone finds time to update the CONTRIBUTING.md document.

Implement interface to disable caching for _restclient and nwis_client

Currently, nwis_client.iv.IVDataService sets up caching on initialization. As near as I can tell _restclient.RestClient also decides whether to cache at initialization. The result is that there is no native way to disable caching prior to or when requesting data.

Previously, I've successfully used the requests_cache context manager to temporarily disable caching, but we might want to offer an interface through the tools themselves.

nwis_client: Retrieving large amounts of data takes too long

Case study: A user needs to conduct a long evaluation and wants 25 years of streamflow data from 2000+ USGS gage locations.

Obstacles:

  1. We currently only parallelize by site, can we parallelize by chunking in time?
  2. How do we get around limited memory?
  3. Will this break requests_cache?
  4. Can a DataFrame cache help us here?

Based on a use-case from @hellkite500

@aaraney not necessarily looking for solutions here yet. Just wanted to start the discussion. The nwis_client tool was really the first tool we fleshed out. So, it seems fitting to start discussions about scaling evaluation_tools here.

Disabling requests-cache is not thread safe. Discussion of the future of _restclient's usage of requests-cache

Due to the way requests-cache (used in _restclient) is implemented, if the cache has been "installed" and a downstream package uses requests, the downstream package's requests will be cached. Meaning that requests_cache implicitly changes the behavior of the requests package for all downstream callers.

Per requests-cache's documentation, they do provide a context manager to "disable" the cache. However, as @christophertubbs pointed out their implementation is not thread safe:

 @contextmanager
    def cache_disabled(self):
        """
        Context manager for temporary disabling cache
        ::
            >>> s = CachedSession()
            >>> with s.cache_disabled():
            ...     s.get('http://httpbin.org/ip')
        """
        self._is_cache_disabled = True
        try:
            yield
        finally:
            self._is_cache_disabled = False

In discussing this issue with @hellkite500, @hellkite500 mentioned the (actively developed) project CacheControl. Before active development to fix this behavior, its worth exploring a bit and determining what long term solution to implement.

Add examples and procedure for adding future examples to repo

For the benefit of users, it makes sense to demonstrate example usages of evaluation_tools sub-packages. As of now, examples are provided in docstrings and readme's. This form of documentation is suitable for quick reference, however there is a need for more complete example style documentation.

@jarq6c has written a great example detailing peak flow analysis for little hope creek. This example is expected to be the first added to the repo and in doing so, pave the way for future examples. That being said, once added to the repo, the little hope example should serve as an example for future example additions.

Outcomes of this addition

  • Maintain multiple ways for a user to find example content as well as interact with it
  • Require that examples take on the form of jupyter notebooks as well as a standalone python script. This should allow a user to complete the example in browser via a google colab notebook, locally in jupyter, or locally via python.
  • In support of users who are not as git savy or just want the example material, each example should be downloadable in a zipped format, as well as, the unzipped form. This will likely be taken care of by a github action, meaning the example developer does not have to worry about maintaining/versioning zipped examples.
  • Display completed notebook examples in gh-pages deployment with the ability to open the example in google colab.

Example structure expectation

  1. Each example should sit under /examples/<name-of-example>
  2. Contain a README (/examples/<name-of-example>/README.md) that:
    1. briefly gives an overview of the example
    2. mentions the ways to interact with the example and links to google colab (google colab, local jupyter, local standalone)
    3. outlines the local installation process
    4. outlines how to run the standalone version
  3. Includes a jupyter notebook with the same name as the example.
  4. Includes a standalone python script with the same name as the example.
  5. Includes a requirements.txt file for installing reproducible dependencies.
  6. (this may change in the future) contains a zipped file of all the above.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.