noaa-owp / hydrotools Goto Github PK

View Code? Open in Web Editor NEW

52.0 5.0 12.0 15.3 MB

Suite of tools for retrieving USGS NWIS observations and evaluating National Water Model (NWM) data.

License: Other

Python 99.52% Makefile 0.48%

python pandas noaa hydrology evaluation verification validation forecasting modeling simulation

hydrotools's Introduction

Documentation

OWPHydroTools GitHub pages documentation

Motivation

We developed OWPHydroTools with data scientists in mind. We attempted to ensure the simplest methods such as get both accepted and returned data structures frequently used by data scientists using scientific Python. Specifically, this means that pandas.DataFrames, geopandas.GeoDataFrames, and numpy.arrays are the most frequently encountered data structures when using OWPHydroTools. The majority of methods include sensible defaults that cover the majority of use-cases, but allow customization if required.

We also attempted to adhere to organizational (NOAA-OWP) data standards where they exist. This means pandas.DataFrames will contain column labels like usgs_site_code, start_date, value_date, and measurement_unit which are consistent with organization wide naming conventions. Our intent is to make retrieving, evaluating, and exporting data as easy and reproducible as possible for scientists, practitioners and other hydrological experts.

What's here?

We've taken a grab-and-go approach to installation and usage of OWPHydroTools. This means, in line with a standard toolbox, you will typically install just the tool or tools that get your job done without having to install all the other tools available. This means a lighter installation load and that tools can be added to the toolbox, without affecting your workflows!

It should be noted, we commonly refer to individual tools in OWPHydroTools as a subpackage or by their name (e.g. nwis_client). You will find this lingo in both issues and documentation.

Currently the repository has the following subpackages:

events: Variety of methods used to perform event-based evaluations of hydrometric time series
nwm_client: Provides methods for retrieving National Water Model data from various sources including Google Cloud Platform and NOMADS
metrics: Variety of methods used to compute common evaluation metrics
nwis_client: Provides easy to use methods for retrieving data from the USGS NWIS Instantaneous Values (IV) Web Service
svi_client: Provides programmatic access to the Center for Disease Control's (CDC) Social Vulnerability Index (SVI)
_restclient: A generic REST client with built in cache that make the construction and retrieval of GET requests painless
caches: Provides a variety of object caching utilities

UTC Time

Note: the canonical pandas.DataFrames used by OWPHydroTools use time-zone naive datetimes that assume UTC time. In general, do not assume methods are compatible with time-zone aware datetimes or timestamps. Expect methods to transform time-zone aware datetimes and timestamps into their timezone naive counterparts at UTC time.

Usage

Refer to each subpackage's README.md or documentation for examples of how to use each tool.

Installation

In accordance with the python community, we support and advise the usage of virtual environments in any workflow using python. In the following installation guide, we use python's built-in venv module to create a virtual environment in which the tools will be installed. Note this is just personal preference, any python virtual environment manager should work just fine (conda, pipenv, etc. ).

# Create and activate python environment, requires python >= 3.8
$ python3 -m venv venv
$ source venv/bin/activate
$ python3 -m pip install --upgrade pip

# Install all tools
$ python3 -m pip install hydrotools

# Alternatively you can install a single tool
#  This installs the NWIS Client tool
$ python3 -m pip install hydrotools.nwis_client

OWPHydroTools Canonical Format

"Canonical" labels are protected and part of a fixed lexicon. Canonical labels are shared among all hydrotools subpackages. Subpackage methods should avoid changing or redefining these columns where they appear to encourage cross-compatibility. Existing canonical labels are listed below:

value [float32]: Indicates the real value of an individual measurement or simulated quantity.
value_time [datetime64[ns]]: formerly value_date, this indicates the valid time of value.
variable_name [category]: string category that indicates the real-world type of value (e.g. streamflow, gage height, temperature).
measurement_unit [category]: string category indicating the measurement unit (SI or standard) of value
qualifiers [category]: string category that indicates any special qualifying codes or messages that apply to value
series [integer32]: Use to disambiguate multiple coincident time series returned by a data source.
configuration [category]: string category used as a label for a particular time series, often used to distinguish types of model runs (e.g. short_range, medium_range, assimilation)
reference_time [datetime64[ns]]: formerly, start_date, some reference time for a particular model simulation. Could be considered an issue time, start time, end time, or other meaningful reference time. Interpretation is simulation or forecast specific.
longitude [category]: float32 category, WGS84 decimal longitude
latitude [category]: float32 category, WGS84 decimal latitude
crs [category]: string category, Coordinate Reference System, typically "EPSG:4326"
geometry [geometry]: GeoPandas compatible GeoSeries used as the default "geometry" column

Non-Canonical Column Labels

"Non-Canonical" labels are subpackage specific extensions to the canonical standard. Packages may share these non-canonical lables, but cross-compatibility is not guaranteed. Examples of non-canonical labels are given below.

usgs_site_code [category]: string category indicating the USGS Site Code/gage ID
nwm_feature_id [integer32]: indicates the NWM reach feature ID/ComID
nws_lid [category]: string category indicating the NWS Location ID/gage ID
usace_gage_id [category]: string category indicating the USACE gage ID
start [datetime64[ns]]: datetime returned by event_detection that indicates the beginning of an event
end [datetime64[ns]]: datetime returned by event_detection that indicates the end of an event

Categorical Data Types

OWPHydroTools uses pandas.Dataframe that contain pandas.Categorical values to increase memory efficiency. Depending upon your use-case, these values may require special consideration. To see if a Dataframe returned by a OWPHydroTools subpackage contains pandas.Categorical you can use pandas.Dataframe.info like so:

print(my_dataframe.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5706954 entries, 0 to 5706953
Data columns (total 7 columns):
 #   Column            Dtype         
---  ------            -----         
 0   value_date        datetime64[ns]
 1   variable_name     category      
 2   usgs_site_code    category      
 3   measurement_unit  category      
 4   value             float32       
 5   qualifiers        category      
 6   series            category      
dtypes: category(5), datetime64[ns](1), float32(1)
memory usage: 141.5 MB
None

Columns with Dtype category are pandas.Categorical. In most cases, the behavior of these columns is indistinguishable from their primitive types (in this case str) However, there are times when use of categories can lead to unexpected behavior such as when using pandas.DataFrame.groupby as documented here. pandas.Categorical are also incompatible with fixed format HDF files (must use format="table") and may cause unexpected behavior when attempting to write to GeoSpatial formats using geopandas.

Possible solutions include:

Cast `Categorical` to `str`

Casting to str will resolve all of the aformentioned issues including writing to geospatial formats.

my_dataframe['usgs_site_code'] = my_dataframe['usgs_site_code'].apply(str)

Remove unused categories

This will remove categories from the Series for which no values are actually present.

my_dataframe['usgs_site_code'] = my_dataframe['usgs_site_code'].cat.remove_unused_categories()

Use `observed` option with `groupby`

This limits groupby operations to category values that actually appear in the Series or DataFrame.

mean_flow = my_dataframe.groupby('usgs_site_code', observed=True).mean()

American Geophysical Union 2021 Fall Meeting Poster

OWPHydroTools_AGU2021.pdf

hydrotools's People

Contributors

Stargazers

Watchers

Forkers

aaraney jarq6c hellkite500 hydrotools trendingtechnology dingxinjunedu groutr zhiling-zhou jameshalgren pseudoszechwaniens kevinspurrier

hydrotools's Issues

Wiki links to old repository

Need to update the Wiki page to point to a real documentation page.

https://github.com/NOAA-OWP/hydrotools/wiki

This link currently points to a non-existent repository.

Document "gotchas" when dealing with Categorical dtype

@aaraney @hellkite500
I think it's a good idea to explicitly document some peculiarities of dealing with pandas.Categorical which are quite common in evaluation_tools canonical pandas.Dataframe. Bare minimum, I'll add something like this to README.md. Thoughts?

Note about `pandas.Categorical` data types

evaluation_tools uses pandas.Dataframe that contain pandas.Categorical values to increase memory efficiency. Depending upon your use-case, these values may require special consideration. To see if a Dataframe returned by evaluation_tools contains pandas.Categorical you can use pandas.Dataframe.info like so:

print(my_dataframe.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5706954 entries, 0 to 5706953
Data columns (total 7 columns):
 #   Column            Dtype         
---  ------            -----         
 0   value_date        datetime64[ns]
 1   variable_name     category      
 2   usgs_site_code    category      
 3   measurement_unit  category      
 4   value             float32       
 5   qualifiers        category      
 6   series            category      
dtypes: category(5), datetime64[ns](1), float32(1)
memory usage: 141.5 MB
None

Columns with Dtype category are pandas.Categorical. It's important to note that these categories persist even if your Dataframe does not contain corresponding values. A possible consequence of this can be found on this stackoverflow question.

Three possible solutions to this issue include:

Casting to `string`

my_dataframe['usgs_site_code`] = my_dataframe['usgs_site_code'].astype(str)

Remove unused categories

my_dataframe['usgs_site_code`] = my_dataframe['usgs_site_code'].cat.remove_unused_categories()

Use `observed` option with `groupby`

mean_flow = my_dataframe.groupby('usgs_site_code', observed=True).mean()

Document Canonical Format

As I'm revisiting the nwis_client and gcp_client tools, I think this is a good time to be daring and document the fixed vocabulary that describes hydrotools canonical dataframes. I'm open to discussion on this topic, but so far I'm leaning toward the column definitions given below. These definitions are not exactly compatible with our internal services, but our internal services are not exactly compatible with each other. My motivation was to establish a vocabulary for hydrotools that was consistent and sufficiently descriptive. Not all of these columns will be present or relevant to all dataframes, but where present definitions should be consistent.

There are two possible breaking changes, value_date is now valid_time and start_date is now reference_time.

These definitions also lean a lot on categorical types to avoid the use of multi-indexes. However, nothing prevents users from recasting these dataframes to use multi-indexes or from adding new columns to their data (like a custom_site_identifier column for example). These column labels are just meant to cover and define what the various client tools might return.

@aaraney @hellkite500

HydroTools Canonical DataFrame Column Definitions

value [float32]: Indicates the real value of an individual measurement or simulated quantity.
valid_time [datetime64[ns]]: formerly value_date, this indicates the valid time of value.
variable_name [category]: string category that indicates the real-world type of value (e.g. streamflow, gage height, temperature).
usgs_site_code [category]: string category indicating the USGS Site Code/gage ID
nwm_feature_id [category]: string category indicating the NWM reach feature ID/ComID
nws_lid [category]: string category indicating the NWS Location ID/gage ID
usace_gage_id [category]: string category indicating the USACE gage ID
measurement_unit [category]: string category indicating the measurement unit (SI or standard) of value
qualifiers [category]: string category that indicates any special qualifying codes or messages that apply to value
series [integer32]: Use to disambiguate multiple coincident time series returned by a data source.
configuration [category]: string category used as a label for a particular model simulation configuration (e.g. short_range, medium_range)
reference_time [datetime64[ns]]: formerly, start_date, some reference time for a particular model simulation. Could be considered an issue time, start time, end time, or other meaningful reference time. Interpretation is simulation or forecast specific.
longitude [category]: float32 category, WGS84 decimal longitude
latitude [category]: float32 category, WGS84 decimal latitude
geometry [geometry]: GeoPandas compatible GeoSeries

NWIS Client: startDT and endDT get shifted an hour

When retrieving data using the nwis_client tool, something is happening when specifying the startDT and endDT options where the returned data is shift forward in time by 1 hour. May be able to clean-up the date handling and hand-off a lot to pandas.

As user, I would like event detection to consider when a station is temporarily discontinued

Good morning.
I am trying to create a list of events for the past 10 days at station FLRV2

https://water.weather.gov/ahps2/hydrograph.php?wfo=rnk&gage=flrv2

https://waterdata.usgs.gov/nwis/uv?cb_00060=on&cb_00065=on&format=gif_default&site_no=02064000&period=10&begin_date=2021-05-03&end_date=2021-05-10

There has been a gage relocation due to a bridge construction. The function rolling_minimum is failing.

Thanks,
Alex

NWIS IVDataService Failing Unit Test

Pytest output:

============================= test session starts ==============================
platform linux -- Python 3.7.9, pytest-6.2.2, py-1.10.0, pluggy-0.13.1
rootdir: /home/runner/work/evaluation_tools/evaluation_tools, configfile: pytest.ini
collected 96 items / 11 deselected / 85 selected

python/_restclient/tests/test_restclient.py .................
python/events/tests/test_decomposition.py .
python/gcp_client/tests/test_gcp.py ..
python/gcp_client/tests/test_utils.py ......
python/nwis_client/tests/test_nwis.py ......F....................................................

=================================== FAILURES ===================================
____________________________ test_get_throw_warning ____________________________

monkeypatch = <_pytest.monkeypatch.MonkeyPatch object at 0x7fb338867a10>

    def test_get_throw_warning(monkeypatch):
        def wrapper(*args, **kwargs):
            return []
    
        # Monkey patch get_raw method to return []
>       monkeypatch.setattr(IVDataService, "get_raw", wrapper)
E       NameError: name 'IVDataService' is not defined

python/nwis_client/tests/test_nwis.py:87: NameError
=============================== warnings summary ===============================
python/nwis_client/evaluation_tools/nwis_client/iv.py:910
  /home/runner/work/evaluation_tools/evaluation_tools/python/nwis_client/evaluation_tools/nwis_client/iv.py:910: DeprecationWarning: invalid escape sequence \d
    pattern = "^(-?)P(?=\d|T\d)(?:(\d+)Y)?(?:(\d+)M)?(?:(\d+)([DW]))?(?:T(?:(\d+)H)?(?:(\d+)M)?(?:(\d+(?:\.\d+)?)S)?)?$"

python/nwis_client/tests/test_nwis.py::test_handle_dates[2020-08-10T04:15-05:00-2020-08-10T09:15+0000]
python/nwis_client/tests/test_nwis.py::test_handle_dates[test5-2020-08-10T09:15+0000]
  /home/runner/work/evaluation_tools/evaluation_tools/python/nwis_client/evaluation_tools/nwis_client/iv.py:883: DeprecationWarning: parsing timezone aware datetimes is deprecated;
  the date has been converted to UTC and the tz information has been dropped, ergo the date is now considered `naive` UTC.
  See https://github.com/NOAA-OWP/evaluation_tools/issues/46
    warnings.warn(warning_message, DeprecationWarning)

-- Docs: https://docs.pytest.org/en/stable/warnings.html
=========================== short test summary info ============================
FAILED python/nwis_client/tests/test_nwis.py::test_get_throw_warning - NameEr...
=========== 1 failed, 84 passed, 11 deselected, 3 warnings in 1.68s ============

No way to install all evaluation tools at once

The top level namespace package evaluation_tools does not have a setup.py script. This means there is no way to grab the whole toolbox. Some users may prefer to install all tools at once.

Add gh-pages sphinx documentation

It would be nice to have a dedicated documentation page, via gh-pages. This issue is just stating the desire for this feature.

stateCd does not work with nwis_client

I had not realized certain options were non-functional with the current version of nwis_client. These options appear to work with the version on the restclient_transition branch. So, I would say that makes getting this branch merged into main a high priority.

Add citation file

https://docs.github.com/en/github/creating-cloning-and-archiving-repositories/creating-a-repository-on-github/about-citation-files

Add gcp_client usage to README

Add a simple use case showcasing the gcp_client in the root README.md.

metrics package requires numpy minimum 1.20

Given the usage of numpy.typing in hydrotools.metrics if a numpy version < 1.20 is not required the user will get an error. See release note for feature addition in numpy project.

gcp_client: update multiprocessing to use futures

Switch from calling multiprocessing directly and use concurrent.futures.ProcessPoolExecutor. This will allow process chunking, which may yield a slight performance increase.

Limit nwm_client.gcp memory usage

The default settings limit retrieval to channel reaches that coincide with USGS gage sites. However, things quickly spiral out of control when attempting to conduct large scale analyses that require all 2.7+ million NWM reaches. This submodule needs a way to limit the maximum number of values held in memory. This likely means a bit of a redesign and a change to the default cache.

Add method to transform units returned by gcp_client and nwis_client

So annoying to remember to transform units from gcp_client to compatible units with nwis_client.

Potential accidental file left in nwm_client from #120

@jarq6c I noticed python/nwm_client/main.py originating from #120, was this file accidentally committed? The commit message, temp test script, further brought it to my attention.

Please add _restclient et al. README.md

To support deployment to PyPI let's add individual README.md documents to each subpackage.

Please update 'About' description with more details

In response to Andy's request to make individual tools more discoverable, please add more details to the repository's description.

For example,

"Suite of tools for retrieving USGS NWIS observations and evaluating National Water Model (NWM) data."

Is there a better description?

`nwm_client_new` does not have a `version` attribute

Trying to limit the scope of the recent massive PR. I will address this later.

google collab bug affecting nwis_client and _restclient. Throws RuntimeError: This event loop is already running

In google collab instantiating a RestClient ( this is implicitly done by nwis_client.IVDataService ) will cause a RuntimeError: This event loop is already running in google collab. This issue is well documented in the jupyter notebook repo. In that thread, a work around using nest_asyncio was mentioned as shown below. The problem and a solution to this issue as shown below.

Reproduce Problem

!pip install hydrotools.nwis_client

from hydrotools import nwis_client

client = nwis_client.IVDataService()

---------------------------------------------------------------------------

RuntimeError                              Traceback (most recent call last)

<ipython-input-10-d37c2bf4ee70> in <module>()
----> 1 service = nwis_client.IVDataService()

4 frames

/usr/lib/python3.7/asyncio/base_events.py in _check_runnung(self)
    521     def _check_runnung(self):
    522         if self.is_running():
--> 523             raise RuntimeError('This event loop is already running')
    524         if events._get_running_loop() is not None:
    525             raise RuntimeError(

RuntimeError: This event loop is already running

Solution

!pip install hydrotools.nwis_client


import nest_asyncio
nest_asyncio.apply()

from hydrotools import nwis_client
client = nwis_client.IVDataService()

IMO I think the best way to get around this is to try catch where the error propagates from and then try to import nest_asyncio and call nest_asyncio.apply(). If nest_asyncio is not installed, throw a ModuleNotFoundError refing this issue and noting to install nest_asyncio. Given that this is such an edge case and nest_asyncio is required by nbclient which is required by nbconvert which is required by jupyter notebook, it is unlikely that a user will ever not have nest_asyncio installed and run into this issue. Before I open a PR to resolve this, I'd like to hear your thoughts @jarq6c.

New repo structure borked gh pages documentation.

Need to update documentation workflow.

Update nwis client to canonical format

nwis_client returns a value_date column. We will add a value_time_label="value_date" option to continue this behavior.

Update 1: Add value_time_label="value_date" to __init__ that will raise a deprecation warning. Default behavior will return value_date. value_time_label="value_time" will return value_time.

Update 2: Default behavior is value_time_label="value_time". Warn if this option is not explicitly set by user.

Update 3: Default behavior is value_time_label="value_time" with no warning.

Jupyter Notebook RuntimeError: This event loop is already running

Duplicate

from hydrotools.nwis_client.iv import IVDataService

data_service = IVDataService(value_time_label="value_time")
stage_df = data_service.get(
    sites=["01646500", "01013500"],
    startDT="2019-08-01",
    endDT="2019-09-01",
    parameterCd="00065"
)

Tagging @aaraney for advice. It seems #100 fixed the issue in Google Collab, but it still persists in Jupyter Notebook and Spyder (which may have further issues with asyncio).

Running the code above inside a Jupyter Notebook or from Spyder resulted in RuntimeError: This event loop is already running.

Fix

You can resolve the error by adding

import nest_asyncio
nest_asyncio.apply()

Top level metapackage does not pull in nwm_client[gcp]

The top level metapackage should either pull in gcp dependencies by default or include a target like hydrotools[gcp] that does.

Error in calling the nwis function for some of the catchments/station numbers

When running my python code, which extract USGS data for hydrograph, it works for most of the station sites. But occasionally, it fails for some of the station sites, and produces the following errors. The station sites that cause failure are:
station #:
01189000
01010070
01017000
01015800
start date: "2015-12-01 00:00:00", end date: "2015-12-30 23:00:00" are all the same.

obs = nexus._hydro_location.get_data("2015-12-01 00:00:00", "2015-12-30 23:00:00")
File "/home/shengting.cui/ngen-cal-test/ngen-cal/venv/lib/python3.8/site-packages/hypy/hydrolocation/nwis_location.py", line 50, in get_data
return self._nwis.get(self._station_id, startDT=start, endDT=end)
File "/home/shengting.cui/ngen-cal-test/ngen-cal/venv/lib/python3.8/site-packages/hydrotools/nwis_client/iv.py", line 262, in get
dfs.loc[:, "value"] = pd.to_numeric(dfs["value"], downcast="float")
File "/home/shengting.cui/ngen-cal-test/ngen-cal/venv/lib/python3.8/site-packages/pandas/core/frame.py", line 2906, in getitem
indexer = self.columns.get_loc(key)
File "/home/shengting.cui/ngen-cal-test/ngen-cal/venv/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 2900, in get_loc
raise KeyError(key) from err
KeyError: 'value'

Request: evaluation_tools.nwis_client CLI

This is a low priority, but CLIs for different evaluation tools may be useful to non-Python users and various old wizards that like doing everything through bash and system calls. The workflows of these users might benefit from evaluation_tools features like standardized data formats, efficient data retrieval, auto-parsing, caching, canned evaluations, etc. A CLI is a way to meet them halfway.

I can imagine something like:

$ evaluation_tools nwis_client --sites 02146600 --output USGS_site_data_02146600.csv

I had a good experience using Click to implement simple CLIs for my personal workflows. I'd like to explore this more.

Events package structure question

I noticed while looking around the package today that the event_detection module is nested one level deeper than typical and does not have a parent subsubpackage (directory above with __init__.py).

More concretely, on the path hydrotools/python/events/src/hydrotools/events/event_detection/, there is an __init__.py in the event_detection directory but not in the events directory. @jarq6c can you clarify why this is the case? I assume there should be an __init__.py in the events directory as well.

Thanks!

Add gcp_client and events docs to gh-pages deployment

These subpackages need to be added to the sphinx docs so they are deployed on gh-pages.

Client defaults to "active" status

https://github.com/NOAA-OWP/evaluation_tools/blob/1e3b0701f68977d16371740df802fc180445024f/python/nwis_client/evaluation_tools/nwis_client/iv.py#L83

To more closely mimic the default behavior of NWIS itself, the active status should default to ALL and users should configure their client for the more specific cases of "inactive" and "active".

As a developer, I would like to specify development dependencies at the subpackage level

At the moment, all development deps are installed at the namespace package level, namely pytest. I would like to come to an agreed upon standard for specifying development deps to setuptools. Below are two examples setup.py file implementations that solve this problem:

# file: setup.py

from setuptools import setup
from setuptools.command.develop import develop
import subprocess

DEVELOPMENT_REQUIREMENTS = ["pytest"]

# Development installation
class Develop(develop):
    def run(self):
        # Install development requirements
        for dev_requirement in DEVELOPMENT_REQUIREMENTS:
            subprocess.check_call(
                [sys.executable, "-m", "pip", "install", dev_requirement]
            )
        develop.run(self)

setup(
    name="mypackage",
    description="some package that does something",
    install_requires=["pandas"],
    cmdclass={"develop": Develop},
)

The above is used as such, python setup.py develop. This will install all the deps, development deps, and the package in an "editable" form (i.e. like pip install -e .).

# file: setup.py

from setuptools import setup

DEVELOPMENT_REQUIREMENTS = ["pytest"]

setup(
    name="mypackage",
    description="some package that does something",
    install_requires=["pandas"],
    extras_require={"develop" : DEVELOPMENT_REQUIREMENTS},
)

2 is used as follows, pip install -e ".[develop]" (the quotes are not required in bash, but are required in zsh and likely other shells).

Personally, I prefer the second option over the first. The second option uses pip and thus files like pyproject.toml are regarded, whereas in 1, they are not. Likewise, this functionality should work on pypi, I am unsure at this moment how you would specify to include tests in the package though, so that may be a moot point. The second solution is also far less code and more maintainable across subpackages IMO.

HydroTools is trademarked

Looks like we'll need to rename the package... again.

Automate gh-pages deployment evalution_tools version tracking

As shown, right now the version number requires manual updating. I need to explore how to improve this.

Tangentially, I think we could improve version bumping for this package and the sub-packages included. I know that bumb2version is commonly used, but I think it makes sense to do some research to find the best option.

Evaluate the influence of NaN values on event detection

Possible bug. The event detection methods validate the index, but not the values. The presence of NaN in the value series may produce undefined behavior given the recursive nature of the filters. I want to see if this case needs handling.

evaluation_tools.metrics False Positives and False Negatives are switched

While developing a separate workflow I found that evaluation_tools.metrics.compute_contingency_table produced contingency tables where the false positives and false negative counts were switched. The unit test was built to pass this bad behavior.

Update GitFlow documentation

Further elaborate the GitFlow used in this repo. Elucidate on the use of develop branch, forking, pull requests, and how to pull upstream changes into your fork to continue work.

nwis-client latest builds broken when installed on Python 3.6

Importing the latest nwis-client using python 3.6

from hydrotools.nwis_client.iv import IVDataService

fails with

TypeError: 'type' object is not subscriptable

The error may be more ubiquitous than just the IVDataService import, but that is where we have had trouble with it.

A brief conversation with the developers suggested that a down stream dependency forces a bump to python 3.7 as the minimum requirement and they are considering workarounds to allow backward compatibility. The newer version is worth the effort, with order-of-magnitude faster retrieval speeds from the NWIS service. We do not have any real reason to continue using 3.6, so we may look at an upgrade path.

Workaround

For now, uninstalling the nwis-client, -restclient, and events modules

pip uninstall hydrotools.nwis_client
pip uninstall hydrotools.-restclient
pip uninstall hydrotools.events

then reinstalling the following versions allowed us to continuing using the service in the meantime on python 3.6.8

pip install hydrotools.-restclient==2.0.0a0
pip install hydrotools.nwis-client==2.0.0a0

Document best practices to use event detection.

Update event detection docstring with some suggested parameters, something like:

Guide to applying event detection using time series decomposition

The evaluation_tools.events.event_detection.decomposition method has two main parameters: halflife and window. These parameters are passed directly to underlying filters used to remove noise and model the underlying trend (AKA baseflow) in a streamflow time series. Significant contiguous deviations from this trend are flagged as "events". This method was originally conceived to detect rainfall-driven runoff events in small watersheds from records of volumetric discharge or total runoff. Before using decomposition you will want to have some idea of the event timescales you hope to detect in your original time series.

Specific Advice

Ensure your time series is monotically increasing with a frequency significantly less than the frequency of events (this could/should be checked on the module side)
Use pandas.Timedelta compatible str to specific halflife and window
Specify a halflife larger than the expected frequency of noise, but smaller than the event frequency/timescale
Specify a window larger than the event frequency/timescale, but at least 4 to 8 times smaller than the entire length of the time series.
Filter/Refine the final list of events to remove false events from especially noisy signals

Add metrics subpackage to compute standard evaluation metrics

Add an evaluation_tools.metrics subpackage with standard methods used to compute evaluation statistics.

Cache GCP client responses to reduce repeated operations and network calls

Each time the gcp client is used, it hits gcp to get the requested data. Given the size of the data and that repeated process, it only makes sense to implement some kind of cache. I propose that we use a file db (i.e. sqlite) to accomplish this for simplicity and broad support in python.

High level logic

check if db cache has desired data (stored as df)
1. yes - return data
2. no - continue
get data from gcp
create a df from the gcp data
cache this df in an sqlite db using the URL path as the key
return the df

Requirements

sqlite lib must support multiprocessing and batch commits

The sqlitedict library is mature, maintained, and seems to fit this bill for this feature. The lib lets you create/connect with a db and use it like you would a python dictionary. Most importantly, it supports multiprocessing.

Return empty df to avoid pd.concat Value Error in iv get method

@hellkite500 found an edge case where an nwis_client get request can throw a ValueError when pd.concat has nothing to concatenate. This often occurs when a user asks for data from a site that is 'active' however the gage is not active. This should just return an empty df with the typical fields headers present.

https://github.com/NOAA-OWP/evaluation_tools/blob/1e3b0701f68977d16371740df802fc180445024f/python/nwis_client/evaluation_tools/nwis_client/iv.py#L141

Update CONTRIBUTING.md

The contents of CONTRIBUTING.md, specifically the guide to create a new subpackage is little outdated. This is just a placeholder for when someone finds time to update the CONTRIBUTING.md document.

Implement interface to disable caching for _restclient and nwis_client

Currently, nwis_client.iv.IVDataService sets up caching on initialization. As near as I can tell _restclient.RestClient also decides whether to cache at initialization. The result is that there is no native way to disable caching prior to or when requesting data.

Previously, I've successfully used the requests_cache context manager to temporarily disable caching, but we might want to offer an interface through the tools themselves.

Investigate unit of measurement handling

HT currently has no formal unit of measurement handling. We may be able to incorporate a library like Pint to deal with this.

https://pint.readthedocs.io/en/stable/

nwis_client: Retrieving large amounts of data takes too long

Case study: A user needs to conduct a long evaluation and wants 25 years of streamflow data from 2000+ USGS gage locations.

Obstacles:

We currently only parallelize by site, can we parallelize by chunking in time?
How do we get around limited memory?
Will this break requests_cache?
Can a DataFrame cache help us here?

Based on a use-case from @hellkite500

@aaraney not necessarily looking for solutions here yet. Just wanted to start the discussion. The nwis_client tool was really the first tool we fleshed out. So, it seems fitting to start discussions about scaling evaluation_tools here.

Github Actions to automate unit testing

I didn't realize that we did not have unit tests via GHActions here. This is mainly just a note here for me to remember to add that.

Add option to shift start event times to local minima

Add a parameter that will shift start times produced by evaluation_tools.events.event_detection.decomposition to a local minima value.

Update metrics and events to reflect new nwis client interface.

As is the examples fail with:

TypeError: get() missing 1 required positional argument: 'self'

Add gh-action to run unit tests marked slow at some regular interval

Yeah good catch, I was not aware of that either. That sounds like something a pytest --run-slow should have caught. I need to create a new gh-action that runs our slow tests every Sunday if something new has been added since the last time the action ran.

Originally posted by @aaraney in #90 (comment)

Disabling requests-cache is not thread safe. Discussion of the future of _restclient's usage of requests-cache

Due to the way requests-cache (used in _restclient) is implemented, if the cache has been "installed" and a downstream package uses requests, the downstream package's requests will be cached. Meaning that requests_cache implicitly changes the behavior of the requests package for all downstream callers.

Per requests-cache's documentation, they do provide a context manager to "disable" the cache. However, as @christophertubbs pointed out their implementation is not thread safe:

 @contextmanager
    def cache_disabled(self):
        """
        Context manager for temporary disabling cache
        ::
            >>> s = CachedSession()
            >>> with s.cache_disabled():
            ...     s.get('http://httpbin.org/ip')
        """
        self._is_cache_disabled = True
        try:
            yield
        finally:
            self._is_cache_disabled = False

In discussing this issue with @hellkite500, @hellkite500 mentioned the (actively developed) project CacheControl. Before active development to fix this behavior, its worth exploring a bit and determining what long term solution to implement.

Add examples and procedure for adding future examples to repo

For the benefit of users, it makes sense to demonstrate example usages of evaluation_tools sub-packages. As of now, examples are provided in docstrings and readme's. This form of documentation is suitable for quick reference, however there is a need for more complete example style documentation.

@jarq6c has written a great example detailing peak flow analysis for little hope creek. This example is expected to be the first added to the repo and in doing so, pave the way for future examples. That being said, once added to the repo, the little hope example should serve as an example for future example additions.

Outcomes of this addition

Maintain multiple ways for a user to find example content as well as interact with it
Require that examples take on the form of jupyter notebooks as well as a standalone python script. This should allow a user to complete the example in browser via a google colab notebook, locally in jupyter, or locally via python.
In support of users who are not as git savy or just want the example material, each example should be downloadable in a zipped format, as well as, the unzipped form. This will likely be taken care of by a github action, meaning the example developer does not have to worry about maintaining/versioning zipped examples.
Display completed notebook examples in gh-pages deployment with the ability to open the example in google colab.

Example structure expectation

Each example should sit under /examples/<name-of-example>
Contain a README (/examples/<name-of-example>/README.md) that:
1. briefly gives an overview of the example
2. mentions the ways to interact with the example and links to google colab (google colab, local jupyter, local standalone)
3. outlines the local installation process
4. outlines how to run the standalone version
Includes a jupyter notebook with the same name as the example.
Includes a standalone python script with the same name as the example.
Includes a requirements.txt file for installing reproducible dependencies.
(this may change in the future) contains a zipped file of all the above.

noaa-owp / hydrotools Goto Github PK

hydrotools's Introduction

Documentation

Motivation

What's here?

UTC Time

Usage

Installation

OWPHydroTools Canonical Format

Non-Canonical Column Labels

Categorical Data Types

Cast Categorical to str

Remove unused categories

Use observed option with groupby

American Geophysical Union 2021 Fall Meeting Poster

hydrotools's People

Contributors

Stargazers

Watchers

Forkers

hydrotools's Issues

Note about pandas.Categorical data types

Casting to string

Remove unused categories

Use observed option with groupby

HydroTools Canonical DataFrame Column Definitions

Reproduce Problem

Solution

Duplicate

Fix

Workaround

Guide to applying event detection using time series decomposition

Specific Advice

High level logic

Requirements

Outcomes of this addition

Example structure expectation

Recommend Projects

Recommend Topics

Recommend Org

Jobs

Cast `Categorical` to `str`

Use `observed` option with `groupby`

Note about `pandas.Categorical` data types

Casting to `string`

Use `observed` option with `groupby`