ltelab / disdrodb Goto Github PK
View Code? Open in Web Editor NEWA global database of disdrometers measurements
Home Page: https://disdrodb.readthedocs.io/en/latest/
License: GNU General Public License v3.0
A global database of disdrometers measurements
Home Page: https://disdrodb.readthedocs.io/en/latest/
License: GNU General Public License v3.0
Hi guys,
I'll start keeping public track of some of the common errors I encounter, I guess it will be useful in the future!
When running the run_DELFT_processing.py
script on the files for the PAR002 instrument. I get a Key Error
when the program is trying to convert the parquet to netcdf file. Here is the log response:
There are the following metadata files without corresponding data: ['PAR001']
- L0 processing of station_id PAR002 has started.
- 79 files to process in /home/sguzzo/Parsivel/RAW_TELEGRAM/CABAUW
- 0 of 1 have been skipped.
- Conversion to Apache Parquet started.
- Conversion to Apache Parquet ended.
- L0 processing of station_id PAR002 ended in 1.34s
- L1 processing of station_id PAR002 has started.
- Reading L0 Apache Parquet file at /home/sguzzo/Parsivel/Processed/CABAUW/L0/CABAUW_sPAR002_20211008.parquet started
- Reading L0 Apache Parquet file at /home/sguzzo/Parsivel/Processed/CABAUW/L0/CABAUW_sPAR002_20211008.parquet ended
- Retrieval of L1 data matrix started.
- Retrieval of L1 data matrix finished.
Traceback (most recent call last):
File "/home/sguzzo/PycharmProjects/disdrodb/disdrodb/readers/DELFT/parser_RASPBERRY.py", line 504, in <module>
main()
File "/home/sguzzo/miniconda3/envs/disdrodb/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
return self.main(*args, **kwargs)
File "/home/sguzzo/miniconda3/envs/disdrodb/lib/python3.9/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/home/sguzzo/miniconda3/envs/disdrodb/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/sguzzo/miniconda3/envs/disdrodb/lib/python3.9/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/home/sguzzo/PycharmProjects/disdrodb/disdrodb/readers/DELFT/parser_RASPBERRY.py", line 466, in main
write_L1_to_netcdf(ds, fpath=fpath, sensor_name=sensor_name)
File "/home/sguzzo/PycharmProjects/disdrodb/disdrodb/L1_proc.py", line 319, in write_L1_to_netcdf
encoding_dict = {k: encoding_dict[k] for k in ds.data_vars}
File "/home/sguzzo/PycharmProjects/disdrodb/disdrodb/L1_proc.py", line 319, in <dictcomp>
encoding_dict = {k: encoding_dict[k] for k in ds.data_vars}
KeyError: 'sensor_temperature_PCB'
No response
Example of raw data file that can be used to reproduce the error: 20211007.zip
No response
3.9.*
This issue aim to expose the idea of creating a separate repository called disdrodb_archive
where to locate sample data and metadata for each station and reader.
The main goals are to:
Related to this point, in the DISDRODB full archive, we could better enforce the format and structure of the raw files within each station directory, by for example banning heterogenous archive/compression formats (tar, gzip, bz2,...) or directory nested structures.
For portability of the DISDRODB raw archive, it might be useful to have all files within a station zipped in a single directory ... especially for stations where the deployment has terminated and we do not expect further data stream.
This feature request will require the development of code to synchronize between this repo metadata and the DISDRODB Full Archive.
In the context of this PR ... please deprecate the use of station_id
in favor of station_name
.
The station directory must have the same (hopefully explanatory) name as the corresponding station_name
key in the metadata
The naming of the branch is not clear to me in the contributors_guidelines
It is currently defined as follow :
reader-<institute>-<campaign>
In the reader page it is defined as follow :
Guidelines for the Name of the institution or country folder :
We use the institution name when campaign data spans more than 1 country.
We use country when all campaigns (or sensor networks) are inside a given country.
Which is bit more clear to me.
Can I update the contributors_guidelines with the explanations from reader page
I'm working on the delft dataset. If I have open one station yml file, I see that the compain = DELFT
and the country = Netherlands
Then I've named my branch reader-Netherland-delft
. Am I right @ghiggi ? I would rather concider Delft as the institution...
What if the delft uni works outside the Netherlands ?
reader-delft-blabla
Raw
and under Raw/NETHERLANDS
Why not having every time the following structure Raw/Country/Institution/...
To be honest, I find mixing country and institution a bit confusing
In the documentation, specify the files that is expected while adding a new reader (only the reader parser_HOLLAND.py ?)
What's your opinion on that @ghiggi ?
Proposal for improvements of temp_parser.py
Modify Readers args to enforce folder structure
replace raw_dir
and process_dir
arguments of the readers with base_directory
, data_source
and campaign_name
/!\ Must update reader_template.py + Tutorial nb + All existing readers.
Is your feature request related to a problem? Please describe.
We need to ensure that the time epoch does not vary between netCDF files.
I suggest fixing it to the Unix EPOCH.
The code should be added in either the L0.standards.get_L0B_encodings_dict or in the
L0.L0B_processing.write_L0B function.
Describe the solution you'd like
EPOCH = u"seconds since 1970-01-01 00:00:00" # somewhere on top of encoding
# ....
encoding = ds['time'].encoding
encoding['units'] = EPOCH
encoding['calendar'] = 'proleptic_gregorian'
Hi guys, I'm opening the issue here so we can have a public discussion!
Recently @mschleiss has experienced an issue while trying to use ncdump
on the converted netcdf files using the [latest version of] parser_RASPBERRY.py.
The error that Marc was receiving was
NetCDF: HDF error
Location: file ; line 1705
which was pretty uninformative.
The headers file got (marc_headers.txt) were resulting incomplete, with the first line failing being string weather_code_METAR_4678(time) ;
Also my colleague Rob faced a similar issue using Matlab:
I was able to open the "corrupted" files using any method (nco, xarray, ecc.) on my machine.
When I finally ran wither the operator nccopy
(without any additional option) or ncks -4 -L 5 notworking.nc good.nc
to create new files out of the corrupted ones, both Marc and Rob were finally able to use them.
Please find here PAR001.tar.gz a few example files of both not-working files and working ones.
I'm curious to know what you guys think of this!
Thanks :)
Add the root folder as path variable
We removed the get_L0_dtype_standards function and replaced it with disdrodb.l0.standards.get_L0A_dtype.
Yes, you should pass the sensor_name argument.
No response
No response
No response
No response
reader_preparation.ipynb
nb for visual data explorationreader_preparation.ipynb
to use the errors handling config file (in data/.. issue ../ yml)When using the entry Parsivel2
as entry in the metadata's sensor_name
key, I receive a TypeError and the script stops executing.
It should be ideally possible to use any string as sensor_name's value
Parsivel
(ends execution) and Parsivel2
(error!)The full traceback error:
There are the following metadata files without corresponding data: ['PAR002', '20', 'testconki', 'PAR003']
- L0 processing of station_id PAR001 has started.
- 1 files to process in /home/sguzzo/Parsivel/RAW_TELEGRAM
- Conversion to Apache Parquet started.
- Conversion to Apache Parquet ended.
Traceback (most recent call last):
File "/home/sguzzo/PycharmProjects/disdrodb/disdrodb/readers/DELFT/parser_RASPBERRY.py", line 487, in <module>
main()
File "/home/sguzzo/miniconda3/envs/disdrodb/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
return self.main(*args, **kwargs)
File "/home/sguzzo/miniconda3/envs/disdrodb/lib/python3.9/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/home/sguzzo/miniconda3/envs/disdrodb/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/sguzzo/miniconda3/envs/disdrodb/lib/python3.9/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/home/sguzzo/PycharmProjects/disdrodb/disdrodb/readers/DELFT/parser_RASPBERRY.py", line 407, in main
check_L0_standards(fpath=fpath,
File "/home/sguzzo/PycharmProjects/disdrodb/disdrodb/check_standards.py", line 48, in check_L0_standards
if not df[column].between(*dict_field_value_range[column]).all():
File "/home/sguzzo/miniconda3/envs/disdrodb/lib/python3.9/site-packages/pandas/core/series.py", line 5110, in between
lmask = self >= left
File "/home/sguzzo/miniconda3/envs/disdrodb/lib/python3.9/site-packages/pandas/core/ops/common.py", line 69, in new_method
return method(self, other)
File "/home/sguzzo/miniconda3/envs/disdrodb/lib/python3.9/site-packages/pandas/core/arraylike.py", line 52, in __ge__
return self._cmp_method(other, operator.ge)
File "/home/sguzzo/miniconda3/envs/disdrodb/lib/python3.9/site-packages/pandas/core/series.py", line 5502, in _cmp_method
res_values = ops.comparison_op(lvalues, rvalues, op)
File "/home/sguzzo/miniconda3/envs/disdrodb/lib/python3.9/site-packages/pandas/core/ops/array_ops.py", line 284, in comparison_op
res_values = comp_method_OBJECT_ARRAY(op, lvalues, rvalues)
File "/home/sguzzo/miniconda3/envs/disdrodb/lib/python3.9/site-packages/pandas/core/ops/array_ops.py", line 73, in comp_method_OBJECT_ARRAY
result = libops.scalar_compare(x.ravel(), y, op)
File "pandas/_libs/ops.pyx", line 107, in pandas._libs.ops.scalar_compare
TypeError: '>=' not supported between instances of 'str' and 'int'
3.9.*
rename / change emplacement :
disdrodb/L0/dev_tools.py --> template_tools.py
disdrodb/L0/utils_cmd.py --> disdrodb/pipeline/utils_cmd.py
While working with jupiter notebook, the log created under data\DISDRODB\Processed\NETHERLANDS\DELFT` blocks the deletion and recreation of the process folder
In raw_dir, processed_dir = check_directories(raw_dir, processed_dir, force=False)
If force = False --> no erreor should be raised if the folder exists already
just run twice the cell under 2. Initialization
No response
No response
Solution
Function to list automatically all available readers .
@ghiggi Okay we're ready for the big leap - now with the notebook as a line-by-line replacement of reader_template.py; people can jsut copy that with their own data to avoid confusion. There should be no reason to keep this "templates" folder !
Let's be consistent with collaborative open-source good practices, if you want to see what people are doing, you can always go in their forks, but they shouldn't have to commit half-baked WIP readers development into the main repo .
I'm going ahead and making the bold move - so we can then rename every "parser" as "reader" to avoid confusion (see #46)
Working on the reader for the Netherlands, I've noticed that some dates are empty in the parquet file. The exact same data frame written as csv outputs no-empty dates. These lost dates seem to be formatted correctly.
The processed folder in the LTE nas contains these empty dates which is, in my opinion, wrong.
If the source file as corrected formatted dates they should be replicated into the parquet.
pickel file here
`
import pandas as pd
df = pd.read_pickle("<your_path>\sample_df.pkl")
df.to_parquet('<your_path>\sample_df.parquet')
df.to_csv('<your_path>\sample_df.csv')
`
- OS:windows
- python:3.8.10
@ghiggi please react if you are already aware of this behaviour. I'm working on it now.
Define settings to block commit on banches. PR must be compulsory
We should add to the doc a table detailing the meaning of each metadata keys.
Also, based on the discussions with the community across various institutes, the default metadata dictionary must be updated.
..
Hi guys, I just wanted to mark this very first GitHub issue with a note of appreciation for this work!! Thanks!!!
@KimCandolfi @ghiggi
The import statements for the parser module need to be updated
Current line (points to wrong folder)
from disdrodb.utils.parser import get_parser_cmd
replace with
from disdrodb.pipeline.utils_cmd import get_parser_cmd
No response
No response
- OS:
- python:
No response
Is your feature request related to a problem? Please describe.
Installing pip / conda env in mac OS seems not to work properly.
Describe the solution you'd like
Define the correct process
Describe alternatives you've considered
It will work
Additional context
None
The index at the beginning of https://github.com/ltelab/disdrodb/blob/main/CONTRIBUTING.rst is outdated.
This doc portion should be updated:
Before submitting your contribution, please make sure to take a moment and read through the following guidelines :
[Code of Conduct](https://github.com/ltelab/disdrodb/blob/main/CODE_OF_CONDUCT.md)
[Issue Reporting Guidelines](https://github.com/ltelab/disdrodb/blob/main/CONTRIBUTING.rst#issue-reporting-guidelines)
[Pull Request Guidelines](https://github.com/ltelab/disdrodb/blob/main/CONTRIBUTING.rst#pull-request-guidelines)
[Development Setup](https://github.com/ltelab/disdrodb/blob/main/CONTRIBUTING.rst#development-setup)
[Project Structure](https://github.com/ltelab/disdrodb/blob/main/CONTRIBUTING.rst#project-structure)
[Github Flow](https://github.com/ltelab/disdrodb/blob/main/CONTRIBUTING.rst#github-flow)
[Commit Lint](https://github.com/ltelab/disdrodb/blob/main/CONTRIBUTING.rst#commit-lint)
environment.yml
with specific versionsCurrently doc here https://disdrodb-dev.readthedocs.io
listening to https://github.com/EPFL-ENAC/LTE-disdrodb/tree/docs
To replicate on main repo
When creating a netcdf file from an xarray object in L0B_processing.py, the code crashes as soon as you try to add compression on string variables.
Set the compression level to 0 for weather_code_metar_4678 and weather_code_nws in L0B_encoding.yml
No response
- OS:
- python:
No response
Do you want us to implement automated tests on new readers provided by the community ? If so, do we need them to add their sample data for a specific reader in order to test it ? We can require them to upload a data_sample at each new reader ..
What do you think ?
We should document how a single raw file can be alternatively processed to L0 netCDF.
There are people that might want to just exploit some functionality provided by disdrodb.
I think the following example could be useful to lot of people
# Import the relevant packages
from disdrodb.L0.import read_raw_data, cast_column_dtypes, create_L0B_from_L0A, set_encodings
# Specify the filepath of a single raw text file
filepath ="/file/path/to/your/raw/text/file.txt"
# Define the sensor type
sensor_name = "OTT_Parsivel"
# Define processing mode
lazy = False
# Define (dummy) attribute dictionary to enable further processing
# --> The attrs dictionary will be attached to the output xr.Dataset (and netCDF4)
attrs = {}
attrs["sensor_name"] = sensor_name
attrs["latitude"] = "-9999"
attrs["longitude"] = "-9999"
attrs["altitude"] = "-9999"
attrs["crs"] = "dummy"
# Specify here the required reader_kwargs, colum_names and df_sanitizer_fun
# --> You can copy it from a specific reader
reader_kwargs = {}
colum_names = []
def df_sanitizer_fun(df, lazy=False):
pass
return df
# Read the raw file
df = read_raw_data(filepath, column_names, reader_kwargs, lazy=lazy)
# Sanitize the dataframe to met the DISDRODB standard columns
df = df_sanitizer_fun(df, lazy=lazy)
print(df)
# Change the column dtype to match the DISDRODB standards
df = cast_column_dtypes(df, sensor_name)
print(df)
# Derive the corresponding xr.Dataset
ds = create_L0B_from_L0A(df, attrs, lazy=lazy, verbose=False)
print(ds)
# Set dataset encodings
# - This also convert object dtype into string
# - This also chunk the array in blocks
ds_encoded = set_encodings(ds.copy(), sensor_name)
print(ds_encoded)
# Write your DISDRODB L0 netCDF4
ds.to_netcdf("/tmp/dummy.nc")
Before adding this example to the docs, we need to add the following imports in the disdrodb.L0.__init__
file
from .L0A_processing import read_raw_data, cast_column_dtypes
from .L0B_processing import retrieve_L0B_arrays, create_L0B_from_L0A, set_encodings
disdrodb/
├── processing
├── L0
├── L1
├── L2
├── pipelines
├── api
├── utils
├── configs
├── data
├── docs
├── references
.gitignore
LICENSE
CONTRIBUTING.md
README.md
requirements.txt
Hey guys, I am running the library over some files and I was wondering if there is any option to singularly create netcdf files, instead of one with all the data present in all the raw data files.
Thanks!
Webhooks listening to Tagged releases
The names of files produced by DISDRODB currently have the following structure:
<campaign_name>_s<station_id>_<optional_suffix>.<file_extension>
which results for example as EPFL_2011_s1.nc
I suggest changing the file name structure to something more appropriate and informative similar to the following:
DISDRODB.<product_level>.<product_name>.<campaign_name>.<station_name>.<sensor_name>.s<start_time>.e<end_time>.p<production_time>.<version>.<file_extension>.
DISDRODB.L0B.Raw.EPFL2011.Campus1.OTT_Parsivel2.s20220125000000.e20220130000000.p20220130000000.V01.nc
This file structure will enable to perform relevant filtering operations without the need of opening the files.
To implement such a file structure we might want to ban and well document that:
campaign_name
, station_name
and sensor_name
cannot contain the delimiter .
sensor_name
to use -
instead of _
For the time
components we could choose between the YYYYMMDDhhmmss or the YYYYDOYhhmmss.
Suggestions are very welcome !!!
@charlottegiseleweil is this normal that we don't have the link for the jupiter notebook in the doc ?
https://disdrodb-dev.readthedocs.io/en/v0.0.6/readers.html#adding-a-new-reader-tutorial ?
did you remove it on purpose ?
the link shoul be here...
No response
No response
No response
These imports and code lines are outdated and should be removed.
No response
No response
No response
No response
to interact as a package
Is your feature request related to a problem? Please describe.
I think the code readability of the L0 readers would improve if we manage to compact reduce the list of click commands
that are present above the reader function definition
Describe the solution you'd like
@click.command() # options_metavar='<options>'
@click.argument('raw_dir', type=click.Path(exists=True), metavar='<raw_dir>')
@click.argument('processed_dir', metavar='<processed_dir>')
@click.option('-L0A', '--L0A_processing', type=bool, show_default=True, default=True, help="Perform L0A processing")
@click.option('-L0B', '--L0B_processing', type=bool, show_default=True, default=True, help="Perform L0B processing")
@click.option('-k', '--keep_L0A', type=bool, show_default=True, default=True, help="Whether to keep the L0A Parquet file")
@click.option('-f', '--force', type=bool, show_default=True, default=False, help="Force overwriting")
@click.option('-v', '--verbose', type=bool, show_default=True, default=False, help="Verbose")
@click.option('-d', '--debugging_mode', type=bool, show_default=True, default=False, help="Switch to debugging mode")
@click.option('-l', '--lazy', type=bool, show_default=True, default=True, help="Use dask if lazy=True")
@click.option('-s', '--single_netcdf', type=bool, show_default=True, default=True, help="Produce single netCDF")
def main(raw_dir,
processed_dir,
L0A_processing=True,
L0B_processing=True,
keep_L0A=False,
force=False,
verbose=False,
debugging_mode=False,
lazy=True,
single_netcdf = True,
):
would become
# Define this in some file
def readers_click_options(function):
function = click.argument('raw_dir', type=click.Path(exists=True), metavar='<raw_dir>')(function)
function = click.argument('processed_dir', metavar='<processed_dir>')(function)
function = click.option('-L0A', '--L0A_processing', type=bool, show_default=True, default=True, help="Perform L0A processing")(function)
function = click.option('-L0B', '--L0B_processing', type=bool, show_default=True, default=True, help="Perform L0B processing")(function)
function = click.option('-k', '--keep_L0A', type=bool, show_default=True, default=True, help="Whether to keep the L0A Parquet file")(function)
function = click.option('-f', '--force', type=bool, show_default=True, default=False, help="Force overwriting")(function)
function = click.option('-v', '--verbose', type=bool, show_default=True, default=False, help="Verbose")(function)
function = click.option('-d', '--debugging_mode', type=bool, show_default=True, default=False, help="Switch to debugging mode")(function)
function = click.option('-l', '--lazy', type=bool, show_default=True, default=True, help="Use dask if lazy=True")(function)
function = click.option('-s', '--single_netcdf', type=bool, show_default=True, default=True, help="Produce single netCDF")(function)
return function
## In each reader
# Add the import of readers_click_options
# And modify as follow
@click.command()
@readers_click_options
def main(raw_dir,
processed_dir,
L0A_processing=True,
L0B_processing=True,
keep_L0A=False,
force=False,
verbose=False,
debugging_mode=False,
lazy=True,
single_netcdf = True,
):
@regislon can you take care of that?
Related to this refactor ... do we maybe want to change the name of the function? From main
to reader
No response
Configuring tag protection rules on main repo
see step here : https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/managing-repository-settings/configuring-tag-protection-rules
No response
No response
3.9
the current environment.yml contains lots of unused packages. The aim here is to keep only the package need to run DISDRODB and to generate the documentation.
remove warning and publish the first doc v0.0.1
Please @ghiggi could confirm us that we can publish this first version and remove the warning.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.