GithubHelp home page GithubHelp logo

pik-primap / primap2 Goto Github PK

View Code? Open in Web Editor NEW
8.0 6.0 2.0 2.49 MB

The next generation of the PRIMAP climate policy analysis suite

Home Page: https://primap2.readthedocs.io

License: Apache License 2.0

Makefile 0.62% Python 99.38%
climate-change climate-data climate-policy

primap2's Introduction

PRIMAP2

PyPI status Documentation Status Zenodo release

PRIMAP2 is the next generation of the PRIMAP climate policy analysis suite. PRIMAP2 is free software, you are welcome to use it in your own research. The documentation can be found at https://primap2.readthedocs.io.

Structure

PRIMAP2 is:
  • A flexible and powerful data format built on xarray.
  • A collection of functions for common tasks when wrangling climate policy data, like aggregation and interpolation.
  • A format for data packages built on datalad, providing metadata extraction and search on a collection of data packages.

Status

PRIMAP2 is in active development, and not everything promised above is built yet.

License

Copyright 2020-2022, Potsdam-Institut für Klimafolgenforschung e.V.

Copyright 2023-2024, Climate Resource Pty Ltd

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

PRIMAP2 incorporates parts of xarray and pint_xarray, which are available under the Apache License, Version 2.0 as well. The full text of the xarray copyright statement is included in the licenses directory.

Citation

If you use this library and want to cite it, please cite it as:

Mika Pflüger and Johannes Gütschow. (2024-07-08). pik-primap/primap2: PRIMAP2 Version 0.11.1. Zenodo. https://doi.org/10.5281/zenodo.12683509

primap2's People

Contributors

jguetschow avatar mikapfl avatar pre-commit-ci[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

primap2's Issues

read in additional coordinates

At the moment it is not possible to specify additional coordinates to be read in using the pm2io functions. We should add this functionality.

One problem to solve will be how to represent additional coordinates in the interchange format.

nir_convert_df_to_long removes nan values

Describe the bug
Currently the function primap2.pm2io.nir_convert_df_to_long removes values interpreted as nan. This included e.g. the string "NA". This contradicts the option to specify a mapping to nan and 0 when converting the data to primap2 format. Thus this should be changed for consistency reasons.

Expected behavior

Keep nan values, so mapping can be made explicit

It is easy to fix, just add 'dropna=False' to the stack command in line 167 of _GHG_inventory_reading.py.

Additional context

It's not worth it's own branch an pull request so the issue is just a reminder to fix this with the next set of changes.

decide the future of `source`

At the moment, we have source as dim, but only allow a single source in a dataset. That actually gives us the worst of all worlds: Only a single source, incompatibilities when doing arithmetic with different sources, and more dimensions, which always hurt somewhat.

We have three options what to do instead:

A single source, in attrs.

Advantages:

  • One less dimension.
  • Direct arithmetic with different sources.
    Disadvantages:
  • Multiple sources have to be held in multiple datasets always. If working with many sources, for example when plotting differences between sources, for loops have to be used.

One or more sources, in dim

Advantages:

  • Select for source like for area.
  • Explicit arithmetic with different sources (e.g. da1.loc[{'source': 'FAO'}] + da2.loc[{'source': 'Andrew'}])
  • Multiple sources can be held in a single dataset. Working with many sources becomes fluent.
    Disadvantages:
  • When working with a single source, the additional dimension makes representations larger, makes tabular display in pycharm more difficult, etc.
  • Explicit arithmetic with different sources (more to type in the "working with two different sources, I know what I'm doing" use case)

Hybrid, both allowed

Advantages:

  • When working with one dimension, all advantages of the attrs solution.
  • When working with multiple dimensions, all advantages of the dim solution.
    Disadvantages:
  • Shared functions need to consider both cases.
  • Mixing Datasets using one style with Datasets using the other style leads to surprising results.
  • Explicit conversions necessary.

My gut reaction at the moment (especially now that we have da.pr.set() and therefore don't have to deal with da.loc[sel] = array[..., np.newaxis] anymore, the np.newaxis really pissed me off) would be to standardize on "one or more sources, in dim".

Interchange format

Is your feature request related to a problem? Please describe.

Reading data from csv files requires several steps to harmonize metadata and dimensions. It would be great to have an interchange format that has all metadata and dimension names as required for PRIMAP2 but is still in a table like format as in a csv file. This format could be used to interchange data with others in a specified format which is easy to export from and import to PRIMAP2 and other tools.

Describe the solution you'd like

For exporting a function "to_interchange_format" should be added to the "pr" xarray accessor. For importing, a function from_interchange_format should be added to the "io" module which is currently under development. Data reading from other formats should then first create an interchange format dataset (e.g. as pandas dataframe) which can then be transformed into the PRIMAP2 xarray data format.
This would not only make our code easier to read but also enable better reuse of data reading functions outside of PRIMAP2

The interchange format has every mandatory dimension + unit + entity + time points as columns. Optional dimensions can be present but don't have to be present. The column names follow the dimension names in PRIMAP2.

An open question is how to store the attrs in the interchange format. It is possible to store them in columns, but repeating information like the reference in every row is a waste of space. For storage in memory we could try to use the pandas dataframe attrs, though the feature is still experimental. For storing we should find a format which consists of a csv file with an additional metdata file.

CSV reading with non-comma separators

Is your feature request related to a problem? Please describe.

When reading data from csv files, only files with comma as separator can be read using the read...csv_file_if functions as the sep parameter in pd.read_csv can not be specified

Describe the solution you'd like

Add the sep parameter to primap2 csv reading routines

Describe alternatives you've considered

Read csv using pandas function and convert in a second thep

sparse data

we need more efficient handling of sparse data. Thinking about sparseness for every data set is worse than just having algorithms which handle sparse data.

In xarray (https://docs.xarray.dev/en/stable/internals/duck-arrays-integration.html?highlight=sparse#integrating-with-duck-arrays and https://github.com/pydata/xarray/blob/main/xarray/tests/test_sparse.py) this functionality is still experimental, but it could be pretty drop-in. Maybe we should see if we can make a primap array sparse and see if our tests still pass, that would be great!

Order of attrs fields not stable in interchange format

Describe the bug

The order of the attr fields in the interchange format is not always the same leading to differences in the same data saved by different users. In the yaml file this is directly visible, but as we have seen differences in checksums of binary files there might be a similar problem there.

Failing Test

No built in test is know to be failing. We've noticed this when re-reading a dataset version in the Andrew cement data repository
See this pr
@crdanielbusch can you clone primap2 and run make test to see if anything fails for you?

Expected behavior

Dataset metadata (and actual data) should always be ordered in the same way such that when saving with DataLad only actual data differences are detected as new and not reordering of metadata or data.

System (please complete the following information):
Original data read on Linux mint, python 3.10.12, pandas 1.2.1, primap2 0.9.7, xarray 2023.10.1
Conflicting read on Mac OS: @crdanielbusch can you add your package versions here?

Harmonize parameter naming for selections/filters

Is your feature request related to a problem? Please describe.

Parameters to limit functions to a selection of the dataset use different names. In the downscaling function we use sel and in the aggregation functions we use filter

Describe the solution you'd like

Parameters that fulfill the same purpose should have the same name so users don't have to remerber different names for different functions.

Additional context

We've not used the new aggregation functions a lot, so it should be acceptable work to change the parameter name for these functions and adapt the code that already uses the functions.

AR5 CCF GWPs missing

Is your feature request related to a problem? Please describe.

Data using AR5 GWPs witch carbon cycle feedbacks (AR5CCFGWP100) can not be read into primap2

Describe the solution you'd like
Integrate AR5 CCF GWPs

CSG: next steps

Some possible steps for the CSG development. Grouped in one issue for now. they can be split into seperate issues when we start working on them.

  • Entity selection: currently there is no way to specify the entity in the priority_definition and strategy_definition. We could use different definitions for different entities, but this would lead to definitions which are mostly identical except for some special cases especially fro the strategy_definition. Thus I think it would be great to enable entity specification in the definitions.
  • Numerical operations: To enable purely numerical operations into the csg process (interpolation, extrapolation) we could add special tags in the definitions and special filling strategies. This would enable the final numerical extrapolation and interpolation in the CSG for simple cases (all cases in current PRIMAP-hist). It would also help with some of the current exception cases where we use numerical methods to help with matching of inconsistent data. However, more advanced extrapolation methods, e.g. using proxy data can not be implemented using the CSG.
  • Wrapper: build a wrapper function for the csg, that handles country and gas basket aggregation, can add regions, handles filling of priority coordinates etc. Most functionality already exists in the primap2 ecosystem but needs to be included into primap2 with proper tests etc.
  • Larger regression test: test with multiple time-series, comparable to PRIMAP-hist (run only on demand not in CI)
  • Filling algorithms: test basic filling algorithms and develop more advanced filling strategies based on the results

Validity checks for interchange format

Is your feature request related to a problem? Please describe.

When reading data into the primap2 interchange format rules are more lenient than for the native pm2 format, leading to errors on conversion from IF to native format. Examples:

  • Entity names can contain characters confusing pint (e.g. HFC-32, where the '-' is interpreted as 'minus')
  • More examples to be added

Describe the solution you'd like

Check for consistency with native format to make sure conversion is possible. Currently a validity check is only implemented for the native format.

Describe alternatives you've considered

Keep as is to have less restrictions int he interchange format.

Additional context

Add any other context or screenshots about the feature request here.

Test fail with current packages

Describe the bug

Test fail after update-venv (all passed before). The

Failing Test

make test hat 14 fails. The errors seems to have something to do with the GWP contexts in openscm-units.

Expected behavior

All tests pass

System (please complete the following information):

  • OS: Linux mint
  • Python version: 3.8.10
  • package versions: only a few packages were updated, most of them have nothing to do with the actual code (e.g. sphinx). However, it's not an obvious package like pandas or scmdata-units itself. I add more info here when I have it.

Additional context

Error basically renders primap2 completely unusable so highest priority to fix (or freeze a package version)

define/store terminology for entity names

Is your feature request related to a problem? Please describe.

So far, entities are stored as the data variable names in xarray Datasets, but the terminology is not stored anywhere. Therefore, the terminology is implied and only a single terminology is possible for entities (like in primap1). This could lead to a rather large and confused "primap2_entity" terminology and prevents straight-forward re-use of other terminologies.

Describe the solution you'd like

I think we should store the terminology for entity names in a primap2 dataset somewhere. I think there are two obvious places:

  1. In the dataset attrs. Then, all variables in a dataset share the same entity terminology. (terminology as metadata)
  2. In the variable names (in a format like CO2 (primap2) and NY.GDP.MKTP.KD (WB) or primap2|CO2 and WB|NY.GDP.MKTP.KD). Then, variables from different entity terminologies can be combined in the same dataset. (terminology as data)

The two solutions have different compromises:

  • readability is arguably better in the common primap2 case using solution number 1, no juggling of prefixes is necessary.
  • including the terminology in the data (solution 2) leads to a lot of repetition.
  • overlapping terminologies (whatever, maybe the primap2 terminology has one definition of F-gases and the FAOstat terminology has another definition of F-gases) can lead to problems when using solution number 1 because it is possible to do calculations e.g. summing two different datasets with differing definitions of F-gases, leading to wrong results.

In other places in primap2, we have chosen to include the terminology in the name of the dimension. That way, we don't have to repeat the terminology in each data point. Unfortunately, this solution is not possible for the entity, because there is no name of the entity dimension in xarray.

I think I would prefer solution number 1 (terminology as metadata) due to the readability and clear separation of data and metadata. The potential for confusion is acceptable for me because I expect that combining datasets from different sources (and therefore potentially different entity terminologies) is a rather uncommon operation which has to be done with care anyway (dealing with partially overlapping data and discontinuities etc.), so that also checking the entity terminology explicitly is not overly complicated.

Describe alternatives you've considered

I considered a few other possibilities:

  • Store the terminology in the data variable's attrs. However, this repeats the terminology for every data variable and still doesn't protect from confusion when accidentally combining variables with different terminologies. For me it looks like it combines disadvantages of the two solutions I proposed.
  • Mandate a common terminology for all primap2 datasets, enlarge this terminology as needed (the primap1 way of doing things). This is attractive because it will be most straight-forward in a strict primap2 context, however, the primap2 terminology will be sprawling and not very coherent, with ad-hoc added and single-use variable names. Examples for this problem from primap1: VariousGases means "CF4, C2F6, C3F8, C4F8, HFC-23, NF3, SF6", BYEMISS is poorly defined as "Base year emissions", VULN apparently means "vulnerability".

Implementation
To implement the change if we go with the "terminology as metadata" solution it is necessary to:

  • Change the detailed format description
  • Change the validation functions
  • Change the read* functions
  • Update examples
  • Adjust tests

Feature request: function to combine multiple datasets

A function which combines multiple, potentially overlapping datasets while properly treating overlapping information (e.g. error/warning if the data in the different datasets diverges by more than a given percentage etc.)

Re-using columns in `read_wide_csv_file_if` while reading data

Describe the bug
Following up on the discussion in #82 read_wide_csv_file_if doesn't support re-using a source column like

Unfortunately, read_wide_csv_file_if doesn't support re-using a source column multiple times in coords_cols because behind the scenes it just does a renaming. Probably something it should support, so maybe worth opening a bug report - but I can't commit to when I will have time to fix it.

Example code:

file = "rcmip-emissions-annual-means-v5-1-0.csv"
coords_cols = {
    "unit": "Unit",
    "area": "Region",
    "model": "Model",
    "scenario": "Scenario",
    "entity": "Variable",
    "category": "Variable"
}
coords_defaults = {
    "source": "RCMIP",
}
coords_terminologies = {
    "area": "RCMIP",
    "category": "RCMIP",
}
coords_value_mapping = {
    "entity": map_variables
}
meta_data = {
    "rights": "CC BY 4.0 International",
}
data_if = pm2.pm2io.read_wide_csv_file_if(
    file,
    coords_cols=coords_cols,
    coords_defaults=coords_defaults,
    coords_terminologies=coords_terminologies,
    coords_value_mapping=coords_value_mapping,
    meta_data=meta_data,
    filter_keep={"f1": {
        "Model": "CEDS/UVA/GCP/PRIMAP",
    }}
)
data_if

Fails with KeyError.

Expected behavior

Allow re-using a column when reading the data.

Potential workaround is described in #82

extrapolation and other numerical / unary operations

compose generally does binary operations, i.e. a given timeseries is combined with
a second timeseries to yield a better result.

We also need support for unary operations, where a given timeseries is handled
(e.g. by interpolation or extrapolation) to yield a better result.

Possible unary operations:

  • extrapolation
  • interpolation
  • extrapolation with proxy data
  • smoothing
  • de-trending
  • replacing by NaNs

Some of these functions already exist in xarray or primap2 (e.g. replacing by NaNs
can be done using pr.set), others are special for our use case. The most urgent ones
are extrapolation and interpolation.

We expect that we will use these functions interactively / ad-hoc, so it would be
useful to have them as pr. functions first. In a second step, we likely also want
to have more complex configs (e.g. for full primap-hist processing) which might be
something we want to do in a similar fashion as the compose function.

Long format reading capabilities

Is your feature request related to a problem? Please describe.

Need to be able to read long format csv files as e.g. in FAOSTAT or output of the UNFCCC-DI API

Describe the solution you'd like

Function that converts long format to wide format that can then be read by the read_wide_csv_file_if() function

Describe alternatives you've considered

Function that directly converts wide format to interchange format. This would mean replication of code and is thus only an option if the preferred option doesn't work for some files

Additional context

Working code is available here: https://gitlab.pik-potsdam.de/PRIMAP/helper-tools/-/tree/master/data_converter
It just needs to be adapted to PRIMAP2

Abandon `sec_cats`

Is your feature request related to a problem? Please describe.

Having to maintain sec_cats in the dataset attrs is work and it seems pointless

Describe the solution you'd like

Let's just not have sec_cats in the native data format (if we want to continue including it in the IF is a second question).

run (x)doctests via pytest

would be great to test the Examples in docstrings automatically. This is a bit of work because not all examples are self-contained at the moment.

Fix docs for aggregation

Describe the bug

The docs for the new aggregation function in 0.11.0 are a bit messed up (e.g. last paragraph not fitting the example above)

Expected behavior
Clean and correct docs

Pandas DataFrame.append deprecated

DataFrame.append is deprecated and will likely be removed in pandas 2.0
pandas-dev/pandas#35407

We use it in some data reading functions so we have to adapt code. DataFrame.concat is supposed to be used as an alternative, but it looses attributes, so we have to be careful when handling the interchange format data.

Add convert_to_gwp() to a dataset

Is your feature request related to a problem? Please describe.

We need to create a loop every time we want to convert the units for a data set.

Describe the solution you'd like

We would like to add an accessor for datasets for convert_to_gwp().

fill_na_gas_basket_from_contents() is obsoleted by set()

Where possible, functionality should not be duplicated.
Plan:

  1. write fill_na_gas_basket_from_contents() using set()
  2. check if tests still pass
  3. re-write tests using set
  4. remove fill_na_gas_basket_from_contents
  5. adjust documentation

fully type everything and add py.typed marker

Is your feature request related to a problem? Please describe.

If you import primap2 in a file and check the file with mypy, you get an error:

app.py:1: error: Skipping analyzing "primap2": module is installed, but missing library stubs or py.typed marker  [import]

That's of course stupid because primap2 is in fact typed, we just haven't added the correct marker to tell the world about it.

Describe the solution you'd like

Add the py.typed marker file.

Additional context

I'm unfortunately not sure what will happen with the xarray accessor and if typing works on it. May have to investigate.

Generalize `pr.set`

Is your feature request related to a problem? Please describe.

Xarray lacks an easy-to-use solution for setting specific values. Unfortunately, pr.set also doesn't handle all use cases so far.

Describe the solution you'd like

It would be great to handle the following cases:

  • Like combine_first, but preserving NaNs in early objects before alignment.
  • Like set, but instead of dim and key, have a sel, so that it looks like da.pr.set(sel, value).
  • Support scalar NaN value in ds.set.

Entity names

I'm wondering if we want to have a standard for variable names. In PRIMAP1 it's all upper case letters. For PRIMAP2 we have specified a way to add GWP information to variable names, but no convention for the variables themselves. I think all uppercase is sometimes hard to read. I think we should have a specification to simplify running code on different datasets.

csv / xls(x) reading for more formats

Is your feature request related to a problem? Please describe.

Currently reading from wide format csv files is implemented (long format under development). However, some datasets use different formats, e.g. a table with gas and category as dimensions where each year has an individual sheet or file.

Describe the solution you'd like

Function that reads csv files from a folder or sheets from an xls(x) file into the interchange format.

Describe alternatives you've considered

Processing using external tools like the primap1 data_converter helper scripts

Additional context

For an example see e.g. the CEADs data for china. www.ceads.net/

improved nan treatment on summing

Is your feature request related to a problem? Please describe.

When using ds.pr.sum with skipna=True on data which contains NaN the result is currently 0 if all data points summed over are nan. I usually need functionality where the result is NaN when all input are NaN but if there are one or more non-NaN values the result is the sum of the non-NaN values .

Describe the solution you'd like

Use min_count=1 in the call of ds.sum (in the definition of ds.pr.sum) when skipna=True.

Alternatives
Always pass min_count=1 to ds.pr.sum when using it.

Would there be any problems in existing functionality when introducing this change @mikapfl ?

Unit handling in xarray fails for Python 3.12

Describe the bug

The function pr.sum removes the units in a dataset for Python version 3.12.2

Failing Test

github_issue_ds_1.csv

import primap2 as pm

test_data = pm.open_dataset("github_issue_ds_1.csv")

test_data = test_data.pr.sum(dim="category", skipna=True, min_count=1)

print(test_data.pr.to_interchange_format()['unit'])

Output for Python 3.10.13:

0      CH4 * gigagram / a
1       CO * gigagram / a
2      CO2 * gigagram / a
3      N2O * gigagram / a
4    NMVOC * gigagram / a
5      NOx * gigagram / a
6      SO2 * gigagram / a
Name: unit, dtype: object

Output for Python 3.12.2:

/Users/danielbusch/Documents/UNFCCC_non-AnnexI_data/venv/lib/python3.12/site-packages/xarray/namedarray/core.py:215: UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray.
  return NamedArray(dims, np.asarray(data), attrs)
/Users/danielbusch/Documents/UNFCCC_non-AnnexI_data/venv/lib/python3.12/site-packages/xarray/namedarray/core.py:215: UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray.
  return NamedArray(dims, np.asarray(data), attrs)
/Users/danielbusch/Documents/UNFCCC_non-AnnexI_data/venv/lib/python3.12/site-packages/xarray/namedarray/core.py:215: UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray.
  return NamedArray(dims, np.asarray(data), attrs)
/Users/danielbusch/Documents/UNFCCC_non-AnnexI_data/venv/lib/python3.12/site-packages/xarray/namedarray/core.py:215: UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray.
  return NamedArray(dims, np.asarray(data), attrs)
/Users/danielbusch/Documents/UNFCCC_non-AnnexI_data/venv/lib/python3.12/site-packages/xarray/namedarray/core.py:215: UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray.
  return NamedArray(dims, np.asarray(data), attrs)
/Users/danielbusch/Documents/UNFCCC_non-AnnexI_data/venv/lib/python3.12/site-packages/xarray/namedarray/core.py:215: UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray.
  return NamedArray(dims, np.asarray(data), attrs)
/Users/danielbusch/Documents/UNFCCC_non-AnnexI_data/venv/lib/python3.12/site-packages/xarray/namedarray/core.py:215: UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray.
  return NamedArray(dims, np.asarray(data), attrs)
/Users/danielbusch/Documents/UNFCCC_non-AnnexI_data/venv/lib/python3.12/site-packages/xarray/namedarray/core.py:215: UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray.
  return NamedArray(dims, np.asarray(data), attrs)
/Users/danielbusch/Documents/UNFCCC_non-AnnexI_data/venv/lib/python3.12/site-packages/xarray/namedarray/core.py:215: UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray.
  return NamedArray(dims, np.asarray(data), attrs)
/Users/danielbusch/Documents/UNFCCC_non-AnnexI_data/venv/lib/python3.12/site-packages/xarray/namedarray/core.py:215: UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray.
  return NamedArray(dims, np.asarray(data), attrs)
0    no unit
1    no unit
2    no unit
3    no unit
4    no unit
5    no unit
6    no unit
Name: unit, dtype: object

Expected behavior

The sum function should calculate the sum over the given dimension and keep all the unit information.

System (please complete the following information):

  • OS: [macOS]
  • Python version 3.12.2

logging verbosity

At the moment, our logging is active by default and very verbose (prints all debug messages). Maybe we should change that. I find myself disabling most logging usually if I write some script.

loguru recommends that libraries should by default disable their loggers completely to not annoy users of libraries, then users enable the loggers selectively. But maybe that's also a bit annoying?

I'm not quite sure what we should do, but maybe enable logging by default at the info level? @JGuetschow opinions?

Add tests on logging for pr.merge

Is your feature request related to a problem? Please describe.

Currently logging for pr.merge is not subject to testing. It would be good to include it, as the information in the log files is part of the output of the merge function.

Describe the solution you'd like

Basic tests that check if the conflicting values are detected and logged.

xls(x) reading

Is your feature request related to a problem? Please describe.

Allow reading of xls(x) files including selection of sheets to read.

Describe the solution you'd like

The pyCPA function that reads all csv files from a folder could be ported to primap2 and be enhanced such that it also reads a selection of sheets from xls(x) files.

Describe alternatives you've considered

xls(x) reading couls also be implemented by converting the xls(x) files to a selction of csv files which are then read using the csv reading functions.

Additional context

Support filtering for `entity` in Dataset.pr.loc

Is your feature request related to a problem? Please describe.

To filter a dataset for the entity and other dimensions, you have to write:

ds[entity].pr.loc[{"other_dim": other_value}]

that's annoying because you have to treat entity different than other dimensions.

Describe the solution you'd like

We enhance the dataset loc accessor such that

ds.pr.loc[{"entity": entity, "other_dim": other_value}]

is valid.

Describe alternatives you've considered

Maybe we shouldn't do this because the return value of ds.pr.loc is not clearly defined anymore. If entity is in the filter, the return value is a DataArray, if entity is not in the filter, it is a Dataset. That's annoying because static type inference doesn't work anymore. So maybe we shouldn't overcomplicate things and live with the fact that filtering for entities is different from filtering for anything else.

Question on interchange format data reading

I was trying to read the RCMIP emissions into the PRIMAP2 interchange format as a first step:

Available from https://gitlab.com/rcmip/rcmip/-/blob/master/data/protocol/rcmip-emissions-annual-means-v5-1-0.csv
or download https://gitlab.com/rcmip/rcmip/-/raw/master/data/protocol/rcmip-emissions-annual-means-v5-1-0.csv?inline=false

The below works but i couldn't get it to read category (commented out) and entity to be read at the same time.
I build a dictionary from the CSV e.g. to map Emissions|CO2 to CO2.

Is this the right way in general? Maybe i'm also missing something conceptually.

df = pd.read_csv("rcmip-emissions-annual-means-v5-1-0.csv")
df["Entity"] = df.Unit.apply(lambda x: x.split(" ")[1].split("/")[0])
map_variables = df[["Variable", "Entity"]].set_index("Variable")["Entity"].to_dict()

file = "rcmip-emissions-annual-means-v5-1-0.csv"
coords_cols = {
    "unit": "Unit",
    "area": "Region",
    "model": "Model",
    "scenario": "Scenario",
    "entity": "Variable",
    #"category": "Variable"
}
coords_defaults = {
    "source": "RCMIP",
}
coords_terminologies = {
    "area": "RCMIP",
    "category": "RCMIP",
}
coords_value_mapping = {
    "entity": map_variables
}
meta_data = {
    "url": "https://doi.org/10.5194/gmd-13-5175-2020",
    "rights": "CC BY 4.0 International",
}
data_if = pm2.pm2io.read_wide_csv_file_if(
    file,
    coords_cols=coords_cols,
    coords_defaults=coords_defaults,
    coords_terminologies=coords_terminologies,
    coords_value_mapping=coords_value_mapping,
    meta_data=meta_data,
    filter_keep={"f1": {
        "Model": "CEDS/UVA/GCP/PRIMAP",
    }}
)
data_if

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.