GithubHelp home page GithubHelp logo

Comments (26)

jarq6c avatar jarq6c commented on June 2, 2024

from hydrotools.

aaraney avatar aaraney commented on June 2, 2024

My inclination is to cache processed dataframes.

Likewise.

GCP is pretty resilient,
but there's a major bottleneck when it comes to processing the NetCDF
files. I'd like to see if we can optimize that aspect first, then explore
caching options.

Okay, I will look into optimizing that before moving on. Can you speculate on the bottleneck you mentioned?

from hydrotools.

jarq6c avatar jarq6c commented on June 2, 2024

from hydrotools.

aaraney avatar aaraney commented on June 2, 2024

Alright, that's helpful. I will fool around with it and see if I see any improvement in performance.

from hydrotools.

aaraney avatar aaraney commented on June 2, 2024

I took a look at the performance issues on Friday and found two places where we can make substantial improvements. For an initial baseline, I timed retrieving an analysis and assimilation cycle without filtering the results (all 2.7 million reaches included). That took ~48 seconds. Not horrible, but also not great. To get a sense of what the bottle neck might be, I did the same test, but filtered to just USGS gauges and that took substantially less time ~24 seconds.

Before I elaborate on the findings, I was able to get the time down to ~20 seconds for pull and processing all 2.7 million reaches for a single AnA cycle. To me, that is a success especially if we are to cache those results. Enough of the good news and on to the findings...

I found three places where we can substantially improve performance:

  1. Dropping variables the we are not concerned about at xarray dataset creation time.
  2. Not decoding the time fields at xarray dataset creation time.
  3. Post xarray dataset creation and creation of pandas df object, not casting the nwm_feature_id field to a string. This had the largest effect on performance. It cut wall time pretty much in half.
def NWM_bytes_to_DataFrame(
    path,
) -> pd.DataFrame:
    vars_to_drop = [
        "crs",
        "nudge",
        "velocity",
        "qSfcLatRunoff",
        "qBucket",
        "qBtmVertRunoff",
    ]
    # Load data as xarray DataSet
    with xr.load_dataset(
        path,
        engine="h5netcdf",
        mask_and_scale=False,
        drop_variables=vars_to_drop, #1
        decode_cf=False, #2
    ) as ds:

        # Extract streamflow data to pandas DataFrame
        df = pd.DataFrame(
            {
                "nwm_feature_id": ds["streamflow"].feature_id.values,
                "value": ds["streamflow"].values,
            }
        )

        # Scale data
        scale_factor = ds["streamflow"].scale_factor[0]
        value_date = pd.to_datetime(ds.time.values[0], unit="m")
        start_date = pd.to_datetime(ds.reference_time.values[0], unit="m")

    df.loc[:, "value"] = df["value"].mul(scale_factor)

    # 3. This had the biggest impact on wall time. With casting, for a full AnA cycle, goes to ~50 seconds
    # Convert feature IDs to strings
    # df["nwm_feature_id"] = df["nwm_feature_id"].astype(str)

    # Extract valid datetime
    df["value_date"] = value_date

    # Extract reference datetime
    df["start_date"] = start_date

    return df

Now with these findings being presented, my only question is, given the cast from int to str of the nwm_feature_id field having the biggest impact on performance, what are your thoughts on leaving the field as an int? This would break with the canonical dataframe format.

from hydrotools.

jarq6c avatar jarq6c commented on June 2, 2024

At the model level, it makes sense to store nwm_feature_id as an int. However, for this use case my preference is that all location identifiers be str then pandas.Category.

Having said that, I believe the transformation from int to str can be more efficiently executed outside the NWM_bytes_to_DataFrame method (after calling pandas.concat elsewhere).

Additionally, I wonder if there is any performance to be gained from using the xarray.Dataset.to_dataframe method. See here docs. Hand-crafting pandas.Dataframe apparently comes with significant overhead.

from hydrotools.

aaraney avatar aaraney commented on June 2, 2024

I agree that for the end user, storing them as strings makes more sense and it is just the standard that we have in place. I think it makes sense to cache results as int's personally. I would imagine that int comparison speeds would be faster than str comparison, then again I guess the question is, if int comparison then casting to str is faster than just a str comparison. I may be changing my mind, but then again, the overall size of the cache would be much higher if we are to store nwm_feature_id as a string.

Having said that, I believe the transformation from int to str can be more efficiently executed outside the NWM_bytes_to_DataFrame method (after calling pandas.concat elsewhere).

Potentially, but I am not so sure of that. Given that a copy of the data will be made in memory when cast from int to str I would think (in most cases) it would actually be slower to cast after concatenating the results. As of now, we have a kind of divide and conquer strategy that should be sightly faster.

Additionally, I wonder if there is any performance to be gained from using the xarray.Dataset.to_dataframe method. See here docs. Hand-crafting pandas.Dataframe apparently comes with significant overhead.

Ive tested that hypothesis , I would not say robustly, and found that is actually not the case. My thinking is that passing the underlying numpy array to the pandas.DataFrame constructor does not conduct a copy of the data in memory until you change it. But I could be wrong about that. Ill see what I can find in the source for my own curiosity sake. But to your point, this is definitely worth more investigation than ive done so far.

from hydrotools.

aaraney avatar aaraney commented on June 2, 2024

Having looked into the source I don't think that for this problem we will see any performance improvements using to_dataframe(). Ive included the method used to create the dataframe by xarray below. Our code should be a minutely faster since we also use the underlying numpy array to construct the df, just in less code.

# from xarray.core.dataset.Dataset 
# https://github.com/pydata/xarray/blob/a0c71c1508f34345ad7eef244cdbbe224e031c1b/xarray/core/dataset.py#L4970
def _to_dataframe(self, ordered_dims: Mapping[Hashable, int]):
        columns = [k for k in self.variables if k not in self.dims]
        data = [
            self._variables[k].set_dims(ordered_dims).values.reshape(-1)
            for k in columns
        ]
        index = self.coords.to_index([*ordered_dims])
        return pd.DataFrame(dict(zip(columns, data)), index=index)

from hydrotools.

jarq6c avatar jarq6c commented on June 2, 2024

Potentially, but I am not so sure of that. Given that a copy of the data will be made in memory when cast from int to str I would think (in most cases) it would actually be slower to cast after concatenating the results. As of now, we have a kind of divide and conquer strategy that should be slightly faster.

As an experiment I ran a test on my personal laptop that compared creating and concatenating 10,000 dataframes each with 1000 rows and two columns (nwm_feature_id and data) using multiprocessing (5 processes). I found casting nwm_feature_id to str after concatenation was 50% slower than before concatenation. At least on my laptop, it seems your intuition is correct.

I understand the efficiency arguments for caching nwm_feature_id as int. My reservations are related to usgs_site_code which are also numbers, but should not be stored as int (due to the leading zero problem). I'm concerned about maintaining a special separate standard at different parts of the pipeline for each location identifier (and any potential future identifiers that may also be number-based, but not actual integers).

from hydrotools.

jarq6c avatar jarq6c commented on June 2, 2024

I also had notes on memory-usage for the resulting Dataframe.
str: 647.4 MB
int: 152.6 MB
str as Category: 95.5 MB
int as Category: 95.4 MB

from hydrotools.

aaraney avatar aaraney commented on June 2, 2024

I understand the efficiency arguments for caching nwm_feature_id as int. My reservations are related to usgs_site_code which are also numbers, but should not be stored as int (due to the leading zero problem). I'm concerned about maintaining a special separate standard at different parts of the pipeline for each location identifier (and any potential future identifiers that may also be number-based, but not actual integers).

Right, I agree with you. My plan was never to include the usgs_site_code in a cached version since crosswalking and merging is so cheap computationally. I share your concern about splitting from the canonical format, just trying to figure out what makes the most sense moving forward.

from hydrotools.

aaraney avatar aaraney commented on June 2, 2024

I was able to get the time down to ~34 seconds doing the following:

max_feature_id = df["nwm_feature_id"].max()
max_digit_length = int(np.log10(max_feature_id)) + 1
df["nwm_feature_id"] = df["nwm_feature_id"].values.astype(f"|S{max_digit_length}")

from hydrotools.

jarq6c avatar jarq6c commented on June 2, 2024

I was able to get the time down to ~34 seconds doing the following:

max_feature_id = df["nwm_feature_id"].max()
max_digit_length = int(np.log10(max_feature_id)) + 1
df["nwm_feature_id"] = df["nwm_feature_id"].values.astype(f"|S{max_digit_length}")

Yuck! But also, clever! πŸ˜† Nice!

from hydrotools.

aaraney avatar aaraney commented on June 2, 2024

Only cause for concern is now nwm_feature_id is byte type not a decoded string. Going back to the drawing board. I found a really thorough and well explained stackoverflow post comparing astype and apply(<T>) that may be useful in the future or just good to know about.

from hydrotools.

hellkite500 avatar hellkite500 commented on June 2, 2024

Might try writing this a vectorized copy and see if it helps performance

    # 3. This had the biggest impact on wall time. With casting, for a full AnA cycle, goes to ~50 seconds
    # Convert feature IDs to strings
    # df["nwm_feature_id"] = df["nwm_feature_id"].astype(str)

To something like

df["nwm_feature_id_str"] = str(df["nwm_feature_id"])
df.drop("nwm_feature_id")
df.rename({"nwm_feature_id_str":"nwm_feature_id"})

I'm curious if the type assignment from one column type to another is causing some significant slow down on that large of data. Essentially it has to create tmp buffer, copy the new transformed data to it, and then reassign that index. Maybe something like the above can speed that up a bit, but YMMV.

from hydrotools.

aaraney avatar aaraney commented on June 2, 2024

I tried something similar to that, but instead of using casting with the built-in str, I used pandas astype. Still hovering at 50 seconds.

Edit: Remember English

from hydrotools.

aaraney avatar aaraney commented on June 2, 2024

Ive spent some more time working on this and I think I may have come to a fair compromise solution. Just to recap a few of the things ive tried since I last commented:

  1. Using cython to cast from an int to a string. I found that df.apply(str) was still slightly faster with a small array, but df.apply(str) became the clear winner as the array size increased.
  2. Using numpy.char.mod("%s", ds["streamflow"].feature_id.values) showed comparable speed to
max_feature_id = df["nwm_feature_id"].max()
max_digit_length = int(np.log10(max_feature_id)) + 1
df["nwm_feature_id"] = df["nwm_feature_id"].values.astype(f"|S{max_digit_length}")

but also returned encoded bytes rather than unicode strings.

  1. Next, I tried np.char.array(ds["nwm_feature_id"].feature_id.values, copy=False, unicode=True), this was faster than df.apply(str) in small arrays, but as array size grew and the string length increased, it grew slower than the df.apply(str) approach ~39 seconds when retrieving an EAnA cycle.
  2. Lastly, I tried pd.Categorical(ds["streamflow"].feature_id.values). This implementation took ~33 seconds for an entire EAnA cycle. I think this implementation is preferable to 1,2, and 3 because nwm_feature_id values are in a type that is recognizable and does not require further casting by the user. Also its pretty fast. As noted in #25, there are known issues with categorical data, however I think these issues are outweighed by the functionality of the categorical datetype. Namely, you can compare integers to categorical types without issue. To that same point, then again, what do we gain from casting these fields to categorical values? It seems like if our only two main advantages are comparison to integer types and that it is not a byte string, I don't see what we benefit is over leaving the values as int's in the first place. I think the best argument is that by using the categorical datatype, we minimize wall time and avoid altering the canonical format.

from hydrotools.

jarq6c avatar jarq6c commented on June 2, 2024

I don't see what we benefit is over leaving the values as int's in the first place. I think the best argument is that by using the categorical datatype, we minimize wall time and avoid altering the canonical format.

The categorical dtype flexes its strength more on large dataframes. I found categories beats both str and int in terms of memory efficiency (and storage as HDF) by a considerable margin on large dataframes. I also found that str categories are comparable to int categories. The added benefit of str categories is that we can use a consistent dtype for all location IDs.

From a previous comment:

str: 647.4 MB
int: 152.6 MB
str as Category: 95.5 MB
int as Category: 95.4 MB

from hydrotools.

aaraney avatar aaraney commented on June 2, 2024

Those memory saving are hard to argue against. Do you know if you can specify the category type (str or int) prior to the astype cast? So, can you get an str as Category from an int without having to cast to str first? From my reading of pandas.Categorical I am not seeing a straight path forward.

from hydrotools.

jarq6c avatar jarq6c commented on June 2, 2024

It appears that you can be explicit about the CategoricalDtype. However, I don't think there's a way around performing the initial cast to str (via astype, apply, etc). There may be differences between chaining apply or astype or using a lambda expression.

Another alternative might be to use xarray.DataArray.astype or similar function on the original dataset.

from hydrotools.

aaraney avatar aaraney commented on June 2, 2024

It appears that you can be explicit about the CategoricalDtype. However, I don't think there's a way around performing the initial cast to str (via astype, apply, etc). There may be differences between chaining apply or astype or using a lambda expression.

Right, that was my thinking as well .

I am giving your second suggestion a run at the moment.

from hydrotools.

aaraney avatar aaraney commented on June 2, 2024

So I tried the following:

df = pd.DataFrame(
                {
                    # "nwm_feature_id": pd.Categorical(
                    #     ds["streamflow"].feature_id.values
                    # ),
                    "nwm_feature_id": ds["streamflow"].feature_id.astype(str),
                    "value": ds["streamflow"].values,
                }
            )

df["nwm_feature_id"] = df["nwm_feature_id"].astype("category")

I took ~183 seconds. I did not report it earlier, but I found early on that casting the values while they are still in an xarray.DataArray to be slow.

from hydrotools.

jarq6c avatar jarq6c commented on June 2, 2024

What about merging your above solution with the underlying numpy method? Something like:

"nwm_feature_id": pd.Categorical(ds["streamflow"].feature_id.values.astype(f"|S{max_digit_length}")

from hydrotools.

aaraney avatar aaraney commented on June 2, 2024

Ill give that a go and see if that improves anything. I just tested this:

df = pd.DataFrame(
                {
                    "nwm_feature_id": ds["streamflow"].feature_id.values,
                    "value": ds["streamflow"].values,
                }
            )

df["nwm_feature_id"] = df["nwm_feature_id"].astype("category")
        df["nwm_feature_id"].cat.rename_categories(
            df["nwm_feature_id"].cat.codes.apply(str), inplace=True
        )

It was the slowest so far of the bunch ~212 seconds lol.

from hydrotools.

aaraney avatar aaraney commented on June 2, 2024

Unfortunately,

"nwm_feature_id": pd.Categorical(ds["streamflow"].feature_id.values.astype(f"|S{max_digit_length}")

took ~148 seconds.

from hydrotools.

jarq6c avatar jarq6c commented on June 2, 2024

Basic caching implemented for this subpack with #66

Closing this issue here. We can discuss the future of dataframe caching on another ticket.

from hydrotools.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.