GithubHelp home page GithubHelp logo

Support CSE on python UDFs about polars HOT 9 OPEN

kszlim avatar kszlim commented on June 27, 2024
Support CSE on python UDFs

from polars.

Comments (9)

deanm0000 avatar deanm0000 commented on June 27, 2024 1

@avimallu I had the same thought on the lru_cache but it doesn't work because neither pl.Series or even np.ndarray are hashable for the underlying cache.

from polars.

ritchie46 avatar ritchie46 commented on June 27, 2024 1

Ping me next week. I will see if I can put something behind an env var.

from polars.

ritchie46 avatar ritchie46 commented on June 27, 2024

I am not sure about that. We will not call into python for equality of functions and pointers checking failed in the past.

from polars.

kszlim avatar kszlim commented on June 27, 2024

Can these be special cased so that a user can say that it's safe to CSE this expression? This ends up being a pretty big annoyance in some cases and makes certain programming patterns ugly.

from polars.

kszlim avatar kszlim commented on June 27, 2024

At work, I essentially provide a framework where users pass me expressions and I apply them to a base table as well as adding an over to the user provided expressions. The only way to avoid UDFs from being recomputed would be by referencing them by name in a later context. It's okay if the UDF is cheap, but some of them are quite expensive, so having CSE work would be great.

from polars.

ritchie46 avatar ritchie46 commented on June 27, 2024

I think this can create many bugs, which I don't want open at this point in time. We can look at enabling it for UDF's later.

from polars.

kszlim avatar kszlim commented on June 27, 2024

Even if it's completely opt in? This is a bit of a blocker for me, I'm curious if it'd be possible to bring back the pl.Expr.cache method as an alternative to this instead?

from polars.

avimallu avatar avimallu commented on June 27, 2024

The only way to avoid UDFs from being recomputed would be by referencing them by name in a later context.

Does lru_cache not work for your case?

from polars.

kszlim avatar kszlim commented on June 27, 2024

The only way to avoid UDFs from being recomputed would be by referencing them by name in a later context.

Does lru_cache not work for your case?

@avimallu I don't think you understand the feature request.

import polars as pl
ldf = pl.LazyFrame({"a": [1, 2, 3]})
udf_expr = pl.col("a").map_batches(lambda x: x*2).alias("b")
derived_expr_0 = udf_expr.mul(2).alias("c")
derived_expr_1 = udf_expr.mul(3).alias("d")
ldf = ldf.with_columns(udf_expr, derived_expr_0, derived_expr_1)
print(ldf.explain())

You'll notice that:

 WITH_COLUMNS:
 [col("a").python_udf().alias("b"), [(col("a").python_udf()) * (2.cast(Unknown(Any)))].alias("c"), [(col("a").python_udf()) * (3.cast(Unknown(Any)))].alias("d")], []
  DF ["a"]; PROJECT */1 COLUMNS; SELECTION: None

Will print out, meaning that the udf gets evaluated 3x.

Contrast it with:

import polars as pl
ldf = pl.LazyFrame({"a": [1, 2, 3]})
udf_expr = pl.col("a").mul(2).alias("b")
derived_expr_0 = udf_expr.mul(2).alias("c")
derived_expr_1 = udf_expr.mul(3).alias("d")
ldf = ldf.with_columns(udf_expr, derived_expr_0, derived_expr_1)
print(ldf.explain())

Which will print out:

 WITH_COLUMNS:
 [col("__POLARS_CSER_0xd39686281a38356a").alias("b"), [(col("__POLARS_CSER_0xd39686281a38356a")) * (2)].alias("c"), [(col("__POLARS_CSER_0xd39686281a38356a")) * (3)].alias("d")], [[(col("a")) * (2)].alias("__POLARS_CSER_0xd39686281a38356a")]
  DF ["a"]; PROJECT */1 COLUMNS; SELECTION: None

This requires polars side changes or you have to explicitly write your query/code like:

import polars as pl
ldf = pl.LazyFrame({"a": [1, 2, 3]})
udf_expr = pl.col("a").map_batches(lambda x: x*2).alias("b")
derived_expr_0 = pl.col("b").mul(2).alias("c")
derived_expr_1 = pl.col("b").mul(3).alias("d")
ldf = ldf.with_columns(udf_expr)
ldf = ldf.with_columns(derived_expr_0, derived_expr_1)
print(ldf.explain())

But in my use case, users build up trees of expressions which they pass to my framework to evaluate, which becomes very ugly if CSE isn't supported, then it breaks the abstraction.

from polars.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.