Comments (9)
@avimallu I had the same thought on the lru_cache but it doesn't work because neither pl.Series
or even np.ndarray
are hashable for the underlying cache.
from polars.
Ping me next week. I will see if I can put something behind an env var.
from polars.
I am not sure about that. We will not call into python for equality of functions and pointers checking failed in the past.
from polars.
Can these be special cased so that a user can say that it's safe to CSE this expression? This ends up being a pretty big annoyance in some cases and makes certain programming patterns ugly.
from polars.
At work, I essentially provide a framework where users pass me expressions and I apply them to a base table as well as adding an over
to the user provided expressions. The only way to avoid UDFs from being recomputed would be by referencing them by name in a later context. It's okay if the UDF is cheap, but some of them are quite expensive, so having CSE work would be great.
from polars.
I think this can create many bugs, which I don't want open at this point in time. We can look at enabling it for UDF's later.
from polars.
Even if it's completely opt in? This is a bit of a blocker for me, I'm curious if it'd be possible to bring back the pl.Expr.cache
method as an alternative to this instead?
from polars.
The only way to avoid UDFs from being recomputed would be by referencing them by name in a later context.
Does lru_cache
not work for your case?
from polars.
The only way to avoid UDFs from being recomputed would be by referencing them by name in a later context.
Does
lru_cache
not work for your case?
@avimallu I don't think you understand the feature request.
import polars as pl
ldf = pl.LazyFrame({"a": [1, 2, 3]})
udf_expr = pl.col("a").map_batches(lambda x: x*2).alias("b")
derived_expr_0 = udf_expr.mul(2).alias("c")
derived_expr_1 = udf_expr.mul(3).alias("d")
ldf = ldf.with_columns(udf_expr, derived_expr_0, derived_expr_1)
print(ldf.explain())
You'll notice that:
WITH_COLUMNS:
[col("a").python_udf().alias("b"), [(col("a").python_udf()) * (2.cast(Unknown(Any)))].alias("c"), [(col("a").python_udf()) * (3.cast(Unknown(Any)))].alias("d")], []
DF ["a"]; PROJECT */1 COLUMNS; SELECTION: None
Will print out, meaning that the udf gets evaluated 3x.
Contrast it with:
import polars as pl
ldf = pl.LazyFrame({"a": [1, 2, 3]})
udf_expr = pl.col("a").mul(2).alias("b")
derived_expr_0 = udf_expr.mul(2).alias("c")
derived_expr_1 = udf_expr.mul(3).alias("d")
ldf = ldf.with_columns(udf_expr, derived_expr_0, derived_expr_1)
print(ldf.explain())
Which will print out:
WITH_COLUMNS:
[col("__POLARS_CSER_0xd39686281a38356a").alias("b"), [(col("__POLARS_CSER_0xd39686281a38356a")) * (2)].alias("c"), [(col("__POLARS_CSER_0xd39686281a38356a")) * (3)].alias("d")], [[(col("a")) * (2)].alias("__POLARS_CSER_0xd39686281a38356a")]
DF ["a"]; PROJECT */1 COLUMNS; SELECTION: None
This requires polars side changes or you have to explicitly write your query/code like:
import polars as pl
ldf = pl.LazyFrame({"a": [1, 2, 3]})
udf_expr = pl.col("a").map_batches(lambda x: x*2).alias("b")
derived_expr_0 = pl.col("b").mul(2).alias("c")
derived_expr_1 = pl.col("b").mul(3).alias("d")
ldf = ldf.with_columns(udf_expr)
ldf = ldf.with_columns(derived_expr_0, derived_expr_1)
print(ldf.explain())
But in my use case, users build up trees of expressions which they pass to my framework to evaluate, which becomes very ugly if CSE isn't supported, then it breaks the abstraction.
from polars.
Related Issues (20)
- Add `polars.Expr.list.drop_nans()` HOT 2
- `read_parquet` don't recognize OSS url scheme
- Not clear if Parquet statistics are used when filter applied
- `bottom_k` should not include nulls if the column contains at least `k` valid elements
- Construct CsvReader from bytestream using CsvReadOptions
- segfault when reading msql using arrow-odbc HOT 1
- `implode` results in extra level of nesting when run within a `group_by(...).agg` HOT 2
- Support reading directly from zipfile.Path objects.
- Read_json panics when infer_schema_length = 0
- `explain(streaming=True)` isn't showing correct plan
- Data in csv files with less columns than schema shifts data. HOT 4
- Add the argument `ignore_nulls` in `.arr.all()`, `.arr.any()`, `.list.all()` and `.list.any()`
- read_database_uri panics for dates beyond 2262.04.11 HOT 2
- Move streaming engine original plan to separate field on the `IRPlan`
- Write upgrade guide for 1.0.0
- Polars is unable to parse dates beyond 2262.04.11 HOT 1
- Make a ParquetWriter context handler and/or more control over row group creation
- Casting to float32, int32, int16 and int8 in polars is slower than pandas in larger dfs HOT 4
- Interpolate based on other Float64 column HOT 3
- Comparing 0 with UInt64 values larger than Int64::MAX incorrectly return NULL
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from polars.