Comments (6)
You can check the plan with df.explain. You should see the filter being pushed down into the scan as a pyarrow compute expression.
If it's correctly showing pushed down pyarrow compute expressions, then it rather points to an issue in pyarrow, where filters are not converted to partition filters
from polars.
Yes, we just pass the predicates to pyarrow. So I think this should be taken upstream.
from polars.
I don't think the issue is with pyarrow, as when running to_table
and passing in the compute expressions works as expected outside of polars land.
I suspect the issue is the predicates are not being passed in to to_table
as we would expect them to when using scan_pyarrow_dataset
. See the screenshots above of my debug session. In the _scan_pyarrow_dataset_impl function I can see there are no predicates being passed in as an argument, and thus no filter is being provided to ds.to_table
. The predicates seem to be getting lost in translation somewhere.
The query plan looks correct to me however from the output of explain()
:
data.explain()
'FILTER [([(col("underlier_id")) == (5135108)]) & ([(col("trade_date")) == (2016-01-04)])] FROM\n\n PYTHON SCAN \n PROJECT */7 COLUMNS'
from polars.
So filtering on non-date/datetime columns works, see below:
Run this code as-is
import polars as pl
df = pl.DataFrame({
"foo": [1,2,3],
"bar": [1,2,3],
"baz": [1,2,3],
}, schema={"foo": pl.Int64, "bar": pl.Date, "baz": pl.Int64,})
df.write_delta('test_table_scan',
mode='overwrite',
delta_write_options={"partition_by": ["foo", "bar"], "engine":"rust"}, overwrite_schema=True)
print(
pl.scan_delta('test_table_scan').filter(pl.col('foo')==2).collect()
)
However, a predicate that contains a date or datetime breaks the predicate pushdown into pyarrow, similar issue: #16248
import polars as pl
df = pl.DataFrame({
"foo": [1,2,3],
"bar": [1,2,2],
"baz": [1,2,3],
}, schema={"foo": pl.Int64, "bar": pl.Date, "baz": pl.Int64,})
df.write_delta('test_table_scan',
mode='overwrite',
delta_write_options={"partition_by": ["foo", "bar"], "engine":"rust"}, overwrite_schema=True)
print(
pl.scan_delta('test_table_scan').filter(pl.col('foo')==2, pl.col('bar')== pl.date(1970,1,3)).collect()
)
from polars.
Seems like the pushdown is not working when it includes date/datetimes @ritchie46
print(pl.scan_delta('test_table_scan').filter(pl.col('foo')==2, pl.col('bar')== pl.date(1970,1,3)).explain(optimized=True))
FILTER [([(col("foo")) == (2)]) & ([(col("bar")) == (dyn int: 1970.dt.datetime([dyn int: 1, dyn int: 3, dyn int: 0, dyn int: 0, dyn int: 0, dyn int: 0, String(raise)]).strict_cast(Date))])] FROM
PYTHON SCAN
PROJECT */3 COLUMNS
This issue is related: #11152
from polars.
Thank you very much for the replies!
Out of curiosity what exactly is it about dates that break the predicate pushdown? This would be a very nice feature to have as it makes scan_pyarrow_dataset
unusable on date partitioned datasets, and it is a very powerful feature we'd love to take advantage of :)
from polars.
Related Issues (20)
- Add `pl.col(...).is_not_in(<iterable>)` method HOT 4
- `search_sorted` in an order of magnitude slower when single element chunk vstacked to the original dataframe HOT 2
- Rust to_ndarray does not cast Null in f64 column to NaN HOT 1
- .hash() return Int64 instead of UInt64 HOT 2
- Add argument to `Series.value_counts` to set the name of the new column created HOT 5
- Copy logic-plan from one LazyFrame to another LazyFrame? HOT 3
- Support converting DataFrames with matching Array types to multidimensional NumPy array
- ColumnNotFoundError appears in lazy mode only in version 0.20.28 HOT 9
- Multiple combination of expressions with Lazyframe raises PanicException
- `cluster_with_optimizer` PanicException during `scan_csv` call
- Opening large CSV files on some Macs is extremely slow. HOT 17
- Use more appropriate error variants in various places across the API HOT 2
- Change `DataFrame.write_parquet(write_statistics)` to a more granular type
- LazyFrame.schema fails with "Option::unwrap()` on a `None` value" HOT 6
- Schema of `LazyFrame.with_context` does not match result of collect HOT 2
- Following a selector with .exclude() is not considered a selector HOT 7
- predicate pushdown with `pl.Expr.cut`
- `.list.to_struct()` has non-deterministic behavior HOT 5
- Add `Expr.list.map_elements(func)` to perform a custom function on every element in a list HOT 2
- pl.from_pandas(..., nan_to_null=True) does not convert NaN to Null HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from polars.