Checks <input type="c

<a target="_blank" rel="noopener noreferrer" href="https://private-user-images.githubu

So filtering on non-date/datetime columns works, see below: <a target="_blank" rel

scan_pyarrow_dataset not filtering on partitions about polars HOT 6 OPEN

mtofano commented on June 28, 2024

scan_pyarrow_dataset not filtering on partitions

from polars.

Comments (6)

ion-elgreco commented on June 28, 2024

You can check the plan with df.explain. You should see the filter being pushed down into the scan as a pyarrow compute expression.

If it's correctly showing pushed down pyarrow compute expressions, then it rather points to an issue in pyarrow, where filters are not converted to partition filters

from polars.

ritchie46 commented on June 28, 2024

Yes, we just pass the predicates to pyarrow. So I think this should be taken upstream.

from polars.

mtofano commented on June 28, 2024

I don't think the issue is with pyarrow, as when running to_table and passing in the compute expressions works as expected outside of polars land.

I suspect the issue is the predicates are not being passed in to to_table as we would expect them to when using scan_pyarrow_dataset. See the screenshots above of my debug session. In the _scan_pyarrow_dataset_impl function I can see there are no predicates being passed in as an argument, and thus no filter is being provided to ds.to_table. The predicates seem to be getting lost in translation somewhere.

The query plan looks correct to me however from the output of explain():

data.explain()
'FILTER [([(col("underlier_id")) == (5135108)]) & ([(col("trade_date")) == (2016-01-04)])] FROM\n\n  PYTHON SCAN \n  PROJECT */7 COLUMNS'

from polars.

ion-elgreco commented on June 28, 2024

So filtering on non-date/datetime columns works, see below:

Run this code as-is

import polars as pl

df = pl.DataFrame({
    "foo": [1,2,3],
    "bar": [1,2,3],
    "baz": [1,2,3],
}, schema={"foo": pl.Int64, "bar": pl.Date, "baz": pl.Int64,})

df.write_delta('test_table_scan', 
               mode='overwrite', 
               delta_write_options={"partition_by": ["foo", "bar"], "engine":"rust"}, overwrite_schema=True)

print(
    pl.scan_delta('test_table_scan').filter(pl.col('foo')==2).collect()
)

However, a predicate that contains a date or datetime breaks the predicate pushdown into pyarrow, similar issue: #16248

import polars as pl

df = pl.DataFrame({
    "foo": [1,2,3],
    "bar": [1,2,2],
    "baz": [1,2,3],
}, schema={"foo": pl.Int64, "bar": pl.Date, "baz": pl.Int64,})

df.write_delta('test_table_scan', 
               mode='overwrite', 
               delta_write_options={"partition_by": ["foo", "bar"], "engine":"rust"}, overwrite_schema=True)

print(
    pl.scan_delta('test_table_scan').filter(pl.col('foo')==2, pl.col('bar')== pl.date(1970,1,3)).collect()
)

from polars.

ion-elgreco commented on June 28, 2024

Seems like the pushdown is not working when it includes date/datetimes @ritchie46

print(pl.scan_delta('test_table_scan').filter(pl.col('foo')==2, pl.col('bar')== pl.date(1970,1,3)).explain(optimized=True))

FILTER [([(col("foo")) == (2)]) & ([(col("bar")) == (dyn int: 1970.dt.datetime([dyn int: 1, dyn int: 3, dyn int: 0, dyn int: 0, dyn int: 0, dyn int: 0, String(raise)]).strict_cast(Date))])] FROM

  PYTHON SCAN 
  PROJECT */3 COLUMNS

This issue is related: #11152

from polars.

mtofano commented on June 28, 2024

Thank you very much for the replies!

Out of curiosity what exactly is it about dates that break the predicate pushdown? This would be a very nice feature to have as it makes scan_pyarrow_dataset unusable on date partitioned datasets, and it is a very powerful feature we'd love to take advantage of :)

from polars.

scan_pyarrow_dataset not filtering on partitions about polars HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs