apache / datafusion-python Goto Github PK

View Code? Open in Web Editor NEW

316.0 35.0 64.0 3.79 MB

Apache DataFusion Python Bindings

Home Page: https://datafusion.apache.org/python

License: Apache License 2.0

Python 30.36% Rust 64.01% Shell 4.61% Dockerfile 0.82% Batchfile 0.20%

datafusion-python's Introduction

DataFusion in Python

This is a Python library that binds to Apache Arrow in-memory query engine DataFusion.

DataFusion's Python bindings can be used as a foundation for building new data systems in Python. Here are some examples:

Dask SQL uses DataFusion's Python bindings for SQL parsing, query planning, and logical plan optimizations, and then transpiles the logical plan to Dask operations for execution.
DataFusion Ballista is a distributed SQL query engine that extends DataFusion's Python bindings for distributed use cases.

It is also possible to use these Python bindings directly for DataFrame and SQL operations, but you may find that Polars and DuckDB are more suitable for this use case, since they have more of an end-user focus and are more actively maintained than these Python bindings.

Features

Execute queries using SQL or DataFrames against CSV, Parquet, and JSON data sources.
Queries are optimized using DataFusion's query optimizer.
Execute user-defined Python code from SQL.
Exchange data with Pandas and other DataFrame libraries that support PyArrow.
Serialize and deserialize query plans in Substrait format.
Experimental support for transpiling SQL queries to DataFrame calls with Polars, Pandas, and cuDF.

Example Usage

The following example demonstrates running a SQL query against a Parquet file using DataFusion, storing the results in a Pandas DataFrame, and then plotting a chart.

The Parquet file used in this example can be downloaded from the following page:

https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

from datafusion import SessionContext

# Create a DataFusion context
ctx = SessionContext()

# Register table with context
ctx.register_parquet('taxi', 'yellow_tripdata_2021-01.parquet')

# Execute SQL
df = ctx.sql("select passenger_count, count(*) "
             "from taxi "
             "where passenger_count is not null "
             "group by passenger_count "
             "order by passenger_count")

# convert to Pandas
pandas_df = df.to_pandas()

# create a chart
fig = pandas_df.plot(kind="bar", title="Trip Count by Number of Passengers").get_figure()
fig.savefig('chart.png')

This produces the following chart:

Configuration

It is possible to configure runtime (memory and disk settings) and configuration settings when creating a context.

runtime = (
    RuntimeConfig()
    .with_disk_manager_os()
    .with_fair_spill_pool(10000000)
)
config = (
    SessionConfig()
    .with_create_default_catalog_and_schema(True)
    .with_default_catalog_and_schema("foo", "bar")
    .with_target_partitions(8)
    .with_information_schema(True)
    .with_repartition_joins(False)
    .with_repartition_aggregations(False)
    .with_repartition_windows(False)
    .with_parquet_pruning(False)
    .set("datafusion.execution.parquet.pushdown_filters", "true")
)
ctx = SessionContext(config, runtime)

Refer to the API documentation for more information.

Printing the context will show the current configuration settings.

print(ctx)

More Examples

See examples for more information.

Executing Queries with DataFusion

Running User-Defined Python Code

Substrait Support

Serialize query plans using Substrait

How to install (from pip)

Pip

pip install datafusion
# or
python -m pip install datafusion

Conda

conda install -c conda-forge datafusion

You can verify the installation by running:

>>> import datafusion
>>> datafusion.__version__
'0.6.0'

How to develop

This assumes that you have rust and cargo installed. We use the workflow recommended by pyo3 and maturin.

The Maturin tools used in this workflow can be installed either via Conda or Pip. Both approaches should offer the same experience. Multiple approaches are only offered to appease developer preference. Bootstrapping for both Conda and Pip are as follows.

Bootstrap (Conda):

# fetch this repo
git clone [email protected]:apache/datafusion-python.git
# create the conda environment for dev
conda env create -f ./conda/environments/datafusion-dev.yaml -n datafusion-dev
# activate the conda environment
conda activate datafusion-dev

Bootstrap (Pip):

# fetch this repo
git clone [email protected]:apache/datafusion-python.git
# prepare development environment (used to build wheel / install in development)
python3 -m venv venv
# activate the venv
source venv/bin/activate
# update pip itself if necessary
python -m pip install -U pip
# install dependencies (for Python 3.8+)
python -m pip install -r requirements.in

The tests rely on test data in git submodules.

git submodule init
git submodule update

Whenever rust code changes (your changes or via git pull):

# make sure you activate the venv using "source venv/bin/activate" first
maturin develop
python -m pytest

Running & Installing pre-commit hooks

arrow-datafusion-python takes advantage of pre-commit to assist developers with code linting to help reduce the number of commits that ultimately fail in CI due to linter errors. Using the pre-commit hooks is optional for the developer but certainly helpful for keeping PRs clean and concise.

Our pre-commit hooks can be installed by running pre-commit install, which will install the configurations in your ARROW_DATAFUSION_PYTHON_ROOT/.github directory and run each time you perform a commit, failing to complete the commit if an offending lint is found allowing you to make changes locally before pushing.

The pre-commit hooks can also be run adhoc without installing them by simply running pre-commit run --all-files

Running linters without using pre-commit

There are scripts in ci/scripts for running Rust and Python linters.

./ci/scripts/python_lint.sh
./ci/scripts/rust_clippy.sh
./ci/scripts/rust_fmt.sh
./ci/scripts/rust_toml_fmt.sh

How to update dependencies

To change test dependencies, change the requirements.in and run

# install pip-tools (this can be done only once), also consider running in venv
python -m pip install pip-tools
python -m piptools compile --generate-hashes -o requirements-310.txt

To update dependencies, run with -U

python -m piptools compile -U --generate-hashes -o requirements-310.txt

More details here

datafusion-python's People

Stargazers

Watchers

datafusion-python's Issues

Upgrade Maturin image to 0.13.1 to Fix manylinux Build

The manylinux build is failing and I suspect it is due to an older version of the Maturin docker image and Rust being used which is incompatible with Arrow 18.0.0. I'd like to test upgrading to Maturin 0.13.1.

Add Python bindings for DataFrame methods `intersect` and `except`

ASF source release tarball has wrong directory name

Describe the bug

Directory should be arrow-datafusion-python-0.7.0 not arrow-datafusion-0.7.0

A    arrow/arrow-datafusion-0.7.0/apache-arrow-datafusion-python-0.7.0.tar.gz
A    arrow/arrow-datafusion-0.7.0/apache-arrow-datafusion-python-0.7.0.tar.gz.asc
A    arrow/arrow-datafusion-0.7.0/apache-arrow-datafusion-python-0.7.0.tar.gz.sha256
A    arrow/arrow-datafusion-0.7.0/apache-arrow-datafusion-python-0.7.0.tar.gz.sha512

To Reproduce
Steps to reproduce the behavior:

Expected behavior
A clear and concise description of what you expected to happen.

Additional context
Add any other context about the problem here.

Add instructions and scripts for publishing the Python documentation

Build Python wheels for Mac ARM architecture (e.g. M1)

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
We only build Python wheels for Mac with x86_64 architecture, so we cannot support newer macs with M1 chip or later

Describe the solution you'd like
Build wheels for Mac arm architecture

Describe alternatives you've considered
None

Additional context
None

Reading delta table as pyarrow dataset does not work

Describe the bug
I am unable to display the contents of delta tables stored locally

To Reproduce

[tool.poetry.dependencies]
python = "^3.10"
datafusion = "^0.7.0"
deltalake = "^0.6.4"

then run the following code:

import pyarrow as pa
import pyarrow.dataset as ds

from deltalake import DeltaTable
import datafusion

ctx = datafusion.SessionContext()

delta_table = DeltaTable("/local_delta_path/")
pa_dataset = dt.to_pyarrow_dataset()

ctx.register_dataset("pa_dataset", pa_dataset)

tmp = ctx.sql("SELECT * FROM pa_dataset limit 10")
tmp.show()

When executed in notebook in vs code, this script can run for >20 min and I am unable to interrupt the execution.

Expected behavior
Top rows displayed

Run Apache RAT (Release Audit Tool) in CI

Because of the history of this project, we are missing the usual ASF infrastructure such as RAT tests to ensure that all files contain an ASF header.

I don't know if we should consider rewriting the history so that we fork from DataFusion just before the Python bindings were removed, and then delete everything else, and then re-apply changes that were made in this repo ... or just copy the parts we need from DataFusion?

Release version 0.7.0

I would like to propose releasing version 0.7.0.

Checklist:

Support fsspec based filesystems

Allow for using a Python fsspec filesystem as an ObjectStore which would allow DataFusion Python to support any filesystem supported by fsspec. I have this working for an internal project.

Would this contribution be valuable to this project? Fsspec supports many different filesystems and cloud providers so it could make using DataFusion easier if native Rust ObjectStores aren't available yet.

Supported built-in fsspec filesystems
Third-party implementations

Implement DataFrame execute_stream and execute_stream_partitioned

arrow-rs now has zero-copy to PyArrow for RecordBatchReaders.

We could use this implement the execute_stream and execute_stream_partitioned methods for DataFrames similar to the Rust DataFrame.

support creating arrow-datafusion-python conda environment

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Development in arrow-datafusion-python is currently limited to pip environments. While nothing is wrong with that lots of people use Conda these days. The problem lies in that if a developer/maintainer/user has Conda installed on their environment they cannot run maturin develop with both a Python virtual environment and Conda on the path. This confuses maturin leading to issues.

I think a simple solution is to also include an environment.yml file for Conda developers to use instead of relying on pip. This addition is easier enough and there is tooling available for generating pip requirements files from conda environment files which also means keeping those environments in sync shouldn't be an issue.

Describe the solution you'd like
Introduce an environment.yml file for creating an arrow-datafusion-python environment and supporting documentation. Bonus for maintainer documentation about keeping the pip and anaconda environment files in sync.

Describe alternatives you've considered
Open issue with Maturin to address the issue. The workaround of having conda environment seems worth it however.

Additional context
None

Add Python bindings for `SessionContext` methods read_csv, read_json, read_parquet, read_avro

We should update the example in the README to show how to do this as well

Rust Release 1.63.0 add a new lint `borrow_deref_ref` causing Cargo Clippy error

https://rust-lang.github.io/rust-clippy/master/index.html#borrow_deref_ref

[DOC] - Does anyone know how to build the documentation under the `docs`

I want to add some documentation.

Maturin build hangs on Linux ARM64

Describe the bug

Executing maturin build or maturin develop hangs on Linux ARM64 (openEuler 22.03 LTS)

To Reproduce
Steps to reproduce the behavior:

ssh to a Linux ARM64 system
maturin develop

Expected behavior

The build should pass successfully.

Additional context

N/A

Build error: could not compile `thiserror` due to 2 previous errors

Describe the bug

Build rustc 1.65.0 (897e37553 2022-11-02)

To Reproduce

source venv/bin/activate

maturin develop

Python Release Build failing after upgrading to maturin 14.2

Describe the bug
See https://github.com/apache/arrow-datafusion-python/actions/runs/3566140311

To Reproduce
Steps to reproduce the behavior:

Expected behavior
A clear and concise description of what you expected to happen.

Additional context
Add any other context about the problem here.

Add `with_column` and `with_column_renamed` to DataFrame

Fix documentation and code samples

The documentation cannot be built and some sample code in the documentation cannot be run.

Cannot install on Mac M1 from source tarball from testpypi

Describe the bug
I am trying to install from testpypi on an M1 mac.

pip3 install -i https://test.pypi.org/simple/ datafusion==0.7.0

Looking in indexes: https://test.pypi.org/simple/
Collecting datafusion==0.7.0
  Using cached https://test-files.pythonhosted.org/packages/a4/4f/5c588562ec6ab1651659ff35e34c197a7c1eaa7663360f2ea9d7d777547d/datafusion-0.7.0.tar.gz (150 kB)
  Installing build dependencies ... error
  error: subprocess-exited-with-error
  
  × pip subprocess to install build dependencies did not run successfully.
  │ exit code: 1
  ╰─> [3 lines of output]
      Looking in indexes: https://test.pypi.org/simple/
      ERROR: Could not find a version that satisfies the requirement maturin<0.14,>=0.11 (from versions: none)
      ERROR: No matching distribution found for maturin<0.14,>=0.11
      [end of output]

To Reproduce

pip3 install -i https://test.pypi.org/simple/ datafusion==0.7.0

Expected behavior
A clear and concise description of what you expected to happen.

Additional context
Add any other context about the problem here.

Add Python binding for DataFrame::collect_partitioned

ImportPathMismatchError when running pytest locally

Describe the bug

__________________________________________________________________________________________________ ERROR collecting test session __________________________________________________________________________________________________
venv/lib/python3.8/site-packages/_pytest/config/__init__.py:607: in _importconftest
    mod = import_path(conftestpath, mode=importmode, root=rootpath)
venv/lib/python3.8/site-packages/_pytest/pathlib.py:556: in import_path
    raise ImportPathMismatchError(module_name, module_file, path)
E   _pytest.pathlib.ImportPathMismatchError: ('datafusion.tests.conftest', '/home/andy/git/apache/arrow-datafusion-python/datafusion/tests/conftest.py', PosixPath('/home/andy/git/apache/arrow-datafusion-python/target/package/datafusion-python-0.7.0/datafusion/tests/conftest.py'))

To Reproduce
Steps to reproduce the behavior:

Expected behavior
A clear and concise description of what you expected to happen.

Additional context
Add any other context about the problem here.

Add support for key-value pair configuration settings

DataFusion now has a configuration mechanism based on key-value pairs (see https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/src/config.rs) and it would be good to expose this via the Python bindings.

[DISCUSS] arrow-datafusion-python versioning

Apologies if this isn't the correct venue for discussing this but I want to discuss versioning for arrow-datafusion-python. With the arrow-datafusion 16.0.0 release in motion I was thinking about the versioning for an upcoming arrow-datafusion-python. If the pattern continues the next version would be 0.8.0. This was confusing to me when first starting with arrow-datafusion-python since I was expecting to find the version for the python bindings to match that with the version of arrow-datafusion I was using, this isn't the case. So my proposal is, why don't we keep the release versions for arrow-datafusion-python in sync with arrow-datafusion and have the next release of arrow-datafusion-python be 16.0.0.

I believe this helps make things much more clear to the end user about which underlying version of datafusion they are using. If there are bugs in arrow-datafusion-python itself that warrant a release then we could just append a post-release fix version to the number.

Curious of others thoughts on this? I know software versioning can almost be a religious discussion at times and this is just a very coarse layout

Support parquet WriterProperties

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
To enable compression, I would like to customize parquet WriterProperties when using DataFrame.write_parquet.

Describe the solution you'd like

Support building parquet WriterProperties from Python.

Describe alternatives you've considered

For basic properties such as default compression applied to all columns, adding compression option as an argument to DataFrame.write_parquet would be sufficient.

Build failing due to rust-toolchain file

Currently the build is failing due to the rust-toolchain file that is check in:

error: failed to select a version for the requirement `parquet = "^18.0.0"`
candidate versions found which didn't match: 15.0.0, 14.0.0, 13.0.0, ...

Removing the rust-toolchain file fixes the issue. Do we need this file checked in? The original repo does not have it checked in.

Reading csv does not work

Describe the bug
ctx.read_csv( returns empty dataset

To Reproduce
`import datafusion

ctx = datafusion.SessionContext()

blasted = ctx.read_csv(path="/localpath/a.tsv", has_header=True, delimiter="\t", schema_infer_max_records=100)
blasted.show()
`
Expected behavior
Few rows from csv should be printed out

Expand unit tests for built-in functions

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Hi @andygrove,

First of all I wanted to thank you for creating and open-sourcing datafusion - I have been following the project for some time and find it super exciting to see the next-gen data engineering infrastructure being built on top of Arrow!

I have no experience with Rust but quite some with writing applications in Python & PySpark so I thought I could contribute to the Python language bindings.

Recently I saw @francis-du added lots of functions to the Python package with #73 - thanks for the significant effort in improving the package!

To improve the test coverage, I've opened pull request #129 with a few unit tests for the built-in functions. Hope the tests are helpful, look forward to getting some feedback.

Describe the solution you'd like
Improve test coverage by adding more unit tests.

Describe alternatives you've considered
n/a

Additional context
n/a

Add Python bindings for DataFrame methods `union` and `union_distinct`

These methods already exist in Rust and need exposing in Python

Add Python binding for `approx_median`

This is not simply a case of exposing the Rust version because Rust has some logic that replaces approx_median with approx_percentile(expr, 0.5) instead.

We need to wait for apache/datafusion#3063 to be fixed then it will be simple

Publish documentation for Python bindings

hi, I might have missed something, but there's no link on the github repo to the python documentation, and some quick googling points to https://arrow.apache.org/datafusion/python/index.html which returns "Not Found The requested URL was not found on this server.".

The main github page points to arrow.apache.org/datafusion but there's no python section there.

I hope I am not wasting your time by missing something obvious.

Register_parquet not working for pandas parquet files

Bug descripion
From pandas I am writing a parquet file (using gzip compression), I am looking to query this file using datafusion. The same file exported as .csv is working fine using this library, but the parquet version is not returning anything.

Steps to reproduce

import datafusion
import pandas as pd

df = pd.DataFrame(data={'col1': [1, 2, 4, 5, 6], 'col2': [3, 4, 3, 5, 2], 'col3': [3, 4, 1, 2, 3], 'col4': [3, 4, 4, 5, 6]})
df.to_csv('df.csv', compression=None)  
df.to_parquet('df.pq', compression=None)  

ctx = datafusion.SessionContext()
ctx.register_csv(name="example_csv", path="df.csv")
ctx.register_parquet(name="example_pq", path="df.pq")

# test csv
df = ctx.sql("SELECT * FROM example_csv")
result = df.collect()
res = result[0]

# test parquet
df = ctx.sql("SELECT * FROM example_pq")
result = df.collect()
res = result[0]

Expected behavior
The same result from both approaches

Additional context
It also seems that gzip compressed files are not working, I am not sure why this is, please consider the following example:

df.to_csv('df.csv.gz') 
ctx.register_csv(name="example_csv_gz", path="df.csv.gz")

# test csv
df = ctx.sql("SELECT * FROM example_csv_gz")
result = df.collect()
res = result[0]

Support reading from PyArrow datasets

Implement PyArrow Dataset TableProvider to support reading from a PyArrow Dataset with filter and projection push down.

Add Python bindings for DataFrame methods write_csv, write_parquet, and write_json

The Rust implementation of DataFrame has write_csv, write_parquet, and write_json and we need to expose those in Python

Build Python source distribution in GitHub workflow

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
We do not currently build the Python source distribution manually.

Describe the solution you'd like
Update GitHub workflow to create Python source distribution.

Describe alternatives you've considered
Keep doing it manually

Additional context
None

Bindings for CSV/JSON compression support

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Support for reading compressed CSV/JSON files was recently added to DataFusion we should add supporting bindings here so that the python bindings can take advantage of that as well.

Describe the solution you'd like
Update context.rs::register_csv/json(...) to take advantage of the new CsvReadOptions which allows for specifying a file compression type.

Describe alternatives you've considered
None

Additional context
None

Add Python binding for `DataFrame::repartition`

EPIC: Add all `SessionContext` and `DataFrame` methods to Python API

The Python bindings currently only expose a subset of functionality, and we want to expose as much as possible.

Here is a list of all available rust methods. Note that there may be reasons why we don't want to expose some of these.

SessionContext

DataFrame

README file points to wrong repository

In How to develop section - (https://github.com/apache/arrow-datafusion-python#how-to-develop) there is a reference to the old site datafusion-contrib/datafusion-python.git

Add ability to cache DataFrame in memory

In stack overflow question https://stackoverflow.com/questions/73015779/how-to-persist-datafusion-dataframe-in-memory there is a question about how to persist a DataFrame in memory. It would be nice to have an easy way to do this from the Python API.

DataFusion has support for MemTable which could be used to achieve this.

Add Python binding for `DataFrame::distinct`

Add Python bindings for substrait module

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
arrow-datafusion recently accepted to donation of arrow-contrib/datafusion-substrait into its main branch. Now that the two repos are formally joined we should add Python bindings for those new bits. There are effectively 3 places to write the bindings. producer, consumer, and serializer.

Describe the solution you'd like
Python bindings to interact with arrow-datafusion substrait module

Describe alternatives you've considered
None

Additional context
None

Github actions produce a lot of warnings

Describe the bug

For example https://github.com/martin-g/arrow-datafusion-python/actions/runs/3591220901 produced:


generate-license
Node.js 12 actions are deprecated. For more information see: https://github.blog/changelog/2022-09-22-github-actions-all-actions-will-begin-running-on-node16-instead-of-node12/. Please update the following actions to use Node.js 16: actions/checkout@v2, actions-rs/toolchain@v1, actions/upload-artifact@v2

generate-license
The `set-output` command is deprecated and will be disabled soon. Please upgrade to using Environment Files. For more information see: https://github.blog/changelog/2022-10-11-github-actions-deprecating-save-state-and-set-output-commands/


Manylinux
Node.js 12 actions are deprecated. For more information see: https://github.blog/changelog/2022-09-22-github-actions-all-actions-will-begin-running-on-node16-instead-of-node12/. Please update the following actions to use Node.js 16: actions/checkout@v2, actions/download-artifact@v2, actions/upload-artifact@v2

Manylinux
The `set-output` command is deprecated and will be disabled soon. Please upgrade to using Environment Files. For more information see: https://github.blog/changelog/2022-10-11-github-actions-deprecating-save-state-and-set-output-commands/


...

To Reproduce

Check the Summary of any workflow in Github Actions.

Expected behavior

0 or close to 0 warnings.

Additional context

window_lead test appears to be non-deterministic

Describe the bug

The test works for me locally but fails in CI/

2023-01-19T00:06:25.0267445Z df = <datafusion.DataFrame object at 0x7ffbd80ccc70>
2023-01-19T00:06:25.0268026Z 
2023-01-19T00:06:25.0268336Z     def test_window_lead(df):
2023-01-19T00:06:25.0268610Z         df = df.select(
2023-01-19T00:06:25.0268869Z             column("a"),
2023-01-19T00:06:25.0269119Z             f.alias(
2023-01-19T00:06:25.0269347Z                 f.window(
2023-01-19T00:06:25.0269666Z                     "lead", [column("b")], order_by=[f.order_by(column("b"))]
2023-01-19T00:06:25.0269979Z                 ),
2023-01-19T00:06:25.0270215Z                 "a_next",
2023-01-19T00:06:25.0270448Z             ),
2023-01-19T00:06:25.0270671Z         )
2023-01-19T00:06:25.0270865Z     
2023-01-19T00:06:25.0271152Z         table = pa.Table.from_batches(df.collect())
2023-01-19T00:06:25.0271426Z     
2023-01-19T00:06:25.0271707Z         expected = {"a": [1, 2, 3], "a_next": [5, 6, None]}
2023-01-19T00:06:25.0272031Z >       assert table.to_pydict() == expected
2023-01-19T00:06:25.0272883Z E       AssertionError: assert {'a': [3, 1, ... [None, 5, 6]} == {'a': [1, 2, ... [5, 6, None]}
2023-01-19T00:06:25.0273211Z E         Differing items:
2023-01-19T00:06:25.0273583Z E         {'a_next': [None, 5, 6]} != {'a_next': [5, 6, None]}
2023-01-19T00:06:25.0273949Z E         {'a': [3, 1, 2]} != {'a': [1, 2, 3]}
2023-01-19T00:06:25.0274208Z E         Full diff:
2023-01-19T00:06:25.0274555Z E         - {'a': [1, 2, 3], 'a_next': [5, 6, None]}
2023-01-19T00:06:25.0274917Z E         ?            ---              ------
2023-01-19T00:06:25.0275260Z E         + {'a': [3, 1, 2], 'a_next': [None, 5, 6]}
2023-01-19T00:06:25.0275725Z E         ?        +++                      ++++++

To Reproduce
Steps to reproduce the behavior:

Expected behavior
A clear and concise description of what you expected to happen.

Additional context
Add any other context about the problem here.

EPIC: Add all functions to python binding `functions`

DataFusion

https://github.com/apache/arrow-datafusion/blob/master/datafusion/expr/src/built_in_function.rs
https://github.com/apache/arrow-datafusion/blob/master/datafusion/expr/src/aggregate_function.rs
https://github.com/apache/arrow-datafusion/blob/master/datafusion/expr/src/window_function.rs

DataFusion Python

https://github.com/apache/arrow-datafusion-python/blob/master/src/functions.rs

Scalar Functions:

Aggregate Functions

Window Functions (Not yet verified)

Integrate with the new `object_store` crate

I would like to be able to query data on s3 with datafusion-python and the new object_store crate. i havent had the chance to play with that crate yet but once i get the chance i would like to get the necessary functionality added here.

Python bindings create duplicated qualified fields after joining

Describe the bug
im working on getting datafusion added to db-benchmark (#147). while putting the benchmarks together i came across an error while doing the join benchmark that i wasnt expecting. specifically the error is:

Traceback (most recent call last):
  File "datafusion/join-datafusion.py", line 72, in <module>
    df = ctx.create_dataframe([ans])
Exception: DataFusion error: Plan("Schema contains duplicate qualified field name 'ce9f0daee780e4f2796b9953bd267457c.id1'")

The test code that produced that is here:

question = "small inner on int" # q1
gc.collect()
t_start = timeit.default_timer()
ans = ctx.sql("SELECT * FROM x INNER JOIN small ON x.id1 = small.id1").collect()
shape = ans_shape(ans)
print(shape)
t = timeit.default_timer() - t_start
t_start = timeit.default_timer()
df = ctx.create_dataframe([ans])
chk = df.aggregate([], [f.sum(col("v1"))]).collect()[0].column(0)[0]
chkt = timeit.default_timer() - t_start
m = memory_usage()
write_log(task=task, data=data_name, in_rows=x_data.num_rows, question=question, out_rows=shape[0], out_cols=shape[1], solution=solution, version=ver, git=git, fun=fun, run=1, time_sec=t, mem_gb=m, cache=cache, chk=make_chk([chk]), chk_time_sec=chkt, on_disk=on_disk)
del ans
gc.collect()

if i update the sql to:

SELECT x.id1, x.id2, x.id3, x.id4, small.id4, x.id5, x.id6, x.v1, small.v2 FROM x INNER JOIN small ON x.id1 = small.id1

I get:

Traceback (most recent call last):
  File "datafusion/join-datafusion.py", line 73, in <module>
    df = ctx.create_dataframe([ans])
Exception: DataFusion error: Plan("Schema contains duplicate qualified field name 'cb53bcf8886f449c3bd2651571df185d4.id4'")

to me this looks like a bug as i think i should be able to write the query without having to alias the overlapping columns (when i alias the overlapping columns it works). for example, below is the equivalent spark query.

select * from x join small using (id1)

To Reproduce
Steps to reproduce the behavior:

Expected behavior
A clear and concise description of what you expected to happen.
I should be able to run either of the following

ans = ctx.sql("SELECT x.id1, x.id2, x.id3, x.id4, small.id4, x.id5, x.id6, x.v1, small.v2 FROM x INNER JOIN small ON x.id1 = small.id1").collect()
df = ctx.create_dataframe([ans])

ans = ctx.sql("SELECT * FROM x INNER JOIN small ON x.id1 = small.id1").collect()
df = ctx.create_dataframe([ans])

Additional context
Add any other context about the problem here.

Release version 0.8.0

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
We should release a new version based on DataFusion 16 17 18

Describe the solution you'd like

Describe alternatives you've considered
None

Additional context
None

Add more maintainers in PyPi

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
We should give more PMC members permission to manage this project in PyPi.

Describe the solution you'd like
PMC members should sign up for an account on the following sites and then post their usernames on this issue.

Describe alternatives you've considered

Additional context

Add script for Python linting

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
CI runs Python linters and fails the build if formatting is incorrect.

Describe the solution you'd like
Add script that I can run locally to format the Python code

Describe alternatives you've considered
None

Additional context
None

apache / datafusion-python Goto Github PK

datafusion-python's Introduction

DataFusion in Python

Features

Example Usage

Configuration

More Examples

Executing Queries with DataFusion

Running User-Defined Python Code

Substrait Support

How to install (from pip)

Pip

Conda

How to develop

Running & Installing pre-commit hooks

Running linters without using pre-commit

How to update dependencies

datafusion-python's People

Stargazers

Watchers

Forkers

datafusion-python's Issues

SessionContext

DataFrame

DataFusion

DataFusion Python

Scalar Functions:

Aggregate Functions

Window Functions (Not yet verified)

Recommend Projects

Recommend Topics

Recommend Org

Jobs