GithubHelp home page GithubHelp logo

prrao87 / duckdb-study Goto Github PK

View Code? Open in Web Editor NEW
14.0 1.0 1.0 48 KB

Compare DuckDB, Polars and Pandas for generating an artificial dataset of persons and companies

License: MIT License

Python 100.00%
duckdb pandas polars

duckdb-study's Introduction

DuckDB, Polars and Pandas performance comparison

This repo contains code that highlights that pandas, despite its maturity and importance to the Python data science ecosystem, is far slower than either polars or duckdb, while being less efficient with memory usage, because it materializes the entire data into memory (as opposed to polars and duckdb which are lazy and only materialize the data when needed).

  • pandas is a popular Python DataFrame library for data manipulation and analysis, whose internals are in C++ -- the latest version of pandas (2.x) is used in this benchmark.
  • polars is a newer DataFrame library for data manipulation and analysis, whose internals are in Rust, and is far more amenable to multi-threading DataFrame ops than pandas.
  • duckdb is a high-performance embedded database that can be queried via a rich SQL dialect. Its core is written in C++, and is designed to be fast, reliable and easy to use, and very amenable to transformation to either pandas or polars

Both polars and duckdb (being written in Rust and C++ respectively) have had ample opportunity to learn from the pain points of pandas and the general PyData ecosystem over the years, allowing them to leverage lessons from modern database theory and the power of the Apache Arrow ecosystem.

Setup

Install a Python virtual environment and then install the dependencies via requirements.txt as follows.

python -m venv .venv  # python -> python 3.10+
source .venv/bin/activate
python -m pip install -U pip wheel  # Upgrade pip and install the wheel package first
python -m pip install -r requirements.txt

Generate dataset

The benchmark task is basically to use any one of pandas, polars, or duckdb to generate an artificial dataset of persons, the companies they held work positions in, and their locations. The input dataset we begin with is the 7+ million companies dataset from Kaggle, which is preprocessed into a parquet file.

The dataset is generated by randomly sampling from the input dataset, and then randomly generating positions for each person with repetition, allowing a one-to-many relationship (the same person and multiple company positions).

Benchmark

The example results and timing benchmarks to generate the full dataset are shown below. The benchmarks are run using the pytest-benchmark library which itself depends on pytest.

Conditions

  • Raw dataset being generated contains 1M persons and 10M positions at companies that these 1M persons have held
  • Macbook Pro M2, 16 GB RAM
  • Average of 3 runs (for each of pandas, polars and duckdb)
  • Garbage collector timing disabled during benchmark

Results

cd perf_study
$ pytest benchmark.py --benchmark-min-rounds=3 --benchmark-disable-gc

==================================================================================================== test session starts ====================================================================================================
platform darwin -- Python 3.11.2, pytest-7.4.0, pluggy-1.2.0
benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=True min_rounds=3 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /code/embedded-dbs/perf_study
plugins: benchmark-4.0.0
collected 3 items                                                                                                                                                                                                           

benchmark.py ...                                                                                                                                                                                                      [100%]


---------------------------------------------------------------------------------- benchmark: 3 tests ----------------------------------------------------------------------------------
Name (time in s)              Min                Max               Mean            StdDev             Median               IQR            Outliers     OPS            Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_duckdb      3.6492 (1.0)       4.0477 (1.0)       3.8405 (1.0)      0.1998 (1.0)       3.8245 (1.0)      0.2989 (1.0)           1;0  0.2604 (1.0)           3           1
test_benchmark_polars      5.4167 (1.48)      6.4685 (1.60)      5.7704 (1.50)     0.6046 (3.03)      5.4260 (1.42)     0.7888 (2.64)          1;0  0.1733 (0.67)          3           1
test_benchmark_pandas     19.3788 (5.31)     19.8074 (4.89)     19.5716 (5.10)     0.2175 (1.09)     19.5287 (5.11)     0.3215 (1.08)          1;0  0.0511 (0.20)          3           1
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean
=============================================================================================== 3 passed in 144.33s (0:02:24) ===============================================================================================

Results

Average over 3 runs:

Approach Time (sec) Runtime increase vs. DuckDB
duckdb 3.84 1.0
polars 5.77 1.5
pandas 19.57 5.1

As can be seen, duckdb is the fastest ๐Ÿ”ฅ, followed cloesly by polars. pandas is the slowest by a factor of 5x when compared to DuckDB.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.