This repo contains code that highlights that pandas
, despite its maturity and importance to the Python data science ecosystem, is far slower than either polars
or duckdb
, while being less efficient with memory usage, because it materializes the entire data into memory (as opposed to polars
and duckdb
which are lazy and only materialize the data when needed).
pandas
is a popular Python DataFrame library for data manipulation and analysis, whose internals are in C++ -- the latest version ofpandas
(2.x) is used in this benchmark.polars
is a newer DataFrame library for data manipulation and analysis, whose internals are in Rust, and is far more amenable to multi-threading DataFrame ops thanpandas
.duckdb
is a high-performance embedded database that can be queried via a rich SQL dialect. Its core is written in C++, and is designed to be fast, reliable and easy to use, and very amenable to transformation to eitherpandas
orpolars
Both polars
and duckdb
(being written in Rust and C++ respectively) have had ample opportunity to learn from the pain points of pandas
and the general PyData ecosystem over the years, allowing them to leverage lessons from modern database theory and the power of the Apache Arrow ecosystem.
Install a Python virtual environment and then install the dependencies via requirements.txt
as follows.
python -m venv .venv # python -> python 3.10+
source .venv/bin/activate
python -m pip install -U pip wheel # Upgrade pip and install the wheel package first
python -m pip install -r requirements.txt
The benchmark task is basically to use any one of pandas
, polars
, or duckdb
to generate an artificial dataset of persons, the companies they held work positions in, and their locations. The input dataset we begin with is the 7+ million companies dataset from Kaggle, which is preprocessed into a parquet file.
The dataset is generated by randomly sampling from the input dataset, and then randomly generating positions for each person with repetition, allowing a one-to-many relationship (the same person and multiple company positions).
The example results and timing benchmarks to generate the full dataset are shown below. The benchmarks are run using the pytest-benchmark
library which itself depends on pytest
.
- Raw dataset being generated contains 1M persons and 10M positions at companies that these 1M persons have held
- Macbook Pro M2, 16 GB RAM
- Average of 3 runs (for each of
pandas
,polars
andduckdb
) - Garbage collector timing disabled during benchmark
cd perf_study
$ pytest benchmark.py --benchmark-min-rounds=3 --benchmark-disable-gc
==================================================================================================== test session starts ====================================================================================================
platform darwin -- Python 3.11.2, pytest-7.4.0, pluggy-1.2.0
benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=True min_rounds=3 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /code/embedded-dbs/perf_study
plugins: benchmark-4.0.0
collected 3 items
benchmark.py ... [100%]
---------------------------------------------------------------------------------- benchmark: 3 tests ----------------------------------------------------------------------------------
Name (time in s) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_duckdb 3.6492 (1.0) 4.0477 (1.0) 3.8405 (1.0) 0.1998 (1.0) 3.8245 (1.0) 0.2989 (1.0) 1;0 0.2604 (1.0) 3 1
test_benchmark_polars 5.4167 (1.48) 6.4685 (1.60) 5.7704 (1.50) 0.6046 (3.03) 5.4260 (1.42) 0.7888 (2.64) 1;0 0.1733 (0.67) 3 1
test_benchmark_pandas 19.3788 (5.31) 19.8074 (4.89) 19.5716 (5.10) 0.2175 (1.09) 19.5287 (5.11) 0.3215 (1.08) 1;0 0.0511 (0.20) 3 1
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Legend:
Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
OPS: Operations Per Second, computed as 1 / Mean
=============================================================================================== 3 passed in 144.33s (0:02:24) ===============================================================================================
Average over 3 runs:
Approach | Time (sec) | Runtime increase vs. DuckDB |
---|---|---|
duckdb |
3.84 | 1.0 |
polars |
5.77 | 1.5 |
pandas |
19.57 | 5.1 |
As can be seen, duckdb
is the fastest ๐ฅ, followed cloesly by polars
. pandas
is the slowest by a factor of 5x when compared to DuckDB.