GithubHelp home page GithubHelp logo

tpch's People

Contributors

anmyachev avatar c-peters avatar chitralverma avatar cnpryer avatar indexseek avatar jbrockmendel avatar jeroenjanssens avatar leonb28 avatar marcogorelli avatar r-brink avatar ritchie46 avatar stinodego avatar thomasaarholt avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

tpch's Issues

Usage of Python subprocess.run

Each query is run through a Python subprocess (subprocess.run),
Don't you think it brings a lot of overhead as Python has to initiate its environment before running any logic ?

run([sys.executable, "-m", f"{package_name}.q{i}"])

Missing parquet files

Following the README.md, we executed the code. However, we get errors saying that 'parquet files are missing for example:
No such file or directory: '/tpch/tables_scale_1/supplier.parquet'

Note: tables_scale_1 folder does not exist.

With that, I get empty plots.

Is there any further documentation on how to generate/populate the tables_scale folder or any other required data.

`make tables` failed

~/tpch$ make tables
make -C tpch-dbgen all
make[1]: Entering directory '/home/anatoly/tpch/tpch-dbgen'
make[1]: Nothing to be done for 'all'.
make[1]: Leaving directory '/home/anatoly/tpch/tpch-dbgen'
cd tpch-dbgen && ./dbgen -vf -s 10 && cd ..
TPC-H Population Generator (Version 2.17.2)
Copyright Transaction Processing Performance Council 1994 - 2010
Generating data for suppliers table/
Preloading text ... 100%
done.
Generating data for customers tabledone.
Generating data for orders/lineitem tablesdone.
Generating data for part/partsupplier tablesdone.
Generating data for nation tabledone.
Generating data for region tabledone.
mkdir -p "data/tables/scale-10"
mv tpch-dbgen/*.tbl data/tables/scale-10/
.venv/bin/python scripts/prepare_data.py 10
Traceback (most recent call last):
  File "/home/anatoly/tpch/scripts/prepare_data.py", line 4, in <module>
    import polars as pl
ModuleNotFoundError: No module named 'polars'
make: *** [Makefile:39: tables] Error 1

It looks like the virtual environment is not being activated correctly.

make version: GNU Make 4.3

Perhaps the problem is in the name of the prerequisite .venv.

Non-idiomatic usage in q7

In both the pandas and modin queries:

lineitem_filtered["l_year"] = lineitem_filtered["l_shipdate"].apply(
    lambda x: x.year
)

should be

lineitem_filtered["l_year"] = lineitem_filtered["l_shipdate"].dt.year

The polars_queries version uses the analogous idiom. This made a pretty big difference locally.

Generated Parquet files are extremely fragmented

Hi, I noticed that the generated Parquet files are extremely fragmented in terms of rowgroups. This likely indicates a bug/issue in the Polars Parquet writer, but definitely also affects the results of the benchmarks.

For a SCALE_FACTOR=10 table generation, the Parquet files have a staggering 20,000 rowgroups!
image

Each rowgroup only has about 3,400 rows and a size of 117kB. For reference, Parquet rowgroups are often suggested to be in the range of about 128MB. Because we have so many rowgroups, the Parquet metadata itself is 27MB and it likely introduces a ton of hops in the process of reading the file ๐Ÿ˜…

Writing this instead with PyArrow (I amended the code in prepare_data.py), we get much more well-behaved rowgroups:

image

Still fairly small as rowgroups go, but I think it's much more reasonable and represents Parquet data in the wild a little better!

Get the results without having to run it

Hi,
Is there a way to get the results without cloning and launching the code :-)?

If not, can a GitHub page be automatically deployed after the code has been run with GitHub actions?

Unexpected arguments

Trying this out, make run_polars is showing

TypeError: scan_csv() got an unexpected keyword argument 'sep'
[...]
TypeError: LazyFrame.sort() got an unexpected keyword argument 'reverse'
[...]
AttributeError: 'LazyFrame' object has no attribute 'with_column'. Did you mean: 'with_columns'?
[...]

Please advise. I'm a polars newbie.

Plot results failed - arr was deprecated

When using polars 0.8.3, the plot_results failed.
I fixed by changing in line 65

.with_columns(pl.col("labels").arr.join(",\n"))
.with_columns(pl.col("labels").list.join(",\n"))

Are TPCH Benchmark results actual or not?

Hi!

Polars' performance is very impressive. I would like to know if the results are up to date, because I did not find the library versions used. Do you have this information?

image

Query #2 Inaccurate Output

When running query number 2 separately from other queries, it outputs the error that it is unable to find the column s_acctbal. I believe the error lies in the order of joining the result table in the final query which causes loss of the desired columns in the final output. I think this is the reason why query #2 times are much lower compared to other queries, because the query isn't executed completely as it throws an error.

Add potential quick fix for macOS compilation error in README

When trying to compile the tpch-dbgen tool on macOS, users may encounter the documented error related to the malloc.h header file. This is a known issue and is mentioned in the README.

To help to save time and effort for macOS users who encounter this issue and improve usability, would it be beneficial to provide this command as a shortcut?

sed -i.bak 's/#include <malloc.h>/#include <sys\/malloc.h>/g' tpch-dbgen/bm_utils.c tpch-dbgen/varsub.c

This command replaces the #include <malloc.h> line with #include <sys/malloc.h> in the bm_utils.c and varsub.c files.

Explain tpch in readme

In the interest of clear communications around benchmarks, I'd suggest explaining what tpch stands for in the first line of the ReadMe.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.