GithubHelp home page GithubHelp logo

Comments (7)

timsaucer avatar timsaucer commented on August 28, 2024 1

There are a couple of things we'll need to do that immediately come to mind:

  • A few of the examples have some differences with what the spec shows as the expected outcome. We need to check if these are a problem in my code or something to do with the size of the generated data set I used. EDIT: fixed. PR to be put up shortly.
  • Decide if we want to run against the answers file or against the one line output shown in the spec. If we want to go against the answer file, update the test to use the query file parameters instead of the values in the spec.
  • Pull the code for every example into a function that can return a dataframe
  • Write code to compare the output dataframe to the return value from the test. We can probably limit decimal precision to 2 decimal places since this is designed to report at the level of $0.01
  • Update CI to pull the dbgen docker image and generate a 1 Gb dataset, the smallest we can expect to get consistent results based on what I've read

I hadn't originally expected the examples to be used in this way, but it's a very good idea.

from datafusion-python.

Michael-J-Ward avatar Michael-J-Ward commented on August 28, 2024 1

After a little experimenting, I'm pretty sure this will work.

Create a ./example/tpch/_tests.py

def test_q01_pricing_summary_report():
    import q01_pricing_summary_report
    df = q01_pricing_summary_report.df
   # assertion code on `df`
  1. Keeps the examples clean
  2. python -m pytest ./examples/tpch/_tests.py will give us a nice test report as a bonus

from datafusion-python.

timsaucer avatar timsaucer commented on August 28, 2024 1

As an update, I've got the tests written as you describe. I removed the reference files from the repo. Now it's checking against the official answer files. There is one special case I mentioned in discord, but from my web searches it appears to be known issue (and my result matches the spec pdf).

All that's left is to work through the CI process. I expect to move my PR to ready on Monday or Tuesday.

from datafusion-python.

timsaucer avatar timsaucer commented on August 28, 2024

I have a branch ready that corrects the examples to match the spec output. Most of it was numerical errors between float and decimal representation, and converting 3 months to 90, 91, or 92 days. When the interval is fixed upstream, we can make that less prone to error. There was one or two queries that needed an algorithmic update. I will hold off on putting in that PR until we first answer if we want to go all in on data validation as part of the example or not. I have this example where I pull the query up into a function and add a validation function. The down side of this is that it makes the example less readable to new users in my opinion. But it does come at the point of adding in both validation and preparing the result to be used by CI. I've asked this question on the discord server, but putting it here for traceability.

from datafusion-python.

Michael-J-Ward avatar Michael-J-Ward commented on August 28, 2024

We could make it stupid simple by making these snapshot tests and using https://pypi.org/project/pytest-snapshot/

(Which is valid because you've already verified the behavior for current behavior, we just want to ensure upgrades don't change the behavior)

from datafusion-python.

timsaucer avatar timsaucer commented on August 28, 2024

Ok, made some good progress on this. Will try to wrap up tomorrow. https://github.com/timsaucer/datafusion-python/blob/tsaucer/prepare_tpch_examples_for_ci/examples/tpch/_tests.py

I do like your idea of switching this to use pytest-snapshot so I'll probably look at that.

I am adding a small set of reduced reference data, which is total of about 1.1 Mb. This will allow users to run the examples without requiring them to generate a full TPC-H data set, but if they want to reproduce the spec result they will have to run the data generator. This is also such a small data set it'll speed up CI by not having to run generator every time. Most importantly, if we don't have a reference data set in the repo it would require all users to run the generator just to get their pyunit tests to pass, which I don't think people would like.

from datafusion-python.

Michael-J-Ward avatar Michael-J-Ward commented on August 28, 2024

My default would be not to add the data to the repo, but I'll let @andygrove decide that. Maybe we could add it to the test-data submodule https://github.com/apache/arrow-testing?

If not, I can add whatever line/script to generate (and maybe cache) the test-data to the CI

from datafusion-python.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.