GithubHelp home page GithubHelp logo

Comments (16)

ericmjl avatar ericmjl commented on July 20, 2024 3

I'd be judicious with the use of hypothesis, btw. For example, I don't know how to generate examples well for the biology functions. There, an example-based-test may be the best kind of test.

Going forth, it's fine to use an example-based test (our original paradigm) rather than a property-based test (using Hypothesis), if that criteria is fulfilled (i.e. it doesn't make sense to generate lots of examples, or we don't know how to do that in a good way). Or, if it's a beginner contribution, example-based test is fine as well.

To selectively run a test, use pytest.mark to label the test. You'll see a few examples in the test suite:

@pytest.mark.hyp  # marks a function with the label "hyp"
def test_function(...):
    ....

Then, at the terminal:

$ pytest -m hyp  # selectively runs the tests marked with "hyp"

Finally, don't forget that in the top-level directory Makefile, you can do the mega-combo command make check!

from pyjanitor.

szuckerman avatar szuckerman commented on July 20, 2024 1

I really like the refactoring of the test suite (especially using the Hypothesis module; it looks awesome).

However, to run the tests it just took me about 30 seconds while, in the past, I remember it being around 7 seconds or so.

I think all the imports at the top might be unnecessarily slowing it down. There's over 30 files, which means hypothesis and pandas could be loaded that many times.

I think a master file with just one import pandas as pd and a from tests import * might work.

from pyjanitor.

szuckerman avatar szuckerman commented on July 20, 2024 1

@zbarry yes, but I think it depends how the files are run. I pass the tests/ folder to pytest which moves through each file. If pytest is running each file separately then the imports may be an issue.

from pyjanitor.

szuckerman avatar szuckerman commented on July 20, 2024 1

@zbarry, I don't think by much. It still takes about 30 seconds to run everything.

What usually can lead to improvements is making the fixtures have a session scope versus function scope (i.e. just change the decorator to @pytest.fixture(scope='session')).

When they have a function scope the fixture gets calculated again for each test function. If it's a session scope, it runs once and the object gets passed to all the test functions. However, this doesn't really work for us since most/all of the tests mutate the DataFrame which will cause subsequent tests to fail.

from pyjanitor.

szuckerman avatar szuckerman commented on July 20, 2024 1

Probably not. They way lru_cache works is it makes a dictionary out of the function arguments and the result. (As a basic understanding).

For example, if you had:

@functools.lru_cache(maxsize=128)
def hi(name):
   return 'hi ' + name

And I ran hi('sam'), lru_cache would make something like a dictionary of {'sam': 'hi sam'} to know what to look up when 'sam' is given as an argument. Every new argument adds another key to the dictionary, until you hit 128 (since that's the threshold given, and then they start deleting the oldest values).

Some main ideas are that:

  1. The arguments must be hashable to be able to be used as keys for the dictionary.
  2. If the arguments are changing all the time, it doesn't help much to cache the results.

For example, take the following pyjanitor test function:

@pytest.mark.functions
@given(df=categoricaldf_strategy(), iterable=names_strategy())
def test_filter_column_isin(df, iterable):
    assume(len(iterable) >= 1)
    df = df.filter_column_isin("names", iterable)
    assert set(df["names"]).issubset(iterable)

Since df is getting calculated as categoricaldf_strategy() every time it runs, the caching from lru_cache probably won't help. Similarly, I'm not sure what's returned from that function, so it might not be hashable anyway to be available to be used by lru_cache.

from pyjanitor.

zbarry avatar zbarry commented on July 20, 2024

Hmm, thought in Python that the package imports were smart in that each module is only imported once generally.

from pyjanitor.

ericmjl avatar ericmjl commented on July 20, 2024

@szuckerman the slowdown in testing is a natural consequence of using Hypothesis. Previously, we were using only one example. Hypothesis is dynamically generating up to (I think) 50-200 examples, looking for a way to falsify each test.

I was able to find some functions for which hypothesis helped discover an example that I wasn't originally thinking of, but could break the function and needed to be checked for inside the function. (e.g. a zero-length list being passed into the function)

from pyjanitor.

zbarry avatar zbarry commented on July 20, 2024

Gotcha. Is there a slick way to be able to get Hypothesis to only operate on a subset of tests when developing to get a fast iteration time, and at the end, we run the whole suite before pushing?

from pyjanitor.

szuckerman avatar szuckerman commented on July 20, 2024

So, just a thought on the refactoring:

I just found out that you can group tests in classes using pytest.

I kind of liked the idea of all the tests being in one file (since that's easier to copy and search for other tests) and having a group of tests in a class will logically keep tests together for certain methods.

Thoughts?

from pyjanitor.

zbarry avatar zbarry commented on July 20, 2024

Generally not a super big fan of giant, monolithic source files, but don't really have much of an opinion w.r.t. unit testing sources.

from pyjanitor.

ericmjl avatar ericmjl commented on July 20, 2024

@szuckerman thanks for the link to there! I learned something new today 😄.

I guess for me, this is a big massive ongoing experiment in seeing how much of a hassle it is to maintain a test suite for a growing library of functions. Current hypothesis is a few-fold:

  • Hypothesis 1: it will be easier to maintain.
  • Hypothesis 2: it will be easier to onboard newcomers.

Are you ok with continuing this experiment for a while first? Happy to re-evaluate later.

from pyjanitor.

zbarry avatar zbarry commented on July 20, 2024

Re: #128 , curious - does this speed things up for you when testing everything at once?

from pyjanitor.

zbarry avatar zbarry commented on July 20, 2024

What about a @copyinputs decorator which the @fixture decorator would itself wrap which copies the fixture data that pytest provides so you don't have to worry about mutation of the original dataframe? 🤔 🤔

(I've been having a lot of fun with decorators recently)

from pyjanitor.

szuckerman avatar szuckerman commented on July 20, 2024

That might work, but it appears the real bottleneck is all the different Hypothesis strategies. If there was a way to copy/cache those results I think you would have a speedup, but I'm not sure how easy that is to do.

from pyjanitor.

zbarry avatar zbarry commented on July 20, 2024

With basically no knowledge of what you're talking about, would this help? https://docs.python.org/3/library/functools.html#functools.lru_cache

from pyjanitor.

ericmjl avatar ericmjl commented on July 20, 2024

@szuckerman @zbarry I think this issue is irrelevant at this point. Going to close it.

from pyjanitor.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.