Comments (16)
I'd be judicious with the use of hypothesis, btw. For example, I don't know how to generate examples well for the biology functions. There, an example-based-test may be the best kind of test.
Going forth, it's fine to use an example-based test (our original paradigm) rather than a property-based test (using Hypothesis), if that criteria is fulfilled (i.e. it doesn't make sense to generate lots of examples, or we don't know how to do that in a good way). Or, if it's a beginner contribution, example-based test is fine as well.
To selectively run a test, use pytest.mark to label the test. You'll see a few examples in the test suite:
@pytest.mark.hyp # marks a function with the label "hyp"
def test_function(...):
....
Then, at the terminal:
$ pytest -m hyp # selectively runs the tests marked with "hyp"
Finally, don't forget that in the top-level directory Makefile, you can do the mega-combo command make check
!
from pyjanitor.
I really like the refactoring of the test suite (especially using the Hypothesis module; it looks awesome).
However, to run the tests it just took me about 30 seconds while, in the past, I remember it being around 7 seconds or so.
I think all the import
s at the top might be unnecessarily slowing it down. There's over 30 files, which means hypothesis
and pandas
could be loaded that many times.
I think a master file with just one import pandas as pd
and a from tests import *
might work.
from pyjanitor.
@zbarry yes, but I think it depends how the files are run. I pass the tests/
folder to pytest
which moves through each file. If pytest
is running each file separately then the imports may be an issue.
from pyjanitor.
@zbarry, I don't think by much. It still takes about 30 seconds to run everything.
What usually can lead to improvements is making the fixtures have a session scope versus function scope (i.e. just change the decorator to @pytest.fixture(scope='session')
).
When they have a function scope the fixture gets calculated again for each test function. If it's a session scope, it runs once and the object gets passed to all the test functions. However, this doesn't really work for us since most/all of the tests mutate the DataFrame which will cause subsequent tests to fail.
from pyjanitor.
Probably not. They way lru_cache
works is it makes a dictionary out of the function arguments and the result. (As a basic understanding).
For example, if you had:
@functools.lru_cache(maxsize=128)
def hi(name):
return 'hi ' + name
And I ran hi('sam')
, lru_cache
would make something like a dictionary of {'sam': 'hi sam'} to know what to look up when 'sam' is given as an argument. Every new argument adds another key to the dictionary, until you hit 128 (since that's the threshold given, and then they start deleting the oldest values).
Some main ideas are that:
- The arguments must be hashable to be able to be used as keys for the dictionary.
- If the arguments are changing all the time, it doesn't help much to cache the results.
For example, take the following pyjanitor
test function:
@pytest.mark.functions
@given(df=categoricaldf_strategy(), iterable=names_strategy())
def test_filter_column_isin(df, iterable):
assume(len(iterable) >= 1)
df = df.filter_column_isin("names", iterable)
assert set(df["names"]).issubset(iterable)
Since df
is getting calculated as categoricaldf_strategy()
every time it runs, the caching from lru_cache
probably won't help. Similarly, I'm not sure what's returned from that function, so it might not be hashable anyway to be available to be used by lru_cache
.
from pyjanitor.
Hmm, thought in Python that the package imports were smart in that each module is only imported once generally.
from pyjanitor.
@szuckerman the slowdown in testing is a natural consequence of using Hypothesis. Previously, we were using only one example. Hypothesis is dynamically generating up to (I think) 50-200 examples, looking for a way to falsify each test.
I was able to find some functions for which hypothesis helped discover an example that I wasn't originally thinking of, but could break the function and needed to be checked for inside the function. (e.g. a zero-length list being passed into the function)
from pyjanitor.
Gotcha. Is there a slick way to be able to get Hypothesis to only operate on a subset of tests when developing to get a fast iteration time, and at the end, we run the whole suite before pushing?
from pyjanitor.
So, just a thought on the refactoring:
I just found out that you can group tests in classes using pytest.
I kind of liked the idea of all the tests being in one file (since that's easier to copy and search for other tests) and having a group of tests in a class will logically keep tests together for certain methods.
Thoughts?
from pyjanitor.
Generally not a super big fan of giant, monolithic source files, but don't really have much of an opinion w.r.t. unit testing sources.
from pyjanitor.
@szuckerman thanks for the link to there! I learned something new today 😄.
I guess for me, this is a big massive ongoing experiment in seeing how much of a hassle it is to maintain a test suite for a growing library of functions. Current hypothesis is a few-fold:
- Hypothesis 1: it will be easier to maintain.
- Hypothesis 2: it will be easier to onboard newcomers.
Are you ok with continuing this experiment for a while first? Happy to re-evaluate later.
from pyjanitor.
Re: #128 , curious - does this speed things up for you when testing everything at once?
from pyjanitor.
What about a @copyinputs
decorator which the @fixture
decorator would itself wrap which copies the fixture data that pytest provides so you don't have to worry about mutation of the original dataframe? 🤔 🤔
(I've been having a lot of fun with decorators recently)
from pyjanitor.
That might work, but it appears the real bottleneck is all the different Hypothesis strategies. If there was a way to copy/cache those results I think you would have a speedup, but I'm not sure how easy that is to do.
from pyjanitor.
With basically no knowledge of what you're talking about, would this help? https://docs.python.org/3/library/functools.html#functools.lru_cache
from pyjanitor.
@szuckerman @zbarry I think this issue is irrelevant at this point. Going to close it.
from pyjanitor.
Related Issues (20)
- `.clean_names()` should replace `@` with `_` HOT 5
- string selection on multiindex top level HOT 1
- add row count for conditional_join
- RuntimeWarning: subpackages can technically be lazily loaded HOT 16
- explode_levels
- Not able to import janitor.clean_name function - ImportError: cannot import name 'ABCPandasArray' from 'pandas.core.dtypes.generic' HOT 2
- Typos in repository
- expand function
- [INFRA] Switch over to pyproject.toml
- Support efficient json extraction within a pandas column HOT 1
- [ENH] implement full numba version of a single conditional_join
- deprecation warning for pivot_longer HOT 1
- Return only matching indices for `conditional_join`
- [ENH] cython a subset of _range_join_indices and equi join HOT 4
- extend `col` powers for index selection HOT 1
- dtype conversion on index
- `conditional_join` fails on mac for `equi-join` and numba HOT 1
- Outdated version in conda forge HOT 1
- extend `row_to_names` to support multiindex
- `sheet_name` not required in jn.xlsx_table
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pyjanitor.