GithubHelp home page GithubHelp logo

lukashedegaard / datasetops Goto Github PK

View Code? Open in Web Editor NEW
10.0 5.0 1.0 25.33 MB

Fluent dataset operations, compatible with your favorite libraries

Home Page: https://datasetops.readthedocs.io

License: MIT License

Python 96.69% Makefile 3.31%
dataset-combinations data-science multiple-datasets pytorch tensorflow data-munging data-wrangling data-cleaning data-processing deep-learning

datasetops's Introduction

πŸ‘‹ Hi, I'm Lukas Hedegaard

PostDoc at Aarhus University, Denmark researching Deep Learning, network acceleration and Transfer Learning applied to Computer Vision and Natural Language Processing. Recent works include:

Apart from doing ML research projects (like real-time Human Activity Recognition using CoX3D ☝️), I like to package code up nicely and open-source whenever it may be of value. Open-source libraries include:

  • Continual Inference [🐍, C++] downloads - Building blocks for Continual Inference Networks in PyTorch.
  • Ride [🐍] downloads - Training wheels, side rails, and helicopter parent for your Deep Learning projects in PyTorch.
  • OpenDR [🐍, C++] downloads - A modular, open and non-proprietary toolkit for core robotic functionalities by harnessing deep learning.
  • DatasetOps [🐍] downloads - Fluent dataset operations, compatible with your favorite libraries.
  • PyTorch Benchmark [🐍] downloads - Easily benchmark PyTorch model FLOPs, latency, throughput, allocated gpu memory and energy consumption.
  • Co-Rider [🐍] downloads - Tiny configuration library tailored for the Ride ecosystem.
  • Supers [🐍] downloads - Call a function in all superclasses using supers(self).foo(42).
  • react-native-svg-pan-zoom [JS] downloads - Pan-zoom via two-finger "Google Maps"-style pinch and drag gestures.
  • redux-maybe [JS] downloads - Nodejs Package for attaching callback functions to redux messages.
  • redux-blabber [JS] downloads - Redux store enhancer for synchronizing states and actions across store instances.

Find me 🌐

datasetops's People

Contributors

clegaard avatar iliiliiliili avatar lukashedegaard avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

dsp6414

datasetops's Issues

Online dataset downloaders

Once in while, people supply a non-standard datasets online. This could be in form of link for downloading a dataset directly, or as a file in Google Drive for instance.
It would be a nice feature, if we could make a loader for these situations.

Naming suggestions:
-from_online
-from_url
-from_link

Fix readthedocs autoapi issues

Currently, documentation build fails due to autoapi extension not being available on readthedocs:
image

It seems like it should be possible to make it the required packages either by doing

  1. pip install .
  2. pip install -r my_dev_requirements.txt

image

Update documentation

Many of the exposed functions have not been documented - if the library is supposed to be truly user-friendly, this should be done.

The README should also be updated with a usage example

MXNet compatibility

We should consider adding support for MXNet, as this is currently the third most popular framework for machine learning

Standard samplers

We should consider adding a few standard splitters with automatic shuffling:

  • split_train_test(ratio=[0.8,0.2])
  • split_train_val_test(ratio=[0.65,0.15,0.2])
  • split_k_fold(k=5)
  • split_balanced(key, num_per_class, comparison_fn)

And another sampler
-sample_balanced(key, num_per_class, comparison_fn)

Make `custom` private

The custom transform was originally intended to be wrap a user-defined lambda, to be applied to each item. However, the transform function has now been implemented in such a way, that we can pass lambdas directly. custom is thus unnecessary, and should be removed.

Difference between Dataset and Loader

What is the difference between a Dataset and a Loader? From a conceptual standpoint, a loader behaves exactly like a dataset. In terms of the implementation, it seems like the Loader simply wraps the dataset?

Naming scheme

We need to come up with a good name for project

Criteria

The name should translate well into package names and imports in Python:

Will be installed using pip or conda:

pip install mldatasets
conda install mldatasets

Imported in Python:

from mldatasets.loaders import load_dataset

Dataset.image should be callable by element names

Dataset.image can be only called without parameters to convert all convertible data to image or with flags (True, False, True). We should also be able to call it with names: ds.image("image_2")

Write examples that demonstrate the desired use of API.

To drive the design of the API it would be useful to exemplify how the API may be used to load, transform and split data.

For example loading and transforming to grayscale:

import dataset_loader as ds
ds = ds.load_dataset(myPath).to_grayscale()

Naming of dataset.py and declaration of transforms

Currently, most of the transforms are implemented in the dataset.py file.
There there is some defined in the compose.py file, related to taking the cartesian product.

It might be reasonable to agree on where transforms should be defined.
What are your thoughts?

Add dataset.remove method

Currently, removing an element can be achieved using dataset.transform(lambda x: (x[0] x[2])) or dataset.reorder("name0","name2"). We should have explicit function for this.

Add split_element to convert one element into a few other elements

Currently, to create new elements from another we can use dataset.transform(lambda x: (x[0], make_P(x[1]), make_R(x[1]), x[2])). This approach breaks dataset names, so we have to call .named(...) after it.
Implementing split_element method that takes name of the (element, function, that returns List with created elements, List[str] of names of new elements) allows us to process the data without the need to touch other data, so less chances for user to make an error.

Remove `extend` from Loader

Currently, there may be issues if a Loader dataset has no elements. Methods that require shape, (named for instance) will not work.
There is currently a temptation to create datasets as follows:

ds = Loader(getitems_fn).named(β€œimage”, β€œlabel”)
ds.extend(ids)

This fails because the named function is only valid after extend was called.

Removing extend from the Loader and instead pass ids via the constructor would avoid the scenario. Also, it conforms better to our otherwise functional API

ds = Loader(getitems_fn, ids).named(β€œimage”, β€œlabel”)

Dataset shape property and debuggers

Currently, the shape property of a dataset is determined by loading a single sample from the dataset.
This has the unintended effects when the dataset is inspected by a debugger like that in vscode, which evaluates the expression, which may potentially take several seconds if each sample is large.

@property
def shape(self) -> Sequence[Shape]:
"""Get the shape of a dataset item.
Returns:
Sequence[int] -- Item shapes
"""
if len(self) == 0:
return _DEFAULT_SHAPE
item = self.__getitem__(0)
if hasattr(item, "__getitem__"):
item_shape = []
for i in item:
if hasattr(i, "shape"): # numpy arrays
item_shape.append(i.shape)
elif hasattr(i, "size"): # PIL.Image.Image
item_shape.append(np.array(i).shape)
else:
item_shape.append(_DEFAULT_SHAPE)
return tuple(item_shape)
return _DEFAULT_SHAPE

This begs the question of whether or not properties should have side effects? I relation to the subsampling operator, this messes a caching mechanism. If logging was enabled this would potentially cause unexpected log messages to be printed.

A solution could be caching the inferred shape, e.g. saving it to a private attribute _shape and having the property link to that value instead?

Add workflow for doctests

The docs support testing using doctest. We should add a GitHub workflow that automatically checks these.

standardize-transform performance

It seems the standardize function adds a significant amount of reads to the underlying dataset.
Calling the function on a dataset containing a single element seemingly causes 7 reads to be carried out.

ds_one = ds.take(1)
def foo():
    ds_center = ds_one.standardize(0, axis=1)
    s = ds_center[0]

def bar():

    from sklearn.preprocessing import StandardScaler
    scaler = StandardScaler()
    s = ds[0]
    scaler.fit(s.data)
    scaled = scaler.transform(s.data)
    mu = np.mean(scaled)
    std = np.std(scaled)


def do_profile(func):

    print(f"######### PROFILING {func.__name__} #########")
    pr = cProfile.Profile(subcalls=False)
    pr.enable()
    func()
    pr.disable()
    s = io.StringIO()
    sortby = SortKey.TIME
    ps = pstats.Stats(pr, stream=s).sort_stats(sortby)
    ps.print_stats()
    # return s.getvalue()
    print(f"{s.getvalue()}\n")

do_profile(foo)
do_profile(bar)
######### PROFILING foo #########
         40119 function calls (39755 primitive calls) in 18.561 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        7   17.962    2.566   18.073    2.582 {method 'read' of 'pandas._libs.parsers.TextReader' objects}
    94/65    0.102    0.001    0.283    0.004 {built-in method numpy.core._multiarray_umath.implement_array_function}
        7    0.098    0.014    0.098    0.014 C:\ProgramData\Miniconda3\lib\site-packages\numpy\lib\arraypad.py:86(_pad_simple)
        7    0.097    0.014    0.098    0.014 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\managers.py:1828(_stack_arrays)
      217    0.085    0.000    0.085    0.000 {built-in method numpy.array}
       36    0.036    0.001    0.036    0.001 {method 'reduce' of 'numpy.ufunc' objects}
        7    0.022    0.003    0.023    0.003 C:\ProgramData\Miniconda3\lib\site-packages\pandas\io\parsers.py:1864(__init__)
        7    0.018    0.003   18.256    2.608 C:\ProgramData\Miniconda3\lib\site-packages\pandas\io\parsers.py:416(_read)
        7    0.018    0.003   18.377    2.625 c:\users\clega\desktop\datasetops\src\datasetops\loaders.py:491(get_data)
        1    0.009    0.009    0.039    0.039 C:\ProgramData\Miniconda3\lib\site-packages\numpy\lib\nanfunctions.py:1421(nanvar)
        2    0.007    0.003    0.036    0.018 C:\ProgramData\Miniconda3\lib\site-packages\numpy\lib\nanfunctions.py:68(_replace_nan)
        1    0.006    0.006    0.025    0.025 C:\ProgramData\Miniconda3\lib\site-packages\sklearn\preprocessing\_data.py:780(transform)
        1    0.006    0.006    0.076    0.076 C:\ProgramData\Miniconda3\lib\site-packages\sklearn\utils\extmath.py:710(_incremental_mean_and_var)
     7847    0.005    0.000    0.011    0.000 {built-in method builtins.isinstance}
     4018    0.005    0.000    0.006    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\generic.py:10(_check)
        1    0.005    0.005    5.368    5.368 c:\users\clega\desktop\datasetops\src\datasetops\dataset.py:238(item_stats)
        1    0.005    0.005    8.000    8.000 c:\users\clega\desktop\datasetops\src\datasetops\dataset.py:1385(make_fn)
        7    0.004    0.001    0.004    0.001 {method 'close' of 'pandas._libs.parsers.TextReader' objects}
     1148    0.003    0.000    0.005    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:1708(_is_dtype_type)
     15/7    0.003    0.000   18.405    2.629 c:\users\clega\desktop\datasetops\src\datasetops\dataset.py:235(__getitem__)
        1    0.003    0.003    2.729    2.729 c:\users\clega\desktop\datasetops\src\datasetops\scaler.py:62(fit)
        1    0.003    0.003   13.269   13.269 c:\users\clega\desktop\datasetops\src\datasetops\dataset.py:648(transform)
        1    0.003    0.003   10.639   10.639 c:\users\clega\desktop\datasetops\src\datasetops\dataset.py:1088(wrapped)
        1    0.003    0.003   15.911   15.911 c:\users\clega\desktop\datasetops\src\datasetops\dataset.py:854(standardize)
        1    0.002    0.002    0.017    0.017 c:\users\clega\desktop\datasetops\src\datasetops\dataset.py:1194(fn)
      812    0.002    0.000    0.004    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:1565(is_extension_array_dtype)
      665    0.002    0.000    0.009    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:542(is_categorical_dtype)
     1106    0.002    0.000    0.011    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\base.py:247(is_dtype)
      455    0.002    0.000    0.004    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:775(is_integer_dtype)
    49/21    0.002    0.000    0.012    0.001 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\indexes\base.py:276(__new__)
      819    0.002    0.000    0.002    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\dtypes.py:75(find)
      273    0.001    0.000    0.002    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:1401(is_float_dtype)
       56    0.001    0.000    0.001    0.000 {built-in method numpy.empty}
      539    0.001    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:216(<lambda>)
      609    0.001    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:208(<lambda>)
      609    0.001    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:206(classes)
      539    0.001    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:211(classes_and_not_datetimelike)
     6128    0.001    0.000    0.001    0.000 {built-in method builtins.getattr}
       63    0.001    0.000    0.005    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\blocks.py:2981(get_block_type)
    21/14    0.001    0.000    0.018    0.001 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\series.py:183(__init__)
       56    0.001    0.000    0.005    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\construction.py:388(sanitize_array)
       91    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\numpy\core\_dtype.py:321(_name_get)
  817/642    0.000    0.000    0.001    0.000 {built-in method builtins.len}
      234    0.000    0.000    0.001    0.000 <frozen importlib._bootstrap>:1009(_handle_fromlist)
      154    0.000    0.000    0.002    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\dtypes.py:1124(is_dtype)
      182    0.000    0.000    0.002    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:506(is_interval_dtype)
      154    0.000    0.000    0.002    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\dtypes.py:917(is_dtype)
      182    0.000    0.000    0.002    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:472(is_period_dtype)
        7    0.000    0.000    0.104    0.015 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\managers.py:1700(form_blocks)
      876    0.000    0.000    0.000    0.000 {built-in method builtins.hasattr}
      119    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:441(is_timedelta64_dtype)
      217    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:1844(pandas_dtype)
      105    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:222(is_object_dtype)
        7    0.000    0.000    0.100    0.014 C:\ProgramData\Miniconda3\lib\site-packages\numpy\lib\arraypad.py:532(pad)
       14    0.000    0.000    0.001    0.000 {pandas._libs.lib.clean_index_list}
      161    0.000    0.000    0.002    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:403(is_datetime64tz_dtype)
      105    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:372(is_datetime64_dtype)
        7    0.000    0.000   18.359    2.623 c:\users\clega\desktop\datasetops\src\datasetops\loaders.py:448(read_single_csv)
     2827    0.000    0.000    0.000    0.000 {built-in method builtins.issubclass}
        7    0.000    0.000   18.074    2.582 C:\ProgramData\Miniconda3\lib\site-packages\pandas\io\parsers.py:2035(read)
    69/14    0.000    0.000    0.000    0.000 {built-in method _abc._abc_subclasscheck}
        7    0.000    0.000    0.102    0.015 c:\Users\clega\Desktop\vibration\sandbox.py:25(func_csv)
       98    0.000    0.000    0.002    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:987(is_datetime64_any_dtype)
       84    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\indexes\base.py:5393(maybe_extract_name)
       49    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\cast.py:1088(maybe_castable)
       56    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:1435(is_bool_dtype)
       42    0.000    0.000    0.096    0.002 <__array_function__ internals>:2(concatenate)
      157    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\numpy\core\_asarray.py:14(asarray)
       56    0.000    0.000    0.002    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\construction.py:506(_try_cast)
       63    0.000    0.000    0.013    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\indexes\base.py:5293(ensure_index)
        7    0.000    0.000   18.209    2.601 C:\ProgramData\Miniconda3\lib\site-packages\pandas\io\parsers.py:1131(read)
       91    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\numpy\core\_dtype.py:307(_name_includes_bit_suffix)
       54    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\numpy\core\numerictypes.py:360(issubdtype)
       63    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:252(is_sparse)
    63/56    0.000    0.000    0.002    0.000 {built-in method builtins.all}
       84    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:1672(_get_dtype)
       49    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\common.py:219(asarray_tuplesafe)
        7    0.000    0.000    0.134    0.019 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\construction.py:213(init_dict)
       21    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\urllib\parse.py:361(urlparse)
       35    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\generic.py:5276(__setattr__)
       21    0.000    0.000    0.003    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\blocks.py:3027(make_block)
        7    0.000    0.000    0.100    0.014 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\managers.py:1811(_multi_blockify)
        4    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\sre_parse.py:469(_parse)
      108    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\numpy\core\numerictypes.py:286(issubclass_)
       21    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\blocks.py:118(__init__)
        7    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\indexes\range.py:83(__new__)
        3    0.000    0.000    0.027    0.009 C:\ProgramData\Miniconda3\lib\site-packages\sklearn\utils\validation.py:350(check_array)
      131    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\abc.py:137(__instancecheck__)
      119    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\inference.py:358(is_hashable)
        7    0.000    0.000    0.002    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\series.py:3857(_reduce)
        7    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\managers.py:212(_rebuild_blknos_and_blklocs)
        7    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\numpy\lib\arraypad.py:457(_as_pairs)
        7    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\io\common.py:144(get_filepath_or_buffer)
       42    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\indexes\base.py:3911(__getitem__)
        7    0.000    0.000    0.023    0.003 C:\ProgramData\Miniconda3\lib\site-packages\pandas\io\parsers.py:792(__init__)
       84    0.000    0.000    0.001    0.000 {pandas._libs.lib.is_list_like}
        7    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\io\parsers.py:939(_clean_options)
       63    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:339(is_categorical)
       42    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\managers.py:1831(_asarray_compat)
       21    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\blocks.py:251(mgr_locs)
        7    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\numpy\lib\stride_tricks.py:114(_broadcast_to)
       21    0.000    0.000    0.000    0.000 {pandas._libs.lib.infer_dtype}
        7    0.000    0.000    0.003    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\construction.py:300(_homogenize)
        7    0.000    0.000    0.003    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\missing.py:225(_isna_ndarraylike)
       42    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:830(is_signed_integer_dtype)
        7    0.000    0.000   18.256    2.608 C:\ProgramData\Miniconda3\lib\site-packages\pandas\io\parsers.py:530(parser_f)
       14    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\cast.py:1209(maybe_cast_to_datetime)
        7    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\nanops.py:234(_get_values)
        7    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\io\common.py:40(is_url)
        7    0.000    0.000    0.002    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\managers.py:122(__init__)
       42    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:887(is_unsigned_integer_dtype)
        7    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\io\parsers.py:885(_get_options_with_defaults)
       91    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\numpy\core\_dtype.py:24(_kind_name)
       14    0.000    0.000    0.002    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\indexes\base.py:4046(equals)
       21    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\urllib\parse.py:412(urlsplit)
       21    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\series.py:376(_set_axis)
        7    0.000    0.000    0.000
######### PROFILING bar #########
         5890 function calls (5849 primitive calls) in 2.768 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    2.570    2.570    2.586    2.586 {method 'read' of 'pandas._libs.parsers.TextReader' objects}
       41    0.042    0.001    0.042    0.001 {built-in method numpy.array}
       16    0.038    0.002    0.038    0.002 {method 'reduce' of 'numpy.ufunc' objects}
    27/16    0.021    0.001    0.138    0.009 {built-in method numpy.core._multiarray_umath.implement_array_function}
        1    0.017    0.017    0.025    0.025 C:\ProgramData\Miniconda3\lib\site-packages\numpy\core\_methods.py:176(_var)
        1    0.014    0.014    0.014    0.014 C:\ProgramData\Miniconda3\lib\site-packages\numpy\lib\arraypad.py:86(_pad_simple)
        1    0.014    0.014    0.014    0.014 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\managers.py:1828(_stack_arrays)
        1    0.009    0.009    0.038    0.038 C:\ProgramData\Miniconda3\lib\site-packages\numpy\lib\nanfunctions.py:1421(nanvar)
        2    0.007    0.003    0.036    0.018 C:\ProgramData\Miniconda3\lib\site-packages\numpy\lib\nanfunctions.py:68(_replace_nan)
        1    0.007    0.007    0.076    0.076 C:\ProgramData\Miniconda3\lib\site-packages\sklearn\utils\extmath.py:710(_incremental_mean_and_var)
        1    0.006    0.006    0.025    0.025 C:\ProgramData\Miniconda3\lib\site-packages\sklearn\preprocessing\_data.py:780(transform)
        1    0.004    0.004    0.004    0.004 C:\ProgramData\Miniconda3\lib\site-packages\pandas\io\parsers.py:1864(__init__)
        1    0.003    0.003    2.613    2.613 C:\ProgramData\Miniconda3\lib\site-packages\pandas\io\parsers.py:416(_read)
        1    0.003    0.003    2.631    2.631 c:\users\clega\desktop\datasetops\src\datasetops\loaders.py:491(get_data)
        1    0.003    0.003    0.028    0.028 C:\ProgramData\Miniconda3\lib\site-packages\numpy\core\_methods.py:232(_std)
     1141    0.001    0.000    0.002    0.000 {built-in method builtins.isinstance}
      574    0.001    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\generic.py:10(_check)
        1    0.001    0.001    0.001    0.001 {method 'close' of 'pandas._libs.parsers.TextReader' objects}
      164    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:1708(_is_dtype_type)
      116    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:1565(is_extension_array_dtype)
      158    0.000    0.000    0.002    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\base.py:247(is_dtype)
       95    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:542(is_categorical_dtype)
       65    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:775(is_integer_dtype)
      117    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\dtypes.py:75(find)
      7/3    0.000    0.000    0.002    0.001 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\indexes\base.py:276(__new__)
        1    0.000    0.000    2.768    2.768 c:\Users\clega\Desktop\vibration\sandbox.py:80(bar)
        8    0.000    0.000    0.000    0.000 {built-in method numpy.empty}
       39    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:1401(is_float_dtype)
       87    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:208(<lambda>)
       77    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:216(<lambda>)
       87    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:206(classes)
        2    0.000    0.000    0.023    0.012 C:\ProgramData\Miniconda3\lib\site-packages\sklearn\utils\validation.py:350(check_array)
       77    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:211(classes_and_not_datetimelike)
      876    0.000    0.000    0.000    0.000 {built-in method builtins.getattr}
        9    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\blocks.py:2981(get_block_type)
        8    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\construction.py:388(sanitize_array)
       13    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\numpy\core\_dtype.py:321(_name_get)
       36    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:1009(_handle_fromlist)
      144    0.000    0.000    0.000    0.000 {built-in method builtins.hasattr}
      3/2    0.000    0.000    0.002    0.001 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\series.py:183(__init__)
        4    0.000    0.000    0.075    0.019 C:\ProgramData\Miniconda3\lib\site-packages\sklearn\utils\extmath.py:681(_safe_accumulator_op)
       22    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\dtypes.py:1124(is_dtype)
       15    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:222(is_object_dtype)
   108/84    0.000    0.000    0.000    0.000 {built-in method builtins.len}
       22    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\dtypes.py:917(is_dtype)
       26    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:506(is_interval_dtype)
        1    0.000    0.000    2.628    2.628 c:\users\clega\desktop\datasetops\src\datasetops\loaders.py:448(read_single_csv)
       17    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:441(is_timedelta64_dtype)
       26    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:472(is_period_dtype)
        2    0.000    0.000    0.009    0.005 C:\ProgramData\Miniconda3\lib\site-packages\sklearn\utils\validation.py:37(_assert_all_finite)
        1    0.000    0.000    0.015    0.015 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\managers.py:1700(form_blocks)
       31    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:1844(pandas_dtype)
        1    0.000    0.000    0.014    0.014 C:\ProgramData\Miniconda3\lib\site-packages\numpy\lib\arraypad.py:532(pad)
       15    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:372(is_datetime64_dtype)
       23    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:403(is_datetime64tz_dtype)
       11    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\numpy\core\numerictypes.py:360(issubdtype)
        8    0.000    0.000    0.027    0.003 C:\ProgramData\Miniconda3\lib\site-packages\numpy\core\fromnumeric.py:70(_wrapreduction)
        6    0.000    0.000    0.014    0.002 <__array_function__ internals>:2(concatenate)
      420    0.000    0.000    0.000    0.000 {built-in method builtins.issubclass}
        1    0.000    0.000    0.015    0.015 c:\Users\clega\Desktop\vibration\sandbox.py:25(func_csv)
        1    0.000    0.000    2.631    2.631 c:\users\clega\desktop\datasetops\src\datasetops\dataset.py:235(__getitem__)
        1    0.000    0.000    0.081    0.081 C:\ProgramData\Miniconda3\lib\site-packages\sklearn\preprocessing\_data.py:671(partial_fit)
        2    0.000    0.000    0.000    0.000 {pandas._libs.lib.clean_index_list}
        8    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\numpy\core\_ufunc_config.py:32(seterr)
       22    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\numpy\core\numerictypes.py:286(issubclass_)
        7    0.000    0.000    0.027    0.004 C:\ProgramData\Miniconda3\lib\site-packages\numpy\core\fromnumeric.py:2105(sum)
       24    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\numpy\core\_asarray.py:14(asarray)
        2    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\numpy\lib\nanfunctions.py:183(_divide_by_count)
        1    0.000    0.000    2.586    2.586 C:\ProgramData\Miniconda3\lib\site-packages\pandas\io\parsers.py:2035(read)
        7    0.000    0.000    0.027    0.004 <__array_function__ internals>:2(sum)
       14    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:987(is_datetime64_any_dtype)
        6    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\managers.py:1831(_asarray_compat)
       12    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\indexes\base.py:5393(maybe_extract_name)
        7    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\cast.py:1088(maybe_castable)
        8    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:1435(is_bool_dtype)
       13    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\numpy\core\_dtype.py:307(_name_includes_bit_suffix)
        8    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\numpy\core\_ufunc_config.py:132(geterr)
        9    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:252(is_sparse)
        9    0.000    0.000    0.002    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\indexes\base.py:5293(ensure_index)
        8    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\construction.py:506(_try_cast)
       21    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\abc.py:137(__instancecheck__)
        7    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\common.py:219(asarray_tuplesafe)
        1    0.000    0.000    0.004    0.004 C:\ProgramData\Miniconda3\lib\site-packages\numpy\core\_methods.py:143(_mean)
       12    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:1672(_get_dtype)
        1    0.000    0.000    0.015    0.015 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\managers.py:1811(_multi_blockify)
        1    0.000    0.000    2.613    2.613 C:\ProgramData\Miniconda3\lib\site-packages\pandas\io\parsers.py:530(parser_f)
        2    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\managers.py:199(_is_single_block)
        1    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\numpy\lib\arraypad.py:457(_as_pairs)
        5    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\generic.py:5276(__setattr__)
        1    0.000    0.000    0.004    0.004 <__array_function__ internals>:2(mean)
       17    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\inference.py:358(is_hashable)
        1    0.000    0.000    2.605    2.605 C:\ProgramData\Miniconda3\lib\site-packages\pandas\io\parsers.py:1131(read)
        1    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\managers.py:212(_rebuild_blknos_and_blklocs)
        1    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\managers.py:798(as_array)
       21    0.000    0.000    0.000    0.000 {built-in method _abc._abc_instancecheck}
        1    0.000    0.000    0.004    0.004 C:\ProgramData\Miniconda3\lib\site-packages\numpy\core\fromnumeric.py:3244(mean)
       13    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\indexes\base.py:615(__len__)
        3    0.000    0.000    0.000    0.000 C:\ProgramData\Min

Slow tests

Currently, a small number of tests are taking disproportionatly big chunk of the execution time.
Running: pytest --durations=0 reveals

33.56s call     tests/datasetops_tests/test_caching.py::test_cache
4.22s call     tests/datasetops_tests/test_loaders.py::test_tfds
3.74s call     tests/datasetops_tests/test_datasets.py::test_to_tensorflow
2.19s call     tests/datasetops_tests/test_examples.py::test_readme_example_2
1.13s call     tests/datasetops_tests/test_stream_dataset.py::test_read_from_file
0.72s call     tests/datasetops_tests/test_datasets.py::test_to_pytorch
0.63s call     tests/datasetops_tests/test_examples.py::test_domain_adaptation
0.53s call     tests/datasetops_tests/test_transformation_graph.py::test_serialization_not_same
0.29s call     tests/datasetops_tests/test_datasets.py::TestSubsample::test_subsample
0.26s call     tests/datasetops_tests/test_transformation_graph.py::test_operation_origins
0.22s call     tests/datasetops_tests/test_transformation_graph.py::test_serialization_same
0.13s call     tests/datasetops_tests/test_datasets.py::test_image_to_tensorflow
0.06s call     tests/datasetops_tests/test_caching.py::test_cacheable
0.05s call     tests/datasetops_tests/test_transformation_graph.py::test_common_nodes_equality
0.04s call     tests/datasetops_tests/test_transformation_graph.py::test_roots_kitti
0.03s call     tests/datasetops_tests/test_loaders.py::test_mat_single_with_multi_data
0.02s call     tests/datasetops_tests/test_loaders.py::test_pytorch
0.02s call     tests/datasetops_tests/test_examples.py::test_readme_example_1
0.02s call     tests/datasetops_tests/test_loaders.py::TestLoadCSV::test_names_missing
0.02s call     tests/datasetops_tests/test_transformation_graph.py::test_roots_tfds
0.01s call     tests/datasetops_tests/test_datasets.py::test_image_resize
0.01s call     tests/datasetops_tests/test_scaler.py::test_center
0.01s call     tests/datasetops_tests/test_datasets.py::TestSubsample::test_caching
0.01s call     tests/datasetops_tests/test_scaler.py::test_item_stats
0.01s call     tests/datasetops_tests/test_loaders.py::test_folder_dataset_class_data

I see two options.

  1. Make them faster
  2. Make it easy only to run the fasts tests

If we go with option 2 we could do

# content of conftest.py

import pytest


def pytest_addoption(parser):
    parser.addoption(
        "--runslow", action="store_true", default=False, help="run slow tests"
    )


def pytest_configure(config):
    config.addinivalue_line("markers", "slow: mark test as slow to run")


def pytest_collection_modifyitems(config, items):
    if config.getoption("--runslow"):
        # --runslow given in cli: do not skip slow tests
        return
    skip_slow = pytest.mark.skip(reason="need --runslow option to run")
    for item in items:
        if "slow" in item.keywords:
            item.add_marker(skip_slow)

sample behaviour

Currently, if more samples are requested on .sample, than are available in the dataset, we will sample some samples multiple times. Should we raise an error instead?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.