lukashedegaard / datasetops Goto Github PK

Fluent dataset operations, compatible with your favorite libraries

Home Page: https://datasetops.readthedocs.io

License: MIT License

Python 96.69% Makefile 3.31%

dataset-combinations data-science multiple-datasets pytorch tensorflow data-munging data-wrangling data-cleaning data-processing deep-learning

datasetops's Introduction

👋 Hi, I'm Lukas Hedegaard

PostDoc at Aarhus University, Denmark researching Deep Learning, network acceleration and Transfer Learning applied to Computer Vision and Natural Language Processing. Recent works include:

Hedegaard et al.: "Continual Spatio-Temporal Graph Convolutional Networks (Pattern Recognition, 2023)
Hedegaard et al.: "Continual Transformers: Redundancy-Free Attention for Online Inference" (ICLR, 2023)
Hedegaard & Iosifidis: "Continual 3D Convolutional Neural Networks for Real-time Processing of Videos" (ECCV, 2022)
Hedegaard & Iosifidis: "Continual Inference: A Library for Efficient Online Inference with Deep Neural Networks in PyTorch" (ECCV workshop, 2022)
Hedegaard et al.: "Supervised Domain Adaptation: A Graph Embedding Perspective and a Rectified Experimental Protocol" (Transactions on Image Processing, 2021)
Hedegaard et al.: "Supervised Domain Adaptation using Graph Embedding" (ICPR, 2021)
Heidari et al.: "Graph Convolutional Networks" (Deep Learning for Robot Perception and Cognition, 2022)
Hedegaard et al.: "Human Activity Recognition" (Deep Learning for Robot Perception and Cognition, 2022)
Hedegaard et al.: "Structured Pruning Adapters" (preprint, 2023)

Apart from doing ML research projects (like real-time Human Activity Recognition using CoX3D ☝️), I like to package code up nicely and open-source whenever it may be of value. Open-source libraries include:

Continual Inference [🐍, C++] - Building blocks for Continual Inference Networks in PyTorch.
Ride [🐍] - Training wheels, side rails, and helicopter parent for your Deep Learning projects in PyTorch.
OpenDR [🐍, C++] - A modular, open and non-proprietary toolkit for core robotic functionalities by harnessing deep learning.
DatasetOps [🐍] - Fluent dataset operations, compatible with your favorite libraries.
PyTorch Benchmark [🐍] - Easily benchmark PyTorch model FLOPs, latency, throughput, allocated gpu memory and energy consumption.
Co-Rider [🐍] - Tiny configuration library tailored for the Ride ecosystem.
Supers [🐍] - Call a function in all superclasses using supers(self).foo(42).
react-native-svg-pan-zoom [JS] - Pan-zoom via two-finger "Google Maps"-style pinch and drag gestures.
redux-maybe [JS] - Nodejs Package for attaching callback functions to redux messages.
redux-blabber [JS] - Redux store enhancer for synchronizing states and actions across store instances.

Find me 🌐

datasetops's People

Contributors

Stargazers

Watchers

Forkers

dsp6414

datasetops's Issues

Extend github action to also build on windows and macOS

Currently, the library is only built and tested on Linux.

It should be rather straight forward to extend the github action test matrix to accommodate this.

Online dataset downloaders

Once in while, people supply a non-standard datasets online. This could be in form of link for downloading a dataset directly, or as a file in Google Drive for instance.
It would be a nice feature, if we could make a loader for these situations.

Naming suggestions:
-from_online
-from_url
-from_link

Fix readthedocs autoapi issues

Currently, documentation build fails due to autoapi extension not being available on readthedocs:

It seems like it should be possible to make it the required packages either by doing

pip install .
pip install -r my_dev_requirements.txt

Update documentation

Many of the exposed functions have not been documented - if the library is supposed to be truly user-friendly, this should be done.

The README should also be updated with a usage example

MXNet compatibility

We should consider adding support for MXNet, as this is currently the third most popular framework for machine learning

Standard samplers

We should consider adding a few standard splitters with automatic shuffling:

split_train_test(ratio=[0.8,0.2])
split_train_val_test(ratio=[0.65,0.15,0.2])
split_k_fold(k=5)
split_balanced(key, num_per_class, comparison_fn)

And another sampler
-sample_balanced(key, num_per_class, comparison_fn)

Make `custom` private

The custom transform was originally intended to be wrap a user-defined lambda, to be applied to each item. However, the transform function has now been implemented in such a way, that we can pass lambdas directly. custom is thus unnecessary, and should be removed.

Difference between Dataset and Loader

What is the difference between a Dataset and a Loader? From a conceptual standpoint, a loader behaves exactly like a dataset. In terms of the implementation, it seems like the Loader simply wraps the dataset?

Naming scheme

We need to come up with a good name for project

Criteria

The name should translate well into package names and imports in Python:

Will be installed using pip or conda:

pip install mldatasets
conda install mldatasets

Imported in Python:

from mldatasets.loaders import load_dataset

Dataset.image should be callable by element names

Dataset.image can be only called without parameters to convert all convertible data to image or with flags (True, False, True). We should also be able to call it with names: ds.image("image_2")

Write examples that demonstrate the desired use of API.

To drive the design of the API it would be useful to exemplify how the API may be used to load, transform and split data.

For example loading and transforming to grayscale:

import dataset_loader as ds
ds = ds.load_dataset(myPath).to_grayscale()

Add save function

We should add a save function. @iliiliiliili , will you update this description?

.named should return a new dataset object

Currently, it returns self and mutates names

Naming of dataset.py and declaration of transforms

Currently, most of the transforms are implemented in the dataset.py file.
There there is some defined in the compose.py file, related to taking the cartesian product.

It might be reasonable to agree on where transforms should be defined.
What are your thoughts?

Add dataset.remove method

Currently, removing an element can be achieved using dataset.transform(lambda x: (x[0] x[2])) or dataset.reorder("name0","name2"). We should have explicit function for this.

Converting to_tensorflow with variable-shape elements

If dataset has different shapes of data of the same element, to_tensorflow dataset won't be able to return data.

Expose core functions directly from package

Packaging with init.py files is largely unused right now. The relevant functions should be exposed to the end user here

Add split_element to convert one element into a few other elements

Currently, to create new elements from another we can use dataset.transform(lambda x: (x[0], make_P(x[1]), make_R(x[1]), x[2])). This approach breaks dataset names, so we have to call .named(...) after it.
Implementing split_element method that takes name of the (element, function, that returns List with created elements, List[str] of names of new elements) allows us to process the data without the need to touch other data, so less chances for user to make an error.

Remove `extend` from Loader

Currently, there may be issues if a Loader dataset has no elements. Methods that require shape, (named for instance) will not work.
There is currently a temptation to create datasets as follows:

ds = Loader(getitems_fn).named(“image”, “label”)
ds.extend(ids)

This fails because the named function is only valid after extend was called.

Removing extend from the Loader and instead pass ids via the constructor would avoid the scenario. Also, it conforms better to our otherwise functional API

ds = Loader(getitems_fn, ids).named(“image”, “label”)

typo in tests/resourses (should be resources)

Docs should reflect the implementation

Currently, the docs do not reflect the state of the implementation.

Dataset shape property and debuggers

Currently, the shape property of a dataset is determined by loading a single sample from the dataset.
This has the unintended effects when the dataset is inspected by a debugger like that in vscode, which evaluates the expression, which may potentially take several seconds if each sample is large.

datasetops/src/datasetops/dataset.py

Lines 276 to 299 in 45a6893

 @property 

 def shape(self) -> Sequence[Shape]: 

 """Get the shape of a dataset item. 

  Returns: 

  Sequence[int] -- Item shapes 

  """ 

 if len(self) == 0: 

 return _DEFAULT_SHAPE 

 item = self.__getitem__(0) 

 if hasattr(item, "__getitem__"): 

 item_shape = [] 

 for i in item: 

 if hasattr(i, "shape"): # numpy arrays 

 item_shape.append(i.shape) 

 elif hasattr(i, "size"): # PIL.Image.Image 

 item_shape.append(np.array(i).shape) 

 else: 

 item_shape.append(_DEFAULT_SHAPE) 

 return tuple(item_shape) 

 return _DEFAULT_SHAPE

This begs the question of whether or not properties should have side effects? I relation to the subsampling operator, this messes a caching mechanism. If logging was enabled this would potentially cause unexpected log messages to be printed.

A solution could be caching the inferred shape, e.g. saving it to a private attribute _shape and having the property link to that value instead?

Add workflow for doctests

The docs support testing using doctest. We should add a GitHub workflow that automatically checks these.

Rename _downstream_getter -> _parent

It seems like there's a general agreement, that this is better

standardize-transform performance

It seems the standardize function adds a significant amount of reads to the underlying dataset.
Calling the function on a dataset containing a single element seemingly causes 7 reads to be carried out.

ds_one = ds.take(1)
def foo():
    ds_center = ds_one.standardize(0, axis=1)
    s = ds_center[0]

def bar():

    from sklearn.preprocessing import StandardScaler
    scaler = StandardScaler()
    s = ds[0]
    scaler.fit(s.data)
    scaled = scaler.transform(s.data)
    mu = np.mean(scaled)
    std = np.std(scaled)


def do_profile(func):

    print(f"######### PROFILING {func.__name__} #########")
    pr = cProfile.Profile(subcalls=False)
    pr.enable()
    func()
    pr.disable()
    s = io.StringIO()
    sortby = SortKey.TIME
    ps = pstats.Stats(pr, stream=s).sort_stats(sortby)
    ps.print_stats()
    # return s.getvalue()
    print(f"{s.getvalue()}\n")

do_profile(foo)
do_profile(bar)

######### PROFILING foo #########
         40119 function calls (39755 primitive calls) in 18.561 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        7   17.962    2.566   18.073    2.582 {method 'read' of 'pandas._libs.parsers.TextReader' objects}
    94/65    0.102    0.001    0.283    0.004 {built-in method numpy.core._multiarray_umath.implement_array_function}
        7    0.098    0.014    0.098    0.014 C:\ProgramData\Miniconda3\lib\site-packages\numpy\lib\arraypad.py:86(_pad_simple)
        7    0.097    0.014    0.098    0.014 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\managers.py:1828(_stack_arrays)
      217    0.085    0.000    0.085    0.000 {built-in method numpy.array}
       36    0.036    0.001    0.036    0.001 {method 'reduce' of 'numpy.ufunc' objects}
        7    0.022    0.003    0.023    0.003 C:\ProgramData\Miniconda3\lib\site-packages\pandas\io\parsers.py:1864(__init__)
        7    0.018    0.003   18.256    2.608 C:\ProgramData\Miniconda3\lib\site-packages\pandas\io\parsers.py:416(_read)
        7    0.018    0.003   18.377    2.625 c:\users\clega\desktop\datasetops\src\datasetops\loaders.py:491(get_data)
        1    0.009    0.009    0.039    0.039 C:\ProgramData\Miniconda3\lib\site-packages\numpy\lib\nanfunctions.py:1421(nanvar)
        2    0.007    0.003    0.036    0.018 C:\ProgramData\Miniconda3\lib\site-packages\numpy\lib\nanfunctions.py:68(_replace_nan)
        1    0.006    0.006    0.025    0.025 C:\ProgramData\Miniconda3\lib\site-packages\sklearn\preprocessing\_data.py:780(transform)
        1    0.006    0.006    0.076    0.076 C:\ProgramData\Miniconda3\lib\site-packages\sklearn\utils\extmath.py:710(_incremental_mean_and_var)
     7847    0.005    0.000    0.011    0.000 {built-in method builtins.isinstance}
     4018    0.005    0.000    0.006    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\generic.py:10(_check)
        1    0.005    0.005    5.368    5.368 c:\users\clega\desktop\datasetops\src\datasetops\dataset.py:238(item_stats)
        1    0.005    0.005    8.000    8.000 c:\users\clega\desktop\datasetops\src\datasetops\dataset.py:1385(make_fn)
        7    0.004    0.001    0.004    0.001 {method 'close' of 'pandas._libs.parsers.TextReader' objects}
     1148    0.003    0.000    0.005    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:1708(_is_dtype_type)
     15/7    0.003    0.000   18.405    2.629 c:\users\clega\desktop\datasetops\src\datasetops\dataset.py:235(__getitem__)
        1    0.003    0.003    2.729    2.729 c:\users\clega\desktop\datasetops\src\datasetops\scaler.py:62(fit)
        1    0.003    0.003   13.269   13.269 c:\users\clega\desktop\datasetops\src\datasetops\dataset.py:648(transform)
        1    0.003    0.003   10.639   10.639 c:\users\clega\desktop\datasetops\src\datasetops\dataset.py:1088(wrapped)
        1    0.003    0.003   15.911   15.911 c:\users\clega\desktop\datasetops\src\datasetops\dataset.py:854(standardize)
        1    0.002    0.002    0.017    0.017 c:\users\clega\desktop\datasetops\src\datasetops\dataset.py:1194(fn)
      812    0.002    0.000    0.004    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:1565(is_extension_array_dtype)
      665    0.002    0.000    0.009    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:542(is_categorical_dtype)
     1106    0.002    0.000    0.011    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\base.py:247(is_dtype)
      455    0.002    0.000    0.004    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:775(is_integer_dtype)
    49/21    0.002    0.000    0.012    0.001 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\indexes\base.py:276(__new__)
      819    0.002    0.000    0.002    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\dtypes.py:75(find)
      273    0.001    0.000    0.002    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:1401(is_float_dtype)
       56    0.001    0.000    0.001    0.000 {built-in method numpy.empty}
      539    0.001    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:216(<lambda>)
      609    0.001    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:208(<lambda>)
      609    0.001    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:206(classes)
      539    0.001    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:211(classes_and_not_datetimelike)
     6128    0.001    0.000    0.001    0.000 {built-in method builtins.getattr}
       63    0.001    0.000    0.005    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\blocks.py:2981(get_block_type)
    21/14    0.001    0.000    0.018    0.001 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\series.py:183(__init__)
       56    0.001    0.000    0.005    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\construction.py:388(sanitize_array)
       91    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\numpy\core\_dtype.py:321(_name_get)
  817/642    0.000    0.000    0.001    0.000 {built-in method builtins.len}
      234    0.000    0.000    0.001    0.000 <frozen importlib._bootstrap>:1009(_handle_fromlist)
      154    0.000    0.000    0.002    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\dtypes.py:1124(is_dtype)
      182    0.000    0.000    0.002    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:506(is_interval_dtype)
      154    0.000    0.000    0.002    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\dtypes.py:917(is_dtype)
      182    0.000    0.000    0.002    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:472(is_period_dtype)
        7    0.000    0.000    0.104    0.015 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\managers.py:1700(form_blocks)
      876    0.000    0.000    0.000    0.000 {built-in method builtins.hasattr}
      119    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:441(is_timedelta64_dtype)
      217    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:1844(pandas_dtype)
      105    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:222(is_object_dtype)
        7    0.000    0.000    0.100    0.014 C:\ProgramData\Miniconda3\lib\site-packages\numpy\lib\arraypad.py:532(pad)
       14    0.000    0.000    0.001    0.000 {pandas._libs.lib.clean_index_list}
      161    0.000    0.000    0.002    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:403(is_datetime64tz_dtype)
      105    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:372(is_datetime64_dtype)
        7    0.000    0.000   18.359    2.623 c:\users\clega\desktop\datasetops\src\datasetops\loaders.py:448(read_single_csv)
     2827    0.000    0.000    0.000    0.000 {built-in method builtins.issubclass}
        7    0.000    0.000   18.074    2.582 C:\ProgramData\Miniconda3\lib\site-packages\pandas\io\parsers.py:2035(read)
    69/14    0.000    0.000    0.000    0.000 {built-in method _abc._abc_subclasscheck}
        7    0.000    0.000    0.102    0.015 c:\Users\clega\Desktop\vibration\sandbox.py:25(func_csv)
       98    0.000    0.000    0.002    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:987(is_datetime64_any_dtype)
       84    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\indexes\base.py:5393(maybe_extract_name)
       49    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\cast.py:1088(maybe_castable)
       56    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:1435(is_bool_dtype)
       42    0.000    0.000    0.096    0.002 <__array_function__ internals>:2(concatenate)
      157    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\numpy\core\_asarray.py:14(asarray)
       56    0.000    0.000    0.002    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\construction.py:506(_try_cast)
       63    0.000    0.000    0.013    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\indexes\base.py:5293(ensure_index)
        7    0.000    0.000   18.209    2.601 C:\ProgramData\Miniconda3\lib\site-packages\pandas\io\parsers.py:1131(read)
       91    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\numpy\core\_dtype.py:307(_name_includes_bit_suffix)
       54    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\numpy\core\numerictypes.py:360(issubdtype)
       63    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:252(is_sparse)
    63/56    0.000    0.000    0.002    0.000 {built-in method builtins.all}
       84    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:1672(_get_dtype)
       49    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\common.py:219(asarray_tuplesafe)
        7    0.000    0.000    0.134    0.019 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\construction.py:213(init_dict)
       21    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\urllib\parse.py:361(urlparse)
       35    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\generic.py:5276(__setattr__)
       21    0.000    0.000    0.003    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\blocks.py:3027(make_block)
        7    0.000    0.000    0.100    0.014 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\managers.py:1811(_multi_blockify)
        4    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\sre_parse.py:469(_parse)
      108    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\numpy\core\numerictypes.py:286(issubclass_)
       21    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\blocks.py:118(__init__)
        7    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\indexes\range.py:83(__new__)
        3    0.000    0.000    0.027    0.009 C:\ProgramData\Miniconda3\lib\site-packages\sklearn\utils\validation.py:350(check_array)
      131    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\abc.py:137(__instancecheck__)
      119    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\inference.py:358(is_hashable)
        7    0.000    0.000    0.002    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\series.py:3857(_reduce)
        7    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\managers.py:212(_rebuild_blknos_and_blklocs)
        7    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\numpy\lib\arraypad.py:457(_as_pairs)
        7    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\io\common.py:144(get_filepath_or_buffer)
       42    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\indexes\base.py:3911(__getitem__)
        7    0.000    0.000    0.023    0.003 C:\ProgramData\Miniconda3\lib\site-packages\pandas\io\parsers.py:792(__init__)
       84    0.000    0.000    0.001    0.000 {pandas._libs.lib.is_list_like}
        7    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\io\parsers.py:939(_clean_options)
       63    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:339(is_categorical)
       42    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\managers.py:1831(_asarray_compat)
       21    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\blocks.py:251(mgr_locs)
        7    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\numpy\lib\stride_tricks.py:114(_broadcast_to)
       21    0.000    0.000    0.000    0.000 {pandas._libs.lib.infer_dtype}
        7    0.000    0.000    0.003    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\construction.py:300(_homogenize)
        7    0.000    0.000    0.003    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\missing.py:225(_isna_ndarraylike)
       42    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:830(is_signed_integer_dtype)
        7    0.000    0.000   18.256    2.608 C:\ProgramData\Miniconda3\lib\site-packages\pandas\io\parsers.py:530(parser_f)
       14    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\cast.py:1209(maybe_cast_to_datetime)
        7    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\nanops.py:234(_get_values)
        7    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\io\common.py:40(is_url)
        7    0.000    0.000    0.002    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\managers.py:122(__init__)
       42    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:887(is_unsigned_integer_dtype)
        7    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\io\parsers.py:885(_get_options_with_defaults)
       91    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\numpy\core\_dtype.py:24(_kind_name)
       14    0.000    0.000    0.002    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\indexes\base.py:4046(equals)
       21    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\urllib\parse.py:412(urlsplit)
       21    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\series.py:376(_set_axis)
        7    0.000    0.000    0.000
######### PROFILING bar #########
         5890 function calls (5849 primitive calls) in 2.768 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    2.570    2.570    2.586    2.586 {method 'read' of 'pandas._libs.parsers.TextReader' objects}
       41    0.042    0.001    0.042    0.001 {built-in method numpy.array}
       16    0.038    0.002    0.038    0.002 {method 'reduce' of 'numpy.ufunc' objects}
    27/16    0.021    0.001    0.138    0.009 {built-in method numpy.core._multiarray_umath.implement_array_function}
        1    0.017    0.017    0.025    0.025 C:\ProgramData\Miniconda3\lib\site-packages\numpy\core\_methods.py:176(_var)
        1    0.014    0.014    0.014    0.014 C:\ProgramData\Miniconda3\lib\site-packages\numpy\lib\arraypad.py:86(_pad_simple)
        1    0.014    0.014    0.014    0.014 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\managers.py:1828(_stack_arrays)
        1    0.009    0.009    0.038    0.038 C:\ProgramData\Miniconda3\lib\site-packages\numpy\lib\nanfunctions.py:1421(nanvar)
        2    0.007    0.003    0.036    0.018 C:\ProgramData\Miniconda3\lib\site-packages\numpy\lib\nanfunctions.py:68(_replace_nan)
        1    0.007    0.007    0.076    0.076 C:\ProgramData\Miniconda3\lib\site-packages\sklearn\utils\extmath.py:710(_incremental_mean_and_var)
        1    0.006    0.006    0.025    0.025 C:\ProgramData\Miniconda3\lib\site-packages\sklearn\preprocessing\_data.py:780(transform)
        1    0.004    0.004    0.004    0.004 C:\ProgramData\Miniconda3\lib\site-packages\pandas\io\parsers.py:1864(__init__)
        1    0.003    0.003    2.613    2.613 C:\ProgramData\Miniconda3\lib\site-packages\pandas\io\parsers.py:416(_read)
        1    0.003    0.003    2.631    2.631 c:\users\clega\desktop\datasetops\src\datasetops\loaders.py:491(get_data)
        1    0.003    0.003    0.028    0.028 C:\ProgramData\Miniconda3\lib\site-packages\numpy\core\_methods.py:232(_std)
     1141    0.001    0.000    0.002    0.000 {built-in method builtins.isinstance}
      574    0.001    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\generic.py:10(_check)
        1    0.001    0.001    0.001    0.001 {method 'close' of 'pandas._libs.parsers.TextReader' objects}
      164    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:1708(_is_dtype_type)
      116    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:1565(is_extension_array_dtype)
      158    0.000    0.000    0.002    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\base.py:247(is_dtype)
       95    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:542(is_categorical_dtype)
       65    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:775(is_integer_dtype)
      117    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\dtypes.py:75(find)
      7/3    0.000    0.000    0.002    0.001 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\indexes\base.py:276(__new__)
        1    0.000    0.000    2.768    2.768 c:\Users\clega\Desktop\vibration\sandbox.py:80(bar)
        8    0.000    0.000    0.000    0.000 {built-in method numpy.empty}
       39    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:1401(is_float_dtype)
       87    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:208(<lambda>)
       77    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:216(<lambda>)
       87    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:206(classes)
        2    0.000    0.000    0.023    0.012 C:\ProgramData\Miniconda3\lib\site-packages\sklearn\utils\validation.py:350(check_array)
       77    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:211(classes_and_not_datetimelike)
      876    0.000    0.000    0.000    0.000 {built-in method builtins.getattr}
        9    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\blocks.py:2981(get_block_type)
        8    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\construction.py:388(sanitize_array)
       13    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\numpy\core\_dtype.py:321(_name_get)
       36    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:1009(_handle_fromlist)
      144    0.000    0.000    0.000    0.000 {built-in method builtins.hasattr}
      3/2    0.000    0.000    0.002    0.001 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\series.py:183(__init__)
        4    0.000    0.000    0.075    0.019 C:\ProgramData\Miniconda3\lib\site-packages\sklearn\utils\extmath.py:681(_safe_accumulator_op)
       22    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\dtypes.py:1124(is_dtype)
       15    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:222(is_object_dtype)
   108/84    0.000    0.000    0.000    0.000 {built-in method builtins.len}
       22    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\dtypes.py:917(is_dtype)
       26    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:506(is_interval_dtype)
        1    0.000    0.000    2.628    2.628 c:\users\clega\desktop\datasetops\src\datasetops\loaders.py:448(read_single_csv)
       17    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:441(is_timedelta64_dtype)
       26    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:472(is_period_dtype)
        2    0.000    0.000    0.009    0.005 C:\ProgramData\Miniconda3\lib\site-packages\sklearn\utils\validation.py:37(_assert_all_finite)
        1    0.000    0.000    0.015    0.015 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\managers.py:1700(form_blocks)
       31    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:1844(pandas_dtype)
        1    0.000    0.000    0.014    0.014 C:\ProgramData\Miniconda3\lib\site-packages\numpy\lib\arraypad.py:532(pad)
       15    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:372(is_datetime64_dtype)
       23    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:403(is_datetime64tz_dtype)
       11    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\numpy\core\numerictypes.py:360(issubdtype)
        8    0.000    0.000    0.027    0.003 C:\ProgramData\Miniconda3\lib\site-packages\numpy\core\fromnumeric.py:70(_wrapreduction)
        6    0.000    0.000    0.014    0.002 <__array_function__ internals>:2(concatenate)
      420    0.000    0.000    0.000    0.000 {built-in method builtins.issubclass}
        1    0.000    0.000    0.015    0.015 c:\Users\clega\Desktop\vibration\sandbox.py:25(func_csv)
        1    0.000    0.000    2.631    2.631 c:\users\clega\desktop\datasetops\src\datasetops\dataset.py:235(__getitem__)
        1    0.000    0.000    0.081    0.081 C:\ProgramData\Miniconda3\lib\site-packages\sklearn\preprocessing\_data.py:671(partial_fit)
        2    0.000    0.000    0.000    0.000 {pandas._libs.lib.clean_index_list}
        8    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\numpy\core\_ufunc_config.py:32(seterr)
       22    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\numpy\core\numerictypes.py:286(issubclass_)
        7    0.000    0.000    0.027    0.004 C:\ProgramData\Miniconda3\lib\site-packages\numpy\core\fromnumeric.py:2105(sum)
       24    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\numpy\core\_asarray.py:14(asarray)
        2    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\numpy\lib\nanfunctions.py:183(_divide_by_count)
        1    0.000    0.000    2.586    2.586 C:\ProgramData\Miniconda3\lib\site-packages\pandas\io\parsers.py:2035(read)
        7    0.000    0.000    0.027    0.004 <__array_function__ internals>:2(sum)
       14    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:987(is_datetime64_any_dtype)
        6    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\managers.py:1831(_asarray_compat)
       12    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\indexes\base.py:5393(maybe_extract_name)
        7    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\cast.py:1088(maybe_castable)
        8    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:1435(is_bool_dtype)
       13    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\numpy\core\_dtype.py:307(_name_includes_bit_suffix)
        8    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\numpy\core\_ufunc_config.py:132(geterr)
        9    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:252(is_sparse)
        9    0.000    0.000    0.002    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\indexes\base.py:5293(ensure_index)
        8    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\construction.py:506(_try_cast)
       21    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\abc.py:137(__instancecheck__)
        7    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\common.py:219(asarray_tuplesafe)
        1    0.000    0.000    0.004    0.004 C:\ProgramData\Miniconda3\lib\site-packages\numpy\core\_methods.py:143(_mean)
       12    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:1672(_get_dtype)
        1    0.000    0.000    0.015    0.015 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\managers.py:1811(_multi_blockify)
        1    0.000    0.000    2.613    2.613 C:\ProgramData\Miniconda3\lib\site-packages\pandas\io\parsers.py:530(parser_f)
        2    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\managers.py:199(_is_single_block)
        1    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\numpy\lib\arraypad.py:457(_as_pairs)
        5    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\generic.py:5276(__setattr__)
        1    0.000    0.000    0.004    0.004 <__array_function__ internals>:2(mean)
       17    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\inference.py:358(is_hashable)
        1    0.000    0.000    2.605    2.605 C:\ProgramData\Miniconda3\lib\site-packages\pandas\io\parsers.py:1131(read)
        1    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\managers.py:212(_rebuild_blknos_and_blklocs)
        1    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\managers.py:798(as_array)
       21    0.000    0.000    0.000    0.000 {built-in method _abc._abc_instancecheck}
        1    0.000    0.000    0.004    0.004 C:\ProgramData\Miniconda3\lib\site-packages\numpy\core\fromnumeric.py:3244(mean)
       13    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\indexes\base.py:615(__len__)
        3    0.000    0.000    0.000    0.000 C:\ProgramData\Min

Rename repeat times to copies and remove default value

Slow tests

Currently, a small number of tests are taking disproportionatly big chunk of the execution time.
Running: pytest --durations=0 reveals

33.56s call     tests/datasetops_tests/test_caching.py::test_cache
4.22s call     tests/datasetops_tests/test_loaders.py::test_tfds
3.74s call     tests/datasetops_tests/test_datasets.py::test_to_tensorflow
2.19s call     tests/datasetops_tests/test_examples.py::test_readme_example_2
1.13s call     tests/datasetops_tests/test_stream_dataset.py::test_read_from_file
0.72s call     tests/datasetops_tests/test_datasets.py::test_to_pytorch
0.63s call     tests/datasetops_tests/test_examples.py::test_domain_adaptation
0.53s call     tests/datasetops_tests/test_transformation_graph.py::test_serialization_not_same
0.29s call     tests/datasetops_tests/test_datasets.py::TestSubsample::test_subsample
0.26s call     tests/datasetops_tests/test_transformation_graph.py::test_operation_origins
0.22s call     tests/datasetops_tests/test_transformation_graph.py::test_serialization_same
0.13s call     tests/datasetops_tests/test_datasets.py::test_image_to_tensorflow
0.06s call     tests/datasetops_tests/test_caching.py::test_cacheable
0.05s call     tests/datasetops_tests/test_transformation_graph.py::test_common_nodes_equality
0.04s call     tests/datasetops_tests/test_transformation_graph.py::test_roots_kitti
0.03s call     tests/datasetops_tests/test_loaders.py::test_mat_single_with_multi_data
0.02s call     tests/datasetops_tests/test_loaders.py::test_pytorch
0.02s call     tests/datasetops_tests/test_examples.py::test_readme_example_1
0.02s call     tests/datasetops_tests/test_loaders.py::TestLoadCSV::test_names_missing
0.02s call     tests/datasetops_tests/test_transformation_graph.py::test_roots_tfds
0.01s call     tests/datasetops_tests/test_datasets.py::test_image_resize
0.01s call     tests/datasetops_tests/test_scaler.py::test_center
0.01s call     tests/datasetops_tests/test_datasets.py::TestSubsample::test_caching
0.01s call     tests/datasetops_tests/test_scaler.py::test_item_stats
0.01s call     tests/datasetops_tests/test_loaders.py::test_folder_dataset_class_data

I see two options.

Make them faster
Make it easy only to run the fasts tests

If we go with option 2 we could do

# content of conftest.py

import pytest


def pytest_addoption(parser):
    parser.addoption(
        "--runslow", action="store_true", default=False, help="run slow tests"
    )


def pytest_configure(config):
    config.addinivalue_line("markers", "slow: mark test as slow to run")


def pytest_collection_modifyitems(config, items):
    if config.getoption("--runslow"):
        # --runslow given in cli: do not skip slow tests
        return
    skip_slow = pytest.mark.skip(reason="need --runslow option to run")
    for item in items:
        if "slow" in item.keywords:
            item.add_marker(skip_slow)

sample behaviour

Currently, if more samples are requested on .sample, than are available in the dataset, we will sample some samples multiple times. Should we raise an error instead?

Scikit-learn compatibility

We should consider adding to and from methods for scikit-learn.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs

Jooble

	@property
	def shape(self) -> Sequence[Shape]:
	"""Get the shape of a dataset item.

	Returns:
	Sequence[int] -- Item shapes
	"""
	if len(self) == 0:
	return _DEFAULT_SHAPE

	item = self.__getitem__(0)
	if hasattr(item, "__getitem__"):
	item_shape = []
	for i in item:
	if hasattr(i, "shape"): # numpy arrays
	item_shape.append(i.shape)
	elif hasattr(i, "size"): # PIL.Image.Image
	item_shape.append(np.array(i).shape)
	else:
	item_shape.append(_DEFAULT_SHAPE)

	return tuple(item_shape)

	return _DEFAULT_SHAPE