GithubHelp home page GithubHelp logo

capitalone / dataprofiler Goto Github PK

View Code? Open in Web Editor NEW
1.4K 21.0 154.0 33.46 MB

What's in your data? Extract schema, statistics and entities from datasets

Home Page: https://capitalone.github.io/DataProfiler

License: Apache License 2.0

Python 97.19% HTML 0.03% PureBasic 2.75% Makefile 0.02% Shell 0.01%
python privacy pii npi nlp data-science gdpr data-analysis data-labels avro

dataprofiler's People

Contributors

andrew-yin avatar anhtruong avatar az85252 avatar chriswallace2020 avatar dependabot[bot] avatar gautomdas avatar gliptak avatar grant-eden avatar granteden avatar jakleh avatar jgsweets avatar joshuart avatar junholee6a avatar kshitijavis avatar ksneab7 avatar lettergram avatar mend-for-github-com[bot] avatar micdavis avatar misterpnp avatar neilkg avatar sagars729 avatar sanketh7 avatar scottiegarcia avatar stefanycoimbra avatar stevensecreti avatar ta7ar avatar taylorfturner avatar tmbjmu avatar tonywu315 avatar vindhyanairlj avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dataprofiler's Issues

duplicate_row_count -> duplicate_row_ratio

Is your feature request related to a problem? Please describe.

I think it makes more sense to utilize ratios or percent of overlap. I think count is good to, but as we're sampling it seems to be a bit strange.

Not all options parameters are validated for CSVData

General Information:

  • Library version: 0.3.4

Describe the bug:
The options dict is not validated other than header. We need to validate all possible parameters (delimiter, etc.)
Tests also need to be added to address this

Allow user to specify the default label for dp.train_structured_labeler

Is your feature request related to a problem? Please describe.
Currently, I have to change my dataset to train on it if a column doesn't contain the default label in its names

Describe the outcome you'd like:
The ability to specify my own default label when training on my data.

Additional context:
potential function: def dp.train_structured_labeler(data, default_label=None, save_dirpath=None, epochs=2)
Also might suggest switching save path and epochs.

Add training and extended training with an example dataset

Please provide the issue you face regarding the documentation
When using extended training,
when i pass custom label with custom data the following error appears
error_while transfer_learning

When i add "BACKGROUND" In labels. Labler is getting trained. when i predict entities. the prediction is as follows
err1

to reproduce the error please use the following data
new _data_label.zip

Please Update the Documentation with a example data, So that it will be helpful .

Thankyou

Separate "optional" requirements

Is your feature request related to a problem? Please describe.

Many people don't care about the data labels (entity recognition), they should be able to install the library without tensorflow and skip the install. This would also help people attempting to use python 3.9, for instance.

I'm not 100% sold this is a great idea, but I think it's worth a discussion.

Describe the outcome you'd like:

I'd like to remove the labeling of the requirements in requirements.txt and separate them into requirements-labeling.txt. Then add them as dataprofiler[extras] or dataprofiler[labeler] or something to that effect. This can easily be done in the setup.py.

The way you would install the labeler would be:

$ pip install dataprofiler[labeler] --user

Without the labeler would simply be:

$ pip install dataprofiler --user

https://stackoverflow.com/questions/6237946/optional-dependencies-in-distutils-pip

In the output report & when executing we can warn the users to install the labeler, if desired.

Additional context:

This was a recommendation on /r/statistics

Add data.length

Is your feature request related to a problem? Please describe.

I think the data class should have an easy-to-call property which gets the length of the given dataset, i.e. data.length instead of len(data.data)

Column level Detection While using labeler

Is your feature request related to a problem? Please describe.

When we predict using labeler.predict(data) we are getting cell level labels.

Describe the outcome you'd like:
How to get output as column level labels

Additional context:
And when we train new Labeler with custom data how to include the new labeler model into profiler while profiling the data to get Data_Label in the Json output.

Thankyou

Allow user to specify null values

Is your feature request related to a problem? Please describe.
I cannot specify what values should be considered null in my dataset

Describe the outcome you'd like:
In options, I want to specify what represents a null in my dataset.

Training on new data

While training on new data in Colab the following data appears.

default_label_error

When i change any one of the column name to "BACKGROUND" then the labeler gets trained and giving the following output.

gives_error_but_model_saves

To reproduce the errors Please use the following csv files
datasets.zip

And while Predicting the Labeler gives prediction for each cell. How to Aggregate them to column level. Please help me in this
Thank you

Consider renaming some of report variables

Consider converting:

  • total_samples -> data_object_count or samples_total or row_count
  • BACKGROUND -> UNKNOWN

Remove from pretty:

  • data_label_representation -> remove
  • avg_predictions -> remove
  • times -> remove

Remove from all:

  • data_label_probability -> remove
  • covariance -> remove

Possibly remove (always null):

  • median -> remove - can be approximated
  • data_classification -> remove - could easily be implemented for PII, NPI, etc

Data labeler doesn't allow TextData as the input for the predict

General Information:

  • OS:
  • Python version:
  • Library version:

Describe the bug:
Currently, data labeler allows all returned objects from data reader except TextData. This can be fixed by modifying the check_and_validate_data_format function.

To Reproduce:
Run this code

data = dp.Data('some_text_file')
predictions = data_labeler.predict(data)

Expected behavior:

Screenshots:

Additional context:

Create the TextProfiler for unstructured profiling

Is your feature request related to a problem? Please describe.
Can't profile unstructured text

Describe the outcome you'd like:
Need class for text profiling of unstructured data which mimics the formats of the structured profiling.
Include the following statistics:
word counts
vocab counts
line_lengths (min,max,...)

Additional context:
Starting point, final code may change:

class TextProfiler(object):
    
    type = 'general_info'

    def __init__(options):

        self.sample_size = 0
        self.times = defaultdict(float)
        self.vocab = set()
        self.words = defaultdict(int)
        self.line_length = {'max': None, 'min': None, ...} # should use numeric stats mixin?

        # options values

        # these stop words are from nltk
        self._stop_words = {
            'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you',
            "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself',
            'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her',
            'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them',
            'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom',
            'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are',
            'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had',
            'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and',
            'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at',
            'by', 'for', 'with', 'about', 'against', 'between', 'into',
            'through', 'during', 'before', 'after', 'above', 'below', 'to',
            'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under',
            'again', 'further', 'then', 'once', 'here', 'there', 'when',
            'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',
            'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own',
            'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will',
            'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll',
            'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn',
            "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't",
            'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma',
            'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't",
            'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't",
            'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"
        }

        self.__calculations = {
            "vocab": TextProfiler._update_vocab,
            "words": TextProfiler._update_words,
        }
        self._filter_properties_w_options(self.__calculations, options)        
        
    def __add__(self, other):
        """
        Merges the properties of two TextProfiler profiles
        
        :param self: first profile
        :param other: second profile
        :type self: TextProfiler
        :type other: TextProfiler
        :return: New TextProfiler merged profile
        """
        if not isinstance(other, TextProfiler):
            raise TypeError("Unsupported operand type(s) for +: "
                            "'TextProfiler' and '{}'".format(
                            other.__class__.__name__))
        merged_profile = TextProfiler(None)
        
        self._merge_calculations(merged_profile.__calculations,
                                 self.__calculations,
                                 other.__calculations)
                                 
        raise NotImplementedError()
        return merged_profile
        
    @property
    def profile(self):
        """
        Property for profile. Returns the profile of the column.
        
        :return:
        """
        profile = dict(
            vocab=self.vocab,
            words=self.words,
            word_count=self.word_count,
            times=self.times,
        )
        return profile
        
    @BaseProfiler._timeit(name='vocab')
    def _update_vocab(self, data, prev_dependent_properties=None,
                      subset_properties=None):
        raise NotImplementedError()
        
    @BaseProfiler._timeit(name='words')
    def _update_words(self, data, prev_dependent_properties=None,
                      subset_properties=None):
        raise NotImplementedError()
        
    def _update_helper(self, data, profile):
        """
        Method for updating the column profile properties with a cleaned
        dataset and the known null parameters of the dataset.
        
        :param df_series_clean: df series with nulls removed
        :type df_series_clean: pandas.core.series.Series
        :param profile: text profile dictionary
        :type profile: dict
        :return: None
        """
        BaseColumnProfiler._perform_property_calcs(
            self, self.__calculations, data=data,
            prev_dependent_properties={}, subset_properties=profile)
        
        self._update_base_properties(profile)
        raise NotImplementedError()

    def update(self, data):
        """
        Updates the column profile.
        
        :param df_series: df series
        :type df_series: pandas.core.series.Series
        :return: None
        """
        len_data = len(data)
        if len_data == 0:
            return self
        
        profile = dict(sample_size=len_data)
        self._update_helper(data, profile)

        return self

Report Concatenation fails with sparse data

General Information:

  • OS: MacOS Catalina 10.15.7
  • Python version: 3.7.9
  • Library version: 0.4.2

Describe the bug:
Report Concatenation fails when missing datetime values in data.

To Reproduce:

import pandas as pd
from dataprofiler import Profiler
from datetime import datetime

dates = [None] * 20
dates[15] = datetime.strptime("2014-12-18", "%Y-%M-%d").date()
dates[16] = datetime.strptime("2015-07-21", "%Y-%M-%d").date()
dates[19] = datetime.strptime("2018-09-01", "%Y-%M-%d").date()

df = pd.DataFrame({"date": dates})

df_1 = df[:10]
df_2 = df[10:]

profiles_1 = Profiler(data=df_1)

profiles_2 = Profiler(data=df_2)

profiles = profiles_1 + profiles_2 

Expected behavior:
To be able to handle None values in statistics when concatenating reports.

Additional context:
This only seems to occur with datetime.date objects where the null values in the column remain None instead of the numpy NaT value. Might be reasonable to expect user to prep data better beforehand, but better handling of None values is probably in order.

Stack trace below

Traceback (most recent call last):
  File "bug2.py", line 19, in <module>
    profiles = profiles_1 + profiles_2
  File "/Users/AXY161/Projects/dataprofiler-oss/dataprofiler/profilers/profile_builder.py", line 454, in __add__
    self._profile[profile_name] + other._profile[profile_name]
  File "/Users/AXY161/Projects/dataprofiler-oss/dataprofiler/profilers/profile_builder.py", line 147, in __add__
    self.profiles[profile_name] + other.profiles[profile_name]
  File "/Users/AXY161/Projects/dataprofiler-oss/dataprofiler/profilers/column_profile_compilers.py", line 99, in __add__
    self._profiles[profile_name] + other._profiles[profile_name]
  File "/Users/AXY161/Projects/dataprofiler-oss/dataprofiler/profilers/datetime_column_profile.py", line 79, in __add__
    if other._dt_obj_min is None or self._dt_obj_min < other._dt_obj_min:
TypeError: '<' not supported between instances of 'NoneType' and 'Timestamp'  

Inaccurate profile formats listed in README

Please provide the issue you face regarding the documentation

In this section of the README, the types for min and max are float, but they are strings for datetime columns.

Additionally, I'm getting simply an int for precision instead of the dictionary defined in the readme.

Add examples of adding new models

Is your feature request related to a problem? Please describe.

Not a problem, but would like to see some examples related to creating / replacing the current models.

Describe the outcome you'd like:

Examples should be easy to follow and swappable in the current library.

Additional context:

Save & Load profiles from disk

Is your feature request related to a problem? Please describe.

I'd like to be able to do: priofile.save(filename=<optional>) and Profiler.load(filename=<profile_filename>), after which I should be able to do profile.update(data) or profile.report().

This would be an amazing feature as it would enable distributing profiling generating and merging.

multi-character separators

General Information:

  • OS: Arch Linux
  • Python version: 3.9
  • Library version: 0.4.2

Describe the bug:

I have a file (sparse-first-and-last-column.txt) containing multiple character separators: , . The system should be able to detect this appropriately and return reasonable results.

Create an unstructured Compiler which combines the TextProfiler and the Unstructured Data Labeling

Is your feature request related to a problem? Please describe.
Need Unstructured profiling

Describe the outcome you'd like:
Compiler which combines: TextProfiler and UnstructuredDataLabelerProfile to return a profile.

Additional context:
starting code, may not be exact same in the end.

class UnstructuredCompiler(BaseColumnProfileCompiler):
    
    # NOTE: these profilers are ordered. Test functionality if changed.
    _profilers = [
        TextProfiler,
        UnstructuredDataLabelerProfile
    ]
    
    @property
    def profile(self):
        profile = {}
        for profiler in self._profiles.values():
            profile[profiler.profile_name].update(profiler.profile)
        return profile

Constant memory for CSVData Match / Header check

Currently the CSVData object reads in X rows which is non-deterministic for the amount of bytes being read.

Instead, we should have a max bytes / max rows to process for header/csv check. This can be done by reading in bytes at a time until X rows are read or X bytes have been read.

Possibly add correlated columns

It would be nice to have meta data showing which columns are likely correlated. I'm not sure on the difficulty here, but it probably isn't super difficult to calculate some estimations. Just a thought.

Min, Max & Avg precision for float & integer columns

Is your feature request related to a problem? Please describe.

There are cases where users may know the given dataset contains measurements from a device with a given precision. If that's the case, our measured precision is likely highly inaccurate. Take the case of the following array [15000, 39023, 94201401], if we know the measurement 15000 is accurate - the significant figures are not 2, they are really 5. We should take that into account and provide both to the users.

Describe the outcome you'd like:

I'd like to see the minimum measured precision and the maximum measured precision.

Precision should also be shown if it's an integer column.

Add inplace options in the process function in the CharPostprocessor class in the data_processing.py file

Is your feature request related to a problem? Please describe.

The process match_sentence_lengths modifies the results string. That's fine, unless you need it in another data_processing step. To make it safe, you need to deepcopy (which is a very time intensive and memory intensive process).

Adding an "inplace"option for the processing function would enable users to either deepcopy or shallow copy.

Tests would need to be written to evaluate this as well.

PR #85 created this potential issue, but currently the code should function correctly (until a new data processing pipeline is build).

Investigate refactor of histogram_to_array for better accuracy

Currently, the _histogram_to_array function does not use the midpoint of the bins to recreate the original dataset.
Investigate accuracy of the _histogram_to_array function currently in comparison to (which uses the midpoint):

def _histogram_to_array(self):
    # Extend histogram to array format
    bin_counts = self._stored_histogram['histogram']['bin_counts']
    bin_edges = self._stored_histogram['histogram']['bin_edges']
    is_bin_non_zero = bin_counts > 0
    bin_midpoints = (bin_edges[1:][is_bin_non_zero]
                     + bin_edges[:-1][is_bin_non_zero]) / 2
    hist_to_array = [
        [midpoint] * count for midpoint, count
        in zip(bin_midpoints, bin_counts[is_bin_non_zero])
    ]
    array_flatten = np.concatenate(hist_to_array)

    # the min/max must be preserved
    array_flatten[0] = bin_edges[0]
    array_flatten[-1] = bin_edges[-1]

    # If we know they are integers, we can limit the data to be as such
    # during conversion
    if not self.__class__.__name__ == 'FloatColumn':
        array_flatten = np.round(array_flatten)

    return array_flatten

CSV Header Detection Errors

General Information:

  • OS: Arch Linux
  • Python version: 3.7.6
  • Library version: 0.3.4

Describe the bug:

Header detection fails on the following files.

blogposts.csv

Blog Post,Date,Subject,Field
Abstract Libraries in Go,3/7/2014,Programming,Computer Science
Virtual Memory and You,3/9/2013,Systems,Computer Science
Mutex - Process Synchronization,3/14/2014,Programming,Computer Science
Newtons Method and Fractals,3/16/2014,Programming,Mathematics
Saint Patrick,3/16/2014,World,History
Cache Optimizing,3/24/2014,Systems,Computer Science
The Cache and Multithreading,3/24/2014,Systems,Computer Science
Counting Sort in C,3/25/2014,Programming,Computer Science
Bilingualism and Pattern Recognition,3/28/2014,Learning,Life
Quadrature - Numerical Integration Comparison,4/1/2014,Algorithms,Mathematics
Basic Book Reader,4/9/2014,Programming,Computer Science
C++ Inheritance - Virtual Functions,4/10/2014,Programming,Computer Science
"Monty Hall, meet Game Theory",4/13/2014,Statistics,Mathematics
Gaussian Quadrature,4/13/2014,Algorithms,Mathematics
Introduction to Markov Processes,4/20/2014,Statistics,Mathematics
Theoretically Determing the Man Made in C++,4/26/2014,Programming,Computer Science
Introduction to Monte Carlo Methods,4/30/2014,Statistics,Mathematics
"Are Decisions Governed by ""Free Will"" or Algorithms",5/1/2014,Learning,Life
"CT-Afferents, Emotions, and Autism",5/3/2014,Learning,Life
Multithreading: Semaphores,5/4/2014,Systems,Computer Science
Multithreading: Producer-Consumer Problem,5/4/2014,Systems,Computer Science
Multithreading: Dining Philosophers Problem,5/4/2014,Systems,Computer Science
Multithreading: Common Pitfalls,5/5/2014,Systems,Computer Science
Generating Graph Images in Golang,5/6/2014,Programming,Computer Science
Intro to IPC | Interprocess Communication,5/8/2014,Systems,Computer Science

length on TextData works, but text data loads everything as a single string

Describe the bug:
Fix text data to allow different test formats and then add tests for length of TextData
TextData could have the following formats:
samples per # of line
samples per # of character
samples per # of words

Otherwise, currently everything is read as a single string and the length of the data is always 1.

Report issue - Error message shown when all options except data labeler are disabled

General Information:

  • OS:
  • Python version:
  • Library version:

Describe the bug:
Get the following error with the report

Error: categorical has no data
Error: order has no data
Error: categorical has no data
Error: order has no data
Error: categorical has no data
Error: order has no data
Error: categorical has no data
Error: order has no data
Error: categorical has no data
Error: order has no data
Error: categorical has no data
Error: order has no data
Error: categorical has no data
Error: order has no data
Error: categorical has no data
Error: order has no data
Error: categorical has no data
Error: order has no data
Error: categorical has no data
Error: order has no data
Error: categorical has no data
Error: order has no data
Error: categorical has no data
Error: order has no data
Error: categorical has no data
Error: order has no data
Error: categorical has no data
Error: order has no data
Error: categorical has no data
Error: order has no data
Error: categorical has no data
Error: order has no data

To Reproduce:
Use the following code with a csv file including 16 columns

# set option to run only data labeler
profile_options = dp.ProfilerOptions()
profile_options.set({"text.is_enabled": False, 
                     "int.is_enabled": False, 
                     "float.is_enabled": False, 
                     "order.is_enabled": False, 
                     "category.is_enabled": False, 
                     "datetime.is_enabled": False,})

profile = dp.Profiler(data, profiler_options=profile_options)

results = profile.report(report_options={"output_format":"pretty"})
print(json.dumps(human_readable_report, indent=4))

Expected behavior:

Screenshots:

Additional context:

Mock the DataLabeler during tests, decreasing test runtime

Describe the bug:
Originally, tests mocked the DataLabelerColumnCompiler to avoid instantiating the DataProfiler and TensorFlow. However, with the recent change that now instantiates a profiler inside the Profiler, these mocks no longer protect against this long load time.

Show progress of profiling

Is your feature request related to a problem? Please describe.

It's unclear where the Data Profiler is currently at in terms of processing

Describe the outcome you'd like:

A progress bar (number of columns or percent of data) would be super useful.

Null count/ Null rows, potentially other info does not get shared if the primitive type of the column is disabled in options

General Information:

  • Library version: 0.4.3

Describe the bug:
If you disable a primitive data type in options and (TEXT;bc catch all) the column would get profiled as that option, it drops information from the report, i.e. null count and nullrows do not show up.

To Reproduce:

import dataprofiler as dp

data = dp.Data(...)
options = dp.ProfilerOptions
options.set({'text.is_enabled': False})

profiler = dp.Profiler(data, profiler_options=options)
report = profiler.report() # will be missing null info

Delimiter detection uses digits in a number

General Information:

  • OS: Arch Linux
  • Python version: 3.9
  • Library version: 0.3.4

Describe the bug:

File:

-123
234
345
534
231

Delimiter detected: 3

Delimiter should probably not detect inside numbers

Null rows are being incorrectly calculated when sampling

General Information:

  • OS: Arch Linux
  • Python version: 3.7
  • Library version: 0.4.2

Describe the bug:

Because indexes are randomly generated per column

sample_ind_generator = utils.shuffle_in_chunks(
len_df, chunk_size=sample_size)

Those indexes are random and may not align with one another. They have to align to get correct results. The inaccurate results would be: row_has_null and row_is_null

To Reproduce:

Every time the data profiler is ran.

Expected behavior:

The solution is to:

  1. Generate all the random indicates for sampling first.
  2. Each profile stop after they obtain the number of samples (min_true_samples) needed (return / store said location)
  3. Do an intersection up-to the smallest number of samples needed for a given column - this should give you the row_is_null calculation

As part of this PR, tests should be written to check the row_has_null and row_is_null counts.

Deepcopy doesn't appear necessary, can we remove?

There is a significant number of "deepcopy" calls. Currently, it is the calls slowing the library, such as the line below:

results = self.match_sentence_lengths(data, copy.deepcopy(results),

After removing deepcopy, 3 tests fail when testing data processing:

pytest dataprofiler/tests/labelers/test_data_processing.py

I suspect, that's due to issues with either the tests OR a function that needs to not manipulate the input. After removing deepcopy the function(s) still all worked fine in practice (as far as I could tell).

In either case, a shallow copy likely would work fine here. When tested it did pass all tests:

results = self.match_sentence_lengths(data, dict(results), flatten_separator)

This resulted in a 10-15% reduction in profiling runtime (tested on test file diamonds.csv)

Used the following code to cProfile: https://gist.github.com/lettergram/d8f7d9f3d19856d4a0187462445382a0

Master repo (sort by tottime):

         33022296 function calls (30312047 primitive calls) in 20.389 seconds

   Ordered by: internal time
   List reduced from 9981 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       50    1.849    0.037    1.851    0.037 {method 'tolist' of 'numpy.ndarray' objects}
2262585/575    1.294    0.000    2.548    0.004 copy.py:132(deepcopy)
    12733    0.517    0.000    0.517    0.000 {method 'ParseFromString' of 'google.protobuf.pyext._message.CMessage' objects}
 85312/40    0.496    0.000    2.452    0.061 copy.py:210(_deepcopy_list)
    23976    0.402    0.000    0.557    0.000 numerical_column_stats.py:386(_get_percentile)
  4755217    0.341    0.000    0.342    0.000 {method 'get' of 'dict' objects}
     1100    0.332    0.000    0.332    0.000 {built-in method tensorflow.python._pywrap_tf_session.TF_GraphImportGraphDefWithResults}
    11392    0.323    0.000    0.323    0.000 {method 'SerializeToString' of 'google.protobuf.pyext._message.CMessage' objects}
   250330    0.318    0.000    0.324    0.000 _collections_abc.py:742(__iter__)
  990/330    0.285    0.000    0.285    0.001 version_utils.py:98(swap_class)
     1100    0.270    0.000    0.473    0.000 function_def_to_graph.py:122(function_def_to_graph_def)
    10250    0.270    0.000    0.272    0.000 {built-in method tensorflow.python._pywrap_tfe.TFE_Py_FastPathExecute}
    31850    0.237    0.000    1.217    0.000 ops.py:1880(__init__)
     1500    0.233    0.000    0.233    0.000 {built-in method tensorflow.python._pywrap_tfe.TFE_Py_Execute}
    29420    0.225    0.000    0.262    0.000 {built-in method tensorflow.python._pywrap_tf_session.TF_OperationGetAttrValueProto}
1485219/1480389    0.215    0.000    0.260    0.000 {built-in method builtins.isinstance}
       10    0.212    0.021    2.541    0.254 character_level_cnn_model.py:698(predict)
    29200    0.191    0.000    0.191    0.000 {method 'CopyFrom' of 'google.protobuf.pyext._message.CMessage' objects}
     2200    0.182    0.000    0.472    0.000 function_deserialization.py:481(_list_function_deps)
     2389    0.178    0.000    0.178    0.000 {built-in method marshal.loads}

After removing deepcopy on results and there was a 15% speedupo (fails 3 tests):

         19387413 function calls (18978038 primitive calls) in 17.416 seconds

   Ordered by: internal time
   List reduced from 9981 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       50    1.784    0.036    1.786    0.036 {method 'tolist' of 'numpy.ndarray' objects}
    12733    0.512    0.000    0.512    0.000 {method 'ParseFromString' of 'google.protobuf.pyext._message.CMessage' objects}
    23976    0.422    0.000    0.571    0.000 numerical_column_stats.py:386(_get_percentile)
     1100    0.321    0.000    0.321    0.000 {built-in method tensorflow.python._pywrap_tf_session.TF_GraphImportGraphDefWithResults}
    11392    0.310    0.000    0.310    0.000 {method 'SerializeToString' of 'google.protobuf.pyext._message.CMessage' objects}
   250330    0.307    0.000    0.314    0.000 _collections_abc.py:742(__iter__)
  990/330    0.283    0.000    0.283    0.001 version_utils.py:98(swap_class)
     1100    0.263    0.000    0.461    0.000 function_def_to_graph.py:122(function_def_to_graph_def)
    10250    0.254    0.000    0.256    0.000 {built-in method tensorflow.python._pywrap_tfe.TFE_Py_FastPathExecute}
     1500    0.227    0.000    0.227    0.000 {built-in method tensorflow.python._pywrap_tfe.TFE_Py_Execute}
    31850    0.225    0.000    1.160    0.000 ops.py:1880(__init__)
    29420    0.213    0.000    0.249    0.000 {built-in method tensorflow.python._pywrap_tf_session.TF_OperationGetAttrValueProto}
1485055/1480225    0.209    0.000    0.251    0.000 {built-in method builtins.isinstance}
       10    0.202    0.020    2.446    0.245 character_level_cnn_model.py:698(predict)
    29200    0.184    0.000    0.184    0.000 {method 'CopyFrom' of 'google.protobuf.pyext._message.CMessage' objects}
     2200    0.178    0.000    0.460    0.000 function_deserialization.py:481(_list_function_deps)
     2389    0.152    0.000    0.152    0.000 {built-in method marshal.loads}
     3190    0.144    0.000    0.144    0.000 {built-in method tensorflow.python._pywrap_tf_session.TF_GraphCopyFunction}
   756337    0.137    0.000    0.137    0.000 {method 'search' of 're.Pattern' objects}
       80    0.137    0.002    0.359    0.004 {pandas._libs.lib.map_infer_mask}

Shallow copy implementation (passes all tests):

         19389622 function calls (18980246 primitive calls) in 17.991 seconds

   Ordered by: internal time
   List reduced from 9981 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       50    1.887    0.038    1.889    0.038 {method 'tolist' of 'numpy.ndarray' objects}
    12733    0.522    0.000    0.522    0.000 {method 'ParseFromString' of 'google.protobuf.pyext._message.CMessage' objects}
    23976    0.378    0.000    0.530    0.000 numerical_column_stats.py:386(_get_percentile)
     1100    0.338    0.000    0.338    0.000 {built-in method tensorflow.python._pywrap_tf_session.TF_GraphImportGraphDefWithResults}
    11392    0.326    0.000    0.326    0.000 {method 'SerializeToString' of 'google.protobuf.pyext._message.CMessage' objects}
   250330    0.321    0.000    0.328    0.000 _collections_abc.py:742(__iter__)
  990/330    0.287    0.000    0.288    0.001 version_utils.py:98(swap_class)
     1100    0.275    0.000    0.480    0.000 function_def_to_graph.py:122(function_def_to_graph_def)
    10250    0.269    0.000    0.271    0.000 {built-in method tensorflow.python._pywrap_tfe.TFE_Py_FastPathExecute}
     1500    0.242    0.000    0.242    0.000 {built-in method tensorflow.python._pywrap_tfe.TFE_Py_Execute}
    31850    0.236    0.000    1.224    0.000 ops.py:1880(__init__)
    29420    0.224    0.000    0.262    0.000 {built-in method tensorflow.python._pywrap_tf_session.TF_OperationGetAttrValueProto}
1485818/1480988    0.214    0.000    0.257    0.000 {built-in method builtins.isinstance}
     2389    0.201    0.000    0.201    0.000 {built-in method marshal.loads}
       10    0.199    0.020    2.559    0.256 character_level_cnn_model.py:698(predict)
    29200    0.190    0.000    0.190    0.000 {method 'CopyFrom' of 'google.protobuf.pyext._message.CMessage' objects}
     2200    0.181    0.000    0.474    0.000 function_deserialization.py:481(_list_function_deps)
     3190    0.150    0.000    0.150    0.000 {built-in method tensorflow.python._pywrap_tf_session.TF_GraphCopyFunction}
   756337    0.139    0.000    0.139    0.000 {method 'search' of 're.Pattern' objects}
    37515    0.139    0.000    0.140    0.000 {built-in method tensorflow.python._tf_stack.extract_stack}

Remove TensorFlow Addons (TFA) and utilize TF nightly to support python 3.9

TensorFlow addons (TFA) are the primary reason the library cannot be upgraded.

TFA is utilized in a single location in the code:

# Use TFA to add f1 score to output

It should be possible to create an F1 score metric function that doesn't require TFA:

https://stackoverflow.com/questions/64474463/custom-f1-score-metric-in-tensorflow

Once removed, tensorflow nightly would likely suffice and the library could then work on python 3.9.

Would we like to do this?

Possible Optimization Bugs Since There Are No Tests

General Information:
The repo has been optimized to reduce the amount of shuffles and add multiprocessing. Both have been merged into the repo without any tests.

There needs to be tests to ensure the repo is shuffling properly when it is maintaining the shuffle indices and when it is not. The new utils function for shuffling needs to be tested.

In multiprocessing, there are no tests to make sure any of the multithreading is working appropriately. Most importantly, single threading needs to be tested to make sure that option still works.

Issue With DataLabeler

When we load default data labeler and predict on a data set the output is as follows

base_model
base_model_prediction

Here I want to make model learn on new label. So used transfer learning, The predictions after transfer learning are not good
transfer_learning
new_model_predictions

When compared to original predictions to this all the values disturbed. How to make model learn on new label without loosing original behavior. Please suggest me how to do it or am i doing any thing wrong here.

To reproduce the results use this datasets
data.zip

Thankyou.

Profiling is slow

General Information:

  • OS: Arch Linux
  • Python version: 3.7
  • Library version: 0.3.4

Describe the bug:

Profiling diamonds.csv (2.5Mb) or really any dataset is much slower than expected. Even with the data labeler disabled.

To Reproduce:

The following code takes:

  • Base run: 38 seconds and ~785Mb

  • No labeler: 24.48 seconds to execute and ~67Mb

  • No labeler, no histogram: 23.47 seconds to execute and ~67Mb.

  • No labeler, no datetime: 23.35 seconds and ~67Mb

import sys
import json
import time
import dataprofiler as dp

filename = sys.argv[1]

def profile_test(filename):
    data = dp.Data(filename)

    profile_options = dp.ProfilerOptions()
    profile_options.structured_options.data_labeler.is_enabled = False                       
    profile_options.set({"histogram_and_quantiles.is_enabled": False})                       
    profile_options.set({"datetime.is_enabled": False})  

    profile = dp.Profiler(data, profiler_options=profile_options)

    human_readable_report = profile.report(report_options={"output_format":"pretty"})

    print(json.dumps(human_readable_report, indent=4))

start_time = time.time()
profile_test(filename)
end_time = time.time()

print("Profile runtime for "+filename, end_time-start_time, 'seconds')

In the output, there's no reason in the "timing" for this to be occuring:

{
    "global_stats": {
        "samples_used": 10788,
        "column_count": 10,
        "unique_row_ratio": 0.9973,
        "row_has_null_ratio": 0.0,
        "duplicate_row_count": 146,
        "file_type": "csv",
        "encoding": "utf-8",
        "data_classification": null,
        "covariance": null
    },
    "data_stats": {
        "carat": {
            "column_name": "carat",
            "data_type": "float",
            "categorical": true,
            "order": "random",
            "samples": "['2.52', '1', '0.55', '1.75', '2.01']",
            "statistics": {
                "min": 0.2,
                "max": 4.01,
                "mean": 0.7976,
                "median": null,
                "variance": 0.2244,
                "stddev": 0.4738,
                "histogram": {
                    "bin_counts": "[  70,  205,  606, 1224,  443, ... , 0, 0, 0, 0, 1]",
                    "bin_edges": "[0.2       , 0.23663462, ... , 3.97336538, 4.01      ]"
                },
                "quantiles": {
                    "0": 0.3832,
                    "1": 0.6762,
                    "2": 1.006
                },
                "times": {
                    "precision": 0.0027,
                    "min": 0.0001,
                    "max": 0.0,
                    "sum": 0.0001,
                    "variance": 0.0001,
                    "histogram_and_quantiles": 0.0263
                },
                "precision": 2,
                "unique_count": 230,
                "unique_ratio": 0.0213,
                "categories": "['0.29', '0.31', '0.26', ... , '0.21', '0.57', '0.69']",
                "sample_size": 10788,
                "null_count": 0,
                "null_types": "[]",
                "null_types_index": {},
                "data_type_representation": {
                    "datetime": 0.0,
                    "int": 0.0336,
                    "float": 1.0,
                    "string": 1.0
                },
                "data_label_probability": null
            }
        },
        "cut": {
            "column_name": "cut",
            "data_type": "string",
            "categorical": true,
            "order": "random",
            "samples": "['Ideal', 'Very Good', 'Ideal', 'Premium', 'Good']",
            "statistics": {
                "min": 4.0,
                "max": 9.0,
                "mean": 6.2944,
                "median": null,
                "variance": 3.1061,
                "stddev": 1.7624,
                "histogram": {
                    "bin_counts": "[1340,    0,    0,    0, ... ,    0,    0,    0, 2454]",
                    "bin_edges": "[4.        , 4.04807692, ... , 8.95192308, 9.        ]"
                },
                "quantiles": {
                    "0": 4.9615,
                    "1": 4.9615,
                    "2": 6.9808
                },
                "vocab": "['P', 'r', 'e', 'm', 'i', ... , 'd', 'I', 'a', 'l', 'F']",
                "times": {
                    "vocab": 0.0239,
                    "min": 0.0001,
                    "max": 0.0,
                    "sum": 0.0001,
                    "variance": 0.0001,
                    "histogram_and_quantiles": 0.0304
                },
                "unique_count": 5,
                "unique_ratio": 0.0005,
                "categories": "['Premium', 'Very Good', 'Ideal', 'Good', 'Fair']",
                "sample_size": 10788,
                "null_count": 0,
                "null_types": "[]",
                "null_types_index": {},
                "data_type_representation": {
                    "datetime": 0.0,
                    "int": 0.0,
                    "float": 0.0,
                    "string": 1.0
                },
                "data_label_probability": null
            }
        },
        "color": {
            "column_name": "color",
            "data_type": "string",
            "categorical": true,
            "order": "random",
            "samples": "['D', 'D', 'E', 'G', 'D']",
            "statistics": {
                "min": 1.0,
                "max": 1.0,
                "mean": 1.0,
                "median": null,
                "variance": 0.0,
                "stddev": 0.0,
                "histogram": {
                    "bin_counts": "[10788]",
                    "bin_edges": "[1., 1.]"
                },
                "quantiles": {
                    "0": 1.0,
                    "1": 1.0,
                    "2": 1.0
                },
                "vocab": "['I', 'E', 'J', 'H', 'D', 'G', 'F']",
                "times": {
                    "vocab": 0.0147,
                    "min": 0.0001,
                    "max": 0.0,
                    "sum": 0.0001,
                    "variance": 0.0001,
                    "histogram_and_quantiles": 0.0113
                },
                "unique_count": 7,
                "unique_ratio": 0.0006,
                "categories": "['I', 'E', 'J', 'H', 'D', 'G', 'F']",
                "sample_size": 10788,
                "null_count": 0,
                "null_types": "[]",
                "null_types_index": {},
                "data_type_representation": {
                    "datetime": 0.0,
                    "int": 0.0,
                    "float": 0.0,
                    "string": 1.0
                },
                "data_label_probability": null
            }
        },
        "clarity": {
            "column_name": "clarity",
            "data_type": "string",
            "categorical": true,
            "order": "random",
            "samples": "['VS1', 'SI2', 'SI1', 'VS1', 'SI1']",
            "statistics": {
                "min": 2.0,
                "max": 4.0,
                "mean": 3.1212,
                "median": null,
                "variance": 0.1989,
                "stddev": 0.446,
                "histogram": {
                    "bin_counts": "[498,   0,   0,   0,   0, ... ,    0,    0,    0,    0, 1806]",
                    "bin_edges": "[2.        , 2.01923077, ... , 3.98076923, 4.        ]"
                },
                "quantiles": {
                    "0": 3.0,
                    "1": 3.0,
                    "2": 3.0
                },
                "vocab": "['V', 'S', '1', 'I', '2', 'F']",
                "times": {
                    "vocab": 0.0149,
                    "min": 0.0001,
                    "max": 0.0001,
                    "sum": 0.0001,
                    "variance": 0.0001,
                    "histogram_and_quantiles": 0.0322
                },
                "unique_count": 8,
                "unique_ratio": 0.0007,
                "categories": "['VS1', 'SI1', 'VVS2', 'VS2', ... , 'SI2', 'VVS1', 'IF', 'I1']",
                "sample_size": 10788,
                "null_count": 0,
                "null_types": "[]",
                "null_types_index": {},
                "data_type_representation": {
                    "datetime": 0.0,
                    "int": 0.0,
                    "float": 0.0,
                    "string": 1.0
                },
                "data_label_probability": null
            }
        },
        "depth": {
            "column_name": "depth",
            "data_type": "float",
            "categorical": true,
            "order": "random",
            "samples": "['62.5', '62.4', '63.9', '63.4', '61']",
            "statistics": {
                "min": 43.0,
                "max": 79.0,
                "mean": 61.7492,
                "median": null,
                "variance": 2.0652,
                "stddev": 1.4371,
                "histogram": {
                    "bin_counts": "[1, 0, 0, 0, 0, 0, 0, 0, ... , 0, 0, 0, 0, 0, 0, 0, 1]",
                    "bin_edges": "[43.        , 43.13533835, ... , 78.86466165, 79.        ]"
                },
                "quantiles": {
                    "0": 61.0,
                    "1": 61.6767,
                    "2": 62.4887
                },
                "times": {
                    "precision": 0.0026,
                    "min": 0.0001,
                    "max": 0.0,
                    "sum": 0.0,
                    "variance": 0.0001,
                    "histogram_and_quantiles": 0.0653
                },
                "precision": 1,
                "unique_count": 136,
                "unique_ratio": 0.0126,
                "categories": "['61.5', '62', '63.4', '61', ... , '54.3', '55.3', '57', '79']",
                "sample_size": 10788,
                "null_count": 0,
                "null_types": "[]",
                "null_types_index": {},
                "data_type_representation": {
                    "datetime": 0.0,
                    "int": 0.099,
                    "float": 1.0,
                    "string": 1.0
                },
                "data_label_probability": null
            }
        },
        "table": {
            "column_name": "table",
            "data_type": "float",
            "categorical": true,
            "order": "random",
            "samples": "['54', '59', '56', '57', '56']",
            "statistics": {
                "min": 51.0,
                "max": 95.0,
                "mean": 57.4651,
                "median": null,
                "variance": 5.0778,
                "stddev": 2.2534,
                "histogram": {
                    "bin_counts": "[ 5,  0,  0, 16,  0,  0,  0, ... , 0, 0, 0, 0, 0, 0, 1]",
                    "bin_edges": "[51.        , 51.26993865, ... , 94.73006135, 95.        ]"
                },
                "quantiles": {
                    "0": 55.7239,
                    "1": 56.8037,
                    "2": 58.6933
                },
                "times": {
                    "precision": 0.0019,
                    "min": 0.0001,
                    "max": 0.0,
                    "sum": 0.0,
                    "variance": 0.0001,
                    "histogram_and_quantiles": 0.0296
                },
                "precision": 1,
                "unique_count": 91,
                "unique_ratio": 0.0084,
                "categories": "['58', '57', '61', '54', ... , '58.9', '62.5', '60.7', '61.6']",
                "sample_size": 10788,
                "null_count": 0,
                "null_types": "[]",
                "null_types_index": {},
                "data_type_representation": {
                    "datetime": 0.0,
                    "int": 0.9819,
                    "float": 1.0,
                    "string": 1.0
                },
                "data_label_probability": null
            }
        },
        "price": {
            "column_name": "price",
            "data_type": "int",
            "categorical": false,
            "order": "random",
            "samples": "['2655', '3587', '1341', '991', '16215']",
            "statistics": {
                "min": 334.0,
                "max": 18795.0,
                "mean": 3903.6178,
                "median": null,
                "variance": 15978757.826,
                "stddev": 3997.3438,
                "histogram": {
                    "bin_counts": "[436, 959, 965, 757, 481, ... , 13, 14, 10, 11, 14]",
                    "bin_edges": "[334.        , 511.50961538, ... , 18617.49038462, 18795.        ]"
                },
                "quantiles": {
                    "0": 866.5288,
                    "1": 2286.6058,
                    "2": 5126.7596
                },
                "times": {
                    "min": 0.0001,
                    "max": 0.0,
                    "sum": 0.0,
                    "variance": 0.0001,
                    "histogram_and_quantiles": 0.0257
                },
                "unique_count": 5349,
                "unique_ratio": 0.4958,
                "sample_size": 10788,
                "null_count": 0,
                "null_types": "[]",
                "null_types_index": {},
                "data_type_representation": {
                    "datetime": 0.0,
                    "int": 1.0,
                    "float": 1.0,
                    "string": 1.0
                },
                "data_label_probability": null
            }
        },
        "x": {
            "column_name": "x",
            "data_type": "float",
            "categorical": true,
            "order": "random",
            "samples": "['5.7', '4.74', '7.3', '4.27', '6.88']",
            "statistics": {
                "min": 0.0,
                "max": 10.02,
                "mean": 5.7337,
                "median": null,
                "variance": 1.2481,
                "stddev": 1.1172,
                "histogram": {
                    "bin_counts": "[2, 0, 0, 0, 0, 0, 0, 0, ... , 2, 1, 1, 0, 0, 0, 0, 1]",
                    "bin_edges": "[0.        , 0.09634615, ... ,  9.92365385, 10.02      ]"
                },
                "quantiles": {
                    "0": 4.6246,
                    "1": 5.6844,
                    "2": 6.4552
                },
                "times": {
                    "precision": 0.0026,
                    "min": 0.0001,
                    "max": 0.0,
                    "sum": 0.0,
                    "variance": 0.0001,
                    "histogram_and_quantiles": 0.0385
                },
                "precision": 2,
                "unique_count": 503,
                "unique_ratio": 0.0466,
                "categories": "['3.87', '3.93', '4.21', ... , '5.54', '3.92', '3.9']",
                "sample_size": 10788,
                "null_count": 0,
                "null_types": "[]",
                "null_types_index": {},
                "data_type_representation": {
                    "datetime": 0.0,
                    "int": 0.005,
                    "float": 1.0,
                    "string": 1.0
                },
                "data_label_probability": null
            }
        },
        "y": {
            "column_name": "y",
            "data_type": "float",
            "categorical": true,
            "order": "random",
            "samples": "['5.13', '4.18', '6.37', '6.18', '6.41']",
            "statistics": {
                "min": 3.71,
                "max": 31.8,
                "mean": 5.724,
                "median": null,
                "variance": 1.3022,
                "stddev": 1.1412,
                "histogram": {
                    "bin_counts": "[   8,   93,  157,  840, 1031, ... , 0, 0, 0, 0, 1]",
                    "bin_edges": "[3.71      , 3.87426901, ... , 31.63573099, 31.8       ]"
                },
                "quantiles": {
                    "0": 4.6135,
                    "1": 5.5991,
                    "2": 6.4204
                },
                "times": {
                    "precision": 0.0027,
                    "min": 0.0001,
                    "max": 0.0,
                    "sum": 0.0001,
                    "variance": 0.0001,
                    "histogram_and_quantiles": 0.0262
                },
                "precision": 2,
                "unique_count": 496,
                "unique_ratio": 0.046,
                "categories": "['3.96', '3.78', '3.9', ... , '3.82', '3.92', '31.8']",
                "sample_size": 10788,
                "null_count": 0,
                "null_types": "[]",
                "null_types_index": {},
                "data_type_representation": {
                    "datetime": 0.0,
                    "int": 0.0056,
                    "float": 1.0,
                    "string": 1.0
                },
                "data_label_probability": null
            }
        },
        "z": {
            "column_name": "z",
            "data_type": "float",
            "categorical": true,
            "order": "random",
            "samples": "['4.23', '2.69', '3.17', '3.11', '4.25']",
            "statistics": {
                "min": 0.0,
                "max": 31.8,
                "mean": 3.54,
                "median": null,
                "variance": 0.554,
                "stddev": 0.7443,
                "histogram": {
                    "bin_counts": "[4, 0, 0, 0, 0, 0, 0, 0, ... , 0, 0, 0, 0, 0, 0, 0, 1]",
                    "bin_edges": "[0.        , 0.10127389, ... , 31.69872611, 31.8       ]"
                },
                "quantiles": {
                    "0": 2.8357,
                    "1": 3.4433,
                    "2": 3.9497
                },
                "times": {
                    "precision": 0.0027,
                    "min": 0.0001,
                    "max": 0.0,
                    "sum": 0.0,
                    "variance": 0.0001,
                    "histogram_and_quantiles": 0.0342
                },
                "precision": 2,
                "unique_count": 328,
                "unique_ratio": 0.0304,
                "categories": "['2.43', '2.31', '2.53', ... , '3.12', '3.13', '31.8']",
                "sample_size": 10788,
                "null_count": 0,
                "null_types": "[]",
                "null_types_index": {},
                "data_type_representation": {
                    "datetime": 0.0,
                    "int": 0.0184,
                    "float": 1.0,
                    "string": 1.0
                },
                "data_label_probability": null
            }
        }
    }
}

Expected behavior:

Screenshots:

Additional context:

Identifying & Loading JSON files

General Information:

  • OS: Arch Linux
  • Python version: 3.7
  • Library version: 0.4.1

Describe the bug:

JSON files are not loading as I would expect. It can do well if there's a list of objects, but if there are embeded objects it does not function well. Further, if there is a JSON file such as: { 'data': [ { }, { }, { } ] } the library will treat [ { }, { }, { } ] as one giant string.

To Reproduce:

JSON files come in a variety of formats, some datasets I tried:

[
  {
    "_id": "605d673b20b4132093890d7f",
    "index": 0,
    "guid": "4582d945-a7c7-4605-a335-255c04fb701d",
    "isActive": false,
    "balance": "$3,822.52",
    "picture": "http://placehold.it/32x32",
    "age": 26,
    "eyeColor": "green",
    "name": "Cobb Bonner",
    "gender": "male",
    "company": "SCENTRIC",
    "email": "[email protected]",
    "phone": "+1 (887) 582-3501",
    "address": "712 Ferris Street, Marysville, New York, 1006",
    "about": "Velit aliquip duis id ut officia culpa cillum labore elit do ad. Esse cillum dolor sunt anim ex elit ullamco qui enim eu. Cupidatat fugiat ea dolore do fugiat et minim occaecat laboris culpa. Cupidatat nostrud dolor deserunt in irure pariatur ut labore anim consequat. Dolor in anim culpa adipisicing cillum occaecat proident cupidatat voluptate occaecat ullamco amet. Laboris fugiat tempor ullamco non non commodo dolore officia deserunt sint cupidatat ea. Culpa qui excepteur duis ea voluptate irure deserunt do quis anim fugiat commodo aute laborum.\r\n",
    "registered": "2014-08-24T12:11:05 +05:00",
    "latitude": -2.632515,
    "longitude": -17.492363,
    "tags": [
      "commodo",
      "aliqua",
      "et",
      "velit",
      "excepteur",
      "deserunt",
      "culpa"
    ],
    "friends": [
      {
        "id": 0,
        "name": "Finch Russell"
      },
      {
        "id": 1,
        "name": "Stephanie Buckner"
      },
      {
        "id": 2,
        "name": "Rachelle Cox"
      }
    ],
    "greeting": "Hello, Cobb Bonner! You have 4 unread messages.",
    "favoriteFruit": "banana"
  },{
     ....
  }
]

Expected behavior:

  1. If there's a top level "data" entry, to load that

  2. Internal objects or lists should be evaluated and not be represented as strings.

Example JSON:

{ 
    'data': [
         {
              "id": 1,
              "tags": [ "test 1", "test 2", "test 3" ] 
         },{ 
              "id": 2,
              "tags": : [ "test 4", "test 5", "test 6" ] 
         }
    ]
}

Example column / data object:

  • data.id [1, 2]
  • data.tags [ "test 1", "test 2", "test 3", "test 4", "test 5", "test 6" ]

Chardet d has trouble identifying some file encodings

General Information:

  • Python version: 3.8.7
  • Library version: 0.3.4

Describe the bug:
Currently the library uses chardet to determine file encodings, however it was relatively unmaintained until recently (Dec 2020) in addition to having trouble detecting some file encodings.

For example:
UTF-8 being detected as windows-1254: chardet/chardet#148

Additional context:
Potential fixes include:
https://hackernoon.com/how-i-used-python-to-solve-declareless-encoding-madness-hk1k42d3o

Detecting CSV integer header w/ description above it

General Information:

  • OS: Ubuntu 18.04
  • Python version: 3.8.7
  • Library version: 0.3.4

Describe the bug:
Currently, if there's a description above a header made of numbers, CSVData will not detect the integer header set.

Expected behavior:
Header should be detected at the proper line. despite a description being above it.

Reference to discussion: #56 (comment)

Suppress / Remove / Fix warnings from tensorflow

Currently, I am seeing:

WARNING:tensorflow:5 out of the last 5 calls to <function recreate_function..restored_function_body at 0x7f27803a6790> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/tutorials/customization/performance#python_or_tensor_args and https://www.tensorflow.org/api_docs/python/tf/function for more details.

Every time a column is profiled / labeled. It would be nice if the warnings were suppressed and / or the issue was fixed.

null ratio, rows ingested may not accurately reflect the sampling providing an incorrect

Describe the bug:
self.rows_ingested = len(data) overwrites the rows ingested each time which means it does not allow for streaming data.

To Reproduce:

import pandas as pd
import dataprofiler as dp


data = pd.DataFrame([1,2,3,4])

profiler = dp.Profiler(data[:2])
proifler.update(data[2:])
assert profiler.rows_ingested == 4

Additionally, these rows ingested don't represent the rows sampled during acquisition of the null parameters which is described by the column's samples.

Since each column may get a differing sample set, this could cause issues because one col may have higher samples count that the other e.g.

# presuming > 1 col
list(profiler.profile.values())[0].sample_size 
# may not equal, (notice the column index change)
list(profiler.profile.values())[1].sample_size 

NULLs are not being estimated when sampling

Offending function / line:

base_stats = {
# TODO: Is this correct? used to be actual sample size, including
# NANs, what now?
"sample_size": total_sample_size,
"null_count": total_na,
"null_types": na_columns,
"sample": random.sample(list(df_series.values),
min(len(df_series), 5))
}

The gist is that these should either be estimates OR if they are real numbers, it should be made clear to the end-user. Probably with an explanation of the command they can use to either estimate it, give raw counts and / or how to do the full sample.

Incidentally, this is also one of the slowest functions. Taking the most cumulative time (see **) below:

         4688319 function calls (4610366 primitive calls) in 2.778 seconds

   Ordered by: internal time
   List reduced from 1162 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    23976    0.403    0.000    0.559    0.000 numerical_column_stats.py:386(_get_percentile)
   755201    0.146    0.000    0.146    0.000 {method 'search' of 're.Pattern' objects}
       80    0.140    0.002    0.378    0.005 {pandas._libs.lib.map_infer_mask}
       **10    0.138    0.014    0.881    0.088 profile_builder.py:217(get_base_props_and_clean_null_params)**
   755160    0.092    0.000    0.237    0.000 object_array.py:120(<lambda>)
       20    0.088    0.004    0.216    0.011 utils.py:50(shuffle_in_chunks)
   107900    0.071    0.000    0.170    0.000 base_column_profilers.py:47(_combine_unique_sets)
      350    0.069    0.000    0.137    0.000 {pandas._libs.lib.map_infer}
    32993    0.068    0.000    0.068    0.000 {method 'reduce' of 'numpy.ufunc' objects}
      222    0.056    0.000    0.056    0.000 {pandas._libs.lib.infer_dtype}
       10    0.054    0.005    0.220    0.022 text_column_profile.py:95(_update_vocab)
123750/123726    0.054    0.000    0.054    0.000 numerical_column_stats.py:85(__getattribute__)
   107880    0.052    0.000    0.119    0.000 random.py:174(randrange)
      168    0.051    0.000    0.114    0.001 numerical_column_stats.py:239(_total_histogram_bin_variance)
   107880    0.047    0.000    0.047    0.000 numerical_column_stats.py:545(is_int)
12798/10902    0.047    0.000    0.056    0.000 {built-in method numpy.array}
   107930    0.046    0.000    0.068    0.000 random.py:224(_randbelow)
       92    0.037    0.000    0.037    0.000 {built-in method pandas._libs.missing.isnaobj}
227832/227831    0.036    0.000    0.049    0.000 {built-in method builtins.isinstance}
   107880    0.034    0.000    0.034    0.000 numerical_column_stats.py:526(is_float)

The main time is the regex:

df_series_subset = df_series.iloc[sample_inds]
# Check if known null types exist in column
for na, flags in null_values_and_flags.items():
# Check for the regex of the na in the string.
reg_ex_na = f"^{na}$"
matching_na_elements = df_series_subset.str.contains(
reg_ex_na, flags=flags)
for row, elem in matching_na_elements.items():
if elem:
# Since df_series_subset[row] is mutable,
# need to make new var
row_value = str(df_series_subset[row])
na_columns.setdefault(row_value, list()).append(row)

Possible solution: Merge all the null searches into one query (as opposed to multiple queries) then split once indexes are matched. I believe it's possible to get a 8x speedup here.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.