GithubHelp home page GithubHelp logo

capitalone / dataprofiler Goto Github PK

View Code? Open in Web Editor NEW
1.4K 20.0 157.0 35.96 MB

What's in your data? Extract schema, statistics and entities from datasets

Home Page: https://capitalone.github.io/DataProfiler

License: Apache License 2.0

Python 99.94% HTML 0.03% Makefile 0.02% Shell 0.01%
python privacy pii npi nlp data-science gdpr data-analysis data-labels avro

dataprofiler's Introduction

PyPI - Python Version GitHub GitHub last commit Downloads

Shows a black logo in light color mode and a white one in dark color mode.

Data Profiler | What's in your data?

The DataProfiler is a Python library designed to make data analysis, monitoring, and sensitive data detection easy.

Loading Data with a single command, the library automatically formats & loads files into a DataFrame. Profiling the Data, the library identifies the schema, statistics, entities (PII / NPI) and more. Data Profiles can then be used in downstream applications or reports.

Getting started only takes a few lines of code (example csv):

import json
from dataprofiler import Data, Profiler

data = Data("your_file.csv") # Auto-Detect & Load: CSV, AVRO, Parquet, JSON, Text, URL

print(data.data.head(5)) # Access data directly via a compatible Pandas DataFrame

profile = Profiler(data) # Calculate Statistics, Entity Recognition, etc

readable_report = profile.report(report_options={"output_format": "compact"})

print(json.dumps(readable_report, indent=4))

Note: The Data Profiler comes with a pre-trained deep learning model, used to efficiently identify sensitive data (PII / NPI). If desired, it's easy to add new entities to the existing pre-trained model or insert an entire new pipeline for entity recognition.

For API documentation, visit the documentation page.

If you have suggestions or find a bug, please open an issue.

If you want to contribute, visit the contributing page.


Install

To install the full package from pypi: pip install DataProfiler[full]

If you want to install the ml dependencies without generating reports use DataProfiler[ml]

If the ML requirements are too strict (say, you don't want to install tensorflow), you can install a slimmer package with DataProfiler[reports]. The slimmer package disables the default sensitive data detection / entity recognition (labler)

Install from pypi: pip install DataProfiler


What is a Data Profile?

In the case of this library, a data profile is a dictionary containing statistics and predictions about the underlying dataset. There are "global statistics" or global_stats, which contain dataset level data and there are "column/row level statistics" or data_stats (each column is a new key-value entry).

The format for a structured profile is below:

"global_stats": {
    "samples_used": int,
    "column_count": int,
    "row_count": int,
    "row_has_null_ratio": float,
    "row_is_null_ratio": float,
    "unique_row_ratio": float,
    "duplicate_row_count": int,
    "file_type": string,
    "encoding": string,
    "correlation_matrix": list[list[int]], (*)
    "chi2_matrix": list[list[float]],
    "profile_schema": {
        string: list[int]
    },
    "times": dict[string, float],
},
"data_stats": [
    {
        "column_name": string,
        "data_type": string,
        "data_label": string,
        "categorical": bool,
        "order": string,
        "samples": list[str],
        "statistics": {
            "sample_size": int,
            "null_count": int,
            "null_types": list[string],
            "null_types_index": {
                string: list[int]
            },
            "data_type_representation": dict[string, float],
            "min": [null, float, str],
            "max": [null, float, str],
            "mode": float,
            "median": float,
            "median_absolute_deviation": float,
            "sum": float,
            "mean": float,
            "variance": float,
            "stddev": float,
            "skewness": float,
            "kurtosis": float,
            "num_zeros": int,
            "num_negatives": int,
            "histogram": {
                "bin_counts": list[int],
                "bin_edges": list[float],
            },
            "quantiles": {
                int: float
            },
            "vocab": list[char],
            "avg_predictions": dict[string, float],
            "data_label_representation": dict[string, float],
            "categories": list[str],
            "unique_count": int,
            "unique_ratio": float,
            "categorical_count": dict[string, int],
            "gini_impurity": float,
            "unalikeability": float,
            "precision": {
                'min': int,
                'max': int,
                'mean': float,
                'var': float,
                'std': float,
                'sample_size': int,
                'margin_of_error': float,
                'confidence_level': float
            },
            "times": dict[string, float],
            "format": string
        },
        "null_replication_metrics": {
            "class_prior": list[int],
            "class_sum": list[list[int]],
            "class_mean": list[list[int]]
        }
    }
]

(*) Currently the correlation matrix update is toggled off. It will be reset in a later update. Users can still use it as desired with the is_enable option set to True.

The format for an unstructured profile is below:

"global_stats": {
    "samples_used": int,
    "empty_line_count": int,
    "file_type": string,
    "encoding": string,
    "memory_size": float, # in MB
    "times": dict[string, float],
},
"data_stats": {
    "data_label": {
        "entity_counts": {
            "word_level": dict[string, int],
            "true_char_level": dict[string, int],
            "postprocess_char_level": dict[string, int]
        },
        "entity_percentages": {
            "word_level": dict[string, float],
            "true_char_level": dict[string, float],
            "postprocess_char_level": dict[string, float]
        },
        "times": dict[string, float]
    },
    "statistics": {
        "vocab": list[char],
        "vocab_count": dict[string, int],
        "words": list[string],
        "word_count": dict[string, int],
        "times": dict[string, float]
    }
}

The format for a graph profile is below:

"num_nodes": int,
"num_edges": int,
"categorical_attributes": list[string],
"continuous_attributes": list[string],
"avg_node_degree": float,
"global_max_component_size": int,
"continuous_distribution": {
    "<attribute_1>": {
        "name": string,
        "scale": float,
        "properties": list[float, np.array]
    },
    "<attribute_2>": None,
    ...
},
"categorical_distribution": {
    "<attribute_1>": None,
    "<attribute_2>": {
        "bin_counts": list[int],
        "bin_edges": list[float]
    },
    ...
},
"times": dict[string, float]

Profile Statistic Descriptions

Structured Profile

global_stats:

  • samples_used - number of input data samples used to generate this profile
  • column_count - the number of columns contained in the input dataset
  • row_count - the number of rows contained in the input dataset
  • row_has_null_ratio - the proportion of rows that contain at least one null value to the total number of rows
  • row_is_null_ratio - the proportion of rows that are fully comprised of null values (null rows) to the total number of rows
  • unique_row_ratio - the proportion of distinct rows in the input dataset to the total number of rows
  • duplicate_row_count - the number of rows that occur more than once in the input dataset
  • file_type - the format of the file containing the input dataset (ex: .csv)
  • encoding - the encoding of the file containing the input dataset (ex: UTF-8)
  • correlation_matrix - matrix of shape column_count x column_count containing the correlation coefficients between each column in the dataset
  • chi2_matrix - matrix of shape column_count x column_count containing the chi-square statistics between each column in the dataset
  • profile_schema - a description of the format of the input dataset labeling each column and its index in the dataset
    • string - the label of the column in question and its index in the profile schema
  • times - the duration of time it took to generate the global statistics for this dataset in milliseconds

data_stats:

  • column_name - the label/title of this column in the input dataset
  • data_type - the primitive python data type that is contained within this column
  • data_label - the label/entity of the data in this column as determined by the Labeler component
  • categorical - ‘true’ if this column contains categorical data
  • order - the way in which the data in this column is ordered, if any, otherwise “random”
  • samples - a small subset of data entries from this column
  • statistics - statistical information on the column
    • sample_size - number of input data samples used to generate this profile
    • null_count - the number of null entries in the sample
    • null_types - a list of the different null types present within this sample
    • null_types_index - a dict containing each null type and a respective list of the indicies that it is present within this sample
    • data_type_representation - the percentage of samples used identifying as each data_type
    • min - minimum value in the sample
    • max - maximum value in the sample
    • mode - mode of the entries in the sample
    • median - median of the entries in the sample
    • median_absolute_deviation - the median absolute deviation of the entries in the sample
    • sum - the total of all sampled values from the column
    • mean - the average of all entries in the sample
    • variance - the variance of all entries in the sample
    • stddev - the standard deviation of all entries in the sample
    • skewness - the statistical skewness of all entries in the sample
    • kurtosis - the statistical kurtosis of all entries in the sample
    • num_zeros - the number of entries in this sample that have the value 0
    • num_negatives - the number of entries in this sample that have a value less than 0
    • histogram - contains histogram relevant information
      • bin_counts - the number of entries within each bin
      • bin_edges - the thresholds of each bin
    • quantiles - the value at each percentile in the order they are listed based on the entries in the sample
    • vocab - a list of the characters used within the entries in this sample
    • avg_predictions - average of the data label prediction confidences across all data points sampled
    • categories - a list of each distinct category within the sample if categorial = 'true'
    • unique_count - the number of distinct entries in the sample
    • unique_ratio - the proportion of the number of distinct entries in the sample to the total number of entries in the sample
    • categorical_count - number of entries sampled for each category if categorical = 'true'
    • gini_impurity - measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset
    • unalikeability - a value denoting how frequently entries differ from one another within the sample
    • precision - a dict of statistics with respect to the number of digits in a number for each sample
    • times - the duration of time it took to generate this sample's statistics in milliseconds
    • format - list of possible datetime formats
  • null_replication_metrics - statistics of data partitioned based on whether column value is null (index 1 of lists referenced by dict keys) or not (index 0)
    • class_prior - a list containing probability of a column value being null and not null
    • class_sum- a list containing sum of all other rows based on whether column value is null or not
    • class_mean- a list containing mean of all other rows based on whether column value is null or not

Unstructured Profile

global_stats:

  • samples_used - number of input data samples used to generate this profile
  • empty_line_count - the number of empty lines in the input data
  • file_type - the file type of the input data (ex: .txt)
  • encoding - file encoding of the input data file (ex: UTF-8)
  • memory_size - size of the input data in MB
  • times - duration of time it took to generate this profile in milliseconds

data_stats:

  • data_label - labels and statistics on the labels of the input data
    • entity_counts - the number of times a specific label or entity appears inside the input data
      • word_level - the number of words counted within each label or entity
      • true_char_level - the number of characters counted within each label or entity as determined by the model
      • postprocess_char_level - the number of characters counted within each label or entity as determined by the postprocessor
    • entity_percentages - the percentages of each label or entity within the input data
      • word_level - the percentage of words in the input data that are contained within each label or entity
      • true_char_level - the percentage of characters in the input data that are contained within each label or entity as determined by the model
      • postprocess_char_level - the percentage of characters in the input data that are contained within each label or entity as determined by the postprocessor
    • times - the duration of time it took for the data labeler to predict on the data
  • statistics - statistics of the input data
    • vocab - a list of each character in the input data
    • vocab_count - the number of occurrences of each distinct character in the input data
    • words - a list of each word in the input data
    • word_count - the number of occurrences of each distinct word in the input data
    • times - the duration of time it took to generate the vocab and words statistics in milliseconds

Graph Profile

  • num_nodes - number of nodes in the graph
  • num_edges - number of edges in the graph
  • categorical_attributes - list of categorical edge attributes
  • continuous_attributes - list of continuous edge attributes
  • avg_node_degree - average degree of nodes in the graph
  • global_max_component_size: size of the global max component

continuous_distribution:

  • <attribute_N>: name of N-th edge attribute in list of attributes
    • name - name of distribution for attribute
    • scale - negative log likelihood used to scale and compare distributions
    • properties - list of statistical properties describing the distribution
      • [shape (optional), loc, scale, mean, variance, skew, kurtosis]

categorical_distribution:

  • <attribute_N>: name of N-th edge attribute in list of attributes

    • bin_counts: counts in each bin of the distribution histogram
    • bin_edges: edges of each bin of the distribution histogram
  • times - duration of time it took to generate this profile in milliseconds

Support

Supported Data Formats

  • Any delimited file (CSV, TSV, etc.)
  • JSON object
  • Avro file
  • Parquet file
  • Text file
  • Pandas DataFrame
  • A URL that points to one of the supported file types above

Data Types

Data Types are determined at the column level for structured data

  • Int
  • Float
  • String
  • DateTime

Data Labels

Data Labels are determined per cell for structured data (column/row when the profiler is used) or at the character level for unstructured data.

  • UNKNOWN
  • ADDRESS
  • BAN (bank account number, 10-18 digits)
  • CREDIT_CARD
  • EMAIL_ADDRESS
  • UUID
  • HASH_OR_KEY (md5, sha1, sha256, random hash, etc.)
  • IPV4
  • IPV6
  • MAC_ADDRESS
  • PERSON
  • PHONE_NUMBER
  • SSN
  • URL
  • US_STATE
  • DRIVERS_LICENSE
  • DATE
  • TIME
  • DATETIME
  • INTEGER
  • FLOAT
  • QUANTITY
  • ORDINAL

Get Started

Load a File

The Data Profiler can profile the following data/file types:

  • CSV file (or any delimited file)
  • JSON object
  • Avro file
  • Parquet file
  • Text file
  • Pandas DataFrame
  • A URL that points to one of the supported file types above

The profiler should automatically identify the file type and load the data into a Data Class.

Along with other attributtes the Data class enables data to be accessed via a valid Pandas DataFrame.

# Load a csv file, return a CSVData object
csv_data = Data('your_file.csv')

# Print the first 10 rows of the csv file
print(csv_data.data.head(10))

# Load a parquet file, return a ParquetData object
parquet_data = Data('your_file.parquet')

# Sort the data by the name column
parquet_data.data.sort_values(by='name', inplace=True)

# Print the sorted first 10 rows of the parquet data
print(parquet_data.data.head(10))

# Load a json file from a URL, return a JSONData object
json_data = Data('https://github.com/capitalone/DataProfiler/blob/main/dataprofiler/tests/data/json/iris-utf-8.json')

If the file type is not automatically identified (rare), you can specify them specifically, see section Specifying a Filetype or Delimiter.

Profile a File

Example uses a CSV file for example, but CSV, JSON, Avro, Parquet or Text also work.

import json
from dataprofiler import Data, Profiler

# Load file (CSV should be automatically identified)
data = Data("your_file.csv")

# Profile the dataset
profile = Profiler(data)

# Generate a report and use json to prettify.
report  = profile.report(report_options={"output_format": "pretty"})

# Print the report
print(json.dumps(report, indent=4))

Updating Profiles

Currently, the data profiler is equipped to update its profile in batches.

import json
from dataprofiler import Data, Profiler

# Load and profile a CSV file
data = Data("your_file.csv")
profile = Profiler(data)

# Update the profile with new data:
new_data = Data("new_data.csv")
profile.update_profile(new_data)

# Print the report using json to prettify.
report  = profile.report(report_options={"output_format": "pretty"})
print(json.dumps(report, indent=4))

Note that if the data you update the profile with contains integer indices that overlap with the indices on data originally profiled, when null rows are calculated the indices will be "shifted" to uninhabited values so that null counts and ratios are still accurate.

Merging Profiles

If you have two files with the same schema (but different data), it is possible to merge the two profiles together via an addition operator.

This also enables profiles to be determined in a distributed manner.

import json
from dataprofiler import Data, Profiler

# Load a CSV file with a schema
data1 = Data("file_a.csv")
profile1 = Profiler(data1)

# Load another CSV file with the same schema
data2 = Data("file_b.csv")
profile2 = Profiler(data2)

profile3 = profile1 + profile2

# Print the report using json to prettify.
report  = profile3.report(report_options={"output_format": "pretty"})
print(json.dumps(report, indent=4))

Note that if merged profiles had overlapping integer indices, when null rows are calculated the indices will be "shifted" to uninhabited values so that null counts and ratios are still accurate.

Profiler Differences

For finding the change between profiles with the same schema we can utilize the profile's diff function. The diff will provide overall file and sampling differences as well as detailed differences of the data's statistics. For example, numerical columns have both t-test to evaluate similarity and PSI (Population Stability Index) to quantify column distribution shift. More information is described in the Profiler section of the Github Pages.

Create the difference report like this:

import json
import dataprofiler as dp

# Load a CSV file
data1 = dp.Data("file_a.csv")
profile1 = dp.Profiler(data1)

# Load another CSV file
data2 = dp.Data("file_b.csv")
profile2 = dp.Profiler(data2)

diff_report = profile1.diff(profile2)
print(json.dumps(diff_report, indent=4))

Profile a Pandas DataFrame

import pandas as pd
import dataprofiler as dp
import json

my_dataframe = pd.DataFrame([[1, 2.0],[1, 2.2],[-1, 3]])
profile = dp.Profiler(my_dataframe)

# print the report using json to prettify.
report = profile.report(report_options={"output_format": "pretty"})
print(json.dumps(report, indent=4))

# read a specified column, in this case it is labeled 0:
print(json.dumps(report["data_stats"][0], indent=4))

Unstructured Profiler

In addition to the structured profiler, DataProfiler provides unstructured profiling for the TextData object or string. The unstructured profiler also works with list[string], pd.Series(string) or pd.DataFrame(string) given profiler_type option specified as unstructured. Below is an example of the unstructured profiler with a text file.

import dataprofiler as dp
import json

my_text = dp.Data('text_file.txt')
profile = dp.Profiler(my_text)

# print the report using json to prettify.
report = profile.report(report_options={"output_format": "pretty"})
print(json.dumps(report, indent=4))

Another example of the unstructured profiler with pd.Series of strings is given as below, with the profiler option profiler_type='unstructured'

import dataprofiler as dp
import pandas as pd
import json

text_data = pd.Series(['first string', 'second string'])
profile = dp.Profiler(text_data, profiler_type='unstructured')

# print the report using json to prettify.
report = profile.report(report_options={"output_format": "pretty"})
print(json.dumps(report, indent=4))

Graph Profiler

DataProfiler also provides the ability to profile graph data from a csv file. Below is an example of the graph profiler with a graph data csv file:

import dataprofiler as dp
import pprint

my_graph = dp.Data('graph_file.csv')
profile = dp.Profiler(my_graph)

# print the report using pretty print (json dump does not work on numpy array values inside dict)
report = profile.report()
printer = pprint.PrettyPrinter(sort_dicts=False, compact=True)
printer.pprint(report)

Visit the documentation page for additional Examples and API details

References

Sensitive Data Detection with High-Throughput Neural Network Models for Financial Institutions
Authors: Anh Truong, Austin Walters, Jeremy Goodsitt
2020 https://arxiv.org/abs/2012.09597
The AAAI-21 Workshop on Knowledge Discovery from Unstructured Data in Financial Services

dataprofiler's People

Contributors

anhtruong avatar az85252 avatar chriswallace2020 avatar dependabot[bot] avatar gautomdas avatar gliptak avatar grant-eden avatar granteden avatar jakleh avatar jgsweets avatar joshuart avatar junholee6a avatar kshitijavis avatar ksneab7 avatar lettergram avatar mend-for-github-com[bot] avatar micdavis avatar misterpnp avatar neilkg avatar rxm7706 avatar sagars729 avatar sanketh7 avatar scottiegarcia avatar stefanycoimbra avatar stevensecreti avatar ta7ar avatar taylorfturner avatar tmbjmu avatar tonywu315 avatar vindhyanairlj avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dataprofiler's Issues

Null rows are being incorrectly calculated when sampling

General Information:

  • OS: Arch Linux
  • Python version: 3.7
  • Library version: 0.4.2

Describe the bug:

Because indexes are randomly generated per column

sample_ind_generator = utils.shuffle_in_chunks(
len_df, chunk_size=sample_size)

Those indexes are random and may not align with one another. They have to align to get correct results. The inaccurate results would be: row_has_null and row_is_null

To Reproduce:

Every time the data profiler is ran.

Expected behavior:

The solution is to:

  1. Generate all the random indicates for sampling first.
  2. Each profile stop after they obtain the number of samples (min_true_samples) needed (return / store said location)
  3. Do an intersection up-to the smallest number of samples needed for a given column - this should give you the row_is_null calculation

As part of this PR, tests should be written to check the row_has_null and row_is_null counts.

Report issue - Error message shown when all options except data labeler are disabled

General Information:

  • OS:
  • Python version:
  • Library version:

Describe the bug:
Get the following error with the report

Error: categorical has no data
Error: order has no data
Error: categorical has no data
Error: order has no data
Error: categorical has no data
Error: order has no data
Error: categorical has no data
Error: order has no data
Error: categorical has no data
Error: order has no data
Error: categorical has no data
Error: order has no data
Error: categorical has no data
Error: order has no data
Error: categorical has no data
Error: order has no data
Error: categorical has no data
Error: order has no data
Error: categorical has no data
Error: order has no data
Error: categorical has no data
Error: order has no data
Error: categorical has no data
Error: order has no data
Error: categorical has no data
Error: order has no data
Error: categorical has no data
Error: order has no data
Error: categorical has no data
Error: order has no data
Error: categorical has no data
Error: order has no data

To Reproduce:
Use the following code with a csv file including 16 columns

# set option to run only data labeler
profile_options = dp.ProfilerOptions()
profile_options.set({"text.is_enabled": False, 
                     "int.is_enabled": False, 
                     "float.is_enabled": False, 
                     "order.is_enabled": False, 
                     "category.is_enabled": False, 
                     "datetime.is_enabled": False,})

profile = dp.Profiler(data, profiler_options=profile_options)

results = profile.report(report_options={"output_format":"pretty"})
print(json.dumps(human_readable_report, indent=4))

Expected behavior:

Screenshots:

Additional context:

Delimiter detection uses digits in a number

General Information:

  • OS: Arch Linux
  • Python version: 3.9
  • Library version: 0.3.4

Describe the bug:

File:

-123
234
345
534
231

Delimiter detected: 3

Delimiter should probably not detect inside numbers

null ratio, rows ingested may not accurately reflect the sampling providing an incorrect

Describe the bug:
self.rows_ingested = len(data) overwrites the rows ingested each time which means it does not allow for streaming data.

To Reproduce:

import pandas as pd
import dataprofiler as dp


data = pd.DataFrame([1,2,3,4])

profiler = dp.Profiler(data[:2])
proifler.update(data[2:])
assert profiler.rows_ingested == 4

Additionally, these rows ingested don't represent the rows sampled during acquisition of the null parameters which is described by the column's samples.

Since each column may get a differing sample set, this could cause issues because one col may have higher samples count that the other e.g.

# presuming > 1 col
list(profiler.profile.values())[0].sample_size 
# may not equal, (notice the column index change)
list(profiler.profile.values())[1].sample_size 

Identifying & Loading JSON files

General Information:

  • OS: Arch Linux
  • Python version: 3.7
  • Library version: 0.4.1

Describe the bug:

JSON files are not loading as I would expect. It can do well if there's a list of objects, but if there are embeded objects it does not function well. Further, if there is a JSON file such as: { 'data': [ { }, { }, { } ] } the library will treat [ { }, { }, { } ] as one giant string.

To Reproduce:

JSON files come in a variety of formats, some datasets I tried:

[
  {
    "_id": "605d673b20b4132093890d7f",
    "index": 0,
    "guid": "4582d945-a7c7-4605-a335-255c04fb701d",
    "isActive": false,
    "balance": "$3,822.52",
    "picture": "http://placehold.it/32x32",
    "age": 26,
    "eyeColor": "green",
    "name": "Cobb Bonner",
    "gender": "male",
    "company": "SCENTRIC",
    "email": "[email protected]",
    "phone": "+1 (887) 582-3501",
    "address": "712 Ferris Street, Marysville, New York, 1006",
    "about": "Velit aliquip duis id ut officia culpa cillum labore elit do ad. Esse cillum dolor sunt anim ex elit ullamco qui enim eu. Cupidatat fugiat ea dolore do fugiat et minim occaecat laboris culpa. Cupidatat nostrud dolor deserunt in irure pariatur ut labore anim consequat. Dolor in anim culpa adipisicing cillum occaecat proident cupidatat voluptate occaecat ullamco amet. Laboris fugiat tempor ullamco non non commodo dolore officia deserunt sint cupidatat ea. Culpa qui excepteur duis ea voluptate irure deserunt do quis anim fugiat commodo aute laborum.\r\n",
    "registered": "2014-08-24T12:11:05 +05:00",
    "latitude": -2.632515,
    "longitude": -17.492363,
    "tags": [
      "commodo",
      "aliqua",
      "et",
      "velit",
      "excepteur",
      "deserunt",
      "culpa"
    ],
    "friends": [
      {
        "id": 0,
        "name": "Finch Russell"
      },
      {
        "id": 1,
        "name": "Stephanie Buckner"
      },
      {
        "id": 2,
        "name": "Rachelle Cox"
      }
    ],
    "greeting": "Hello, Cobb Bonner! You have 4 unread messages.",
    "favoriteFruit": "banana"
  },{
     ....
  }
]

Expected behavior:

  1. If there's a top level "data" entry, to load that

  2. Internal objects or lists should be evaluated and not be represented as strings.

Example JSON:

{ 
    'data': [
         {
              "id": 1,
              "tags": [ "test 1", "test 2", "test 3" ] 
         },{ 
              "id": 2,
              "tags": : [ "test 4", "test 5", "test 6" ] 
         }
    ]
}

Example column / data object:

  • data.id [1, 2]
  • data.tags [ "test 1", "test 2", "test 3", "test 4", "test 5", "test 6" ]

Allow user to specify null values

Is your feature request related to a problem? Please describe.
I cannot specify what values should be considered null in my dataset

Describe the outcome you'd like:
In options, I want to specify what represents a null in my dataset.

Remove TensorFlow Addons (TFA) and utilize TF nightly to support python 3.9

TensorFlow addons (TFA) are the primary reason the library cannot be upgraded.

TFA is utilized in a single location in the code:

# Use TFA to add f1 score to output

It should be possible to create an F1 score metric function that doesn't require TFA:

https://stackoverflow.com/questions/64474463/custom-f1-score-metric-in-tensorflow

Once removed, tensorflow nightly would likely suffice and the library could then work on python 3.9.

Would we like to do this?

Save & Load profiles from disk

Is your feature request related to a problem? Please describe.

I'd like to be able to do: priofile.save(filename=<optional>) and Profiler.load(filename=<profile_filename>), after which I should be able to do profile.update(data) or profile.report().

This would be an amazing feature as it would enable distributing profiling generating and merging.

Training on new data

While training on new data in Colab the following data appears.

default_label_error

When i change any one of the column name to "BACKGROUND" then the labeler gets trained and giving the following output.

gives_error_but_model_saves

To reproduce the errors Please use the following csv files
datasets.zip

And while Predicting the Labeler gives prediction for each cell. How to Aggregate them to column level. Please help me in this
Thank you

Profiling is slow

General Information:

  • OS: Arch Linux
  • Python version: 3.7
  • Library version: 0.3.4

Describe the bug:

Profiling diamonds.csv (2.5Mb) or really any dataset is much slower than expected. Even with the data labeler disabled.

To Reproduce:

The following code takes:

  • Base run: 38 seconds and ~785Mb

  • No labeler: 24.48 seconds to execute and ~67Mb

  • No labeler, no histogram: 23.47 seconds to execute and ~67Mb.

  • No labeler, no datetime: 23.35 seconds and ~67Mb

import sys
import json
import time
import dataprofiler as dp

filename = sys.argv[1]

def profile_test(filename):
    data = dp.Data(filename)

    profile_options = dp.ProfilerOptions()
    profile_options.structured_options.data_labeler.is_enabled = False                       
    profile_options.set({"histogram_and_quantiles.is_enabled": False})                       
    profile_options.set({"datetime.is_enabled": False})  

    profile = dp.Profiler(data, profiler_options=profile_options)

    human_readable_report = profile.report(report_options={"output_format":"pretty"})

    print(json.dumps(human_readable_report, indent=4))

start_time = time.time()
profile_test(filename)
end_time = time.time()

print("Profile runtime for "+filename, end_time-start_time, 'seconds')

In the output, there's no reason in the "timing" for this to be occuring:

{
    "global_stats": {
        "samples_used": 10788,
        "column_count": 10,
        "unique_row_ratio": 0.9973,
        "row_has_null_ratio": 0.0,
        "duplicate_row_count": 146,
        "file_type": "csv",
        "encoding": "utf-8",
        "data_classification": null,
        "covariance": null
    },
    "data_stats": {
        "carat": {
            "column_name": "carat",
            "data_type": "float",
            "categorical": true,
            "order": "random",
            "samples": "['2.52', '1', '0.55', '1.75', '2.01']",
            "statistics": {
                "min": 0.2,
                "max": 4.01,
                "mean": 0.7976,
                "median": null,
                "variance": 0.2244,
                "stddev": 0.4738,
                "histogram": {
                    "bin_counts": "[  70,  205,  606, 1224,  443, ... , 0, 0, 0, 0, 1]",
                    "bin_edges": "[0.2       , 0.23663462, ... , 3.97336538, 4.01      ]"
                },
                "quantiles": {
                    "0": 0.3832,
                    "1": 0.6762,
                    "2": 1.006
                },
                "times": {
                    "precision": 0.0027,
                    "min": 0.0001,
                    "max": 0.0,
                    "sum": 0.0001,
                    "variance": 0.0001,
                    "histogram_and_quantiles": 0.0263
                },
                "precision": 2,
                "unique_count": 230,
                "unique_ratio": 0.0213,
                "categories": "['0.29', '0.31', '0.26', ... , '0.21', '0.57', '0.69']",
                "sample_size": 10788,
                "null_count": 0,
                "null_types": "[]",
                "null_types_index": {},
                "data_type_representation": {
                    "datetime": 0.0,
                    "int": 0.0336,
                    "float": 1.0,
                    "string": 1.0
                },
                "data_label_probability": null
            }
        },
        "cut": {
            "column_name": "cut",
            "data_type": "string",
            "categorical": true,
            "order": "random",
            "samples": "['Ideal', 'Very Good', 'Ideal', 'Premium', 'Good']",
            "statistics": {
                "min": 4.0,
                "max": 9.0,
                "mean": 6.2944,
                "median": null,
                "variance": 3.1061,
                "stddev": 1.7624,
                "histogram": {
                    "bin_counts": "[1340,    0,    0,    0, ... ,    0,    0,    0, 2454]",
                    "bin_edges": "[4.        , 4.04807692, ... , 8.95192308, 9.        ]"
                },
                "quantiles": {
                    "0": 4.9615,
                    "1": 4.9615,
                    "2": 6.9808
                },
                "vocab": "['P', 'r', 'e', 'm', 'i', ... , 'd', 'I', 'a', 'l', 'F']",
                "times": {
                    "vocab": 0.0239,
                    "min": 0.0001,
                    "max": 0.0,
                    "sum": 0.0001,
                    "variance": 0.0001,
                    "histogram_and_quantiles": 0.0304
                },
                "unique_count": 5,
                "unique_ratio": 0.0005,
                "categories": "['Premium', 'Very Good', 'Ideal', 'Good', 'Fair']",
                "sample_size": 10788,
                "null_count": 0,
                "null_types": "[]",
                "null_types_index": {},
                "data_type_representation": {
                    "datetime": 0.0,
                    "int": 0.0,
                    "float": 0.0,
                    "string": 1.0
                },
                "data_label_probability": null
            }
        },
        "color": {
            "column_name": "color",
            "data_type": "string",
            "categorical": true,
            "order": "random",
            "samples": "['D', 'D', 'E', 'G', 'D']",
            "statistics": {
                "min": 1.0,
                "max": 1.0,
                "mean": 1.0,
                "median": null,
                "variance": 0.0,
                "stddev": 0.0,
                "histogram": {
                    "bin_counts": "[10788]",
                    "bin_edges": "[1., 1.]"
                },
                "quantiles": {
                    "0": 1.0,
                    "1": 1.0,
                    "2": 1.0
                },
                "vocab": "['I', 'E', 'J', 'H', 'D', 'G', 'F']",
                "times": {
                    "vocab": 0.0147,
                    "min": 0.0001,
                    "max": 0.0,
                    "sum": 0.0001,
                    "variance": 0.0001,
                    "histogram_and_quantiles": 0.0113
                },
                "unique_count": 7,
                "unique_ratio": 0.0006,
                "categories": "['I', 'E', 'J', 'H', 'D', 'G', 'F']",
                "sample_size": 10788,
                "null_count": 0,
                "null_types": "[]",
                "null_types_index": {},
                "data_type_representation": {
                    "datetime": 0.0,
                    "int": 0.0,
                    "float": 0.0,
                    "string": 1.0
                },
                "data_label_probability": null
            }
        },
        "clarity": {
            "column_name": "clarity",
            "data_type": "string",
            "categorical": true,
            "order": "random",
            "samples": "['VS1', 'SI2', 'SI1', 'VS1', 'SI1']",
            "statistics": {
                "min": 2.0,
                "max": 4.0,
                "mean": 3.1212,
                "median": null,
                "variance": 0.1989,
                "stddev": 0.446,
                "histogram": {
                    "bin_counts": "[498,   0,   0,   0,   0, ... ,    0,    0,    0,    0, 1806]",
                    "bin_edges": "[2.        , 2.01923077, ... , 3.98076923, 4.        ]"
                },
                "quantiles": {
                    "0": 3.0,
                    "1": 3.0,
                    "2": 3.0
                },
                "vocab": "['V', 'S', '1', 'I', '2', 'F']",
                "times": {
                    "vocab": 0.0149,
                    "min": 0.0001,
                    "max": 0.0001,
                    "sum": 0.0001,
                    "variance": 0.0001,
                    "histogram_and_quantiles": 0.0322
                },
                "unique_count": 8,
                "unique_ratio": 0.0007,
                "categories": "['VS1', 'SI1', 'VVS2', 'VS2', ... , 'SI2', 'VVS1', 'IF', 'I1']",
                "sample_size": 10788,
                "null_count": 0,
                "null_types": "[]",
                "null_types_index": {},
                "data_type_representation": {
                    "datetime": 0.0,
                    "int": 0.0,
                    "float": 0.0,
                    "string": 1.0
                },
                "data_label_probability": null
            }
        },
        "depth": {
            "column_name": "depth",
            "data_type": "float",
            "categorical": true,
            "order": "random",
            "samples": "['62.5', '62.4', '63.9', '63.4', '61']",
            "statistics": {
                "min": 43.0,
                "max": 79.0,
                "mean": 61.7492,
                "median": null,
                "variance": 2.0652,
                "stddev": 1.4371,
                "histogram": {
                    "bin_counts": "[1, 0, 0, 0, 0, 0, 0, 0, ... , 0, 0, 0, 0, 0, 0, 0, 1]",
                    "bin_edges": "[43.        , 43.13533835, ... , 78.86466165, 79.        ]"
                },
                "quantiles": {
                    "0": 61.0,
                    "1": 61.6767,
                    "2": 62.4887
                },
                "times": {
                    "precision": 0.0026,
                    "min": 0.0001,
                    "max": 0.0,
                    "sum": 0.0,
                    "variance": 0.0001,
                    "histogram_and_quantiles": 0.0653
                },
                "precision": 1,
                "unique_count": 136,
                "unique_ratio": 0.0126,
                "categories": "['61.5', '62', '63.4', '61', ... , '54.3', '55.3', '57', '79']",
                "sample_size": 10788,
                "null_count": 0,
                "null_types": "[]",
                "null_types_index": {},
                "data_type_representation": {
                    "datetime": 0.0,
                    "int": 0.099,
                    "float": 1.0,
                    "string": 1.0
                },
                "data_label_probability": null
            }
        },
        "table": {
            "column_name": "table",
            "data_type": "float",
            "categorical": true,
            "order": "random",
            "samples": "['54', '59', '56', '57', '56']",
            "statistics": {
                "min": 51.0,
                "max": 95.0,
                "mean": 57.4651,
                "median": null,
                "variance": 5.0778,
                "stddev": 2.2534,
                "histogram": {
                    "bin_counts": "[ 5,  0,  0, 16,  0,  0,  0, ... , 0, 0, 0, 0, 0, 0, 1]",
                    "bin_edges": "[51.        , 51.26993865, ... , 94.73006135, 95.        ]"
                },
                "quantiles": {
                    "0": 55.7239,
                    "1": 56.8037,
                    "2": 58.6933
                },
                "times": {
                    "precision": 0.0019,
                    "min": 0.0001,
                    "max": 0.0,
                    "sum": 0.0,
                    "variance": 0.0001,
                    "histogram_and_quantiles": 0.0296
                },
                "precision": 1,
                "unique_count": 91,
                "unique_ratio": 0.0084,
                "categories": "['58', '57', '61', '54', ... , '58.9', '62.5', '60.7', '61.6']",
                "sample_size": 10788,
                "null_count": 0,
                "null_types": "[]",
                "null_types_index": {},
                "data_type_representation": {
                    "datetime": 0.0,
                    "int": 0.9819,
                    "float": 1.0,
                    "string": 1.0
                },
                "data_label_probability": null
            }
        },
        "price": {
            "column_name": "price",
            "data_type": "int",
            "categorical": false,
            "order": "random",
            "samples": "['2655', '3587', '1341', '991', '16215']",
            "statistics": {
                "min": 334.0,
                "max": 18795.0,
                "mean": 3903.6178,
                "median": null,
                "variance": 15978757.826,
                "stddev": 3997.3438,
                "histogram": {
                    "bin_counts": "[436, 959, 965, 757, 481, ... , 13, 14, 10, 11, 14]",
                    "bin_edges": "[334.        , 511.50961538, ... , 18617.49038462, 18795.        ]"
                },
                "quantiles": {
                    "0": 866.5288,
                    "1": 2286.6058,
                    "2": 5126.7596
                },
                "times": {
                    "min": 0.0001,
                    "max": 0.0,
                    "sum": 0.0,
                    "variance": 0.0001,
                    "histogram_and_quantiles": 0.0257
                },
                "unique_count": 5349,
                "unique_ratio": 0.4958,
                "sample_size": 10788,
                "null_count": 0,
                "null_types": "[]",
                "null_types_index": {},
                "data_type_representation": {
                    "datetime": 0.0,
                    "int": 1.0,
                    "float": 1.0,
                    "string": 1.0
                },
                "data_label_probability": null
            }
        },
        "x": {
            "column_name": "x",
            "data_type": "float",
            "categorical": true,
            "order": "random",
            "samples": "['5.7', '4.74', '7.3', '4.27', '6.88']",
            "statistics": {
                "min": 0.0,
                "max": 10.02,
                "mean": 5.7337,
                "median": null,
                "variance": 1.2481,
                "stddev": 1.1172,
                "histogram": {
                    "bin_counts": "[2, 0, 0, 0, 0, 0, 0, 0, ... , 2, 1, 1, 0, 0, 0, 0, 1]",
                    "bin_edges": "[0.        , 0.09634615, ... ,  9.92365385, 10.02      ]"
                },
                "quantiles": {
                    "0": 4.6246,
                    "1": 5.6844,
                    "2": 6.4552
                },
                "times": {
                    "precision": 0.0026,
                    "min": 0.0001,
                    "max": 0.0,
                    "sum": 0.0,
                    "variance": 0.0001,
                    "histogram_and_quantiles": 0.0385
                },
                "precision": 2,
                "unique_count": 503,
                "unique_ratio": 0.0466,
                "categories": "['3.87', '3.93', '4.21', ... , '5.54', '3.92', '3.9']",
                "sample_size": 10788,
                "null_count": 0,
                "null_types": "[]",
                "null_types_index": {},
                "data_type_representation": {
                    "datetime": 0.0,
                    "int": 0.005,
                    "float": 1.0,
                    "string": 1.0
                },
                "data_label_probability": null
            }
        },
        "y": {
            "column_name": "y",
            "data_type": "float",
            "categorical": true,
            "order": "random",
            "samples": "['5.13', '4.18', '6.37', '6.18', '6.41']",
            "statistics": {
                "min": 3.71,
                "max": 31.8,
                "mean": 5.724,
                "median": null,
                "variance": 1.3022,
                "stddev": 1.1412,
                "histogram": {
                    "bin_counts": "[   8,   93,  157,  840, 1031, ... , 0, 0, 0, 0, 1]",
                    "bin_edges": "[3.71      , 3.87426901, ... , 31.63573099, 31.8       ]"
                },
                "quantiles": {
                    "0": 4.6135,
                    "1": 5.5991,
                    "2": 6.4204
                },
                "times": {
                    "precision": 0.0027,
                    "min": 0.0001,
                    "max": 0.0,
                    "sum": 0.0001,
                    "variance": 0.0001,
                    "histogram_and_quantiles": 0.0262
                },
                "precision": 2,
                "unique_count": 496,
                "unique_ratio": 0.046,
                "categories": "['3.96', '3.78', '3.9', ... , '3.82', '3.92', '31.8']",
                "sample_size": 10788,
                "null_count": 0,
                "null_types": "[]",
                "null_types_index": {},
                "data_type_representation": {
                    "datetime": 0.0,
                    "int": 0.0056,
                    "float": 1.0,
                    "string": 1.0
                },
                "data_label_probability": null
            }
        },
        "z": {
            "column_name": "z",
            "data_type": "float",
            "categorical": true,
            "order": "random",
            "samples": "['4.23', '2.69', '3.17', '3.11', '4.25']",
            "statistics": {
                "min": 0.0,
                "max": 31.8,
                "mean": 3.54,
                "median": null,
                "variance": 0.554,
                "stddev": 0.7443,
                "histogram": {
                    "bin_counts": "[4, 0, 0, 0, 0, 0, 0, 0, ... , 0, 0, 0, 0, 0, 0, 0, 1]",
                    "bin_edges": "[0.        , 0.10127389, ... , 31.69872611, 31.8       ]"
                },
                "quantiles": {
                    "0": 2.8357,
                    "1": 3.4433,
                    "2": 3.9497
                },
                "times": {
                    "precision": 0.0027,
                    "min": 0.0001,
                    "max": 0.0,
                    "sum": 0.0,
                    "variance": 0.0001,
                    "histogram_and_quantiles": 0.0342
                },
                "precision": 2,
                "unique_count": 328,
                "unique_ratio": 0.0304,
                "categories": "['2.43', '2.31', '2.53', ... , '3.12', '3.13', '31.8']",
                "sample_size": 10788,
                "null_count": 0,
                "null_types": "[]",
                "null_types_index": {},
                "data_type_representation": {
                    "datetime": 0.0,
                    "int": 0.0184,
                    "float": 1.0,
                    "string": 1.0
                },
                "data_label_probability": null
            }
        }
    }
}

Expected behavior:

Screenshots:

Additional context:

Not all options parameters are validated for CSVData

General Information:

  • Library version: 0.3.4

Describe the bug:
The options dict is not validated other than header. We need to validate all possible parameters (delimiter, etc.)
Tests also need to be added to address this

Consider renaming some of report variables

Consider converting:

  • total_samples -> data_object_count or samples_total or row_count
  • BACKGROUND -> UNKNOWN

Remove from pretty:

  • data_label_representation -> remove
  • avg_predictions -> remove
  • times -> remove

Remove from all:

  • data_label_probability -> remove
  • covariance -> remove

Possibly remove (always null):

  • median -> remove - can be approximated
  • data_classification -> remove - could easily be implemented for PII, NPI, etc

Add data.length

Is your feature request related to a problem? Please describe.

I think the data class should have an easy-to-call property which gets the length of the given dataset, i.e. data.length instead of len(data.data)

Show progress of profiling

Is your feature request related to a problem? Please describe.

It's unclear where the Data Profiler is currently at in terms of processing

Describe the outcome you'd like:

A progress bar (number of columns or percent of data) would be super useful.

Min, Max & Avg precision for float & integer columns

Is your feature request related to a problem? Please describe.

There are cases where users may know the given dataset contains measurements from a device with a given precision. If that's the case, our measured precision is likely highly inaccurate. Take the case of the following array [15000, 39023, 94201401], if we know the measurement 15000 is accurate - the significant figures are not 2, they are really 5. We should take that into account and provide both to the users.

Describe the outcome you'd like:

I'd like to see the minimum measured precision and the maximum measured precision.

Precision should also be shown if it's an integer column.

CSV Header Detection Errors

General Information:

  • OS: Arch Linux
  • Python version: 3.7.6
  • Library version: 0.3.4

Describe the bug:

Header detection fails on the following files.

blogposts.csv

Blog Post,Date,Subject,Field
Abstract Libraries in Go,3/7/2014,Programming,Computer Science
Virtual Memory and You,3/9/2013,Systems,Computer Science
Mutex - Process Synchronization,3/14/2014,Programming,Computer Science
Newtons Method and Fractals,3/16/2014,Programming,Mathematics
Saint Patrick,3/16/2014,World,History
Cache Optimizing,3/24/2014,Systems,Computer Science
The Cache and Multithreading,3/24/2014,Systems,Computer Science
Counting Sort in C,3/25/2014,Programming,Computer Science
Bilingualism and Pattern Recognition,3/28/2014,Learning,Life
Quadrature - Numerical Integration Comparison,4/1/2014,Algorithms,Mathematics
Basic Book Reader,4/9/2014,Programming,Computer Science
C++ Inheritance - Virtual Functions,4/10/2014,Programming,Computer Science
"Monty Hall, meet Game Theory",4/13/2014,Statistics,Mathematics
Gaussian Quadrature,4/13/2014,Algorithms,Mathematics
Introduction to Markov Processes,4/20/2014,Statistics,Mathematics
Theoretically Determing the Man Made in C++,4/26/2014,Programming,Computer Science
Introduction to Monte Carlo Methods,4/30/2014,Statistics,Mathematics
"Are Decisions Governed by ""Free Will"" or Algorithms",5/1/2014,Learning,Life
"CT-Afferents, Emotions, and Autism",5/3/2014,Learning,Life
Multithreading: Semaphores,5/4/2014,Systems,Computer Science
Multithreading: Producer-Consumer Problem,5/4/2014,Systems,Computer Science
Multithreading: Dining Philosophers Problem,5/4/2014,Systems,Computer Science
Multithreading: Common Pitfalls,5/5/2014,Systems,Computer Science
Generating Graph Images in Golang,5/6/2014,Programming,Computer Science
Intro to IPC | Interprocess Communication,5/8/2014,Systems,Computer Science

Inaccurate profile formats listed in README

Please provide the issue you face regarding the documentation

In this section of the README, the types for min and max are float, but they are strings for datetime columns.

Additionally, I'm getting simply an int for precision instead of the dictionary defined in the readme.

Suppress / Remove / Fix warnings from tensorflow

Currently, I am seeing:

WARNING:tensorflow:5 out of the last 5 calls to <function recreate_function..restored_function_body at 0x7f27803a6790> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/tutorials/customization/performance#python_or_tensor_args and https://www.tensorflow.org/api_docs/python/tf/function for more details.

Every time a column is profiled / labeled. It would be nice if the warnings were suppressed and / or the issue was fixed.

Investigate refactor of histogram_to_array for better accuracy

Currently, the _histogram_to_array function does not use the midpoint of the bins to recreate the original dataset.
Investigate accuracy of the _histogram_to_array function currently in comparison to (which uses the midpoint):

def _histogram_to_array(self):
    # Extend histogram to array format
    bin_counts = self._stored_histogram['histogram']['bin_counts']
    bin_edges = self._stored_histogram['histogram']['bin_edges']
    is_bin_non_zero = bin_counts > 0
    bin_midpoints = (bin_edges[1:][is_bin_non_zero]
                     + bin_edges[:-1][is_bin_non_zero]) / 2
    hist_to_array = [
        [midpoint] * count for midpoint, count
        in zip(bin_midpoints, bin_counts[is_bin_non_zero])
    ]
    array_flatten = np.concatenate(hist_to_array)

    # the min/max must be preserved
    array_flatten[0] = bin_edges[0]
    array_flatten[-1] = bin_edges[-1]

    # If we know they are integers, we can limit the data to be as such
    # during conversion
    if not self.__class__.__name__ == 'FloatColumn':
        array_flatten = np.round(array_flatten)

    return array_flatten

Deepcopy doesn't appear necessary, can we remove?

There is a significant number of "deepcopy" calls. Currently, it is the calls slowing the library, such as the line below:

results = self.match_sentence_lengths(data, copy.deepcopy(results),

After removing deepcopy, 3 tests fail when testing data processing:

pytest dataprofiler/tests/labelers/test_data_processing.py

I suspect, that's due to issues with either the tests OR a function that needs to not manipulate the input. After removing deepcopy the function(s) still all worked fine in practice (as far as I could tell).

In either case, a shallow copy likely would work fine here. When tested it did pass all tests:

results = self.match_sentence_lengths(data, dict(results), flatten_separator)

This resulted in a 10-15% reduction in profiling runtime (tested on test file diamonds.csv)

Used the following code to cProfile: https://gist.github.com/lettergram/d8f7d9f3d19856d4a0187462445382a0

Master repo (sort by tottime):

         33022296 function calls (30312047 primitive calls) in 20.389 seconds

   Ordered by: internal time
   List reduced from 9981 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       50    1.849    0.037    1.851    0.037 {method 'tolist' of 'numpy.ndarray' objects}
2262585/575    1.294    0.000    2.548    0.004 copy.py:132(deepcopy)
    12733    0.517    0.000    0.517    0.000 {method 'ParseFromString' of 'google.protobuf.pyext._message.CMessage' objects}
 85312/40    0.496    0.000    2.452    0.061 copy.py:210(_deepcopy_list)
    23976    0.402    0.000    0.557    0.000 numerical_column_stats.py:386(_get_percentile)
  4755217    0.341    0.000    0.342    0.000 {method 'get' of 'dict' objects}
     1100    0.332    0.000    0.332    0.000 {built-in method tensorflow.python._pywrap_tf_session.TF_GraphImportGraphDefWithResults}
    11392    0.323    0.000    0.323    0.000 {method 'SerializeToString' of 'google.protobuf.pyext._message.CMessage' objects}
   250330    0.318    0.000    0.324    0.000 _collections_abc.py:742(__iter__)
  990/330    0.285    0.000    0.285    0.001 version_utils.py:98(swap_class)
     1100    0.270    0.000    0.473    0.000 function_def_to_graph.py:122(function_def_to_graph_def)
    10250    0.270    0.000    0.272    0.000 {built-in method tensorflow.python._pywrap_tfe.TFE_Py_FastPathExecute}
    31850    0.237    0.000    1.217    0.000 ops.py:1880(__init__)
     1500    0.233    0.000    0.233    0.000 {built-in method tensorflow.python._pywrap_tfe.TFE_Py_Execute}
    29420    0.225    0.000    0.262    0.000 {built-in method tensorflow.python._pywrap_tf_session.TF_OperationGetAttrValueProto}
1485219/1480389    0.215    0.000    0.260    0.000 {built-in method builtins.isinstance}
       10    0.212    0.021    2.541    0.254 character_level_cnn_model.py:698(predict)
    29200    0.191    0.000    0.191    0.000 {method 'CopyFrom' of 'google.protobuf.pyext._message.CMessage' objects}
     2200    0.182    0.000    0.472    0.000 function_deserialization.py:481(_list_function_deps)
     2389    0.178    0.000    0.178    0.000 {built-in method marshal.loads}

After removing deepcopy on results and there was a 15% speedupo (fails 3 tests):

         19387413 function calls (18978038 primitive calls) in 17.416 seconds

   Ordered by: internal time
   List reduced from 9981 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       50    1.784    0.036    1.786    0.036 {method 'tolist' of 'numpy.ndarray' objects}
    12733    0.512    0.000    0.512    0.000 {method 'ParseFromString' of 'google.protobuf.pyext._message.CMessage' objects}
    23976    0.422    0.000    0.571    0.000 numerical_column_stats.py:386(_get_percentile)
     1100    0.321    0.000    0.321    0.000 {built-in method tensorflow.python._pywrap_tf_session.TF_GraphImportGraphDefWithResults}
    11392    0.310    0.000    0.310    0.000 {method 'SerializeToString' of 'google.protobuf.pyext._message.CMessage' objects}
   250330    0.307    0.000    0.314    0.000 _collections_abc.py:742(__iter__)
  990/330    0.283    0.000    0.283    0.001 version_utils.py:98(swap_class)
     1100    0.263    0.000    0.461    0.000 function_def_to_graph.py:122(function_def_to_graph_def)
    10250    0.254    0.000    0.256    0.000 {built-in method tensorflow.python._pywrap_tfe.TFE_Py_FastPathExecute}
     1500    0.227    0.000    0.227    0.000 {built-in method tensorflow.python._pywrap_tfe.TFE_Py_Execute}
    31850    0.225    0.000    1.160    0.000 ops.py:1880(__init__)
    29420    0.213    0.000    0.249    0.000 {built-in method tensorflow.python._pywrap_tf_session.TF_OperationGetAttrValueProto}
1485055/1480225    0.209    0.000    0.251    0.000 {built-in method builtins.isinstance}
       10    0.202    0.020    2.446    0.245 character_level_cnn_model.py:698(predict)
    29200    0.184    0.000    0.184    0.000 {method 'CopyFrom' of 'google.protobuf.pyext._message.CMessage' objects}
     2200    0.178    0.000    0.460    0.000 function_deserialization.py:481(_list_function_deps)
     2389    0.152    0.000    0.152    0.000 {built-in method marshal.loads}
     3190    0.144    0.000    0.144    0.000 {built-in method tensorflow.python._pywrap_tf_session.TF_GraphCopyFunction}
   756337    0.137    0.000    0.137    0.000 {method 'search' of 're.Pattern' objects}
       80    0.137    0.002    0.359    0.004 {pandas._libs.lib.map_infer_mask}

Shallow copy implementation (passes all tests):

         19389622 function calls (18980246 primitive calls) in 17.991 seconds

   Ordered by: internal time
   List reduced from 9981 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       50    1.887    0.038    1.889    0.038 {method 'tolist' of 'numpy.ndarray' objects}
    12733    0.522    0.000    0.522    0.000 {method 'ParseFromString' of 'google.protobuf.pyext._message.CMessage' objects}
    23976    0.378    0.000    0.530    0.000 numerical_column_stats.py:386(_get_percentile)
     1100    0.338    0.000    0.338    0.000 {built-in method tensorflow.python._pywrap_tf_session.TF_GraphImportGraphDefWithResults}
    11392    0.326    0.000    0.326    0.000 {method 'SerializeToString' of 'google.protobuf.pyext._message.CMessage' objects}
   250330    0.321    0.000    0.328    0.000 _collections_abc.py:742(__iter__)
  990/330    0.287    0.000    0.288    0.001 version_utils.py:98(swap_class)
     1100    0.275    0.000    0.480    0.000 function_def_to_graph.py:122(function_def_to_graph_def)
    10250    0.269    0.000    0.271    0.000 {built-in method tensorflow.python._pywrap_tfe.TFE_Py_FastPathExecute}
     1500    0.242    0.000    0.242    0.000 {built-in method tensorflow.python._pywrap_tfe.TFE_Py_Execute}
    31850    0.236    0.000    1.224    0.000 ops.py:1880(__init__)
    29420    0.224    0.000    0.262    0.000 {built-in method tensorflow.python._pywrap_tf_session.TF_OperationGetAttrValueProto}
1485818/1480988    0.214    0.000    0.257    0.000 {built-in method builtins.isinstance}
     2389    0.201    0.000    0.201    0.000 {built-in method marshal.loads}
       10    0.199    0.020    2.559    0.256 character_level_cnn_model.py:698(predict)
    29200    0.190    0.000    0.190    0.000 {method 'CopyFrom' of 'google.protobuf.pyext._message.CMessage' objects}
     2200    0.181    0.000    0.474    0.000 function_deserialization.py:481(_list_function_deps)
     3190    0.150    0.000    0.150    0.000 {built-in method tensorflow.python._pywrap_tf_session.TF_GraphCopyFunction}
   756337    0.139    0.000    0.139    0.000 {method 'search' of 're.Pattern' objects}
    37515    0.139    0.000    0.140    0.000 {built-in method tensorflow.python._tf_stack.extract_stack}

Column level Detection While using labeler

Is your feature request related to a problem? Please describe.

When we predict using labeler.predict(data) we are getting cell level labels.

Describe the outcome you'd like:
How to get output as column level labels

Additional context:
And when we train new Labeler with custom data how to include the new labeler model into profiler while profiling the data to get Data_Label in the Json output.

Thankyou

Possibly add correlated columns

It would be nice to have meta data showing which columns are likely correlated. I'm not sure on the difficulty here, but it probably isn't super difficult to calculate some estimations. Just a thought.

Mock the DataLabeler during tests, decreasing test runtime

Describe the bug:
Originally, tests mocked the DataLabelerColumnCompiler to avoid instantiating the DataProfiler and TensorFlow. However, with the recent change that now instantiates a profiler inside the Profiler, these mocks no longer protect against this long load time.

Separate "optional" requirements

Is your feature request related to a problem? Please describe.

Many people don't care about the data labels (entity recognition), they should be able to install the library without tensorflow and skip the install. This would also help people attempting to use python 3.9, for instance.

I'm not 100% sold this is a great idea, but I think it's worth a discussion.

Describe the outcome you'd like:

I'd like to remove the labeling of the requirements in requirements.txt and separate them into requirements-labeling.txt. Then add them as dataprofiler[extras] or dataprofiler[labeler] or something to that effect. This can easily be done in the setup.py.

The way you would install the labeler would be:

$ pip install dataprofiler[labeler] --user

Without the labeler would simply be:

$ pip install dataprofiler --user

https://stackoverflow.com/questions/6237946/optional-dependencies-in-distutils-pip

In the output report & when executing we can warn the users to install the labeler, if desired.

Additional context:

This was a recommendation on /r/statistics

Add training and extended training with an example dataset

Please provide the issue you face regarding the documentation
When using extended training,
when i pass custom label with custom data the following error appears
error_while transfer_learning

When i add "BACKGROUND" In labels. Labler is getting trained. when i predict entities. the prediction is as follows
err1

to reproduce the error please use the following data
new _data_label.zip

Please Update the Documentation with a example data, So that it will be helpful .

Thankyou

Chardet d has trouble identifying some file encodings

General Information:

  • Python version: 3.8.7
  • Library version: 0.3.4

Describe the bug:
Currently the library uses chardet to determine file encodings, however it was relatively unmaintained until recently (Dec 2020) in addition to having trouble detecting some file encodings.

For example:
UTF-8 being detected as windows-1254: chardet/chardet#148

Additional context:
Potential fixes include:
https://hackernoon.com/how-i-used-python-to-solve-declareless-encoding-madness-hk1k42d3o

Issue With DataLabeler

When we load default data labeler and predict on a data set the output is as follows

base_model
base_model_prediction

Here I want to make model learn on new label. So used transfer learning, The predictions after transfer learning are not good
transfer_learning
new_model_predictions

When compared to original predictions to this all the values disturbed. How to make model learn on new label without loosing original behavior. Please suggest me how to do it or am i doing any thing wrong here.

To reproduce the results use this datasets
data.zip

Thankyou.

Null count/ Null rows, potentially other info does not get shared if the primitive type of the column is disabled in options

General Information:

  • Library version: 0.4.3

Describe the bug:
If you disable a primitive data type in options and (TEXT;bc catch all) the column would get profiled as that option, it drops information from the report, i.e. null count and nullrows do not show up.

To Reproduce:

import dataprofiler as dp

data = dp.Data(...)
options = dp.ProfilerOptions
options.set({'text.is_enabled': False})

profiler = dp.Profiler(data, profiler_options=options)
report = profiler.report() # will be missing null info

Data labeler doesn't allow TextData as the input for the predict

General Information:

  • OS:
  • Python version:
  • Library version:

Describe the bug:
Currently, data labeler allows all returned objects from data reader except TextData. This can be fixed by modifying the check_and_validate_data_format function.

To Reproduce:
Run this code

data = dp.Data('some_text_file')
predictions = data_labeler.predict(data)

Expected behavior:

Screenshots:

Additional context:

Detecting CSV integer header w/ description above it

General Information:

  • OS: Ubuntu 18.04
  • Python version: 3.8.7
  • Library version: 0.3.4

Describe the bug:
Currently, if there's a description above a header made of numbers, CSVData will not detect the integer header set.

Expected behavior:
Header should be detected at the proper line. despite a description being above it.

Reference to discussion: #56 (comment)

Create an unstructured Compiler which combines the TextProfiler and the Unstructured Data Labeling

Is your feature request related to a problem? Please describe.
Need Unstructured profiling

Describe the outcome you'd like:
Compiler which combines: TextProfiler and UnstructuredDataLabelerProfile to return a profile.

Additional context:
starting code, may not be exact same in the end.

class UnstructuredCompiler(BaseColumnProfileCompiler):
    
    # NOTE: these profilers are ordered. Test functionality if changed.
    _profilers = [
        TextProfiler,
        UnstructuredDataLabelerProfile
    ]
    
    @property
    def profile(self):
        profile = {}
        for profiler in self._profiles.values():
            profile[profiler.profile_name].update(profiler.profile)
        return profile

duplicate_row_count -> duplicate_row_ratio

Is your feature request related to a problem? Please describe.

I think it makes more sense to utilize ratios or percent of overlap. I think count is good to, but as we're sampling it seems to be a bit strange.

NULLs are not being estimated when sampling

Offending function / line:

base_stats = {
# TODO: Is this correct? used to be actual sample size, including
# NANs, what now?
"sample_size": total_sample_size,
"null_count": total_na,
"null_types": na_columns,
"sample": random.sample(list(df_series.values),
min(len(df_series), 5))
}

The gist is that these should either be estimates OR if they are real numbers, it should be made clear to the end-user. Probably with an explanation of the command they can use to either estimate it, give raw counts and / or how to do the full sample.

Incidentally, this is also one of the slowest functions. Taking the most cumulative time (see **) below:

         4688319 function calls (4610366 primitive calls) in 2.778 seconds

   Ordered by: internal time
   List reduced from 1162 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    23976    0.403    0.000    0.559    0.000 numerical_column_stats.py:386(_get_percentile)
   755201    0.146    0.000    0.146    0.000 {method 'search' of 're.Pattern' objects}
       80    0.140    0.002    0.378    0.005 {pandas._libs.lib.map_infer_mask}
       **10    0.138    0.014    0.881    0.088 profile_builder.py:217(get_base_props_and_clean_null_params)**
   755160    0.092    0.000    0.237    0.000 object_array.py:120(<lambda>)
       20    0.088    0.004    0.216    0.011 utils.py:50(shuffle_in_chunks)
   107900    0.071    0.000    0.170    0.000 base_column_profilers.py:47(_combine_unique_sets)
      350    0.069    0.000    0.137    0.000 {pandas._libs.lib.map_infer}
    32993    0.068    0.000    0.068    0.000 {method 'reduce' of 'numpy.ufunc' objects}
      222    0.056    0.000    0.056    0.000 {pandas._libs.lib.infer_dtype}
       10    0.054    0.005    0.220    0.022 text_column_profile.py:95(_update_vocab)
123750/123726    0.054    0.000    0.054    0.000 numerical_column_stats.py:85(__getattribute__)
   107880    0.052    0.000    0.119    0.000 random.py:174(randrange)
      168    0.051    0.000    0.114    0.001 numerical_column_stats.py:239(_total_histogram_bin_variance)
   107880    0.047    0.000    0.047    0.000 numerical_column_stats.py:545(is_int)
12798/10902    0.047    0.000    0.056    0.000 {built-in method numpy.array}
   107930    0.046    0.000    0.068    0.000 random.py:224(_randbelow)
       92    0.037    0.000    0.037    0.000 {built-in method pandas._libs.missing.isnaobj}
227832/227831    0.036    0.000    0.049    0.000 {built-in method builtins.isinstance}
   107880    0.034    0.000    0.034    0.000 numerical_column_stats.py:526(is_float)

The main time is the regex:

df_series_subset = df_series.iloc[sample_inds]
# Check if known null types exist in column
for na, flags in null_values_and_flags.items():
# Check for the regex of the na in the string.
reg_ex_na = f"^{na}$"
matching_na_elements = df_series_subset.str.contains(
reg_ex_na, flags=flags)
for row, elem in matching_na_elements.items():
if elem:
# Since df_series_subset[row] is mutable,
# need to make new var
row_value = str(df_series_subset[row])
na_columns.setdefault(row_value, list()).append(row)

Possible solution: Merge all the null searches into one query (as opposed to multiple queries) then split once indexes are matched. I believe it's possible to get a 8x speedup here.

Create the TextProfiler for unstructured profiling

Is your feature request related to a problem? Please describe.
Can't profile unstructured text

Describe the outcome you'd like:
Need class for text profiling of unstructured data which mimics the formats of the structured profiling.
Include the following statistics:
word counts
vocab counts
line_lengths (min,max,...)

Additional context:
Starting point, final code may change:

class TextProfiler(object):
    
    type = 'general_info'

    def __init__(options):

        self.sample_size = 0
        self.times = defaultdict(float)
        self.vocab = set()
        self.words = defaultdict(int)
        self.line_length = {'max': None, 'min': None, ...} # should use numeric stats mixin?

        # options values

        # these stop words are from nltk
        self._stop_words = {
            'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you',
            "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself',
            'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her',
            'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them',
            'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom',
            'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are',
            'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had',
            'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and',
            'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at',
            'by', 'for', 'with', 'about', 'against', 'between', 'into',
            'through', 'during', 'before', 'after', 'above', 'below', 'to',
            'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under',
            'again', 'further', 'then', 'once', 'here', 'there', 'when',
            'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',
            'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own',
            'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will',
            'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll',
            'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn',
            "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't",
            'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma',
            'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't",
            'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't",
            'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"
        }

        self.__calculations = {
            "vocab": TextProfiler._update_vocab,
            "words": TextProfiler._update_words,
        }
        self._filter_properties_w_options(self.__calculations, options)        
        
    def __add__(self, other):
        """
        Merges the properties of two TextProfiler profiles
        
        :param self: first profile
        :param other: second profile
        :type self: TextProfiler
        :type other: TextProfiler
        :return: New TextProfiler merged profile
        """
        if not isinstance(other, TextProfiler):
            raise TypeError("Unsupported operand type(s) for +: "
                            "'TextProfiler' and '{}'".format(
                            other.__class__.__name__))
        merged_profile = TextProfiler(None)
        
        self._merge_calculations(merged_profile.__calculations,
                                 self.__calculations,
                                 other.__calculations)
                                 
        raise NotImplementedError()
        return merged_profile
        
    @property
    def profile(self):
        """
        Property for profile. Returns the profile of the column.
        
        :return:
        """
        profile = dict(
            vocab=self.vocab,
            words=self.words,
            word_count=self.word_count,
            times=self.times,
        )
        return profile
        
    @BaseProfiler._timeit(name='vocab')
    def _update_vocab(self, data, prev_dependent_properties=None,
                      subset_properties=None):
        raise NotImplementedError()
        
    @BaseProfiler._timeit(name='words')
    def _update_words(self, data, prev_dependent_properties=None,
                      subset_properties=None):
        raise NotImplementedError()
        
    def _update_helper(self, data, profile):
        """
        Method for updating the column profile properties with a cleaned
        dataset and the known null parameters of the dataset.
        
        :param df_series_clean: df series with nulls removed
        :type df_series_clean: pandas.core.series.Series
        :param profile: text profile dictionary
        :type profile: dict
        :return: None
        """
        BaseColumnProfiler._perform_property_calcs(
            self, self.__calculations, data=data,
            prev_dependent_properties={}, subset_properties=profile)
        
        self._update_base_properties(profile)
        raise NotImplementedError()

    def update(self, data):
        """
        Updates the column profile.
        
        :param df_series: df series
        :type df_series: pandas.core.series.Series
        :return: None
        """
        len_data = len(data)
        if len_data == 0:
            return self
        
        profile = dict(sample_size=len_data)
        self._update_helper(data, profile)

        return self

Add inplace options in the process function in the CharPostprocessor class in the data_processing.py file

Is your feature request related to a problem? Please describe.

The process match_sentence_lengths modifies the results string. That's fine, unless you need it in another data_processing step. To make it safe, you need to deepcopy (which is a very time intensive and memory intensive process).

Adding an "inplace"option for the processing function would enable users to either deepcopy or shallow copy.

Tests would need to be written to evaluate this as well.

PR #85 created this potential issue, but currently the code should function correctly (until a new data processing pipeline is build).

Allow user to specify the default label for dp.train_structured_labeler

Is your feature request related to a problem? Please describe.
Currently, I have to change my dataset to train on it if a column doesn't contain the default label in its names

Describe the outcome you'd like:
The ability to specify my own default label when training on my data.

Additional context:
potential function: def dp.train_structured_labeler(data, default_label=None, save_dirpath=None, epochs=2)
Also might suggest switching save path and epochs.

length on TextData works, but text data loads everything as a single string

Describe the bug:
Fix text data to allow different test formats and then add tests for length of TextData
TextData could have the following formats:
samples per # of line
samples per # of character
samples per # of words

Otherwise, currently everything is read as a single string and the length of the data is always 1.

Constant memory for CSVData Match / Header check

Currently the CSVData object reads in X rows which is non-deterministic for the amount of bytes being read.

Instead, we should have a max bytes / max rows to process for header/csv check. This can be done by reading in bytes at a time until X rows are read or X bytes have been read.

Add examples of adding new models

Is your feature request related to a problem? Please describe.

Not a problem, but would like to see some examples related to creating / replacing the current models.

Describe the outcome you'd like:

Examples should be easy to follow and swappable in the current library.

Additional context:

multi-character separators

General Information:

  • OS: Arch Linux
  • Python version: 3.9
  • Library version: 0.4.2

Describe the bug:

I have a file (sparse-first-and-last-column.txt) containing multiple character separators: , . The system should be able to detect this appropriately and return reasonable results.

Possible Optimization Bugs Since There Are No Tests

General Information:
The repo has been optimized to reduce the amount of shuffles and add multiprocessing. Both have been merged into the repo without any tests.

There needs to be tests to ensure the repo is shuffling properly when it is maintaining the shuffle indices and when it is not. The new utils function for shuffling needs to be tested.

In multiprocessing, there are no tests to make sure any of the multithreading is working appropriately. Most importantly, single threading needs to be tested to make sure that option still works.

Report Concatenation fails with sparse data

General Information:

  • OS: MacOS Catalina 10.15.7
  • Python version: 3.7.9
  • Library version: 0.4.2

Describe the bug:
Report Concatenation fails when missing datetime values in data.

To Reproduce:

import pandas as pd
from dataprofiler import Profiler
from datetime import datetime

dates = [None] * 20
dates[15] = datetime.strptime("2014-12-18", "%Y-%M-%d").date()
dates[16] = datetime.strptime("2015-07-21", "%Y-%M-%d").date()
dates[19] = datetime.strptime("2018-09-01", "%Y-%M-%d").date()

df = pd.DataFrame({"date": dates})

df_1 = df[:10]
df_2 = df[10:]

profiles_1 = Profiler(data=df_1)

profiles_2 = Profiler(data=df_2)

profiles = profiles_1 + profiles_2 

Expected behavior:
To be able to handle None values in statistics when concatenating reports.

Additional context:
This only seems to occur with datetime.date objects where the null values in the column remain None instead of the numpy NaT value. Might be reasonable to expect user to prep data better beforehand, but better handling of None values is probably in order.

Stack trace below

Traceback (most recent call last):
  File "bug2.py", line 19, in <module>
    profiles = profiles_1 + profiles_2
  File "/Users/AXY161/Projects/dataprofiler-oss/dataprofiler/profilers/profile_builder.py", line 454, in __add__
    self._profile[profile_name] + other._profile[profile_name]
  File "/Users/AXY161/Projects/dataprofiler-oss/dataprofiler/profilers/profile_builder.py", line 147, in __add__
    self.profiles[profile_name] + other.profiles[profile_name]
  File "/Users/AXY161/Projects/dataprofiler-oss/dataprofiler/profilers/column_profile_compilers.py", line 99, in __add__
    self._profiles[profile_name] + other._profiles[profile_name]
  File "/Users/AXY161/Projects/dataprofiler-oss/dataprofiler/profilers/datetime_column_profile.py", line 79, in __add__
    if other._dt_obj_min is None or self._dt_obj_min < other._dt_obj_min:
TypeError: '<' not supported between instances of 'NoneType' and 'Timestamp'  

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.