capitalone / dataprofiler Goto Github PK
View Code? Open in Web Editor NEWWhat's in your data? Extract schema, statistics and entities from datasets
Home Page: https://capitalone.github.io/DataProfiler
License: Apache License 2.0
What's in your data? Extract schema, statistics and entities from datasets
Home Page: https://capitalone.github.io/DataProfiler
License: Apache License 2.0
Is your feature request related to a problem? Please describe.
I think it makes more sense to utilize ratios or percent of overlap. I think count is good to, but as we're sampling it seems to be a bit strange.
General Information:
Describe the bug:
The options dict is not validated other than header. We need to validate all possible parameters (delimiter, etc.)
Tests also need to be added to address this
Is your feature request related to a problem? Please describe.
Currently, I have to change my dataset to train on it if a column doesn't contain the default label in its names
Describe the outcome you'd like:
The ability to specify my own default label when training on my data.
Additional context:
potential function: def dp.train_structured_labeler(data, default_label=None, save_dirpath=None, epochs=2)
Also might suggest switching save path and epochs.
Please provide the issue you face regarding the documentation
When using extended training,
when i pass custom label with custom data the following error appears
When i add "BACKGROUND" In labels. Labler is getting trained. when i predict entities. the prediction is as follows
to reproduce the error please use the following data
new _data_label.zip
Please Update the Documentation with a example data, So that it will be helpful .
Thankyou
Is your feature request related to a problem? Please describe.
Many people don't care about the data labels (entity recognition), they should be able to install the library without tensorflow and skip the install. This would also help people attempting to use python 3.9, for instance.
I'm not 100% sold this is a great idea, but I think it's worth a discussion.
Describe the outcome you'd like:
I'd like to remove the labeling of the requirements in requirements.txt
and separate them into requirements-labeling.txt
. Then add them as dataprofiler[extras]
or dataprofiler[labeler]
or something to that effect. This can easily be done in the setup.py
.
The way you would install the labeler would be:
$ pip install dataprofiler[labeler] --user
Without the labeler would simply be:
$ pip install dataprofiler --user
https://stackoverflow.com/questions/6237946/optional-dependencies-in-distutils-pip
In the output report & when executing we can warn the users to install the labeler, if desired.
Additional context:
This was a recommendation on /r/statistics
Is your feature request related to a problem? Please describe.
The global stats should return with the number of rows or objects in the dataset.
Is your feature request related to a problem? Please describe.
I think the data class should have an easy-to-call property which gets the length of the given dataset, i.e. data.length
instead of len(data.data)
Is your feature request related to a problem? Please describe.
When we predict using labeler.predict(data) we are getting cell level labels.
Describe the outcome you'd like:
How to get output as column level labels
Additional context:
And when we train new Labeler with custom data how to include the new labeler model into profiler while profiling the data to get Data_Label in the Json output.
Thankyou
Is your feature request related to a problem? Please describe.
I cannot specify what values should be considered null in my dataset
Describe the outcome you'd like:
In options, I want to specify what represents a null in my dataset.
While training on new data in Colab the following data appears.
When i change any one of the column name to "BACKGROUND" then the labeler gets trained and giving the following output.
To reproduce the errors Please use the following csv files
datasets.zip
And while Predicting the Labeler gives prediction for each cell. How to Aggregate them to column level. Please help me in this
Thank you
Consider converting:
total_samples
-> data_object_count
or samples_total
or row_count
BACKGROUND
-> UNKNOWN
Remove from pretty:
data_label_representation
-> removeavg_predictions
-> removetimes
-> removeRemove from all:
data_label_probability
-> removecovariance
-> removePossibly remove (always null):
median
-> remove - can be approximateddata_classification
-> remove - could easily be implemented for PII, NPI, etcGeneral Information:
If you use the set function of profiler options, you will not get an error if you try to set something that doesn't exist. You should be getting an error or at least a warning.
General Information:
Describe the bug:
function _filter_properties_w_options
uses a variable called property
which shadows the built-in python property
General Information:
Describe the bug:
Currently, data labeler allows all returned objects from data reader except TextData. This can be fixed by modifying the check_and_validate_data_format function.
To Reproduce:
Run this code
data = dp.Data('some_text_file')
predictions = data_labeler.predict(data)
Expected behavior:
Screenshots:
Additional context:
Is your feature request related to a problem? Please describe.
Can't profile unstructured text
Describe the outcome you'd like:
Need class for text profiling of unstructured data which mimics the formats of the structured profiling.
Include the following statistics:
word counts
vocab counts
line_lengths (min,max,...)
Additional context:
Starting point, final code may change:
class TextProfiler(object):
type = 'general_info'
def __init__(options):
self.sample_size = 0
self.times = defaultdict(float)
self.vocab = set()
self.words = defaultdict(int)
self.line_length = {'max': None, 'min': None, ...} # should use numeric stats mixin?
# options values
# these stop words are from nltk
self._stop_words = {
'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you',
"you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself',
'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her',
'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them',
'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom',
'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are',
'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had',
'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and',
'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at',
'by', 'for', 'with', 'about', 'against', 'between', 'into',
'through', 'during', 'before', 'after', 'above', 'below', 'to',
'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under',
'again', 'further', 'then', 'once', 'here', 'there', 'when',
'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',
'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own',
'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will',
'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll',
'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn',
"couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't",
'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma',
'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't",
'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't",
'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"
}
self.__calculations = {
"vocab": TextProfiler._update_vocab,
"words": TextProfiler._update_words,
}
self._filter_properties_w_options(self.__calculations, options)
def __add__(self, other):
"""
Merges the properties of two TextProfiler profiles
:param self: first profile
:param other: second profile
:type self: TextProfiler
:type other: TextProfiler
:return: New TextProfiler merged profile
"""
if not isinstance(other, TextProfiler):
raise TypeError("Unsupported operand type(s) for +: "
"'TextProfiler' and '{}'".format(
other.__class__.__name__))
merged_profile = TextProfiler(None)
self._merge_calculations(merged_profile.__calculations,
self.__calculations,
other.__calculations)
raise NotImplementedError()
return merged_profile
@property
def profile(self):
"""
Property for profile. Returns the profile of the column.
:return:
"""
profile = dict(
vocab=self.vocab,
words=self.words,
word_count=self.word_count,
times=self.times,
)
return profile
@BaseProfiler._timeit(name='vocab')
def _update_vocab(self, data, prev_dependent_properties=None,
subset_properties=None):
raise NotImplementedError()
@BaseProfiler._timeit(name='words')
def _update_words(self, data, prev_dependent_properties=None,
subset_properties=None):
raise NotImplementedError()
def _update_helper(self, data, profile):
"""
Method for updating the column profile properties with a cleaned
dataset and the known null parameters of the dataset.
:param df_series_clean: df series with nulls removed
:type df_series_clean: pandas.core.series.Series
:param profile: text profile dictionary
:type profile: dict
:return: None
"""
BaseColumnProfiler._perform_property_calcs(
self, self.__calculations, data=data,
prev_dependent_properties={}, subset_properties=profile)
self._update_base_properties(profile)
raise NotImplementedError()
def update(self, data):
"""
Updates the column profile.
:param df_series: df series
:type df_series: pandas.core.series.Series
:return: None
"""
len_data = len(data)
if len_data == 0:
return self
profile = dict(sample_size=len_data)
self._update_helper(data, profile)
return self
General Information:
Describe the bug:
Report Concatenation fails when missing datetime values in data.
To Reproduce:
import pandas as pd
from dataprofiler import Profiler
from datetime import datetime
dates = [None] * 20
dates[15] = datetime.strptime("2014-12-18", "%Y-%M-%d").date()
dates[16] = datetime.strptime("2015-07-21", "%Y-%M-%d").date()
dates[19] = datetime.strptime("2018-09-01", "%Y-%M-%d").date()
df = pd.DataFrame({"date": dates})
df_1 = df[:10]
df_2 = df[10:]
profiles_1 = Profiler(data=df_1)
profiles_2 = Profiler(data=df_2)
profiles = profiles_1 + profiles_2
Expected behavior:
To be able to handle None
values in statistics when concatenating reports.
Additional context:
This only seems to occur with datetime.date
objects where the null values in the column remain None
instead of the numpy NaT
value. Might be reasonable to expect user to prep data better beforehand, but better handling of None values is probably in order.
Stack trace below
Traceback (most recent call last):
File "bug2.py", line 19, in <module>
profiles = profiles_1 + profiles_2
File "/Users/AXY161/Projects/dataprofiler-oss/dataprofiler/profilers/profile_builder.py", line 454, in __add__
self._profile[profile_name] + other._profile[profile_name]
File "/Users/AXY161/Projects/dataprofiler-oss/dataprofiler/profilers/profile_builder.py", line 147, in __add__
self.profiles[profile_name] + other.profiles[profile_name]
File "/Users/AXY161/Projects/dataprofiler-oss/dataprofiler/profilers/column_profile_compilers.py", line 99, in __add__
self._profiles[profile_name] + other._profiles[profile_name]
File "/Users/AXY161/Projects/dataprofiler-oss/dataprofiler/profilers/datetime_column_profile.py", line 79, in __add__
if other._dt_obj_min is None or self._dt_obj_min < other._dt_obj_min:
TypeError: '<' not supported between instances of 'NoneType' and 'Timestamp'
Please provide the issue you face regarding the documentation
In this section of the README, the types for min
and max
are float, but they are strings for datetime columns.
Additionally, I'm getting simply an int for precision
instead of the dictionary defined in the readme.
Is your feature request related to a problem? Please describe.
Not a problem, but would like to see some examples related to creating / replacing the current models.
Describe the outcome you'd like:
Examples should be easy to follow and swappable in the current library.
Additional context:
General Information:
Describe the bug:
File / Data is mostly correct, but the Year
column is labeled as address
To Reproduce:
Download file, execute commands.
Is your feature request related to a problem? Please describe.
I'd like to be able to do: priofile.save(filename=<optional>)
and Profiler.load(filename=<profile_filename>)
, after which I should be able to do profile.update(data)
or profile.report()
.
This would be an amazing feature as it would enable distributing profiling generating and merging.
General Information:
Describe the bug:
I have a file (sparse-first-and-last-column.txt
) containing multiple character separators: ,
. The system should be able to detect this appropriately and return reasonable results.
Is your feature request related to a problem? Please describe.
Need Unstructured profiling
Describe the outcome you'd like:
Compiler which combines: TextProfiler and UnstructuredDataLabelerProfile to return a profile.
Additional context:
starting code, may not be exact same in the end.
class UnstructuredCompiler(BaseColumnProfileCompiler):
# NOTE: these profilers are ordered. Test functionality if changed.
_profilers = [
TextProfiler,
UnstructuredDataLabelerProfile
]
@property
def profile(self):
profile = {}
for profiler in self._profiles.values():
profile[profiler.profile_name].update(profiler.profile)
return profile
Currently the CSVData object reads in X rows which is non-deterministic for the amount of bytes being read.
Instead, we should have a max bytes / max rows to process for header/csv check. This can be done by reading in bytes at a time until X rows are read or X bytes have been read.
It would be nice to have meta data showing which columns are likely correlated. I'm not sure on the difficulty here, but it probably isn't super difficult to calculate some estimations. Just a thought.
Is your feature request related to a problem? Please describe.
There are cases where users may know the given dataset contains measurements from a device with a given precision. If that's the case, our measured precision is likely highly inaccurate. Take the case of the following array [15000, 39023, 94201401]
, if we know the measurement 15000 is accurate - the significant figures are not 2, they are really 5. We should take that into account and provide both to the users.
Describe the outcome you'd like:
I'd like to see the minimum measured precision and the maximum measured precision.
Precision should also be shown if it's an integer column.
If you were to ingest the following file:
NAME, VALUE
test1, 1
test2, 2
test3, 3
,,
test5, 5
test6, 6
,,
test7, 7
I would expect there to be two null rows. Currently, the profiler identifies them as zero null rows.
Is your feature request related to a problem? Please describe.
The process match_sentence_lengths
modifies the results string. That's fine, unless you need it in another data_processing step. To make it safe, you need to deepcopy (which is a very time intensive and memory intensive process).
Adding an "inplace"option for the processing function would enable users to either deepcopy or shallow copy.
Tests would need to be written to evaluate this as well.
PR #85 created this potential issue, but currently the code should function correctly (until a new data processing pipeline is build).
Currently, the _histogram_to_array
function does not use the midpoint of the bins to recreate the original dataset.
Investigate accuracy of the _histogram_to_array
function currently in comparison to (which uses the midpoint):
def _histogram_to_array(self):
# Extend histogram to array format
bin_counts = self._stored_histogram['histogram']['bin_counts']
bin_edges = self._stored_histogram['histogram']['bin_edges']
is_bin_non_zero = bin_counts > 0
bin_midpoints = (bin_edges[1:][is_bin_non_zero]
+ bin_edges[:-1][is_bin_non_zero]) / 2
hist_to_array = [
[midpoint] * count for midpoint, count
in zip(bin_midpoints, bin_counts[is_bin_non_zero])
]
array_flatten = np.concatenate(hist_to_array)
# the min/max must be preserved
array_flatten[0] = bin_edges[0]
array_flatten[-1] = bin_edges[-1]
# If we know they are integers, we can limit the data to be as such
# during conversion
if not self.__class__.__name__ == 'FloatColumn':
array_flatten = np.round(array_flatten)
return array_flatten
General Information:
Describe the bug:
Header detection fails on the following files.
blogposts.csv
Blog Post,Date,Subject,Field
Abstract Libraries in Go,3/7/2014,Programming,Computer Science
Virtual Memory and You,3/9/2013,Systems,Computer Science
Mutex - Process Synchronization,3/14/2014,Programming,Computer Science
Newtons Method and Fractals,3/16/2014,Programming,Mathematics
Saint Patrick,3/16/2014,World,History
Cache Optimizing,3/24/2014,Systems,Computer Science
The Cache and Multithreading,3/24/2014,Systems,Computer Science
Counting Sort in C,3/25/2014,Programming,Computer Science
Bilingualism and Pattern Recognition,3/28/2014,Learning,Life
Quadrature - Numerical Integration Comparison,4/1/2014,Algorithms,Mathematics
Basic Book Reader,4/9/2014,Programming,Computer Science
C++ Inheritance - Virtual Functions,4/10/2014,Programming,Computer Science
"Monty Hall, meet Game Theory",4/13/2014,Statistics,Mathematics
Gaussian Quadrature,4/13/2014,Algorithms,Mathematics
Introduction to Markov Processes,4/20/2014,Statistics,Mathematics
Theoretically Determing the Man Made in C++,4/26/2014,Programming,Computer Science
Introduction to Monte Carlo Methods,4/30/2014,Statistics,Mathematics
"Are Decisions Governed by ""Free Will"" or Algorithms",5/1/2014,Learning,Life
"CT-Afferents, Emotions, and Autism",5/3/2014,Learning,Life
Multithreading: Semaphores,5/4/2014,Systems,Computer Science
Multithreading: Producer-Consumer Problem,5/4/2014,Systems,Computer Science
Multithreading: Dining Philosophers Problem,5/4/2014,Systems,Computer Science
Multithreading: Common Pitfalls,5/5/2014,Systems,Computer Science
Generating Graph Images in Golang,5/6/2014,Programming,Computer Science
Intro to IPC | Interprocess Communication,5/8/2014,Systems,Computer Science
Describe the bug:
Fix text data to allow different test formats and then add tests for length of TextData
TextData could have the following formats:
samples per # of line
samples per # of character
samples per # of words
Otherwise, currently everything is read as a single string and the length of the data is always 1.
General Information:
Describe the bug:
Get the following error with the report
Error: categorical has no data
Error: order has no data
Error: categorical has no data
Error: order has no data
Error: categorical has no data
Error: order has no data
Error: categorical has no data
Error: order has no data
Error: categorical has no data
Error: order has no data
Error: categorical has no data
Error: order has no data
Error: categorical has no data
Error: order has no data
Error: categorical has no data
Error: order has no data
Error: categorical has no data
Error: order has no data
Error: categorical has no data
Error: order has no data
Error: categorical has no data
Error: order has no data
Error: categorical has no data
Error: order has no data
Error: categorical has no data
Error: order has no data
Error: categorical has no data
Error: order has no data
Error: categorical has no data
Error: order has no data
Error: categorical has no data
Error: order has no data
To Reproduce:
Use the following code with a csv file including 16 columns
# set option to run only data labeler
profile_options = dp.ProfilerOptions()
profile_options.set({"text.is_enabled": False,
"int.is_enabled": False,
"float.is_enabled": False,
"order.is_enabled": False,
"category.is_enabled": False,
"datetime.is_enabled": False,})
profile = dp.Profiler(data, profiler_options=profile_options)
results = profile.report(report_options={"output_format":"pretty"})
print(json.dumps(human_readable_report, indent=4))
Expected behavior:
Screenshots:
Additional context:
Describe the bug:
Originally, tests mocked the DataLabelerColumnCompiler to avoid instantiating the DataProfiler and TensorFlow. However, with the recent change that now instantiates a profiler inside the Profiler, these mocks no longer protect against this long load time.
Is your feature request related to a problem? Please describe.
It's unclear where the Data Profiler is currently at in terms of processing
Describe the outcome you'd like:
A progress bar (number of columns or percent of data) would be super useful.
General Information:
Describe the bug:
If you disable a primitive data type in options and (TEXT;bc catch all) the column would get profiled as that option, it drops information from the report, i.e. null count and nullrows do not show up.
To Reproduce:
import dataprofiler as dp
data = dp.Data(...)
options = dp.ProfilerOptions
options.set({'text.is_enabled': False})
profiler = dp.Profiler(data, profiler_options=options)
report = profiler.report() # will be missing null info
General Information:
Describe the bug:
File:
-123
234
345
534
231
Delimiter detected: 3
Delimiter should probably not detect inside numbers
General Information:
Describe the bug:
Because indexes are randomly generated per column
DataProfiler/dataprofiler/profilers/profile_builder.py
Lines 280 to 281 in 5dfa3b0
Those indexes are random and may not align with one another. They have to align to get correct results. The inaccurate results would be: row_has_null
and row_is_null
To Reproduce:
Every time the data profiler is ran.
Expected behavior:
The solution is to:
min_true_samples
) needed (return / store said location)row_is_null
calculationAs part of this PR, tests should be written to check the row_has_null
and row_is_null
counts.
There is a significant number of "deepcopy" calls. Currently, it is the calls slowing the library, such as the line below:
After removing deepcopy, 3 tests fail when testing data processing:
pytest dataprofiler/tests/labelers/test_data_processing.py
I suspect, that's due to issues with either the tests OR a function that needs to not manipulate the input. After removing deepcopy the function(s) still all worked fine in practice (as far as I could tell).
In either case, a shallow copy likely would work fine here. When tested it did pass all tests:
results = self.match_sentence_lengths(data, dict(results), flatten_separator)
This resulted in a 10-15% reduction in profiling runtime (tested on test file diamonds.csv
)
Used the following code to cProfile: https://gist.github.com/lettergram/d8f7d9f3d19856d4a0187462445382a0
Master repo (sort by tottime):
33022296 function calls (30312047 primitive calls) in 20.389 seconds
Ordered by: internal time
List reduced from 9981 to 20 due to restriction <20>
ncalls tottime percall cumtime percall filename:lineno(function)
50 1.849 0.037 1.851 0.037 {method 'tolist' of 'numpy.ndarray' objects}
2262585/575 1.294 0.000 2.548 0.004 copy.py:132(deepcopy)
12733 0.517 0.000 0.517 0.000 {method 'ParseFromString' of 'google.protobuf.pyext._message.CMessage' objects}
85312/40 0.496 0.000 2.452 0.061 copy.py:210(_deepcopy_list)
23976 0.402 0.000 0.557 0.000 numerical_column_stats.py:386(_get_percentile)
4755217 0.341 0.000 0.342 0.000 {method 'get' of 'dict' objects}
1100 0.332 0.000 0.332 0.000 {built-in method tensorflow.python._pywrap_tf_session.TF_GraphImportGraphDefWithResults}
11392 0.323 0.000 0.323 0.000 {method 'SerializeToString' of 'google.protobuf.pyext._message.CMessage' objects}
250330 0.318 0.000 0.324 0.000 _collections_abc.py:742(__iter__)
990/330 0.285 0.000 0.285 0.001 version_utils.py:98(swap_class)
1100 0.270 0.000 0.473 0.000 function_def_to_graph.py:122(function_def_to_graph_def)
10250 0.270 0.000 0.272 0.000 {built-in method tensorflow.python._pywrap_tfe.TFE_Py_FastPathExecute}
31850 0.237 0.000 1.217 0.000 ops.py:1880(__init__)
1500 0.233 0.000 0.233 0.000 {built-in method tensorflow.python._pywrap_tfe.TFE_Py_Execute}
29420 0.225 0.000 0.262 0.000 {built-in method tensorflow.python._pywrap_tf_session.TF_OperationGetAttrValueProto}
1485219/1480389 0.215 0.000 0.260 0.000 {built-in method builtins.isinstance}
10 0.212 0.021 2.541 0.254 character_level_cnn_model.py:698(predict)
29200 0.191 0.000 0.191 0.000 {method 'CopyFrom' of 'google.protobuf.pyext._message.CMessage' objects}
2200 0.182 0.000 0.472 0.000 function_deserialization.py:481(_list_function_deps)
2389 0.178 0.000 0.178 0.000 {built-in method marshal.loads}
After removing deepcopy on results and there was a 15% speedupo (fails 3 tests):
19387413 function calls (18978038 primitive calls) in 17.416 seconds
Ordered by: internal time
List reduced from 9981 to 20 due to restriction <20>
ncalls tottime percall cumtime percall filename:lineno(function)
50 1.784 0.036 1.786 0.036 {method 'tolist' of 'numpy.ndarray' objects}
12733 0.512 0.000 0.512 0.000 {method 'ParseFromString' of 'google.protobuf.pyext._message.CMessage' objects}
23976 0.422 0.000 0.571 0.000 numerical_column_stats.py:386(_get_percentile)
1100 0.321 0.000 0.321 0.000 {built-in method tensorflow.python._pywrap_tf_session.TF_GraphImportGraphDefWithResults}
11392 0.310 0.000 0.310 0.000 {method 'SerializeToString' of 'google.protobuf.pyext._message.CMessage' objects}
250330 0.307 0.000 0.314 0.000 _collections_abc.py:742(__iter__)
990/330 0.283 0.000 0.283 0.001 version_utils.py:98(swap_class)
1100 0.263 0.000 0.461 0.000 function_def_to_graph.py:122(function_def_to_graph_def)
10250 0.254 0.000 0.256 0.000 {built-in method tensorflow.python._pywrap_tfe.TFE_Py_FastPathExecute}
1500 0.227 0.000 0.227 0.000 {built-in method tensorflow.python._pywrap_tfe.TFE_Py_Execute}
31850 0.225 0.000 1.160 0.000 ops.py:1880(__init__)
29420 0.213 0.000 0.249 0.000 {built-in method tensorflow.python._pywrap_tf_session.TF_OperationGetAttrValueProto}
1485055/1480225 0.209 0.000 0.251 0.000 {built-in method builtins.isinstance}
10 0.202 0.020 2.446 0.245 character_level_cnn_model.py:698(predict)
29200 0.184 0.000 0.184 0.000 {method 'CopyFrom' of 'google.protobuf.pyext._message.CMessage' objects}
2200 0.178 0.000 0.460 0.000 function_deserialization.py:481(_list_function_deps)
2389 0.152 0.000 0.152 0.000 {built-in method marshal.loads}
3190 0.144 0.000 0.144 0.000 {built-in method tensorflow.python._pywrap_tf_session.TF_GraphCopyFunction}
756337 0.137 0.000 0.137 0.000 {method 'search' of 're.Pattern' objects}
80 0.137 0.002 0.359 0.004 {pandas._libs.lib.map_infer_mask}
Shallow copy implementation (passes all tests):
19389622 function calls (18980246 primitive calls) in 17.991 seconds
Ordered by: internal time
List reduced from 9981 to 20 due to restriction <20>
ncalls tottime percall cumtime percall filename:lineno(function)
50 1.887 0.038 1.889 0.038 {method 'tolist' of 'numpy.ndarray' objects}
12733 0.522 0.000 0.522 0.000 {method 'ParseFromString' of 'google.protobuf.pyext._message.CMessage' objects}
23976 0.378 0.000 0.530 0.000 numerical_column_stats.py:386(_get_percentile)
1100 0.338 0.000 0.338 0.000 {built-in method tensorflow.python._pywrap_tf_session.TF_GraphImportGraphDefWithResults}
11392 0.326 0.000 0.326 0.000 {method 'SerializeToString' of 'google.protobuf.pyext._message.CMessage' objects}
250330 0.321 0.000 0.328 0.000 _collections_abc.py:742(__iter__)
990/330 0.287 0.000 0.288 0.001 version_utils.py:98(swap_class)
1100 0.275 0.000 0.480 0.000 function_def_to_graph.py:122(function_def_to_graph_def)
10250 0.269 0.000 0.271 0.000 {built-in method tensorflow.python._pywrap_tfe.TFE_Py_FastPathExecute}
1500 0.242 0.000 0.242 0.000 {built-in method tensorflow.python._pywrap_tfe.TFE_Py_Execute}
31850 0.236 0.000 1.224 0.000 ops.py:1880(__init__)
29420 0.224 0.000 0.262 0.000 {built-in method tensorflow.python._pywrap_tf_session.TF_OperationGetAttrValueProto}
1485818/1480988 0.214 0.000 0.257 0.000 {built-in method builtins.isinstance}
2389 0.201 0.000 0.201 0.000 {built-in method marshal.loads}
10 0.199 0.020 2.559 0.256 character_level_cnn_model.py:698(predict)
29200 0.190 0.000 0.190 0.000 {method 'CopyFrom' of 'google.protobuf.pyext._message.CMessage' objects}
2200 0.181 0.000 0.474 0.000 function_deserialization.py:481(_list_function_deps)
3190 0.150 0.000 0.150 0.000 {built-in method tensorflow.python._pywrap_tf_session.TF_GraphCopyFunction}
756337 0.139 0.000 0.139 0.000 {method 'search' of 're.Pattern' objects}
37515 0.139 0.000 0.140 0.000 {built-in method tensorflow.python._tf_stack.extract_stack}
TensorFlow addons (TFA) are the primary reason the library cannot be upgraded.
TFA is utilized in a single location in the code:
It should be possible to create an F1 score metric function that doesn't require TFA:
https://stackoverflow.com/questions/64474463/custom-f1-score-metric-in-tensorflow
Once removed, tensorflow nightly would likely suffice and the library could then work on python 3.9.
Would we like to do this?
General Information:
The repo has been optimized to reduce the amount of shuffles and add multiprocessing. Both have been merged into the repo without any tests.
There needs to be tests to ensure the repo is shuffling properly when it is maintaining the shuffle indices and when it is not. The new utils function for shuffling needs to be tested.
In multiprocessing, there are no tests to make sure any of the multithreading is working appropriately. Most importantly, single threading needs to be tested to make sure that option still works.
When we load default data labeler and predict on a data set the output is as follows
Here I want to make model learn on new label. So used transfer learning, The predictions after transfer learning are not good
When compared to original predictions to this all the values disturbed. How to make model learn on new label without loosing original behavior. Please suggest me how to do it or am i doing any thing wrong here.
To reproduce the results use this datasets
data.zip
Thankyou.
General Information:
Describe the bug:
Profiling diamonds.csv
(2.5Mb) or really any dataset is much slower than expected. Even with the data labeler disabled.
To Reproduce:
The following code takes:
Base run: 38 seconds and ~785Mb
No labeler: 24.48 seconds to execute and ~67Mb
No labeler, no histogram: 23.47 seconds to execute and ~67Mb.
No labeler, no datetime: 23.35 seconds and ~67Mb
import sys
import json
import time
import dataprofiler as dp
filename = sys.argv[1]
def profile_test(filename):
data = dp.Data(filename)
profile_options = dp.ProfilerOptions()
profile_options.structured_options.data_labeler.is_enabled = False
profile_options.set({"histogram_and_quantiles.is_enabled": False})
profile_options.set({"datetime.is_enabled": False})
profile = dp.Profiler(data, profiler_options=profile_options)
human_readable_report = profile.report(report_options={"output_format":"pretty"})
print(json.dumps(human_readable_report, indent=4))
start_time = time.time()
profile_test(filename)
end_time = time.time()
print("Profile runtime for "+filename, end_time-start_time, 'seconds')
In the output, there's no reason in the "timing" for this to be occuring:
{
"global_stats": {
"samples_used": 10788,
"column_count": 10,
"unique_row_ratio": 0.9973,
"row_has_null_ratio": 0.0,
"duplicate_row_count": 146,
"file_type": "csv",
"encoding": "utf-8",
"data_classification": null,
"covariance": null
},
"data_stats": {
"carat": {
"column_name": "carat",
"data_type": "float",
"categorical": true,
"order": "random",
"samples": "['2.52', '1', '0.55', '1.75', '2.01']",
"statistics": {
"min": 0.2,
"max": 4.01,
"mean": 0.7976,
"median": null,
"variance": 0.2244,
"stddev": 0.4738,
"histogram": {
"bin_counts": "[ 70, 205, 606, 1224, 443, ... , 0, 0, 0, 0, 1]",
"bin_edges": "[0.2 , 0.23663462, ... , 3.97336538, 4.01 ]"
},
"quantiles": {
"0": 0.3832,
"1": 0.6762,
"2": 1.006
},
"times": {
"precision": 0.0027,
"min": 0.0001,
"max": 0.0,
"sum": 0.0001,
"variance": 0.0001,
"histogram_and_quantiles": 0.0263
},
"precision": 2,
"unique_count": 230,
"unique_ratio": 0.0213,
"categories": "['0.29', '0.31', '0.26', ... , '0.21', '0.57', '0.69']",
"sample_size": 10788,
"null_count": 0,
"null_types": "[]",
"null_types_index": {},
"data_type_representation": {
"datetime": 0.0,
"int": 0.0336,
"float": 1.0,
"string": 1.0
},
"data_label_probability": null
}
},
"cut": {
"column_name": "cut",
"data_type": "string",
"categorical": true,
"order": "random",
"samples": "['Ideal', 'Very Good', 'Ideal', 'Premium', 'Good']",
"statistics": {
"min": 4.0,
"max": 9.0,
"mean": 6.2944,
"median": null,
"variance": 3.1061,
"stddev": 1.7624,
"histogram": {
"bin_counts": "[1340, 0, 0, 0, ... , 0, 0, 0, 2454]",
"bin_edges": "[4. , 4.04807692, ... , 8.95192308, 9. ]"
},
"quantiles": {
"0": 4.9615,
"1": 4.9615,
"2": 6.9808
},
"vocab": "['P', 'r', 'e', 'm', 'i', ... , 'd', 'I', 'a', 'l', 'F']",
"times": {
"vocab": 0.0239,
"min": 0.0001,
"max": 0.0,
"sum": 0.0001,
"variance": 0.0001,
"histogram_and_quantiles": 0.0304
},
"unique_count": 5,
"unique_ratio": 0.0005,
"categories": "['Premium', 'Very Good', 'Ideal', 'Good', 'Fair']",
"sample_size": 10788,
"null_count": 0,
"null_types": "[]",
"null_types_index": {},
"data_type_representation": {
"datetime": 0.0,
"int": 0.0,
"float": 0.0,
"string": 1.0
},
"data_label_probability": null
}
},
"color": {
"column_name": "color",
"data_type": "string",
"categorical": true,
"order": "random",
"samples": "['D', 'D', 'E', 'G', 'D']",
"statistics": {
"min": 1.0,
"max": 1.0,
"mean": 1.0,
"median": null,
"variance": 0.0,
"stddev": 0.0,
"histogram": {
"bin_counts": "[10788]",
"bin_edges": "[1., 1.]"
},
"quantiles": {
"0": 1.0,
"1": 1.0,
"2": 1.0
},
"vocab": "['I', 'E', 'J', 'H', 'D', 'G', 'F']",
"times": {
"vocab": 0.0147,
"min": 0.0001,
"max": 0.0,
"sum": 0.0001,
"variance": 0.0001,
"histogram_and_quantiles": 0.0113
},
"unique_count": 7,
"unique_ratio": 0.0006,
"categories": "['I', 'E', 'J', 'H', 'D', 'G', 'F']",
"sample_size": 10788,
"null_count": 0,
"null_types": "[]",
"null_types_index": {},
"data_type_representation": {
"datetime": 0.0,
"int": 0.0,
"float": 0.0,
"string": 1.0
},
"data_label_probability": null
}
},
"clarity": {
"column_name": "clarity",
"data_type": "string",
"categorical": true,
"order": "random",
"samples": "['VS1', 'SI2', 'SI1', 'VS1', 'SI1']",
"statistics": {
"min": 2.0,
"max": 4.0,
"mean": 3.1212,
"median": null,
"variance": 0.1989,
"stddev": 0.446,
"histogram": {
"bin_counts": "[498, 0, 0, 0, 0, ... , 0, 0, 0, 0, 1806]",
"bin_edges": "[2. , 2.01923077, ... , 3.98076923, 4. ]"
},
"quantiles": {
"0": 3.0,
"1": 3.0,
"2": 3.0
},
"vocab": "['V', 'S', '1', 'I', '2', 'F']",
"times": {
"vocab": 0.0149,
"min": 0.0001,
"max": 0.0001,
"sum": 0.0001,
"variance": 0.0001,
"histogram_and_quantiles": 0.0322
},
"unique_count": 8,
"unique_ratio": 0.0007,
"categories": "['VS1', 'SI1', 'VVS2', 'VS2', ... , 'SI2', 'VVS1', 'IF', 'I1']",
"sample_size": 10788,
"null_count": 0,
"null_types": "[]",
"null_types_index": {},
"data_type_representation": {
"datetime": 0.0,
"int": 0.0,
"float": 0.0,
"string": 1.0
},
"data_label_probability": null
}
},
"depth": {
"column_name": "depth",
"data_type": "float",
"categorical": true,
"order": "random",
"samples": "['62.5', '62.4', '63.9', '63.4', '61']",
"statistics": {
"min": 43.0,
"max": 79.0,
"mean": 61.7492,
"median": null,
"variance": 2.0652,
"stddev": 1.4371,
"histogram": {
"bin_counts": "[1, 0, 0, 0, 0, 0, 0, 0, ... , 0, 0, 0, 0, 0, 0, 0, 1]",
"bin_edges": "[43. , 43.13533835, ... , 78.86466165, 79. ]"
},
"quantiles": {
"0": 61.0,
"1": 61.6767,
"2": 62.4887
},
"times": {
"precision": 0.0026,
"min": 0.0001,
"max": 0.0,
"sum": 0.0,
"variance": 0.0001,
"histogram_and_quantiles": 0.0653
},
"precision": 1,
"unique_count": 136,
"unique_ratio": 0.0126,
"categories": "['61.5', '62', '63.4', '61', ... , '54.3', '55.3', '57', '79']",
"sample_size": 10788,
"null_count": 0,
"null_types": "[]",
"null_types_index": {},
"data_type_representation": {
"datetime": 0.0,
"int": 0.099,
"float": 1.0,
"string": 1.0
},
"data_label_probability": null
}
},
"table": {
"column_name": "table",
"data_type": "float",
"categorical": true,
"order": "random",
"samples": "['54', '59', '56', '57', '56']",
"statistics": {
"min": 51.0,
"max": 95.0,
"mean": 57.4651,
"median": null,
"variance": 5.0778,
"stddev": 2.2534,
"histogram": {
"bin_counts": "[ 5, 0, 0, 16, 0, 0, 0, ... , 0, 0, 0, 0, 0, 0, 1]",
"bin_edges": "[51. , 51.26993865, ... , 94.73006135, 95. ]"
},
"quantiles": {
"0": 55.7239,
"1": 56.8037,
"2": 58.6933
},
"times": {
"precision": 0.0019,
"min": 0.0001,
"max": 0.0,
"sum": 0.0,
"variance": 0.0001,
"histogram_and_quantiles": 0.0296
},
"precision": 1,
"unique_count": 91,
"unique_ratio": 0.0084,
"categories": "['58', '57', '61', '54', ... , '58.9', '62.5', '60.7', '61.6']",
"sample_size": 10788,
"null_count": 0,
"null_types": "[]",
"null_types_index": {},
"data_type_representation": {
"datetime": 0.0,
"int": 0.9819,
"float": 1.0,
"string": 1.0
},
"data_label_probability": null
}
},
"price": {
"column_name": "price",
"data_type": "int",
"categorical": false,
"order": "random",
"samples": "['2655', '3587', '1341', '991', '16215']",
"statistics": {
"min": 334.0,
"max": 18795.0,
"mean": 3903.6178,
"median": null,
"variance": 15978757.826,
"stddev": 3997.3438,
"histogram": {
"bin_counts": "[436, 959, 965, 757, 481, ... , 13, 14, 10, 11, 14]",
"bin_edges": "[334. , 511.50961538, ... , 18617.49038462, 18795. ]"
},
"quantiles": {
"0": 866.5288,
"1": 2286.6058,
"2": 5126.7596
},
"times": {
"min": 0.0001,
"max": 0.0,
"sum": 0.0,
"variance": 0.0001,
"histogram_and_quantiles": 0.0257
},
"unique_count": 5349,
"unique_ratio": 0.4958,
"sample_size": 10788,
"null_count": 0,
"null_types": "[]",
"null_types_index": {},
"data_type_representation": {
"datetime": 0.0,
"int": 1.0,
"float": 1.0,
"string": 1.0
},
"data_label_probability": null
}
},
"x": {
"column_name": "x",
"data_type": "float",
"categorical": true,
"order": "random",
"samples": "['5.7', '4.74', '7.3', '4.27', '6.88']",
"statistics": {
"min": 0.0,
"max": 10.02,
"mean": 5.7337,
"median": null,
"variance": 1.2481,
"stddev": 1.1172,
"histogram": {
"bin_counts": "[2, 0, 0, 0, 0, 0, 0, 0, ... , 2, 1, 1, 0, 0, 0, 0, 1]",
"bin_edges": "[0. , 0.09634615, ... , 9.92365385, 10.02 ]"
},
"quantiles": {
"0": 4.6246,
"1": 5.6844,
"2": 6.4552
},
"times": {
"precision": 0.0026,
"min": 0.0001,
"max": 0.0,
"sum": 0.0,
"variance": 0.0001,
"histogram_and_quantiles": 0.0385
},
"precision": 2,
"unique_count": 503,
"unique_ratio": 0.0466,
"categories": "['3.87', '3.93', '4.21', ... , '5.54', '3.92', '3.9']",
"sample_size": 10788,
"null_count": 0,
"null_types": "[]",
"null_types_index": {},
"data_type_representation": {
"datetime": 0.0,
"int": 0.005,
"float": 1.0,
"string": 1.0
},
"data_label_probability": null
}
},
"y": {
"column_name": "y",
"data_type": "float",
"categorical": true,
"order": "random",
"samples": "['5.13', '4.18', '6.37', '6.18', '6.41']",
"statistics": {
"min": 3.71,
"max": 31.8,
"mean": 5.724,
"median": null,
"variance": 1.3022,
"stddev": 1.1412,
"histogram": {
"bin_counts": "[ 8, 93, 157, 840, 1031, ... , 0, 0, 0, 0, 1]",
"bin_edges": "[3.71 , 3.87426901, ... , 31.63573099, 31.8 ]"
},
"quantiles": {
"0": 4.6135,
"1": 5.5991,
"2": 6.4204
},
"times": {
"precision": 0.0027,
"min": 0.0001,
"max": 0.0,
"sum": 0.0001,
"variance": 0.0001,
"histogram_and_quantiles": 0.0262
},
"precision": 2,
"unique_count": 496,
"unique_ratio": 0.046,
"categories": "['3.96', '3.78', '3.9', ... , '3.82', '3.92', '31.8']",
"sample_size": 10788,
"null_count": 0,
"null_types": "[]",
"null_types_index": {},
"data_type_representation": {
"datetime": 0.0,
"int": 0.0056,
"float": 1.0,
"string": 1.0
},
"data_label_probability": null
}
},
"z": {
"column_name": "z",
"data_type": "float",
"categorical": true,
"order": "random",
"samples": "['4.23', '2.69', '3.17', '3.11', '4.25']",
"statistics": {
"min": 0.0,
"max": 31.8,
"mean": 3.54,
"median": null,
"variance": 0.554,
"stddev": 0.7443,
"histogram": {
"bin_counts": "[4, 0, 0, 0, 0, 0, 0, 0, ... , 0, 0, 0, 0, 0, 0, 0, 1]",
"bin_edges": "[0. , 0.10127389, ... , 31.69872611, 31.8 ]"
},
"quantiles": {
"0": 2.8357,
"1": 3.4433,
"2": 3.9497
},
"times": {
"precision": 0.0027,
"min": 0.0001,
"max": 0.0,
"sum": 0.0,
"variance": 0.0001,
"histogram_and_quantiles": 0.0342
},
"precision": 2,
"unique_count": 328,
"unique_ratio": 0.0304,
"categories": "['2.43', '2.31', '2.53', ... , '3.12', '3.13', '31.8']",
"sample_size": 10788,
"null_count": 0,
"null_types": "[]",
"null_types_index": {},
"data_type_representation": {
"datetime": 0.0,
"int": 0.0184,
"float": 1.0,
"string": 1.0
},
"data_label_probability": null
}
}
}
}
Expected behavior:
Screenshots:
Additional context:
General Information:
Describe the bug:
JSON files are not loading as I would expect. It can do well if there's a list of objects, but if there are embeded objects it does not function well. Further, if there is a JSON file such as: { 'data': [ { }, { }, { } ] }
the library will treat [ { }, { }, { } ]
as one giant string.
To Reproduce:
JSON files come in a variety of formats, some datasets I tried:
JSON
- https://catalog.data.gov/dataset/2006-2011-nys-math-test-results-by-grade-citywide-by-race-ethnicityJSON
- https://catalog.data.gov/dataset/united-states-drought-monitor-2000-2016-8fe5c
CSV
and is 57.8Mb, 2.8m rowsJSON
https://www.json-generator.com/[
{
"_id": "605d673b20b4132093890d7f",
"index": 0,
"guid": "4582d945-a7c7-4605-a335-255c04fb701d",
"isActive": false,
"balance": "$3,822.52",
"picture": "http://placehold.it/32x32",
"age": 26,
"eyeColor": "green",
"name": "Cobb Bonner",
"gender": "male",
"company": "SCENTRIC",
"email": "[email protected]",
"phone": "+1 (887) 582-3501",
"address": "712 Ferris Street, Marysville, New York, 1006",
"about": "Velit aliquip duis id ut officia culpa cillum labore elit do ad. Esse cillum dolor sunt anim ex elit ullamco qui enim eu. Cupidatat fugiat ea dolore do fugiat et minim occaecat laboris culpa. Cupidatat nostrud dolor deserunt in irure pariatur ut labore anim consequat. Dolor in anim culpa adipisicing cillum occaecat proident cupidatat voluptate occaecat ullamco amet. Laboris fugiat tempor ullamco non non commodo dolore officia deserunt sint cupidatat ea. Culpa qui excepteur duis ea voluptate irure deserunt do quis anim fugiat commodo aute laborum.\r\n",
"registered": "2014-08-24T12:11:05 +05:00",
"latitude": -2.632515,
"longitude": -17.492363,
"tags": [
"commodo",
"aliqua",
"et",
"velit",
"excepteur",
"deserunt",
"culpa"
],
"friends": [
{
"id": 0,
"name": "Finch Russell"
},
{
"id": 1,
"name": "Stephanie Buckner"
},
{
"id": 2,
"name": "Rachelle Cox"
}
],
"greeting": "Hello, Cobb Bonner! You have 4 unread messages.",
"favoriteFruit": "banana"
},{
....
}
]
Expected behavior:
If there's a top level "data" entry, to load that
Internal objects or lists should be evaluated and not be represented as strings.
Example JSON:
{
'data': [
{
"id": 1,
"tags": [ "test 1", "test 2", "test 3" ]
},{
"id": 2,
"tags": : [ "test 4", "test 5", "test 6" ]
}
]
}
Example column / data object:
The DataLabelerColumnProfile
requires a specific output to function correctly to create a profile. If the datalabeler is changed, this will cause an error if the output is incorrect.
This may also need to be addressed in code as well via understandable error messages.
General Information:
Describe the bug:
Currently the library uses chardet to determine file encodings, however it was relatively unmaintained until recently (Dec 2020) in addition to having trouble detecting some file encodings.
For example:
UTF-8 being detected as windows-1254: chardet/chardet#148
Additional context:
Potential fixes include:
https://hackernoon.com/how-i-used-python-to-solve-declareless-encoding-madness-hk1k42d3o
General Information:
Describe the bug:
Currently, if there's a description above a header made of numbers, CSVData will not detect the integer header set.
Expected behavior:
Header should be detected at the proper line. despite a description being above it.
Reference to discussion: #56 (comment)
Currently, I am seeing:
WARNING:tensorflow:5 out of the last 5 calls to <function recreate_function..restored_function_body at 0x7f27803a6790> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/tutorials/customization/performance#python_or_tensor_args and https://www.tensorflow.org/api_docs/python/tf/function for more details.
Every time a column is profiled / labeled. It would be nice if the warnings were suppressed and / or the issue was fixed.
Describe the bug:
self.rows_ingested = len(data) overwrites the rows ingested each time which means it does not allow for streaming data.
To Reproduce:
import pandas as pd
import dataprofiler as dp
data = pd.DataFrame([1,2,3,4])
profiler = dp.Profiler(data[:2])
proifler.update(data[2:])
assert profiler.rows_ingested == 4
Additionally, these rows ingested don't represent the rows sampled during acquisition of the null parameters which is described by the column's samples.
Since each column may get a differing sample set, this could cause issues because one col may have higher samples count that the other e.g.
# presuming > 1 col
list(profiler.profile.values())[0].sample_size
# may not equal, (notice the column index change)
list(profiler.profile.values())[1].sample_size
Test: dataprofiler/tests/data/csv/quote-test-singlequote.txt
Currently, the function predicts it as a TextData
, which is incorrect.
Offending function / line:
DataProfiler/dataprofiler/profilers/profile_builder.py
Lines 294 to 302 in 1a47b8e
The gist is that these should either be estimates OR if they are real numbers, it should be made clear to the end-user. Probably with an explanation of the command they can use to either estimate it, give raw counts and / or how to do the full sample.
Incidentally, this is also one of the slowest functions. Taking the most cumulative time (see **) below:
4688319 function calls (4610366 primitive calls) in 2.778 seconds
Ordered by: internal time
List reduced from 1162 to 20 due to restriction <20>
ncalls tottime percall cumtime percall filename:lineno(function)
23976 0.403 0.000 0.559 0.000 numerical_column_stats.py:386(_get_percentile)
755201 0.146 0.000 0.146 0.000 {method 'search' of 're.Pattern' objects}
80 0.140 0.002 0.378 0.005 {pandas._libs.lib.map_infer_mask}
**10 0.138 0.014 0.881 0.088 profile_builder.py:217(get_base_props_and_clean_null_params)**
755160 0.092 0.000 0.237 0.000 object_array.py:120(<lambda>)
20 0.088 0.004 0.216 0.011 utils.py:50(shuffle_in_chunks)
107900 0.071 0.000 0.170 0.000 base_column_profilers.py:47(_combine_unique_sets)
350 0.069 0.000 0.137 0.000 {pandas._libs.lib.map_infer}
32993 0.068 0.000 0.068 0.000 {method 'reduce' of 'numpy.ufunc' objects}
222 0.056 0.000 0.056 0.000 {pandas._libs.lib.infer_dtype}
10 0.054 0.005 0.220 0.022 text_column_profile.py:95(_update_vocab)
123750/123726 0.054 0.000 0.054 0.000 numerical_column_stats.py:85(__getattribute__)
107880 0.052 0.000 0.119 0.000 random.py:174(randrange)
168 0.051 0.000 0.114 0.001 numerical_column_stats.py:239(_total_histogram_bin_variance)
107880 0.047 0.000 0.047 0.000 numerical_column_stats.py:545(is_int)
12798/10902 0.047 0.000 0.056 0.000 {built-in method numpy.array}
107930 0.046 0.000 0.068 0.000 random.py:224(_randbelow)
92 0.037 0.000 0.037 0.000 {built-in method pandas._libs.missing.isnaobj}
227832/227831 0.036 0.000 0.049 0.000 {built-in method builtins.isinstance}
107880 0.034 0.000 0.034 0.000 numerical_column_stats.py:526(is_float)
The main time is the regex:
DataProfiler/dataprofiler/profilers/profile_builder.py
Lines 266 to 278 in 1a47b8e
Possible solution: Merge all the null searches into one query (as opposed to multiple queries) then split once indexes are matched. I believe it's possible to get a 8x speedup here.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.