great-expectations / great_expectations Goto Github PK
View Code? Open in Web Editor NEWAlways know what to expect from your data.
Home Page: https://docs.greatexpectations.io/
License: Apache License 2.0
Always know what to expect from your data.
Home Page: https://docs.greatexpectations.io/
License: Apache License 2.0
Decide on expected behavior for (and implement)
expect_column_numerical_distribution_to_be
expect_column_frequency_distribution_to_be
Notes from July 10th call:
column_map_expectation
and column_aggregate_expectation
column_elementwise_expectation
for now. (If we discover we need it, we can add it back it.)dataset.expect_function_to_be_elementwise_true('column', function)
=> assert(df.column.apply(function)==[True] * len(column))
dataset.expect_function_to_be_true('column', function)
=> assert(function(df.column == True))
output_format
is categorical, not strictly ordered. This makes the output API more flexible and extensible.true_value
for aggregate_column_expectations,include_lineage
to include_kwargs
. Also make it clear that expectations have only kwargs, no args.row_index_list
as a return value. (This gets complicated in some-non-pandas systems.)...which is very broken. ๐
expect_column_value_lengths_to_be_between use exclusive boundaries, so you can't specify that all values are of the same length. For example:
drg.expect_column_value_lengths_to_be_between(column=" Average Covered Charges ", min_value=9, max_value=9)
will return:
{'exception_list': [
'$105929.47',
'$101282.03',
'$146892.00',
...
nose recommends using a new framework for new projects to support python3, and we want to be as broadly compatible as possible.
Returns False:
'data_set.expect_column_proportion_of_unique_values_to_be_between(column="ID_COLUMN", min_value=1, include_config=True)['success']
Returns True:
'data_set.expect_column_proportion_of_unique_values_to_be_between(column="ID_COLUMN", min_value=1, max_value=1, include_config=True)['success']
Distributional expectations are different from all the other @column_aggregate_expectations:
They need to accept a confidence_threshold
argument, similar to mostly
for column_map_expectations. Unlike mostly
, confidence_threshold
isn't optional.
In addition to a true_value
, they should also return a confidence_value
:
{
success : boolean,
true_value : partitioned_weights,
confidence_value : float on [0,1]
}
The difference isn't fundamentally because these are expectations about distributions. The difference is because these are statistical assumptions.
How should we capture this in our expectations?
Option 1: Create a @column_statistical_expectation
Option 2: Add parameters to the distributional expectations to give them the
I lean towards (1). Jerry-rigging extra fields and parameters in (2) seems like it could get sticky pretty fast. And statistical expectations are a patterns that I expect to use more in the future.
@jcampbell, @dgmiller Thoughts?
Currently, the API doesn't suggest parallelism will necessarily exist in the expectations that will be implemented in classes that inherit from DataSet
For example, if an expectation is written on a column that does not exist, currently that expectation will immediately be added, even if it is never even evaluated.
Not this:
@DataSet.column_expectation
def expect_column_values_to_be_unique(self, column, mostly=None, suppress_exceptions=False):
But this:
@column_expectation
def expect_column_values_to_be_unique(self, column, mostly=None, suppress_exceptions=False):
expect_table_row_count_to_equal should be changed to DataSet.table_expectation
expect_table_row_count_to_be_between should be changed to DataSet.table_expectation
expect_column_values_to_be_subset should be to DataSet.expectation
update decorators in base.py to use the new python 3 logic
Currently, ensure_json_serializable uses instance in a way that does not work for both python 2 and 3.
Essentially this pushes the project to only python2 in this version (since @abegong added the unicode type back to the check).
Add more thorough unit tests for
expect_table_row_count_to_be_between
expect_table_row_count_to_equal
expect_column_values_to_be_dateutil_parseable
expect_column_values_to_be_valid_json
expect_column_stdev_to_be_between
Consider the case of something like the following:
for column in df.columns:
df.expect_column_mean_to_be_between(column, min, max)
Currently, this will work, but we would need to wrap the expectation statement in print() to see the output at all, and even then we cannot see which column the expectation was about, unless we also coerce printing of the dictionary returned by the expectation. Is this a useful pattern?
What are all the generic parameters that Expectations should accept?
All Expectations
For column_map_expectations
For column_aggregate_expectations
What other logic can we include?
Input validation
Output validation
Docstring propagation...
Create and append the Expectation to the dataset
Logic for de-duplication/updating Expectations
A bit of sugar:
dataset.column_name.expect_something(arg1, arg2)
should evaluate to dataset.expect_something(column_name, arg1, arg2)
...and ipython should be able to autocomplete the expectation on tab.
Right now, there's not way to programmatically reference them from Expectations.
The code is 90% redundant. Isn't there some way to refactor these?
I would have expected it to throw a TypeError, instead it uses the value of the numeric data.
I think of the "length" of an int or float as meaningless. As such, I would expect this expectation to only work for strings.
Lots of the expectations don't implement suppress_exceptions.
Unit tests have been refactored and converted to work in python 3. See commit comments for specific details.
Docstrings updated for version 0.1
{
"partitions" : [0.0, 0.1, 0.3, 0.6, 1.0],
"weights" : [0.4, 0.05, 0.25, 0.3, 0.0]
}
Partitions
specifies the lower bound for each partition.
Weights
specifies the total mass within each partition. (lower_bound <= value < upper_bound)
Note: Are there a JSON-serializable versions of inf
and -inf
?
Putting weights and partitions together into a single object has several advantages:
true_value
and other parametersUsing PDF instead of CDF has some advantages, too
great_expectations my_dataset.csv my_expectations.json --output_format=BOOLEAN_ONLY --catch_exceptions=False --include_config=True
[Replication example needed]
We need to either fix this, or document and own it.
Also, we should write tests against this.
At this stage in the decorator refactor, this is the single biggest source of uncertainty for me.
Running unit tests currently requires of a mix of:
python -m unittest tests
(for things converted to unittest) nosetests
(for those not). We need to finish conversion and add to developer/contributor docs.
Close #39 with improvements for documentation, unit tests, and bug fixes.
Distributional expectations need:
Currently, a user can create an expectation using parameters that are not json serializable but not be aware of the error until attempting to save the config.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.