great-expectations / great_expectations Goto Github PK

Always know what to expect from your data.

Home Page: https://docs.greatexpectations.io/

License: Apache License 2.0

Python 99.05% Jupyter Notebook 0.12% CSS 0.05% Lua 0.03% Dockerfile 0.01% Shell 0.05% Jinja 0.57% JavaScript 0.13%

pipeline-tests dataquality datacleaning datacleaner data-science data-profiling pipeline pipeline-testing cleandata dataunittest

great_expectations's People

Contributors

Stargazers

Watchers

Forkers

las-ncsu greatexpectationslabs mgasner gouker luminantdata polya20 theianrobertson louispotok schrockn crawlik bouke-nederstigt njsmith8 zack3241 fhachez colemanja91 damolaakinleye ilmari-abaenglish edjoesu sotte ncsu-las shasthojoy cggarvey smontanaro luke-zhu irontablee allen8838 datariot anhollis eugmandel elsander benzei avanderm rossem kenoskynci jambo1 rahulj51 jseeman prem2017 cselig 2legit lewtun znd4 royalts mdscruggs acompa mastratton3 khorgath himanshukandwal nature123 chris-dis-missed guoqq1994 binbenban radinkar danish-moengage walvekarvarun cuulee orenovadia appliedinfo heerensharma talagluck msempere hubayirp alexsherstinsky fpli-mbr scarrucciu grehce pariyat williamjr bballamudi peasonkews ian-whitestone dalisaydavid nchrist2 goeddie ffinfo isaacaguirre changhsinlee anuragnaik adamhepner noncomposmentis rajesuwerps xxsacxx wegamekinglc cwerner henrywu2019 techwrekfix alyizzet confman bobhaffner bgscoones clee8912 alexras williamwsyhk kkwan-tc kkwanyang jcampbell abhishekms1047 joostboonzajerflaes-heineken arseniid ag1011

great_expectations's Issues

Decide on expected behavior for (and implement) distributional expectations

Decide on expected behavior for (and implement)
expect_column_numerical_distribution_to_be
expect_column_frequency_distribution_to_be

All expectations should have docstrings

Define API for distributional expectations

Create issues from this issue. :)

Notes from July 10th call:

Custom expectations:

Standardize on column_map_expectation and column_aggregate_expectation
Drop column_elementwise_expectation for now. (If we discover we need it, we can add it back it.)
Add better worked examples in the docs.
We also need to add a prototyping syntax for expectations that doesn't require subclassing and decorators. Something along the lines of:

dataset.expect_function_to_be_elementwise_true('column', function)
   => assert(df.column.apply(function)==[True] * len(column))
dataset.expect_function_to_be_true('column', function)
   => assert(function(df.column == True))

Output formats

Make it clear in the docs: output_format is categorical, not strictly ordered. This makes the output API more flexible and extensible.
Bring the docs up to date (e.g. true_value for aggregate_column_expectations,
Change include_lineage to include_kwargs. Also make it clear that expectations have only kwargs, no args.
Think about including row_index_list as a return value. (This gets complicated in some-non-pandas systems.)
What about error messages and handling in expectations?

Append_expectation drops expectations of the same type even for different columns

...which is very broken. 👎

expect_column_value_lengths_to_be_between doesn't allow for specifying a single length

expect_column_value_lengths_to_be_between use exclusive boundaries, so you can't specify that all values are of the same length. For example:

drg.expect_column_value_lengths_to_be_between(column=" Average Covered Charges ", min_value=9, max_value=9)

will return:

{'exception_list': [
  '$105929.47',
  '$101282.03',
  '$146892.00',
...

Feature/stopgap output spec

Support python3 and python3-compatible unittest framework (unittest)

nose recommends using a new framework for new projects to support python3, and we want to be as broadly compatible as possible.

Docstrings in DataSet Expectations should propagate to PandasDataSet

Remove all refs to expect_column_value_lengths_to_be_less_than_or_equal_to

expect_column_proportion_of_unique_values_to_be_between doesn't work without optional field max_value

Returns False:
'data_set.expect_column_proportion_of_unique_values_to_be_between(column="ID_COLUMN", min_value=1, include_config=True)['success']

Returns True:
'data_set.expect_column_proportion_of_unique_values_to_be_between(column="ID_COLUMN", min_value=1, max_value=1, include_config=True)['success']

How should we implement distributional expectations within the new expectation decorators?

Distributional expectations are different from all the other @column_aggregate_expectations:

They need to accept a confidence_threshold argument, similar to mostly for column_map_expectations. Unlike mostly, confidence_threshold isn't optional.

In addition to a true_value, they should also return a confidence_value:

{
  success : boolean,
  true_value : partitioned_weights,
  confidence_value : float on [0,1]
}

The difference isn't fundamentally because these are expectations about distributions. The difference is because these are statistical assumptions.

How should we capture this in our expectations?

Option 1: Create a @column_statistical_expectation
Option 2: Add parameters to the distributional expectations to give them the

I lean towards (1). Jerry-rigging extra fields and parameters in (2) seems like it could get sticky pretty fast. And statistical expectations are a patterns that I expect to use more in the future.

@jcampbell, @dgmiller Thoughts?

Function signatures should be present in base DataSet class and overridden in implementing subclasses (e.g. PandasDataSet)

Currently, the API doesn't suggest parallelism will necessarily exist in the expectations that will be implemented in classes that inherit from DataSet

Document behavior of expect_column_value_lengths_to_be_between

Settle on a stopgap output spec for expectations result objects

Should expectation decorators add expectations that cannot be executed to the expectation configuration?

For example, if an expectation is written on a column that does not exist, currently that expectation will immediately be added, even if it is never even evaluated.

Remove the need for `DataSet` when using the column_expectation decorator

Not this:

    @DataSet.column_expectation
    def expect_column_values_to_be_unique(self, column, mostly=None, suppress_exceptions=False):

But this:

    @column_expectation
    def expect_column_values_to_be_unique(self, column, mostly=None, suppress_exceptions=False):

Convert expectation decorators in pandas_dataset.py and update expectation decorator in base.py

expect_table_row_count_to_equal should be changed to DataSet.table_expectation
expect_table_row_count_to_be_between should be changed to DataSet.table_expectation
expect_column_values_to_be_subset should be to DataSet.expectation
update decorators in base.py to use the new python 3 logic

Address isinstance python2 and python3 compatibility

Currently, ensure_json_serializable uses instance in a way that does not work for both python 2 and 3.

Essentially this pushes the project to only python2 in this version (since @abegong added the unicode type back to the check).

Standardize output formats

Add more thorough unit tests for...

Add more thorough unit tests for
expect_table_row_count_to_be_between
expect_table_row_count_to_equal
expect_column_values_to_be_dateutil_parseable
expect_column_values_to_be_valid_json
expect_column_stdev_to_be_between

Implement expect_column_values_to_match_json_schema with tests

Feature/distributional expectations

Propose an API for custom expectations

Should `catch_exceptions` try to continue executing, or just return False with an informative error message?

In unit tests, we should always use `assert_equal`, not `assert`

In append_expecation, add an option for not overwriting duplicate expectations.

Should it be easy to simultaneously create many expectations?

Consider the case of something like the following:

for column in df.columns:
    df.expect_column_mean_to_be_between(column, min, max)

Currently, this will work, but we would need to wrap the expectation statement in print() to see the output at all, and even then we cannot see which column the expectation was about, unless we also coerce printing of the dictionary returned by the expectation. Is this a useful pattern?

What additional logic should we pack into Expectation decorators?

What are all the generic parameters that Expectations should accept?

All Expectations

output_format
include_kwargs
catch_exceptions
exclude_null_values?

For column_map_expectations

mostly

For column_aggregate_expectations

confidence_threshold

What other logic can we include?

Input validation
Output validation
- Is JSON serializable
- Has expected fields, etc.
Docstring propagation...
Create and append the Expectation to the dataset
Logic for de-duplication/updating Expectations

Add tabbed autocomplete for dataset.column_name.column_expectation

A bit of sugar:

dataset.column_name.expect_something(arg1, arg2) should evaluate to dataset.expect_something(column_name, arg1, arg2)

...and ipython should be able to autocomplete the expectation on tab.

named_regex_patterns should actually do something.

Right now, there's not way to programmatically reference them from Expectations.

Should Expectations include an `exclude_null_values` parameter?

Do we really need dataset.@expectation and dataset.@column_expectation

The code is 90% redundant. Isn't there some way to refactor these?

Implement the datasource API

Remove (broken) multicolumn relations and check serialization for all expectations

Clean up overall documentation

What should expect_column_value_lengths_to_be_less_than_or_equal_to do when passed floats or integers?

I would have expected it to throw a TypeError, instead it uses the value of the numeric data.

I think of the "length" of an int or float as meaningless. As such, I would expect this expectation to only work for strings.

Activate suppress_exceptions in all expectations

Lots of the expectations don't implement suppress_exceptions.

Implement expect_column_values_to_be_of_type with tests

Feature/unit test refactor

Unit tests have been refactored and converted to work in python 3. See commit comments for specific details.

Docstrings

Docstrings updated for version 0.1

Proposal: Use a WeightedPartitions for distributional expectations

{
  "partitions" : [0.0, 0.1, 0.3, 0.6, 1.0],
  "weights" : [0.4, 0.05, 0.25, 0.3, 0.0]
}

Partitions specifies the lower bound for each partition.
Weights specifies the total mass within each partition. (lower_bound <= value < upper_bound)

The number of entries in partition and weight lists must be equal.
For convenience, partitions are always sorted in ascending order.
Weights must sum to exactly 1.0

Note: Are there a JSON-serializable versions of inf and -inf?

Putting weights and partitions together into a single object has several advantages:

Can be passed/returned through true_value and other parameters
Simpler to test
Less prone to accidental separation in exploratory workflows

Using PDF instead of CDF has some advantages, too

Unified representation for categorical and continuous data
More user-friendly graphs
Still information-complete for calculating CDFs

ensure unittest functionality is python 2 and 3 compatible

Implement these flags in validation tool

great_expectations my_dataset.csv my_expectations.json --output_format=BOOLEAN_ONLY --catch_exceptions=False --include_config=True

With the new decorators, passing args (instead of kwargs) to Expectations sometimes crashes them.

[Replication example needed]

We need to either fix this, or document and own it.

Also, we should write tests against this.

At this stage in the decorator refactor, this is the single biggest source of uncertainty for me.

Documentation
Bug Fixes
Unit tests
Better and simpler helpers
KL Divergence for discrete data

Expectations should ensure their configuration is saveable at runtime

Currently, a user can create an expectation using parameters that are not json serializable but not be aware of the error until attempting to save the config.

great-expectations / great_expectations Goto Github PK

great_expectations's People

Contributors

Stargazers

Watchers

Forkers

great_expectations's Issues

Custom expectations:

Output formats

Recommend Projects

Recommend Topics

Recommend Org

Jobs