alteryx / woodwork Goto Github PK

Woodwork is a Python library that provides robust methods for managing and communicating data typing information.

License: BSD 3-Clause "New" or "Revised" License

Makefile 0.16% Python 99.84%

python machine-learning data-science typing woodwork featuretools semantic-tags dataframe dataframes evalml

woodwork's Introduction

Woodwork provides a common typing namespace for using your existing DataFrames in Featuretools, EvalML, and general ML. A Woodwork DataFrame stores the physical, logical, and semantic data types present in the data. In addition, it can store metadata about the data, allowing you to store specific information you might need for your application.

Installation

Install with pip:

python -m pip install woodwork

or from the conda-forge channel on conda:

conda install -c conda-forge woodwork

Add-ons

Update checker - Receive automatic notifications of new Woodwork releases

python -m pip install "woodwork[updater]"

Example

Below is an example of using Woodwork. In this example, a sample dataset of order items is used to create a Woodwork DataFrame, specifying the LogicalType for five of the columns.

import pandas as pd
import woodwork as ww

df = pd.read_csv("https://oss.alteryx.com/datasets/online-retail-logs-2018-08-28.csv")
df.ww.init(name='retail')
df.ww.set_types(logical_types={
    'quantity': 'Integer',
    'customer_name': 'PersonFullName',
    'country': 'Categorical',
    'order_id': 'Categorical',
    'description': 'NaturalLanguage',
})
df.ww

                   Physical Type     Logical Type Semantic Tag(s)
Column
order_id                category      Categorical    ['category']
product_id              category      Categorical    ['category']
description               string  NaturalLanguage              []
quantity                   int64          Integer     ['numeric']
order_date        datetime64[ns]         Datetime              []
unit_price               float64           Double     ['numeric']
customer_name             string   PersonFullName              []
country                 category      Categorical    ['category']
total                    float64           Double     ['numeric']
cancelled                   bool          Boolean              []

We now have initialized Woodwork on the DataFrame with the specified logical types assigned. For columns that did not have a specified logical type value, Woodwork has automatically inferred the logical type based on the underlying data. Additionally, Woodwork has automatically assigned semantic tags to some of the columns, based on the inferred or assigned logical type.

If we wanted to do further analysis on only the columns in this table that have a logical type of Boolean or a semantic tag of numeric we can simply select those columns and access a dataframe containing just those columns:

filtered_df = df.ww.select(include=['Boolean', 'numeric'])
filtered_df

    quantity  unit_price   total  cancelled
0          6      4.2075  25.245      False
1          6      5.5935  33.561      False
2          8      4.5375  36.300      False
3          6      5.5935  33.561      False
4          6      5.5935  33.561      False
..       ...         ...     ...        ...
95         6      4.2075  25.245      False
96       120      0.6930  83.160      False
97        24      0.9075  21.780      False
98        24      0.9075  21.780      False
99        24      0.9075  21.780      False

As you can see, Woodwork makes it easy to manage typing information for your data, and provides simple interfaces to access only the data you need based on the logical types or semantic tags. Please refer to the Woodwork documentation for more detail on working with a Woodwork DataFrame.

Support

The Woodwork community is happy to provide support to users of Woodwork. Project support can be found in four places depending on the type of question:

For usage questions, use Stack Overflow with the woodwork tag.
For bugs, issues, or feature requests start a Github issue.
For discussion regarding development on the core library, use Slack.
For everything else, the core developers can be reached by email at [email protected]

Built at Alteryx

Woodwork is an open source project built by Alteryx. To see the other open source projects we’re working on visit Alteryx Open Source. If building impactful data science pipelines is important to you or your business, please get in touch.

woodwork's People

Stargazers

Watchers

Forkers

chukarsten johnbridstrup kaidisn mikewcasale not-so-rabh vibhujawa shism2 hercules261188 tahasha willsmithorg protect-identity isabella232 j-maxey sainiudit woaying trellixvulnteam ulc0 krillingone sandy4321

woodwork's Issues

Koalas

Add support for setting the ranking in Ordinal Logical Type

The Ordinal Logical type should allow the user to specify the ranking of the values
For example

Ordinal(ranking='hot', 'hotter', 'hottest')

We should save this attribute to the Logical Type (add a unit test to check)
We should also think about how we can properly convert from the strings -> integers (with proper ranking), and back the other way (integers -> strings). Downstream ML algorithms may require the integers, but the strings may be more useful to the user.

Add time_index tag to time index column passed in init

If the user specifies an time index in the constructor, we add the time_index tag to the semantic type for that Data Column

df = pd.read_csv(...)

dt = DataTable(df, 
               name='retail',
               index='id', 
               time_index='datetime',

Check that columns specified in semantic_types are present in dataframe

If the user supplies semantic_types during creation of a DataTable, we should check that all the specified columns are present in the dataframe and raise an error if any specified columns are not found. We should also include a check to verify that the proper object type is passed for this parameter. These checks can be added to _validate_params.

Add code coverage reports with codecov

Be explicit in testing DataFrames, and Series with dtypes

We want to be explicit in the dtypes with our the DataFrames and Series in our tests. We should explicit define all the dtypes of all columns.

Add numeric tag to Integer, WholeNumber or Double

If a Data Column is inferred/set to be Integer/WholeNumber/Double, then we should add the numeric tag to the Data Column (semantic type)

Ensure nans in input data don't break type inference

We should make sure that if a Series has NaN values, it does not break the infer logical type.
We should therefore add some tests with NaN values in the Series

Check that columns specified in logical_types are present in dataframe

If the user supplies logical_types during creation of a DataTable, we should check that all the specified columns are present in the dataframe and raise an error if any specified columns are not found. We should also include a check to verify that the proper object type is passed for this parameter. These checks can be added to _validate_params.

Add WholeNumber of infer_logical_type

We are differentiating between WholeNumbers and Integers in the Logical Types and our inference code should

Whole numbers are all natural numbers including 0 e.g. 0, 1, 2, 3, 4…

Integers include all whole numbers and their negative counterpart e.g. … -4, -3, -2, -1, 0,1, 2, 3, 4,…

Dask

Add set_types for setting the DataTable logical types after init

After a user init a DataTable, they may want to change the logical types for some of the columns. This function should allow them to the do that

dt = DataTable(df, 
               name='retail', # default to df.name
               index=None, 
               time_index=None)

# set logical type
dt.set_types({
    "datetime": Datetime,
    "comments": NaturalLanguage, # change the underlying dtype to string
    "store_id": Categorical # changes the underlying dtype to categorical
})

Change dtype of Datetime Logical type to datetime64[ns]

If a Logical Type is inferred/set to Datetime, we should change the underlying dtype to datetime64[ns]

Add docs to readthedocs

We would like to build out docs on read the docs, and have them auto-deploy for releases, as well as merges into master

[Feature Request] Smarter variable type inference

Is it possible to add smarter variable type inference to detect all the different variable types that Featuretools supports like PhoneNumber, ZipCode etc. using regex or other rules?

NaturalLanguage feature metadata: optionally specify language name

We should allow the user to optionally the provide language on the Natural Language column
Downstream uses of language may benefit from this information (such StopWordCount)
- https://github.com/FeatureLabs/nlp_primitives/blob/master/nlp_primitives/stopword_count.py#L35

Boolean type conversion does not get called

If a user supplies specific variable_types for a Boolean variable that has an underlying "object" dtype, and that include type information as kwargs, the associated logic in Entity.entityset_convert_variable_type does not get called. This is because Booleans are a subclass of Discrete, and the "object" dtype is in "_categorical_types".

Relevant code is in Entity.convert_variable_types().

What this means is that if I have a dataframe with a string column that I call a Boolean:

df = pd.DataFrame({'boolean': ['y', 'n', 'y'], 'id': [0,1,2]})
variable_types .= {'boolean': (ft.variable_types.Boolean, {'true_val': 'y', 'false_val': 'n'})}
es.entity_from_dataframe('test', df, index='id', variable_types=variable_types)

The column remains a string, does not get converted to a Boolean, and so features like PercentTrue fail because they expect boolean inputs.

Check index columns are present in dataframe if specified during init

If users pass values for index and time_index when creating a DataTable we should verify that the columns are present in the underlying dataframe and raise an appropriate error if they are not found.

Related to #8

Change dtype of Integer Logical Type to Int64

If a Logical Type is set to Integer, we should change the underlying dtype to Int64
- int64 -> not nullable
- Int64 -> nullable

Change default branch to main

Add index tag to index column passed in init

If the user passes an index column to the DataTable constructor, we can set the tag for that Data Column as index

df = pd.read_csv(...)

dt = DataTable(df, 
               name='retail',
               index='id', 
               time_index='datetime',

Add copy argument to set_logical_types, and default to False

We should add an argument to set_logical_types that can specific if a new Data Column object should be created it or not when setting the logical types (after creation (init).

dt = DataTable(sample_df)
dt.set_logical_types({
    'full_name': FullName,
    'email': EmailAddress,
    'phone_number': PhoneNumber,
    'age': Double,
}, copy=False) # default to this, save memory

dt.set_logical_types({
    'full_name': FullName,
    'email': EmailAddress,
    'phone_number': PhoneNumber,
    'age': Double,
}, copy=True) # if user wants to change, and make a new Data Column copy

Add check for unique column names in inputted DataFrame

We need to check that column names inputted into the DataTable are unique.

df.columns.is_unique

There needs to be unit tests that verify this, and try to input a valid and invalid dataframe

Move infer_variable_types from Featuretools to DataTables for Logical Types

Let's take the Featuretools function here, and move it into this library.
We can

def infer_logical_types(dataframe, index, time_index):
    return dict{col -> LogicalType}

In terms of the actual Logical, let's take what we can from Featuretools's infer_variable_types.
Numeric instead a Logical Type in Datatables, which means we should use Double for now
Text should be NaturalLanguage
NaturalLanguage should be the logical type inferred if nothing else can be (instead of Unknown)

Add Infer IP Address Logical Type

We can infer that a column is a IPAddress type, if it fits the following conditions
pd.dtype -> object / string
np.nan
IPv4 address (e.g., 172.16.254.1)
IPV6 address (2001:0db8:85a3:0000:0000:8a2e:0370:7334)

Use pandas style doc for documentation

To be consistent across libraries, we will be using the pandas style for the documentation
See example here:
https://evalml.featurelabs.com/en/latest/

Add Infer LatLong Logical Type

We can infer that a column is a LatLong type, if it fits the following conditions
pd.dtype -> object / string
values are in

np.nan
tuple(Latitude within -90 to +90, Longitude within -180 to +180)
tuple(np.nan, np.nan)
tuple(np.nan, valid Longitude)
tuple(valid Latitude, np.nan)

Add better boolean inference

We currently just check the dtype of a Series to understand if its a boolean type.
However, you could determine the unique values in a column, and if there is only 2 (excluding NaN).
We have some logical like this here:
https://github.com/FeatureLabs/featuretools/blob/03b03acadb91c88460fd07b115bac23f638cbc9f/featuretools/variable_types/variable.py#L160

        default = [1, True, "true", "True", "yes", "t", "T"]
        self.true_values = true_values or default
        default = [0, False, "false", "False", "no", "f", "F"]
        self.false_values = false_values or default

We add this logical to our inference code that that if the unique values fall into [T, F] OR [TRUE, FALSE], etc then the column is inferred to be Boolean
Should the strings be changed? T -> True in the DataTable?

Add dataframe attribute to DataTable

The DataTable needs to store the DataFrame inputted, and allow the user to retrieve the dataframe.
We need to add a unit test to make sure this retrieve works

import pandas as pd 
from data_tables import DataTable

df = pd.read_csv(...)

dt = DataTable(df)
assert isinstance(dt.dataframe(), pd.DataFrame())

Check that time index specified is a valid datetime Type

If the user passes a column and specifies its the time index, we should make sure that the Logical Type is Datetime, and the underlying data type is datetime64 (pandas dtype)

Change dtype of Double Logical Type to float64

If a Logical Type is set to Double, we should change the underlying dtype to float64

Add is_logical_type helper functions

We should have helper functions that can help identify if a python class matches a certain logical type

is_categorical_logical_type
is_integer_logical_type
....

It may not be necessary to have the logical_type part in the function

Add pandas Extension to support df.to_datatable

Pandas has an extension functionality which can allow us to register functions to dataframes.
This would allow a user to do dataframe.to_datatable
Pandas Doc Link

Infer Semantic Types

Initial DataTable Release

Add DataTable types property to show physical, logical, semantic types

We want the user to be able view all the Physical, Logical, and Semantic types on a DataTable
Therefore, we need to have a types on DataTable that print out data column names, Physical dtypes, Logical Types & Semantic Tags

class DataTable(object):
    def __repr__(self):
        # print out data column names, pandas dtypes, Logical Types & Semantic Tags
        # similar to df.type

dt = DataTable(df, 
               name='retail', # default to df.name
               index=None, 
               time_index=None)
print(dt)
# Column ----- Physical Type --- Logical Type --- Tags
id             int64             Numeric          {(index, {})}            
expired        boolean           Boolean          set()
card_id        object            Categorical      set()
datetime       datetime64[ns]    Datetime         {(time_index, {})}   
ship_date      datetime64[ns]    Datetime         set()
store_id       categorical       Categorical      set()
comments       string            NaturalLanguage  set()

pandas has a nice dtypes function, and so we should try to mirror them and return a DataFrame.

Avro

Parquet

Allow user to update index and time index after creation

Users should be able to update the index and time index after a DataTable has been created:

dt.index = 'id'
dt.time_index = 'datetime'

When this is done the proper validation checks should be performed to make sure the supplied values are valid and can be used for the index or time index.

Add License

Change dtype of Boolean Logical Type to boolean

If a Logical Type is inferred/set to Boolean, we should change the underlying dtype to boolean

Infer Logical Types

Add Mutual Information utility function

Since datatables contains the typing information about data, we can use that typing information to calculate statistical information.
One statistical information we can calculate is mutual information between Data Columns
We should have a unit test to verify the output, and support for all valid types.

Add Infer PostalCode Logical Type

We can infer that a column is a ZIPCode type, if it fits the following conditions
pd.dtype -> object / string
values are in

np.nan
5 digit ZIPCode "#####"  
9 digit ZIPCode (dash) "#####-####"

Change dtype of object into string for string-like Logical Types

If a Logical Type is inferred/set to the any of the following, we should change the underlying dtype to string

FilePath
FullName
IPAddress 
LatLong
NaturalLanguage
PhoneNumber 
URL

Note: some of these may not be infer_logical_types, but we should right it such that if logical type in the list above, dtype -> changes to string

Add convert to string for Logical Types

We want the user to be able to give a string instead of the Python Class when passing a Logical Type.
The user should be able to pass the actual Python Object (DateOfBirth), the type string (date_of_birth).
We may also want the user to be able to pass the python string of the class name (DateOfBirth)

dt = DataTable(df,...
               logical_types={
                   "birth_day": 'date_of_birth',
                   "card_id": 'categorical',
                   "customer_id': Categorical,
               },

Update object constructors to allow setting of semantic types

The __init__ methods for DataTable and DataColumn need to be updated to allow the setting of semantic type tags during the creation of the objects.

After this is implemented, a unit test should be created to verify that the semantic types for a column are retained properly after calling DataColumn.set_logical_types to update a column logical type.

Add General Statistical Information utility function

Since we have DataTables, and typing information, we can use this to calculate general statistical information
Some initial code can be found here:
alteryx/featuretools#1061
We should have a unit test to verify the output of this.

Add index column is unique check

When user inputs the datatable, and provides us with an index column, we should check that the index column is unique.

import pandas as pd 

pd.Series([1, 2, 3]).is_unique

We should add a test for this with valid and invalid index

Change dtype of Categorical Logical Types into categorical

If a Logical Type is inferred/set to the any of the following, we should change the underlying dtype to categorical

Categorical
CountryCode
Ordinal
SubRegionCode
ZIPCode

Note: some of these may not be infer_logical_types, but we should right it such that if logical type in the list above, dtype -> changes to categorical