GithubHelp home page GithubHelp logo

alteryx / woodwork Goto Github PK

View Code? Open in Web Editor NEW
139.0 19.0 19.0 3.2 MB

Woodwork is a Python library that provides robust methods for managing and communicating data typing information.

Home Page: https://woodwork.alteryx.com

License: BSD 3-Clause "New" or "Revised" License

Makefile 0.16% Python 99.84%
python machine-learning data-science typing woodwork featuretools semantic-tags dataframe dataframes evalml

woodwork's Introduction

Woodwork

Tests Documentation Status PyPI Version Anaconda Version PyPI Downloads


Woodwork provides a common typing namespace for using your existing DataFrames in Featuretools, EvalML, and general ML. A Woodwork DataFrame stores the physical, logical, and semantic data types present in the data. In addition, it can store metadata about the data, allowing you to store specific information you might need for your application.

Installation

Install with pip:

python -m pip install woodwork

or from the conda-forge channel on conda:

conda install -c conda-forge woodwork

Add-ons

Update checker - Receive automatic notifications of new Woodwork releases

python -m pip install "woodwork[updater]"

Example

Below is an example of using Woodwork. In this example, a sample dataset of order items is used to create a Woodwork DataFrame, specifying the LogicalType for five of the columns.

import pandas as pd
import woodwork as ww

df = pd.read_csv("https://oss.alteryx.com/datasets/online-retail-logs-2018-08-28.csv")
df.ww.init(name='retail')
df.ww.set_types(logical_types={
    'quantity': 'Integer',
    'customer_name': 'PersonFullName',
    'country': 'Categorical',
    'order_id': 'Categorical',
    'description': 'NaturalLanguage',
})
df.ww
                   Physical Type     Logical Type Semantic Tag(s)
Column
order_id                category      Categorical    ['category']
product_id              category      Categorical    ['category']
description               string  NaturalLanguage              []
quantity                   int64          Integer     ['numeric']
order_date        datetime64[ns]         Datetime              []
unit_price               float64           Double     ['numeric']
customer_name             string   PersonFullName              []
country                 category      Categorical    ['category']
total                    float64           Double     ['numeric']
cancelled                   bool          Boolean              []

We now have initialized Woodwork on the DataFrame with the specified logical types assigned. For columns that did not have a specified logical type value, Woodwork has automatically inferred the logical type based on the underlying data. Additionally, Woodwork has automatically assigned semantic tags to some of the columns, based on the inferred or assigned logical type.

If we wanted to do further analysis on only the columns in this table that have a logical type of Boolean or a semantic tag of numeric we can simply select those columns and access a dataframe containing just those columns:

filtered_df = df.ww.select(include=['Boolean', 'numeric'])
filtered_df
    quantity  unit_price   total  cancelled
0          6      4.2075  25.245      False
1          6      5.5935  33.561      False
2          8      4.5375  36.300      False
3          6      5.5935  33.561      False
4          6      5.5935  33.561      False
..       ...         ...     ...        ...
95         6      4.2075  25.245      False
96       120      0.6930  83.160      False
97        24      0.9075  21.780      False
98        24      0.9075  21.780      False
99        24      0.9075  21.780      False

As you can see, Woodwork makes it easy to manage typing information for your data, and provides simple interfaces to access only the data you need based on the logical types or semantic tags. Please refer to the Woodwork documentation for more detail on working with a Woodwork DataFrame.

Support

The Woodwork community is happy to provide support to users of Woodwork. Project support can be found in four places depending on the type of question:

  1. For usage questions, use Stack Overflow with the woodwork tag.
  2. For bugs, issues, or feature requests start a Github issue.
  3. For discussion regarding development on the core library, use Slack.
  4. For everything else, the core developers can be reached by email at [email protected]

Built at Alteryx

Woodwork is an open source project built by Alteryx. To see the other open source projects we’re working on visit Alteryx Open Source. If building impactful data science pipelines is important to you or your business, please get in touch.

Alteryx Open Source

woodwork's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

woodwork's Issues

Add support for setting the ranking in Ordinal Logical Type

  • The Ordinal Logical type should allow the user to specify the ranking of the values
  • For example
Ordinal(ranking='hot', 'hotter', 'hottest')
  • We should save this attribute to the Logical Type (add a unit test to check)
  • We should also think about how we can properly convert from the strings -> integers (with proper ranking), and back the other way (integers -> strings). Downstream ML algorithms may require the integers, but the strings may be more useful to the user.

Add time_index tag to time index column passed in init

  • If the user specifies an time index in the constructor, we add the time_index tag to the semantic type for that Data Column
df = pd.read_csv(...)

dt = DataTable(df, 
               name='retail',
               index='id', 
               time_index='datetime',

Check that columns specified in semantic_types are present in dataframe

If the user supplies semantic_types during creation of a DataTable, we should check that all the specified columns are present in the dataframe and raise an error if any specified columns are not found. We should also include a check to verify that the proper object type is passed for this parameter. These checks can be added to _validate_params.

Check that columns specified in logical_types are present in dataframe

If the user supplies logical_types during creation of a DataTable, we should check that all the specified columns are present in the dataframe and raise an error if any specified columns are not found. We should also include a check to verify that the proper object type is passed for this parameter. These checks can be added to _validate_params.

Add WholeNumber of infer_logical_type

  • We are differentiating between WholeNumbers and Integers in the Logical Types and our inference code should

Whole numbers are all natural numbers including 0 e.g. 0, 1, 2, 3, 4…

Integers include all whole numbers and their negative counterpart e.g. … -4, -3, -2, -1, 0,1, 2, 3, 4,…

Add set_types for setting the DataTable logical types after init

  • After a user init a DataTable, they may want to change the logical types for some of the columns. This function should allow them to the do that
dt = DataTable(df, 
               name='retail', # default to df.name
               index=None, 
               time_index=None)

# set logical type
dt.set_types({
    "datetime": Datetime,
    "comments": NaturalLanguage, # change the underlying dtype to string
    "store_id": Categorical # changes the underlying dtype to categorical
})

Add docs to readthedocs

  • We would like to build out docs on read the docs, and have them auto-deploy for releases, as well as merges into master

Boolean type conversion does not get called

If a user supplies specific variable_types for a Boolean variable that has an underlying "object" dtype, and that include type information as kwargs, the associated logic in Entity.entityset_convert_variable_type does not get called. This is because Booleans are a subclass of Discrete, and the "object" dtype is in "_categorical_types".

Relevant code is in Entity.convert_variable_types().

What this means is that if I have a dataframe with a string column that I call a Boolean:

df = pd.DataFrame({'boolean': ['y', 'n', 'y'], 'id': [0,1,2]})
variable_types .= {'boolean': (ft.variable_types.Boolean, {'true_val': 'y', 'false_val': 'n'})}
es.entity_from_dataframe('test', df, index='id', variable_types=variable_types)

The column remains a string, does not get converted to a Boolean, and so features like PercentTrue fail because they expect boolean inputs.

Add index tag to index column passed in init

  • If the user passes an index column to the DataTable constructor, we can set the tag for that Data Column as index
df = pd.read_csv(...)

dt = DataTable(df, 
               name='retail',
               index='id', 
               time_index='datetime',

Add copy argument to set_logical_types, and default to False

  • We should add an argument to set_logical_types that can specific if a new Data Column object should be created it or not when setting the logical types (after creation (init).
dt = DataTable(sample_df)
dt.set_logical_types({
    'full_name': FullName,
    'email': EmailAddress,
    'phone_number': PhoneNumber,
    'age': Double,
}, copy=False) # default to this, save memory

dt.set_logical_types({
    'full_name': FullName,
    'email': EmailAddress,
    'phone_number': PhoneNumber,
    'age': Double,
}, copy=True) # if user wants to change, and make a new Data Column copy

Move infer_variable_types from Featuretools to DataTables for Logical Types

  • Let's take the Featuretools function here, and move it into this library.
  • We can
def infer_logical_types(dataframe, index, time_index):
    return dict{col -> LogicalType}

  • In terms of the actual Logical, let's take what we can from Featuretools's infer_variable_types.
  • Numeric instead a Logical Type in Datatables, which means we should use Double for now
  • Text should be NaturalLanguage
  • NaturalLanguage should be the logical type inferred if nothing else can be (instead of Unknown)

Add Infer IP Address Logical Type

  • We can infer that a column is a IPAddress type, if it fits the following conditions
  • pd.dtype -> object / string
    np.nan
    IPv4 address (e.g., 172.16.254.1)
    IPV6 address (2001:0db8:85a3:0000:0000:8a2e:0370:7334)

Add Infer LatLong Logical Type

  • We can infer that a column is a LatLong type, if it fits the following conditions
  • pd.dtype -> object / string
  • values are in
np.nan
tuple(Latitude within -90 to +90, Longitude within -180 to +180)
tuple(np.nan, np.nan)
tuple(np.nan, valid Longitude)
tuple(valid Latitude, np.nan)

Add better boolean inference

        default = [1, True, "true", "True", "yes", "t", "T"]
        self.true_values = true_values or default
        default = [0, False, "false", "False", "no", "f", "F"]
        self.false_values = false_values or default
  • We add this logical to our inference code that that if the unique values fall into [T, F] OR [TRUE, FALSE], etc then the column is inferred to be Boolean
  • Should the strings be changed? T -> True in the DataTable?

Add dataframe attribute to DataTable

  • The DataTable needs to store the DataFrame inputted, and allow the user to retrieve the dataframe.
  • We need to add a unit test to make sure this retrieve works
import pandas as pd 
from data_tables import DataTable

df = pd.read_csv(...)

dt = DataTable(df)
assert isinstance(dt.dataframe(), pd.DataFrame())

Add is_logical_type helper functions

  • We should have helper functions that can help identify if a python class matches a certain logical type
is_categorical_logical_type
is_integer_logical_type
....
  • It may not be necessary to have the logical_type part in the function

Add DataTable types property to show physical, logical, semantic types

  • We want the user to be able view all the Physical, Logical, and Semantic types on a DataTable
  • Therefore, we need to have a types on DataTable that print out data column names, Physical dtypes, Logical Types & Semantic Tags
class DataTable(object):
    def __repr__(self):
        # print out data column names, pandas dtypes, Logical Types & Semantic Tags
        # similar to df.type
dt = DataTable(df, 
               name='retail', # default to df.name
               index=None, 
               time_index=None)
print(dt)
# Column ----- Physical Type --- Logical Type --- Tags
id             int64             Numeric          {(index, {})}            
expired        boolean           Boolean          set()
card_id        object            Categorical      set()
datetime       datetime64[ns]    Datetime         {(time_index, {})}   
ship_date      datetime64[ns]    Datetime         set()
store_id       categorical       Categorical      set()
comments       string            NaturalLanguage  set()
  • pandas has a nice dtypes function, and so we should try to mirror them and return a DataFrame.

Allow user to update index and time index after creation

Users should be able to update the index and time index after a DataTable has been created:

dt.index = 'id'
dt.time_index = 'datetime'

When this is done the proper validation checks should be performed to make sure the supplied values are valid and can be used for the index or time index.

Add Mutual Information utility function

  • Since datatables contains the typing information about data, we can use that typing information to calculate statistical information.
  • One statistical information we can calculate is mutual information between Data Columns
  • We should have a unit test to verify the output, and support for all valid types.

Add Infer PostalCode Logical Type

  • We can infer that a column is a ZIPCode type, if it fits the following conditions
  • pd.dtype -> object / string
  • values are in
np.nan
5 digit ZIPCode "#####"  
9 digit ZIPCode (dash) "#####-####"

Change dtype of object into string for string-like Logical Types

  • If a Logical Type is inferred/set to the any of the following, we should change the underlying dtype to string
FilePath
FullName
IPAddress 
LatLong
NaturalLanguage
PhoneNumber 
URL
  • Note: some of these may not be infer_logical_types, but we should right it such that if logical type in the list above, dtype -> changes to string

Add convert to string for Logical Types

  • We want the user to be able to give a string instead of the Python Class when passing a Logical Type.
  • The user should be able to pass the actual Python Object (DateOfBirth), the type string (date_of_birth).
  • We may also want the user to be able to pass the python string of the class name (DateOfBirth)
dt = DataTable(df,...
               logical_types={
                   "birth_day": 'date_of_birth',
                   "card_id": 'categorical',
                   "customer_id': Categorical,
               },

Update object constructors to allow setting of semantic types

The __init__ methods for DataTable and DataColumn need to be updated to allow the setting of semantic type tags during the creation of the objects.

After this is implemented, a unit test should be created to verify that the semantic types for a column are retained properly after calling DataColumn.set_logical_types to update a column logical type.

Add index column is unique check

  • When user inputs the datatable, and provides us with an index column, we should check that the index column is unique.
import pandas as pd 

pd.Series([1, 2, 3]).is_unique
  • We should add a test for this with valid and invalid index

Change dtype of Categorical Logical Types into categorical

  • If a Logical Type is inferred/set to the any of the following, we should change the underlying dtype to categorical
Categorical
CountryCode
Ordinal
SubRegionCode
ZIPCode
  • Note: some of these may not be infer_logical_types, but we should right it such that if logical type in the list above, dtype -> changes to categorical

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.