GithubHelp home page GithubHelp logo

pygander's Introduction

pygander

Latest PyPI version Latest Travis CI build status

Kaggle Data Utilities

Usage

Transforms

PyGander contains advanced function decorators which allow you to create or replace columns in dataframes using a fun and efficient function-based syntax.

import pygander as pg
import pandas as pd

df = pd.DataFrame(...)

@pg.rowlogic(df)
def new_column_name(existing_column_name, existing_column_name2:cast_to_type=value_if_na):
    # named elements from rows get passed in
    # (after being cast to the annotated types or replaced with default values if NA)
    # output is the value for the current row
    # computed in-place, a new column called 'new_column_name' will be added to df
    ... # complex logic here
    return ... # some value

Take for example, the following dummy code which performs a complex logical operation row-wise on a dataframe with at least three columns, a, b and c. Assume the columns need to be cast to ints for the operation. This operation could also create exceptions which need to be handled.

def without_rowlogic(row:pd.Series):
    try:
        if int(row['a']) < int(row['b']):
            return 'a < b'
        elif int(row['a']) < 2 * int(row['b']):
            return 'a < 2b'
        elif int(row['a']) > int(row['b']):
            raise ValueError('a > b')
    except Exception as e:
        return pd.NA

    return str(int(row['c'] if row.notna()['c'] else 7))

df['without_rowlogic'] = [without_rowlogic(row) for idx, row in df.iterrows()]

Using pygander, we can cut the number of characters required to write this operation in half, and make the syntax _much_ easier to read.

import pygander as pg

@pg.rowlogic(df, default_if_exception=pd.NA)
def with_rowlogic(a: int, b: int, c: int = 7):
    if a < b:
        return 'a < b'
    elif a < 2 * b:
        return 'a < 2b'
    elif a > b:
        raise ValueError('a > b')

    return str(c)

The ctransf decorator also allows for column-wise transformations. Take for example the following pandas script.

import pandas as pd

df = pd.DataFrame(...)

# make a z_score_of_a column in df
df['z_score_of_a'] = (df['a'] - df['a'].mean()) / (df['a'].std())

The following pygander script is equivalent.

import pygander as pg
import pandas as pd


df = pd.DataFrame(...)

# make a z_score_of_a column in df using pygander
@pg.ctransf(df)
def z_score_of_a(a):
    return (a - a.mean()) / a.std()

Both scripts are equally verbose, but as the number of column references increases, ctransf becomes a more efficient means of creating column transforms. In addition to reducing at least 5-6 characters per column reference (df[' and ']), the decorator-and-function syntax is also easier to read.

If your dataframe has column names that aren't "identifiers" and can't be listed as function parameters (leading digits, spaces, symbols, etc.), pygander offers pg.norm_colnames a convenient and robust in-place column renaming function.

>>> df.columns
Index['  0123 aBC', ' /// e', 'Normal Name']
>>> pg.norm_colnames(df)
>>> df.columns
Index['_0123_abc', '_e', 'normal_name']
>>> pg.rowlogic(df)
... def new_col(_0123_abc, _e, normal_name):
...    ... # now you can use the column names as identifiers

Grouper

The Grouper class provides a convenient means of selecting certain dataframe splits or column sets. Take the following dataframes:

k = 100
df = pd.DataFrame()
df['rowid'] = list(range(k))
df['a'] = choices([1, 2, 3], k=k)
df['b'] = choices(['e','ee','eee'], k=k)
df['c'] = choices([1, 2, 3, 4, 5, 6], k=k)
df['target'] = choices([0, 1], k=k)

train=df.head(k//4)
val = df.head(k//2).tail(k//4)
test=df.tail(k//2).drop('target', axis=1)

To create a grouper, identify sets of columns and which dataframes you're interested in.

dfg = pg.Grouper(
     column_groups = {
        'rowid': ['rowid'],
        'x': ['a', 'b', 'c'],
        'y': ['target'],
    },
    train=train,
    test=test,
    val=val
)

You can then call dfg.sel to select and combine certain dataframes and column sets. For example, common phases of the kaggle process include data preparation, where you'll want all feature columns for all splits, without row IDs or target data.

dfg.sel(cg='x')

During training, you'll want feature data and target data for the training set.

dfg.sel(cg=['x', 'y'], df='train')

During validation, you'll want feature data and target data for the validation set.

dfg.sel(cg=['x', 'y'], df='val')

And during competition inference, you'll want the row IDs and the feature data for the test set.

dfg.sel(cg=['rowid', 'x'], df='test')

If you want to see everything, you can call an empty dfg.sel.

dfg.sel()

Installation

Requirements

Compatibility

Licence

Authors

pygander was written by Elias Jaffe.

pygander's People

Contributors

handcartcactus avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.