GithubHelp home page GithubHelp logo

davidmasse / blog-pandas-tips Goto Github PK

View Code? Open in Web Editor NEW
2.0 0.0 0.0 61 KB

Helpful plotting and Pandas patterns

Home Page: https://medium.com/@davidmasse8/helpful-plotting-and-pandas-patterns-80fd82b2b88b

cohens-d pandas-tricks-for-data-manipulation pandas seaborn plotly matplotlib exploratory-data-analysis

blog-pandas-tips's Introduction

Helpful Plotting and Pandas Patterns

The following are mainly notes to myself about a few things to do upon receiving a dataset.

Some basic Python imports:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Binary-Target Histograms and Cohen's d

If you have a binary target variable, it's often illustrative to plot a histogram of each feature showing two populations: the feature's values where the target is positive, and its values where the target is negative. Other conditions may be placed on membership in either group to exclude outliers if desired.

# made-up data for a dataframe
d = {'target': [0] * 50 + [1] * 50,
     'B': list(np.random.normal(25, 10, 50)) + list(np.random.normal(75, 10, 50)),
     'C': list(np.random.normal(40, 30, 50)) + list(np.random.normal(66, 20, 50))}
df = pd.DataFrame(d)
yes = df[(df['target'] == 1) & (df['B'] < 100)]
no = df[(df['target'] == 0) & (df['B'] < 100)]
cols = list(df.columns)
# we could plot the target column, but it would just show us the proportion of 0's and 1's
cols.remove('target')
for name in cols:
    a = yes[name].values
    b = no[name].values
    sns.distplot(a, label = 'positive', kde = False, rug = False)
    sns.distplot(b, label = 'negative', kde = False, rug = False)
    plt.title(name)
    plt.legend()
    plt.show()

hists

The visible separation in pairs of distributions can be quantified by Cohen's d, a standardized difference (dividing by pooled standard deviation) between the means of the positive and negative populations. Often this is a way to get a preview of the feature importances of decision-tree-based classifiers, which tend to have similar ordering to Cohen's d. This makes sense as higher Cohen's d indicates higher effect size (i.e more separation between positive and negative distributions), which should give the decision tree greater ability to identify which distribution (positive or negative) a particular observation came from.

cd = []
nn = len(no)
ny = len(yes)
for name in cols:
    a = yes[name].values
    b = no[name].values
    cd.append((name, round((abs(np.mean(a) - np.mean(b)) /
              np.sqrt(((ny - 1)*np.std(a)**2 + (nn - 1)*np.std(b)**2) / (nn + ny - 2))),2)))
print("Cohen's d:")
sorted(cd, key = lambda x: x[1], reverse = True)

cohen

Side-by-Side Bar Chart (Seaborn)

Here's a quick side-by-side bar chart to show more than one vertical value (all on the same scale) for each category on the horizontal axis - in case you would like to see various versions of a metric for each group or point in time.

# nonsense example data:
d = {'Month': ['March', 'March', 'April', 'April', 'May', 'May'],
                       'Measurement': [4, 3, 3, 4, 5, 6],
                       'Type of Measurement': ['A', 'B', 'A', 'B', 'A', 'B']}
to_chart = pd.DataFrame(d)
# the line below is convenient to adjust depending on scale and look of plot
sns.set(font_scale = 1.5)
fig, ax = plt.subplots(figsize = (10, 7))
# hue is the argument to know, indicates the two versions of each month's data to show
sns.barplot(x = 'Month', y = 'Measurement', hue = 'Type of Measurement', data = to_chart)
plt.show()

sbs_bar

Null Values and Duplicates

This seems to be the fastest way to check for any null values present in a dataframe: df.isnull().values.any().

Here are two quick ways to check for duplicates. Returns False if duplicate rows (same values in specified subset columns, or same in all columns if none specified) are found, and True if no duplicates.

# otherwise useful `keep` argument, not needed here, allows control over which copies of a particular row are kept/dropped
np.sum(df.duplicated(subset = ['A', 'B', 'C'])) == 0
df.equals(pd.DataFrame.drop_duplicates(df, subset = ['A', 'B', 'C']))

The .duplicated method will also return a useful Pandas series of True at duplicate locations and False elsewhere. If you do drop some duplicates (use inplace = True), and the index isn't meaningful, you could do df.reset_index(drop=True) to keep the index numbers what you would assume (i.e. so it won't break when you happen to use .loc instead of .iloc much later).

Other Initial Checks

Look at df.info(), df.describe(), df.dtypes, df.shape, df.head(N) (where N is a judgement-call number of rows needed to make sense of the dataframe) and df.iloc[n:m] (where n:m is a slice somewhere a random percentage of the way to the end of the rows as the beginning and end may have idiosyncrasies). Look at set(df.column) and len(set(df.column)) for each column for unique values.

.groupby (along with .merge) for SQL Operations (dplyr/reshape2 in R)

Start with the likes of df[['A', 'B', 'C']].groupby('B').count().sum(). Tack on .sort_values('A', ascending = False) or replace .count().sum() with .agg({'A': lambda_function_1, 'C': lambda_function_2}) (where the lambda functions, defined elsewhere, pick out qualities of a group, e.g. average length of strings). Instead of columns, specify rows with conditions df[(df.A > 0) & (df.C > 100)].groupby('B').count().sum().

To take this further, iterate through a grouped dataframe:

for name, group in grouped_df:
    more_complex_function_1(group)
    more_complex_function_2(name)

Use .merge suffixes to keep track of columns when joining a table to itself or a similar table (such as when chaining together periods to assess transitions):

merged = table_1[table_1.month == 'January'].merge(table_2[table_2.month == 'February'], on = 'ID', suffixes = ('L', 'R'))
# now you can make a column of ratios or percent changes from January to February, i.e. merged.balanceR / merged.balanceL - 1

Apply and Assign

Here's the best way I've found to add derived columns to a dataframe. It's nice to see the lambda function separately. Adding if-else logic is often useful:

# Here 'A' and 'B' are columns in df:
fn = lambda row: row['A'] / row['B'] if row['B'] != 0 else 0
df_expanded = df.assign(derived_column = df.apply(fn, axis = 1).values

blog-pandas-tips's People

Contributors

davidmasse avatar

Stargazers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.