GithubHelp home page GithubHelp logo

pablo14 / funpymodeling Goto Github PK

View Code? Open in Web Editor NEW
17.0 6.0 7.0 271 KB

A package to help data scientist in Exploratory Data Analysis and Data Preparation for ML models

License: MIT License

Python 10.51% Jupyter Notebook 89.27% Makefile 0.22%

funpymodeling's People

Contributors

aoelvp94 avatar cabustillo13 avatar pablo14 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

funpymodeling's Issues

Create `cross_plot` function

cross_plot receives a data frame (data), a target variable target and optional a list of variable names to generate the following plots:

An example in R:

cross_plot(data=heart_disease, input=c("age", "oldpeak"), target="has_heart_disease")

It receives a data frame, and two string variable names, and the target variable, and it plots:

image

As it is shown, it crates two plots (for age and oldpeak), showing the distribution of each variable given the target variable.

Notes:

  • If no input is provided, then it will run for all the variables.
  • It handles categorical and numerical variables. If it is numerical, and the number of unique values is higher than 15, then it does discretization based on equal frequency. The number of bins is a parameter (this is not present in R), the default is 10
  • cross_plot only applies for binary classification tasks (the target should contain only two different values)

Cannot install funpymodeling on Google Colab

Suggested by @aoelvp94 ๐Ÿ‘Œ

When you try to install this library in a Google Colab environment it throws an error:

Collecting funpymodeling
  Downloading https://files.pythonhosted.org/packages/93/09/73d42d2983e18e28e00e4a33e38eeddd1531e3f83ed91a8224f1816be81f/funpymodeling-0.1.4.tar.gz
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
    Preparing wheel metadata ... error
ERROR: Command errored out with exit status 1: /usr/bin/python3 /usr/local/lib/python3.6/dist-packages/pip/_vendor/pep517/_in_process.py prepare_metadata_for_build_wheel /tmp/tmpdq9ovsms Check the logs for full command output.

Create `freq_plot` function (frequency plot for categorical variables)

In funModeling, freq functions plots the frequency for all the categorical variables.

Below there is a code that do something similar:

import seaborn as sns
import matplotlib.pyplot as plt

sns.set()
tips = sns.load_dataset("tips")



d_plot=tips
fig, ax = plt.subplots(4, 2, figsize=(20, 20))
for variable, subplot in zip(cat_vars(d_plot), ax.flatten()):
    sns.countplot(y=d_plot[variable], ax=subplot, order = d_plot[variable].value_counts().index)
    for label in subplot.get_xticklabels():
        label.set_rotation(90)
        

It shows:

image

  1. This is not the case, but if the names are too long they overlap across the plots
  2. Don't create empty grids (calculate dynamically the number of plots)
  3. It needs to show the absolute and relative percetage per bar as it is shown below:

image

This data is already calculated by the function freq_tbl in this package.

  1. If there are more than 100 different categories, the plot should group in the other or more category, to avoid crashing.

  2. It should use the todf() function (from funpymodeling) to convert different datatypes to dataframe so freq_plot supports numpy 1D/2D, pandas series and 1D/2D lists

Encapsulate pairwaise correlation based on MIC statistic

Modify corr_pair function from funpymodeling in order to handle the correlation based on MIC statistic as it is shown below:

from __future__ import division
import numpy as np
import matplotlib.pyplot as plt
from minepy import MINE

from funpymodeling.exploratory import cat_vars, num_vars

tips = sns.load_dataset('tips')

# Encapsulate from here:
data=tips[num_vars(tips)] # only pre-select numeric variables
data

df_res = pd.DataFrame()
for a,b in col_pairs:
    mine = MINE(alpha=0.6, c=15, est="mic_approx")
    mine.compute_score(data[a], data[b])
    df_res=df_res.append({"v1":a, "v2":b, "mic":mine.mic()}, ignore_index=True)

image

corr_pair should support the method='mic'.

add CI

Implement CI in this project to check/run our future tests

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.