pablo14 / funpymodeling Goto Github PK

View Code? Open in Web Editor NEW

17.0 6.0 7.0 271 KB

A package to help data scientist in Exploratory Data Analysis and Data Preparation for ML models

License: MIT License

Python 10.51% Jupyter Notebook 89.27% Makefile 0.22%

funpymodeling's People

Contributors

Stargazers

Watchers

Forkers

jcgarciasis martinceriotti danielc9310 psotomayor cabustillo13 tactlabs kevinperezgarcia

funpymodeling's Issues

Create `cross_plot` function

cross_plot receives a data frame (data), a target variable target and optional a list of variable names to generate the following plots:

An example in R:

cross_plot(data=heart_disease, input=c("age", "oldpeak"), target="has_heart_disease")

It receives a data frame, and two string variable names, and the target variable, and it plots:

As it is shown, it crates two plots (for age and oldpeak), showing the distribution of each variable given the target variable.

Notes:

If no input is provided, then it will run for all the variables.
It handles categorical and numerical variables. If it is numerical, and the number of unique values is higher than 15, then it does discretization based on equal frequency. The number of bins is a parameter (this is not present in R), the default is 10
cross_plot only applies for binary classification tasks (the target should contain only two different values)

Cannot install funpymodeling on Google Colab

Suggested by @aoelvp94 👌

When you try to install this library in a Google Colab environment it throws an error:

Collecting funpymodeling
  Downloading https://files.pythonhosted.org/packages/93/09/73d42d2983e18e28e00e4a33e38eeddd1531e3f83ed91a8224f1816be81f/funpymodeling-0.1.4.tar.gz
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
    Preparing wheel metadata ... error
ERROR: Command errored out with exit status 1: /usr/bin/python3 /usr/local/lib/python3.6/dist-packages/pip/_vendor/pep517/_in_process.py prepare_metadata_for_build_wheel /tmp/tmpdq9ovsms Check the logs for full command output.

Improve structure of project

You can follow this structure: https://github.com/sergiocalde94/pydrift
@sergiocalde94 is a good professional, maybe can help us :)

Create `freq_plot` function (frequency plot for categorical variables)

In funModeling, freq functions plots the frequency for all the categorical variables.

Below there is a code that do something similar:

import seaborn as sns
import matplotlib.pyplot as plt

sns.set()
tips = sns.load_dataset("tips")



d_plot=tips
fig, ax = plt.subplots(4, 2, figsize=(20, 20))
for variable, subplot in zip(cat_vars(d_plot), ax.flatten()):
    sns.countplot(y=d_plot[variable], ax=subplot, order = d_plot[variable].value_counts().index)
    for label in subplot.get_xticklabels():
        label.set_rotation(90)

It shows:

This is not the case, but if the names are too long they overlap across the plots
Don't create empty grids (calculate dynamically the number of plots)
It needs to show the absolute and relative percetage per bar as it is shown below:

This data is already calculated by the function freq_tbl in this package.

If there are more than 100 different categories, the plot should group in the other or more category, to avoid crashing.
It should use the todf() function (from funpymodeling) to convert different datatypes to dataframe so freq_plot supports numpy 1D/2D, pandas series and 1D/2D lists

Encapsulate pairwaise correlation based on MIC statistic

Modify corr_pair function from funpymodeling in order to handle the correlation based on MIC statistic as it is shown below:

from __future__ import division
import numpy as np
import matplotlib.pyplot as plt
from minepy import MINE

from funpymodeling.exploratory import cat_vars, num_vars

tips = sns.load_dataset('tips')

# Encapsulate from here:
data=tips[num_vars(tips)] # only pre-select numeric variables
data

df_res = pd.DataFrame()
for a,b in col_pairs:
    mine = MINE(alpha=0.6, c=15, est="mic_approx")
    mine.compute_score(data[a], data[b])
    df_res=df_res.append({"v1":a, "v2":b, "mic":mine.mic()}, ignore_index=True)

corr_pair should support the method='mic'.

set versions of numpy, pandas and matplotlib

Implement GitBook for documentation

We can make something like this https://sergiocalde94.github.io/pydrift/ (@sergiocalde94 Sorry je, I think that your project is a good example to follow :) )

add CI

Implement CI in this project to check/run our future tests

Add files/folders to .gitignore

add

AUX/
__ pycache _ _ (without spaces)
funPyModeling.egg-info/
dist/
.ipynb_checkpoints/

Replace init

Remove this file: https://github.com/pablo14/funpymodeling/blob/master/__init__.py

Take the code of previous script and paste that here: https://github.com/pablo14/funpymodeling/tree/master/funPyModeling

pablo14 / funpymodeling Goto Github PK

funpymodeling's People

Contributors

Stargazers

Watchers

Forkers

funpymodeling's Issues

Create `cross_plot` function

Cannot install funpymodeling on Google Colab

Improve structure of project

Create `freq_plot` function (frequency plot for categorical variables)

Encapsulate pairwaise correlation based on MIC statistic

set versions of numpy, pandas and matplotlib

Implement GitBook for documentation

add CI

Add files/folders to .gitignore

Replace init

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs