The datadez's intro from dezounet

Pandas dataframe inspection, filtering, balancing

The main goal of this package is to make your life easier if you want to:

Inspect a dataset and compute metrics about its columns content (auto type inference: numeric, mono-label or multi-label).
Filter the dataset one some criteria (minimum label occurrence, empty example).
Balance the dataset (TODO) in order to get better performance while training ML or NN models.

Requirements

Python 2.7 or 3.6
Numpy and Pandas

Usage

Command line inspection

You can gain insight into your dataset, getting, among other things, label count, max/min/mean label occurrences (for mono and multi label columns).

# Dataframe with numeric, mono-label, or multi-label (list, tuple, set) columns
df = pd.DataFrame(...)

# Filter label not occurring much in column 'B'
from datadez.filter import filter_small_occurrence
df = datadez.filter.filter_small_occurrence(df, column_name='B', min_occurrence=3)

# Filter empty row based on column 'B' or 'C' values
from datadez.filter import filter_empty
df = filter_empty(df, column_names=['B', 'C'])

# Compute some metrics about your dataset
import pprint
from datadez.summarize import summarize
df_summaries = summarize(df)
pprint.pprint(df_summaries)

Visual inspection

# Dataframe with a multi-label column 'C'
df = pd.DataFrame(...)

# Compute a plotly figure from this dataframe
from datadez.dataviz import multilabel_plot
figure = multilabel_plot.intersection_matrix(df, 'C')

# Plot the figure in a file. You can also do this inside a Jupyter notebook
from plotly.offline import plot
plot(figure, filename='chord-diagram.html')

Output will look like this:

Transformation

With this code snippet:

# Dataframe with numeric, text, mono-label, multi-label (list, tuple, set) columns
df = pd.DataFrame(...)

print("Original dataset:")
print(df.head())

# Vectorize text, mono-label and multi-label columns
from datadez.transform import vectorize_dataset
df, vectorizers = vectorize_dataset(df)

print("Vectorized dataset:")
print(df.head())

One will get:

Original dataset:
          A       B                                 C                                D
0 -0.585248  W2SIF2                                []     house jumps adorable crazily
1  0.569125  RYKAXC  [IRX7HF, AXQU0L, PM1E1Q, 1FCWZQ]            car swims odd merrily
2 -0.076040  7UVFIJ  [60WILH, NT28YD, 8IYE5F, 7UVFIJ]  monkey barfs clueless dutifully
3 -0.098878  U9WN5M                  [KS5EXD, YGTPR9]           boy runs odd dutifully
4  0.952773  SK1Z1M                          [AXQU0L]            boy barfs odd crazily

Vectorized dataset:
          A      C                                                          ...        D
      value 1BNK1S 1FCWZQ 246K1M 3A48BH 60WILH 6C3VOQ 6LGS3T 7UVFIJ 8IYE5F  ...  merrily monkey occasionally odd puppy rabbit runs stupid swims weeps
0 -0.585248      0      0      0      0      0      0      0      0      0  ...        0      0            0   0     0      0    0      0     0     0
1  0.569125      0      1      0      0      0      0      0      0      0  ...        1      0            0   1     0      0    0      0     1     0
2 -0.076040      0      0      0      0      1      0      0      1      1  ...        0      1            0   0     0      0    0      0     0     0
3 -0.098878      0      0      0      0      0      0      0      0      0  ...        0      0            0   1     0      0    1      0     0     0
4  0.952773      0      0      0      0      0      0      0      0      0  ...        0      0            0   1     0      0    0      0     0     0

[5 rows x 85 columns]

Do some tests

Just clone this repository, and execute:

python -m tests.sample

This will execute a test sample, for you to get what's going on:

Starting from this dataframe (len=100):
          A       B                                 C                               D
0  0.745236  BBM7UP  [TUT7RS, MOW92W, 9O6IX6, T70X4Z]    donkey barfs dirty foolishly
1  0.484822  GPC8CL  [BG05XJ, IORYVC, BX9UK5, ERT4PJ]    girl hits clueless dutifully
2  0.673377  BK3OE7  [GPC8CL, GPC8CL, GPC8CL, BG05XJ]  car eats clueless occasionally
3  0.462564  AEAIH6                                []          car eats dirty crazily
4 -0.115847  T70X4Z                                []      girl hits adorable crazily

With these metrics:
{u'A': {u'column_type': u'numeric',
        u'mean': 0.048246178299744653,
        u'std': 0.95162789611877563},
 u'B': {u'column_type': u'mono-label',
        u'imbalance_ratio': 6,
        u'labels': 30,
        u'occurrence_max': 6,
        u'occurrence_mean': 3.3333333333333335,
        u'occurrence_min': 1,
        u'occurrence_std_dev': 1.4452988925785868},
 u'C': {u'cardinality_mean': 1.6499999999999999,
        u'cardinality_std_dev': 1.3955285736952863,
        u'column_type': u'multi-label',
        u'imbalance_ratio': 5,
        u'labels': 29,
        u'occurrence_max': 10,
        u'occurrence_mean': 5.6896551724137927,
        u'occurrence_min': 2,
        u'occurrence_std_dev': 2.1189367580327199,
        u'partitions': {u'imbalance_ratio': 29,
                        u'labels': 65,
                        u'occurrence_max': 29,
                        u'occurrence_mean': 1.5384615384615385,
                        u'occurrence_min': 1,
                        u'occurrence_std_dev': 3.4555503761871074}},
 u'D': {u'column_type': u'mono-label',
        u'imbalance_ratio': 2,
        u'labels': 98,
        u'occurrence_max': 2,
        u'occurrence_mean': 1.0204081632653061,
        u'occurrence_min': 1,
        u'occurrence_std_dev': 0.14139190265868387}}

Filtering the small occurrences label of columns B and C...
We now get this (len=100):
          A       B                                 C                               D
0  0.745236  BBM7UP                  [MOW92W, 9O6IX6]    donkey barfs dirty foolishly
1  0.484822  GPC8CL          [BG05XJ, IORYVC, ERT4PJ]    girl hits clueless dutifully
2  0.673377  BK3OE7  [GPC8CL, GPC8CL, GPC8CL, BG05XJ]  car eats clueless occasionally
3  0.462564  AEAIH6                                []          car eats dirty crazily
4 -0.115847  T70X4Z                                []      girl hits adorable crazily


Filtering empty entry example for column B or C...

We finally have a clean dataframe (len=47):
          A       B                                 C                               D
0  0.745236  BBM7UP                  [MOW92W, 9O6IX6]    donkey barfs dirty foolishly
1  0.484822  GPC8CL          [BG05XJ, IORYVC, ERT4PJ]    girl hits clueless dutifully
2  0.673377  BK3OE7  [GPC8CL, GPC8CL, GPC8CL, BG05XJ]  car eats clueless occasionally
5 -0.941320  7A37D6                          [76AYX1]    girl eats clueless dutifully
6  0.043402  7A37D6                          [BK3OE7]     rabbit jumps stupid merrily

With these metrics:
{u'A': {u'column_type': u'numeric',
        u'mean': 0.14122463494338683,
        u'std': 0.9745937014876852},
 u'B': {u'column_type': u'mono-label',
        u'imbalance_ratio': 5,
        u'labels': 20,
        u'occurrence_max': 5,
        u'occurrence_mean': 2.3500000000000001,
        u'occurrence_min': 1,
        u'occurrence_std_dev': 1.0618380290797651},
 u'C': {u'cardinality_mean': 1.7872340425531914,
        u'cardinality_std_dev': 0.84893350396851086,
        u'column_type': u'multi-label',
        u'imbalance_ratio': 3,
        u'labels': 15,
        u'occurrence_max': 9,
        u'occurrence_mean': 5.5999999999999996,
        u'occurrence_min': 3,
        u'occurrence_std_dev': 1.5405626677721789,
        u'partitions': {u'imbalance_ratio': 4,
                        u'labels': 36,
                        u'occurrence_max': 4,
                        u'occurrence_mean': 1.3055555555555556,
                        u'occurrence_min': 1,
                        u'occurrence_std_dev': 0.69997795379745287}},
 u'D': {u'column_type': u'mono-label',
        u'imbalance_ratio': 1,
        u'labels': 47,
        u'occurrence_max': 1,
        u'occurrence_mean': 1.0,
        u'occurrence_min': 1,
        u'occurrence_std_dev': 0.0}}

dezounet / datadez Goto Github PK

datadez's Introduction

Pandas dataframe inspection, filtering, balancing

Requirements

Usage

Do some tests

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs