sfu-db / dataprep Goto Github PK

Open-source low code data preparation library in python. Collect, clean and visualization your data in python with a few lines of code.

Home Page: http://dataprep.ai

License: MIT License

Python 60.38% HTML 9.30% JavaScript 26.79% Vue 1.00% CSS 2.50% Just 0.03%

dataprep data-science datapreparation dataconnector eda exploratory-data-analysis data-exploration connector cleaning datacleaning

dataprep's Introduction

Documentation | Forum

Low code data preparation

Currently, you can use DataPrep to:

Collect data from common data sources (through dataprep.connector)
Do your exploratory data analysis (through dataprep.eda)
Clean and standardize data (through dataprep.clean)
...more modules are coming

Releases

Repo	Version	Downloads
PyPI
conda-forge

Installation

pip install -U dataprep

EDA

DataPrep.EDA is the fastest and the easiest EDA (Exploratory Data Analysis) tool in Python. It allows you to understand a Pandas/Dask DataFrame with a few lines of code in seconds.

Create Profile Reports, Fast

You can create a beautiful profile report from a Pandas/Dask DataFrame with the create_report function. DataPrep.EDA has the following advantages compared to other tools:

10X Faster: DataPrep.EDA can be 10X faster than Pandas-based profiling tools due to its highly optimized Dask-based computing module.
Interactive Visualization: DataPrep.EDA generates interactive visualizations in a report, which makes the report look more appealing to end users.
Big Data Support: DataPrep.EDA naturally supports big data stored in a Dask cluster by accepting a Dask dataframe as input.

The following code demonstrates how to use DataPrep.EDA to create a profile report for the titanic dataset.

from dataprep.datasets import load_dataset
from dataprep.eda import create_report
df = load_dataset("titanic")
create_report(df).show_browser()

Click here to see the generated report of the above code.

Click here to see the benchmark result.

Try DataPrep.EDA Online: DataPrep.EDA Demo in Colab

Innovative System Design

DataPrep.EDA is the only task-centric EDA system in Python. It is carefully designed to improve usability.

Task-Centric API Design: You can declaratively specify a wide range of EDA tasks in different granularity with a single function call. All needed visualizations will be automatically and intelligently generated for you.
Auto-Insights: DataPrep.EDA automatically detects and highlights the insights (e.g., a column has many outliers) to facilitate pattern discovery about the data.
How-to Guide: A how-to guide is provided to show the configuration of each plot function. With this feature, you can easily customize the generated visualizations.

Learn DataPrep.EDA in 2 minutes:

Click here to check all the supported tasks.

Check plot, plot_correlation, plot_missing and create_report to see how each function works.

Clean

DataPrep.Clean contains about 140+ functions designed for cleaning and validating data in a DataFrame. It provides

A Convenient GUI: incorporated into Jupyter Notebook, users can clean their own DataFrame without any coding (see the video below).
A Unified API: each function follows the syntax clean_{type}(df, 'column name') (see an example below).
Speed: the computations are parallelized using Dask. It can clean 50K rows per second on a dual-core laptop (that means cleaning 1 million rows in only 20 seconds).
Transparency: a report is generated that summarizes the alterations to the data that occured during cleaning.

The following video shows how to use GUI of Dataprep.Clean

The following example shows how to clean and standardize a column of country names.

from dataprep.clean import clean_country
import pandas as pd
df = pd.DataFrame({'country': ['USA', 'country: Canada', '233', ' tr ', 'NA']})
df2 = clean_country(df, 'country')
df2
           country  country_clean
0              USA  United States
1  country: Canada         Canada
2              233        Estonia
3              tr          Turkey
4               NA            NaN

Type validation is also supported:

from dataprep.clean import validate_country
series = validate_country(df['country'])
series
0     True
1    False
2     True
3     True
4    False
Name: country, dtype: bool

Check Documentation of Dataprep.Clean to see how each function works.

Connector

Connector now supports loading data from both web API and databases.

Web API

Connector is an intuitive, open-source API wrapper that speeds up development by standardizing calls to multiple APIs as a simple workflow.

Connector provides a simple wrapper to collect structured data from different Web APIs (e.g., Twitter, Spotify), making web data collection easy and efficient, without requiring advanced programming skills.

Do you want to leverage the growing number of websites that are opening their data through public APIs? Connector is for you!

Let's check out the several benefits that Connector offers:

A unified API: You can fetch data using one or two lines of code to get data from tens of popular websites.
Auto Pagination: Do you want to invoke a Web API that could return a large result set and need to handle it through pagination? Connector automatically does the pagination for you! Just specify the desired number of returned results (argument _count) without getting into unnecessary detail about a specific pagination scheme.
Speed: Do you want to fetch results more quickly by making concurrent requests to Web APIs? Through the _concurrency argument, Connector simplifies concurrency, issuing API requests in parallel while respecting the API's rate limit policy.

How to fetch all publications of Andrew Y. Ng?

from dataprep.connector import connect
conn_dblp = connect("dblp", _concurrency = 5)
df = await conn_dblp.query("publication", author = "Andrew Y. Ng", _count = 2000)

Here, you can find detailed Examples.

Connector is designed to be easy to extend. If you want to connect with your own web API, you just have to write a simple configuration file to support it. This configuration file describes the API's main attributes like the URL, query parameters, authorization method, pagination properties, etc.

Database

Connector now has adopted connectorx in order to enable loading data from databases (Postgres, Mysql, SQLServer, etc.) into Python dataframes (pandas, dask, modin, arrow, polars) in the fastest and most memory efficient way. [Benchmark]

What you need to do is just install connectorx (pip install -U connectorx) and run one line of code:

from dataprep.connector import read_sql
read_sql("postgresql://username:password@server:port/database", "SELECT * FROM lineitem")

Check out here for supported databases and dataframes and more examples usages.

Lineage

A Column Level Lineage Graph for SQL. This tool is intended to help you by creating an interactive graph on a webpage to explore the column level lineage among them.

The lineage module offers:

A general introduction of the project can be found in this blog post.

Automatic dependency creation: When there are dependency among the SQL files, and those tables are not yet in the database, the lineage module will automatically tries to find the dependency table and creates it.
Clean and simple but very interactive user interface: The user interface is very simple to use with minimal clutters on the page while showing all of the necessary information.
Variety of SQL statements: The lineage module supports a variety of SQL statements, aside from the typical SELECT statement, it also supports CREATE TABLE/VIEW [IF NOT EXISTS] statement as well as the INSERT and DELETE statement.
dbt support: The lineage module is also implemented in the dbt-LineageX, it is added into a dbt project and by using the dbt library fal, it is able to reuse the Python core and create the similar output from the dbt project.

Uses and Demo

The interactive graph looks like this: Here is a live demo with the mimic-iv concepts_postgres files(navigation instructions) and that is created with one line of code:

from dataprep.lineage import lineagex
lineagex(sql=path/to/sql, target_schema="schema1", conn_string="postgresql://username:password@server:port/database", search_path_schema="schema1, public")

Check out more detailed usage and examples here.

Documentation

The following documentation can give you an impression of what DataPrep can do:

Contribute

There are many ways to contribute to DataPrep.

Submit bugs and help us verify fixes as they are checked in.
Review the source code changes.
Engage with other DataPrep users and developers on StackOverflow.
Ask questions & propose new ideas in our Forum.
Contribute bug fixes.
Providing use cases and writing down your user experience.

Please take a look at our wiki for development documentations!

Acknowledgement

Some functionalities of DataPrep are inspired by the following packages.

Pandas Profiling

Inspired the report functionality and insights provided in dataprep.eda.
missingno

Inspired the missing value analysis in dataprep.eda.

Citing DataPrep

If you use DataPrep, please consider citing the following paper:

Jinglin Peng, Weiyuan Wu, Brandon Lockhart, Song Bian, Jing Nathan Yan, Linghao Xu, Zhixuan Chi, Jeffrey M. Rzeszotarski, and Jiannan Wang. DataPrep.EDA: Task-Centric Exploratory Data Analysis for Statistical Modeling in Python. SIGMOD 2021.

BibTeX entry:

@inproceedings{dataprepeda2021,
  author    = {Jinglin Peng and Weiyuan Wu and Brandon Lockhart and Song Bian and Jing Nathan Yan and Linghao Xu and Zhixuan Chi and Jeffrey M. Rzeszotarski and Jiannan Wang},
  title     = {DataPrep.EDA: Task-Centric Exploratory Data Analysis for Statistical Modeling in Python},
  booktitle = {Proceedings of the 2021 International Conference on Management of Data (SIGMOD '21), June 20--25, 2021, Virtual Event, China},
  year      = {2021}
}

dataprep's People

Contributors

Stargazers

Watchers

Forkers

gomrinal deeksha0104 inderpartap abhishek-pv najq madankrishnan97 qzx820 hashihab willmartell aguiarandre kla55 dylanzxc sanjana12111994 krieya peshotan pallavibharadwaj panoscorp avianaglobal bexxmodd feanorian johnjboren brdhunga korakot zggithubbb kamalesh-pathy-va allensmile gamebusterz cilmidheere keshabb 79212 matthieurouland yxie66 suppu-github luke202001 ranzj pitsopo ryanwdale jinski71 sbrugman adbmd yuzhenmao juandavidospina nick-zrymiak soheil647 mpwjames lenamax2355 dgonzo nguyenkhacbaoanh netkingcode andywangsfu jmalinao19 badkoubeh pplonski lakshay-sethi kgtdbx hwec0112 wanyun-yang abe2g shrinivasdharmadhikari mcgarrah hypothesis2304 noirtree danielsywang mpaulonis open-sources-project jinglinpeng crayon eshnil2000 subratac laopeng2021 the1onwrongway thoughtsynapse arungrace88 pwwang jombaba ccchai1 kmissa speedyidea miguelleon88 321hg debugx-x stjordanis mohannaesmail hanaluana sahmad11 devinllu marcofernandez007 waterpine genomicsnx fantasticer shiven004 agilicus gaybro8777 gaelicgrime forestlzj yurifreire2007 eluisluzquadros ychuckt8 dustinpartain bowen0729

dataprep's Issues

eda.plot_missing: error when passing column

I'm getting this error when running the following code

import pandas as pd
df = pd.read_csv("https://s3-us-west-2.amazonaws.com/dataprep.dsl/datasets/suicide-rate.csv")
plot_missing(df, "suicides")

@Waterpine @jinglinpeng is anyone else getting this error? I'm using dask version 2.9.1

eda.plot: add pairplot

I think seaborn pairplot function is very useful, which could give us a reasonable idea about variables relationships: https://seaborn.pydata.org/generated/seaborn.pairplot.html
However, if we use dataprep, we have to write for loops.

plot(df): xtics need to be optimized for numerical attributes

I was doing plot(df) on the example data and found that the xtics of many numerical attributes are not carefully set. Below I compared the xtics of tableau (left) and dataprep (right). Apparently, tableau looks better. This is not a high-priority, but worth considering in the future release.

eda.plot_correlation: plot_correlation is not efficient

Currently, compared with seaborn, plot_correlation is not efficient enough.

I don't know the reason, maybe we should check the code and make the function more efficient.

plot_missing(df, x): several bugs

I found a few bugs when running plot_missing(df, 'HDI_for_year').

Bug 1. Would be better to let the user know not all countries are displayed. Please work with @brandonlockhart to check how to add ngroups here and also take a look at #42 .

Bug 2. Orange bars are not displayed on the sex and age tabs.

Bug 3. No bars are displayed on the country_year and gdp_for_year tabs.

plot: error in x ticks for some dataset

For some dataset, the x ticks of plot may have issues. Please try the following code:
import pandas as pd
from dataprep.eda import *
df = pd.read_csv('https://www.openml.org/data/get_csv/52236/phpAyyBys', na_values = ['?', 'nan'])
plot(df)
The result is as follows:

data_connector: dblp schema

Please check the dblp schema below.

I suggest using authors as a column name since it has more than one author.
I suggest using pages as a column name since it has more than one page.
Why is the data type of the venue column a list? Should it be string?

dataprep.eda: plot(df,x,y) with categorical variables bar chart problems

Some labels are overlapping in the nested bar chart:

I think we should include the count in the tooltip of the stacked bar chart since this plot could be deceiving if there is a small number of observations in a group

dataprep.eda: correlation for categorical column

For a classification task, we need to understand how other columns are related to the label column, which is a categorical column. However, current plot_correlation only supports correlation for numeric column, we need to think about how the correlation of categorical column could be understood (may via or not via plot_correlation function).

dataprep plot() doesn't show plots on the Google Colab interface.

I used dataprep on google colab for one of my EDA work. The dataprep library works well on google colab but doesn't shows up the plots.

dataprep.eda: add case study for 'Housing Price'

The task is to add a case study of using dataprep.eda to simplify the EDA for 'Housing Price' task.

House Price:
Data: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/
EDA: https://www.kaggle.com/mgmarques/houses-prices-complete-solution

The case study has 2 purposes:

Help us figure out if the current API design is complete and easy to use.
Educate the users on how to use our tool to finish popular tasks.

TODO:

Create Jupyter Notebook.
Add the use case into the documentation.

plot_missing: documentation for num_bins and num_cols

Please check below a screenshot of the documentation for plot_missing.

There are a few issues:

It should be "num_bins" rather than "bins_num". The description ("The number of rows in the figure") is also a bit confusing.
In plot(df), this parameter is called bins. Please work with @brandonlockhart and make it consistent. I would suggest calling it bins, which is consistent with pandas.dataframe.hist.
Maybe it's better to use ncols instead of num_cols. This is because that in plot(df), there is a parameter called ngroups, which is short for num_groups.

EDA plot function

Goal: Plot function includes plot(df), plot(df, x="x") and plot(df, x="x", y="y")

Step 1: create intermediates
Step 2: plot graphs based on intermediates

data_connector: API issuing strategy expression

Design a mechanism of support fluent API query, i.e. get results effectively with respect to the network condition and websites' constraint, etc. (A retrospection from previous meetings.)

Design and write tutorial notebook for data_connector

Runtime warning while using plot_correlation() on Kaggle notebooks

I was using dataprep.ai in my notebooks on kaggle and found that plot_correlation() function throws a runtime warning while plotting. I think there must be try and except functionalities in the function so that it doesn't throw runtime warnings because it looks odd.
Thank you!

data_connector: pagination design

I'm working on the design of the pagination feature of data connector. Here are some plans and existing problems. Thoughts are very welcome.

Plan:

Implementation of limit specification: user can specify the limit to control the maximum number of returned
Implementation of fetch all results under query: under the help of offset parameter for each API (since_id for Twitter, may need further modification)

Problems:

How to find a general way to represent parameters in the query() function
How to deal with the specific way of Twitter API in terms of pagination

plot_missing(df): bugs in tooltips

I got a dataset collected from Yelp using the data_connector API. The dataset has 20 rows. Below shows the first 10 values in the address3 column.

Bug 1. Please check the tooltip below.

missing% is larger than 100%?
loc should be 5 rather than 4~5.

Bug 2. Please check the tooltip below.

Should the location of the first row 1 or 0? In Pandas Dataframe, iloc starts from 0.

Bug 3. Should we consider empty strings as missing values? If so, the address3 column should have many more missing values.

eda.plot_correlation: handle categorical column

I'm considering whether we should handle categorical variables in plot_correlation. One use case of plot_correlation is plot_correlation(df, x = label) to rank the features that are correlated to the label. For this scenario, it would be important to have a uniform way to measure the correlation for both categorical variable and continuous variable.

My idea is to add one measure to handle categorical variables, such as Cramer's V (based on chi-square's test) or Uncertainty Coefficient (based on mutual information). For continuous variable, we make bins and treat it as categorical variable.

It requires to add one more tab on current output of plot_correlation(df) and plot_correlation(df, x), which shows the Cramer's V or Uncertainty Coefficient for all columns. Please let me know any opinions. @dovahcrow @jnwang @Waterpine @brandonlockhart

support of time series

This issue is about the rough idea to support time series in dataprep.eda.

Essentially, datatime could be regarded as a numeric type, and it could be transformed to timestamp (float) via datatime.timestamp() or pd.to_numeric(). Hence, we could do the following work as the initial support of time series.

Identify the column with datatime64 type in the dataframe.
plot(df) & plot(df, x): handle time series column like numeric column, which could be binalized. When show the ticks of time series column, show the datetime string via function like datatime.strftime(). An example output is https://pandas.pydata.org/pandas-docs/version/0.13/visualization.html
plot(df, x, y): When x is a datetime column and y is a numeric column, change the scatter plot with the line chart, which shows how y changes with x. For all other cases, apply the processing as step 2.
plot_correlation: we could ignore the datetime column as pandas does, or transform datetime to numeric column via pd.to_numeric() and then apply the original processing.
plot_missing: apply the similar processing of step 2.

plot(df) and plot_correlation(df) fail when data has 'list' columns

When running plot(df) and plot_correlation(df) on the following dataframe, since the author column is a list, both plot and plot_correlation failed.

For plot(), the reported error is TypeError: unhashable type: 'list'

For plot_correlation(df), the reported error AssertionError: No numerical columns found

Add documentation for data_connector

dataprep.eda: report implementation

Implement the Report enhancement as specified in #74.

plot_correlation: why does it not work for the columns with missing values?

For the columns with missing values, plot_correlation(df) returns NaN as their correlation values (see below).

It would be better to take a look at how pandas.DataFrame.corr overcomes this limitation.

data_connector: add Author table to DBLP?

I suggest adding Author table to the DBLP connector. This can solve the name ambiguity issue.

Suppose I would like to find all the papers published by Guoliang Li (Tsinghua). I can first query the Author table to find all the people whose name is Guoliang Li. https://dblp.org/search/author/api?q=Guoliang$_Li$

There are six people and the second one is whom I am looking for:

Once #50 is supported, I can get all Guoliang Li (Tsinghua)'s publications through https://dblp.org/search/publ/api?q=author%3AGuoliang_Li_0001%3A

data_connector: schema.json, adding description attribute for field definition?

Adding a description for the parameters will help the users understand how to specify values for each parameter. For example, the format of the longitude in Yelp.businesses table; the maximum limit of the results that a user can expect (if we incorporate limit parameter in the future).

data_connector (Spotify): missing cols in the Artist and Album tables

Is there any reason for not including external_urls, images for the Artist table and for not including popularity, album_type, copyrights, external_urls for the Album table?

Also, in the Album table, can an album have more than one artist?

data_connector: automate testing configuration

The task is to make every PR trigger impacted module tests automatically. This function can be achieved by Github actions.

design the test workflow for data-connector
implement the workflow
code review
PR & release

plot(df) error

eda.plot_missing: error when change column type

I try the training data of Titanic, which could be download in https://www.kaggle.com/c/titanic/data.

The following code will raise an error:
import pandas as pd
from dataprep.eda import *
df = pd.read_csv('titanic/train.csv')
df['PassengerId'] = df['PassengerId'].astype("object")
plot_missing(df, 'Age')

The error information is as follows:

However, if we do not change column type of 'PassengerId'. I.e., remove df['PassengerId'] = df['PassengerId'].astype("object"), the code run successfully.

Add docstring for data_connector

Conda Installation of the dataprep AI is not supported

Conda Installation for the data prep AI is not supported.

$ conda install dataprep
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.

PackagesNotFoundError: The following packages are not available from current channels:

dataprep

Current channels:

To search for alternate channels that may provide the conda package you're
looking for, navigate to

https://anaconda.org

and use the search bar at the top of the page.

plot_missing(df, x, y): several bugs

I found a few bugs after running plot_missing(df, x, y).

Bug 1. DropMissing should be orange and Origin should be blue. Also, the PDF curve looks strange to me. Please double-check whether it is correct.

Bug 2. The two CDF curves overlap, which looks strange to me.

Bug 3. Please make the box plot consistent with the one generated by plot(df, numerical_x). Also, the color scheme of the box plot looks strange to me.

Extend data-connector for more webistes

look for more frequently used websites besides our current supporting ones (e.g. yelp) and make a list
learn how to write data-connector config for a new website
implement to support one more website (the rest on the list would be supported in the future)
PR & code review

Check version of dataprep

I would like to know which version of dataprep that I install. Can we support this?

eda.plot: JavaScript output is disabled in JupyterLab

The plot can not be showed and there is an warning: JavaScript output is disabled in JupyterLab when I import and invoke eda.plot using Jupyter lab (notebook does not have this issue).

Design and implement an all-in-one report

Given a dataset, run all (or many) of the plot functions, and output the visualizations into a nicely formatted html file. This will be similar to pandas-profiling, the main differences being a larger variety of interaction plots and the tooltip. The current plan is to not include descriptive statistics.

Create a low-fidelity mockup
Get feedback about the mockup from the DataPrep team
Implement the report
Test

plot(df, x): histogram shows incorrect values

When running plot(df, 'year' 31) on the example dataset (suicide-rate.csv), I got the following histogram, which shows are 2015: 936 and 2016: 904. However, the correct values should be 2015: 744 and 2016: 160.

data_connector: issue using API parameters without template variables

Support for templates was added in this PR.
When template variables are not specified in the API request, the template value still contains the string around it and is not "empty". This always results in key conflicting with to_key Warning and returns empty results.

Example:
if first_name and last_name are template variables and are not mentioned in the request and to_key is q and is specified in the following manner:
df = dc.query("publication", q="Journal Articles")
The request contains template value <Template memory:7f4f6c36c2d0> author:_: with the above Warning. Instead of returning the publications of type "Journal Articles", it would return an empty data frame.

dataprep.eda: add case study for 'Titanic'

The task is to add a case study (on jupyter notebook) of using dataprep.eda to do ML task.

No module named 'toml'

I tried to upgrade dataprep to the latest version: pip install -U dataprep, but got the following error:

eda.plot: empty bins in histogram

Currently the histogram will keep the bins even it is empty. For example, run the following code:
import pandas as pd
from dataprep.eda import *
df = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
plot(df)
The result is:

Actually, Pclass and survived only has 3 and 2 distinct values, respectively. Since we will show 10 bins by default, lots of bins are empty in Pclass and survived. Could we have a better way to visualize the histogram in this case? Maybe take a look at how other plotting library handling this issue.

fix KDE plot

The KDE plot of Dataprep is bad and needs to be fixed:

plot_correlation: handle missing values

It looks plot_correlation(df, x) and plot_correlation(df, x, y) cannot handle missing values. Could you please take a look? @Waterpine

The code is as follows:
df = pd.read_csv('https://www.openml.org/data/get_csv/9/dataset_9_autos.arff', na_values = ['?'])
plot_correlation(df, 'price')
plot_correlation(df, 'price', 'bore')

The running result is as follows:

eda.plot: box plot x-axis label is not clear

Currently, when we use plot(df, x, y, bins), if you set the number of bins too large, the box plot's x-axis label is not clear.

Combine two function together

I find that plot(df, x, y) and plot_correlation(df, x, y) have similar outputs. Why not combine them together. Then, we just use the plot(df, x, y) to analyze the data.

plot(df, x, y): ngroups does not work

I would like to increase the number of groups in the box plot of plot(df, "suicides", "country"), but found that setting ngroups = 20 does not work (see below).

plot(df, x, y): make it possible and easy for users to set ngroups

When running plot(df, "country", "generation"), I got the following plots. It seems that it is impossible to adjust ngroups (i.e., top 5, top 20, top 70) for each plot.

To make it easy to set ngroups, I have one proposal.

First, we make the three plots have the same ngroups by default (e.g., ngroups = 10).
Second, if a user wants to change ngroups, she only needs to change one parameter and then it will be applied to all three plots.
Third, if ngroups is very large, then we should increase the plot width/height automatically so that a user can view the whole plot by scrolling the vertical and horizontal bars. See the plot below for an example.

data_connector: Fetch all publications of one specific author

Suppose a user wants to fetch all publications of one specific author (e.g., Jian Pei). Dataprep.data_connector cannot meet her needs. For example, the first paper is not written by Jian Pei, but it was returned since the author list contains the keywords Jian and Pei.

A user can get all publications of Jian Pei through this API: https://dblp.org/search/publ/api?q=author%3AJian_Pei%3A

Please consider to support this feature.

data_connector: Fetch all publications of one specific conference

Suppose a user wants to fetch all publications of one specific conference (e.g., SIGMOD Conference). Dataprep.data_connector cannot meet her needs. For example, the following paper will be returned.

A user can get all publications of SIGMOD Conference through this API: https://dblp.org/search/publ/api?q=venue%3ASIGMOD_Conference%3A