iamaziz / pydataset Goto Github PK

View Code? Open in Web Editor NEW

933.0 34.0 86.0 15.31 MB

Instant access to many datasets in Python.

License: MIT License

Python 100.00%

python datasets data-science

pydataset's Introduction

PyDataset

Provides instant access to many datasets right from Python (in pandas DataFrame structure).

What?

The idea is simple. There are various datasets available out there, but they are scattered in different places over the web. Is there a quick way (in Python) to access them instantly without going through the hassle of searching, downloading, and reading ... etc? PyDataset tries to address that question :)

Usage:

Start with importing data():

from pydataset import data

To load a dataset:

titanic = data('titanic')

To display the documentation of a dataset:

data('titanic', show_doc=True)

To see the available datasets:

data()

That's it. See more examples.

Why?

In R, there is a very easy and immediate way to access multiple statistical datasets, in almost no effort. All it takes is one line > data(dataset_name). This makes the life easier for quick prototyping and testing. Well, I am jealous that Python does not have a similar functionality. Thus, the aim of pydataset is to fill that gap.

Currently, pydataset has about 757 (mostly numerical-based) datasets, that are based on RDatasets. In the future, I plan to scale it to include a larger set of datasets. For example,

include textual data for NLP-related tasks, and
allow adding a new dataset to the in-module repository.

Installation:

$ pip install pydataset

Uninstall:

$ pip uninstall pydataset
$ rm -rf $HOME/.pydataset

Changelog

0.2.0

Add search dataset by name similarity.
Example:

>>> data('heat')
Did you mean:
Wheat, heart, Heating, Yeast, eidat, badhealth, deaths, agefat, hla, heptathlon, azt

0.1.1

Fix: add support to Windows and fix filepaths, issue #1

Dependency:

pandas

Miscellaneous:

Tested on OSX and Linux (debian).
Supports both Python 2 (2.7.11) and Python 3 (3.5.1).

TODO:

add textual datasets (e.g. NLTK stuff).
add samples generators.

Thanks to:

RDatasets: R's datasets collection.

pydataset's People

Contributors

Stargazers

Watchers

Forkers

perfettiful jxrgxn shilohtd rippowamlabs gussand arl256 ml-ai-nlp-ir vijaynitrr sandy4321 amit-dingare tekton os7borne raamana digideskio phillette fignewtons intiveda gcagle1 prashantksharma killedision mrvege dupuleng traveler817 socrateslab johnvlahos holycattle lijian8 rdatasets wahlmzr robsmith1776 ivishwa murali-munna alisaad andybold arikan al1s jburke007 eyad-alshami davilaedu spicer23 vanzaj szdbl jason790 olivierh59500 farahzack hydrosquall thomasyang183 python3pkg chaosimple ten2net ivangugon1 jordanmaduro ryninho nickteff guibeira liao8933 pradshibu humdingers joseluis71 sundongcandy perellonieto pythonthings maddyvc dblueai fagan2888 fearcode curio13579 gekonshi phillip1029 hyun3010 brandonjbryant poem1209 bkiselgof fletchel lwood7983 dasa777777 aahmadai snap22 datalayer-externals ysebega mshans66 ric98 ligo-ai nasingfaund kkawailab stjordanis

pydataset's Issues

Importing Pydataset

Hello

I am trying to use Pydataset and I am having a strange error.

I am using Windows 10 with Python 3.6. Have already updated my pip and I can load all dataset but I cannot use none.

Here is a screenshot:

As you can see it says "Not valid dataset name and no similar found" but I am trying with many different names copy and paste then. In this case exceptionaly I used cmd but most I use IDLE or PyCharm.

Mostly I use Windows 10 but it is also occurring in Mint Linux at a Virtual Machine.

Translating R to Python. Worth the effort?

The starter datasets came from R's samples, so their html documentation includes R examples on how to use the data. Would it be considered worthwhile to translate the usage information to Python3?

Please make datasets non-executable

When initiating the datasets repo, all files have permissions 0755 (-rwxr-xr-x) when in fact they are not executable. Please make the initialization install datasets as 0644 (-rw-r--r--).

Process for adding datasets?

In the README, there's interest in expanding the number of datasets. I'm wondering what kind of criteria that new data would have to meet. Just of the top of my head:

Would it need to be useful prima facie, or would niche data also be acceptable? The kind of thing I'm considering (not seriously for inclusion, just in general) is that I'm working on scraping info about episodes of Detective Conan, such as what characters appeared in them. Would that be too niche?
Would it have to pass some vote for inclusion? If so, who gets a vote?
All the current data is csv. Would other kinds of data formats be able to be included later? Like HDF5?

Fix simple typo: smiliarity -> similarity

Issue Type

[x] Bug (Typo)

Steps to Replicate

Examine pydataset/support.py.
Search for smiliarity.

Expected Behaviour

Should read similarity.

Semi-automated issue generated by
https://github.com/timgates42/meticulous/blob/master/docs/NOTE.md

To avoid wasting CI processing resources a branch with the fix has been
prepared but a pull request has not yet been created. A pull request fixing
the issue can be prepared from the link below, feel free to create it or
request @timgates42 create the PR.

https://github.com/timgates42/PyDataset/pull/new/bugfix_typo_similarity

Thanks.

Allow usage of pydataset with no external dependancies

PyDataset is a fantastic tool to learn Python. But requiring pandas (and hence numpy) is a big barrier of entry. What's more you may want to be able to load the data using another tool to process it.

To make your lib more flexible and more newcomer friendly, I'd advice:

to create a toolbox that let you define the data index and load it in a generic way. It should not rely on a particular tech for downloading or result format and provide hooks to plug your own.
then build adapters for your downloader and pandas;
then build an adapter for regular python data structure.
It should default on pandas if it's installed, or regular python list/dict if it's not.

This will allow:

beginers to use it without needing to learn or install pandas;
external tools to embed it and adapt it easily;
make it easy to adapt to use with other data processing tools.
make it easy to adapt to use with other way to download data (gevent, asyncio, threadpool, etc).

Merge code ? DataPackage / datasets ...

Hello,

A lot of datasets are also available at https://github.com/datasets
They are called DataPackage.

They are available using Python and

https://github.com/datapackages/datapackage-py (Work In Progress) or https://github.com/trickvi/datapackage

Pinging @vitorbaptista @pwalsh @trickvi @rgrp

There is some overlap between these projects so maybe merging might be considered.

At least we should all be aware of existence of others projects.

Kind regards

PS : Datasets are also available at https://github.com/vincentarelbundock/Rdatasets

Regression/Classification info

Hi,

It would be nice to have a 3rd column for data() output indicating whether the dataset can be used for regression or classification problems.

Distinct dataset documentation

The documentation shown from 'housing' dataset don't match actual rows and columns imported

How to reproduce:

>>> from pydataset import data`
>>> df = data('housing')`
>>> df

       id    y  time  sec
1       1  1.0     0    1
2       1  2.0     6    1
3       1  2.0    12    1
4       1  2.0    24    1
5       2  1.0     0    1
...   ...  ...   ...  ...
1444  361  NaN    24    0
1445  362  1.0     0    0
1446  362  1.0     6    0
1447  362  1.0    12    0
1448  362  1.0    24    0

[1448 rows x 4 columns]

>>> data('housing', show_doc='True')

housing

PyDataset Documentation (adopted from R Documentation. The displayed examples are in R)

Frequency Table from a Copenhagen Housing Conditions Survey

Description

The housing data frame has 72 rows and 5 variables.

Usage

housing

Format

Sat

Satisfaction of householders with their present housing circumstances, (High,
Medium or Low, ordered factor).

Infl

Perceived degree of influence householders have on the management of the
property (High, Medium, Low).

Type

Type of rental accommodation, (Tower, Atrium, Apartment, Terrace).

Cont

Contact residents are afforded with other residents, (Low, High).

Freq

Frequencies: the numbers of residents in each class.

Source

Madsen, M. (1976) Statistical analysis of multiple contingency tables. Two
examples. Scand. J. Statist. 3, 97–106.

Cox, D. R. and Snell, E. J. (1984) Applied Statistics, Principles and
Examples. Chapman & Hall.

References

Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S.
Fourth edition. Springer.

Examples

options(contrasts = c("contr.treatment", "contr.poly"))
# Surrogate Poisson models
house.glm0 <- glm(Freq ~ Infl*Type*Cont + Sat, family = poisson,
                  data = housing)
summary(house.glm0, cor = FALSE)
addterm(house.glm0, ~. + Sat:(Infl+Type+Cont), test = "Chisq")
house.glm1 <- update(house.glm0, . ~ . + Sat*(Infl+Type+Cont))
summary(house.glm1, cor = FALSE)
1 - pchisq(deviance(house.glm1), house.glm1$df.residual)
dropterm(house.glm1, test = "Chisq")
addterm(house.glm1, ~. + Sat:(Infl+Type+Cont)^2, test  =  "Chisq")
hnames <- lapply(housing[, -5], levels) # omit Freq
newData <- expand.grid(hnames)
newData$Sat <- ordered(newData$Sat)
house.pm <- predict(house.glm1, newData,
                    type = "response")  # poisson means
house.pm <- matrix(house.pm, ncol = 3, byrow = TRUE,
                   dimnames = list(NULL, hnames[[1]]))
house.pr <- house.pm/drop(house.pm %*% rep(1, 3))
cbind(expand.grid(hnames[-1]), round(house.pr, 2))
# Iterative proportional scaling
loglm(Freq ~ Infl*Type*Cont + Sat*(Infl+Type+Cont), data = housing)
# multinomial model
library(nnet)
(house.mult<- multinom(Sat ~ Infl + Type + Cont, weights = Freq,
                       data = housing))
house.mult2 <- multinom(Sat ~ Infl*Type*Cont, weights = Freq,
                        data = housing)
anova(house.mult, house.mult2)
house.pm <- predict(house.mult, expand.grid(hnames[-1]), type = "probs")
cbind(expand.grid(hnames[-1]), round(house.pm, 2))
# proportional odds model
house.cpr <- apply(house.pr, 1, cumsum)
logit <- function(x) log(x/(1-x))
house.ld <- logit(house.cpr[2, ]) - logit(house.cpr[1, ])
(ratio <- sort(drop(house.ld)))
mean(ratio)
(house.plr <- polr(Sat ~ Infl + Type + Cont,
                   data = housing, weights = Freq))
house.pr1 <- predict(house.plr, expand.grid(hnames[-1]), type = "probs")
cbind(expand.grid(hnames[-1]), round(house.pr1, 2))
Fr <- matrix(housing$Freq, ncol  =  3, byrow = TRUE)
2*sum(Fr*log(house.pr/house.pr1))
house.plr2 <- stepAIC(house.plr, ~.^2)
house.plr2$anova

I can't find what the actual dataset imported means. I suggest adjusting the documentation to describe the correct one.

Break same name with R

I always thought having the name "data" as a function globally is one of the weirdest things in R.
Perhaps consider changing it into load_data (comparable to load_xxxx in sklearn).
Then people can use data (however vague that term is anyway) in their scripts freely.

Display options set

Pydataset sets display options like display.max_rows = 170 without restoring after whatever it does. These should be set in my opinion in an option_context context handler. A module should not modify the user's environment permanently (until restart of the interactive interpreter).

get_rdatasets in statsmodels

Just wanted to point you to some similar functionality we have in statsmodels that just pulls from the Rdatasets repo.

https://github.com/statsmodels/statsmodels/blob/master/statsmodels/datasets/utils.py#L246

Provide namespaces and an index

This fantastic idea, kudos.

With the growing number of dataset your tool will support, you will quickly run out of names. And searching about a particular dataset will be hard.

I'd recommand:

to require the dataset to have namespaces. E.G by source: "tld.domain.titanic" or by taxonomy "history.titanic.victims.#timestamp#"
to publish a web page with an index of all data sets with their namespace and content.
to define a procedure to add a dataset to the repo, or by plugins.

Getting error in windows 10 when installing with pip3

error: Microsoft Visual C++ 10.0 is required. Get it with "Microsoft Windows SDK 7.1": www.microsoft.com/download/details.aspx?id=8279

Unable to load datasets (Python 3.5.1 under Anaconda, Win 7)

Hi,

I'm unable to load datasets in Python 3.5.1, Win 7. I can install pydataset, import it, and view available datasets just fine. However, when I try to load datasets, I get an error saying that I have the wrong name for the dataset. For example:

In [1]: iris= data('iris')
Traceback (most recent call last):

  File "<ipython-input-3-f894fb655dca>", line 1, in <module>
    cake = data("cake", show_doc=True)

  File "C:\Users\ctaylor\AppData\Local\Continuum\Anaconda3\lib\site-packages\pydataset\__init__.py", line 36, in data
    raise Exception('Wrong dataset name! Try: data() to see available.')

Exception: Wrong dataset name! Try: data() to see available.