GithubHelp home page GithubHelp logo

pydataset's Introduction

PyDataset

PyPI version

Provides instant access to many datasets right from Python (in pandas DataFrame structure).

What?

The idea is simple. There are various datasets available out there, but they are scattered in different places over the web. Is there a quick way (in Python) to access them instantly without going through the hassle of searching, downloading, and reading ... etc? PyDataset tries to address that question :)

Usage:

Start with importing data():

from pydataset import data
  • To load a dataset:
titanic = data('titanic')
  • To display the documentation of a dataset:
data('titanic', show_doc=True)
  • To see the available datasets:
data()

That's it. See more examples.

Why?

In R, there is a very easy and immediate way to access multiple statistical datasets, in almost no effort. All it takes is one line > data(dataset_name). This makes the life easier for quick prototyping and testing. Well, I am jealous that Python does not have a similar functionality. Thus, the aim of pydataset is to fill that gap.

Currently, pydataset has about 757 (mostly numerical-based) datasets, that are based on RDatasets. In the future, I plan to scale it to include a larger set of datasets. For example,

  1. include textual data for NLP-related tasks, and
  2. allow adding a new dataset to the in-module repository.

Installation:

$ pip install pydataset

Uninstall:

  • $ pip uninstall pydataset
  • $ rm -rf $HOME/.pydataset

Changelog

0.2.0

  • Add search dataset by name similarity.
  • Example:
>>> data('heat')
Did you mean:
Wheat, heart, Heating, Yeast, eidat, badhealth, deaths, agefat, hla, heptathlon, azt

0.1.1

  • Fix: add support to Windows and fix filepaths, issue #1

Dependency:

  • pandas

Miscellaneous:

  • Tested on OSX and Linux (debian).
  • Supports both Python 2 (2.7.11) and Python 3 (3.5.1).

TODO:

  • add textual datasets (e.g. NLTK stuff).
  • add samples generators.

Thanks to:

pydataset's People

Contributors

iamaziz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pydataset's Issues

Importing Pydataset

Hello

I am trying to use Pydataset and I am having a strange error.

I am using Windows 10 with Python 3.6. Have already updated my pip and I can load all dataset but I cannot use none.

Here is a screenshot:
pydataset

As you can see it says "Not valid dataset name and no similar found" but I am trying with many different names copy and paste then. In this case exceptionaly I used cmd but most I use IDLE or PyCharm.

Mostly I use Windows 10 but it is also occurring in Mint Linux at a Virtual Machine.

Translating R to Python. Worth the effort?

The starter datasets came from R's samples, so their html documentation includes R examples on how to use the data. Would it be considered worthwhile to translate the usage information to Python3?

Please make datasets non-executable

When initiating the datasets repo, all files have permissions 0755 (-rwxr-xr-x) when in fact they are not executable. Please make the initialization install datasets as 0644 (-rw-r--r--).

Process for adding datasets?

In the README, there's interest in expanding the number of datasets. I'm wondering what kind of criteria that new data would have to meet. Just of the top of my head:

  1. Would it need to be useful prima facie, or would niche data also be acceptable? The kind of thing I'm considering (not seriously for inclusion, just in general) is that I'm working on scraping info about episodes of Detective Conan, such as what characters appeared in them. Would that be too niche?
  2. Would it have to pass some vote for inclusion? If so, who gets a vote?
  3. All the current data is csv. Would other kinds of data formats be able to be included later? Like HDF5?

Fix simple typo: smiliarity -> similarity

Issue Type

[x] Bug (Typo)

Steps to Replicate

  1. Examine pydataset/support.py.
  2. Search for smiliarity.

Expected Behaviour

  1. Should read similarity.

Semi-automated issue generated by
https://github.com/timgates42/meticulous/blob/master/docs/NOTE.md

To avoid wasting CI processing resources a branch with the fix has been
prepared but a pull request has not yet been created. A pull request fixing
the issue can be prepared from the link below, feel free to create it or
request @timgates42 create the PR.

https://github.com/timgates42/PyDataset/pull/new/bugfix_typo_similarity

Thanks.

Allow usage of pydataset with no external dependancies

PyDataset is a fantastic tool to learn Python. But requiring pandas (and hence numpy) is a big barrier of entry. What's more you may want to be able to load the data using another tool to process it.

To make your lib more flexible and more newcomer friendly, I'd advice:

  • to create a toolbox that let you define the data index and load it in a generic way. It should not rely on a particular tech for downloading or result format and provide hooks to plug your own.
  • then build adapters for your downloader and pandas;
  • then build an adapter for regular python data structure.
  • It should default on pandas if it's installed, or regular python list/dict if it's not.

This will allow:

  • beginers to use it without needing to learn or install pandas;
  • external tools to embed it and adapt it easily;
  • make it easy to adapt to use with other data processing tools.
  • make it easy to adapt to use with other way to download data (gevent, asyncio, threadpool, etc).

Merge code ? DataPackage / datasets ...

Hello,

A lot of datasets are also available at https://github.com/datasets
They are called DataPackage.

They are available using Python and

https://github.com/datapackages/datapackage-py (Work In Progress) or https://github.com/trickvi/datapackage

Pinging @vitorbaptista @pwalsh @trickvi @rgrp

There is some overlap between these projects so maybe merging might be considered.

At least we should all be aware of existence of others projects.

Kind regards

PS : Datasets are also available at https://github.com/vincentarelbundock/Rdatasets

Regression/Classification info

Hi,

It would be nice to have a 3rd column for data() output indicating whether the dataset can be used for regression or classification problems.

Distinct dataset documentation

The documentation shown from 'housing' dataset don't match actual rows and columns imported

How to reproduce:

>>> from pydataset import data`
>>> df = data('housing')`
>>> df

       id    y  time  sec
1       1  1.0     0    1
2       1  2.0     6    1
3       1  2.0    12    1
4       1  2.0    24    1
5       2  1.0     0    1
...   ...  ...   ...  ...
1444  361  NaN    24    0
1445  362  1.0     0    0
1446  362  1.0     6    0
1447  362  1.0    12    0
1448  362  1.0    24    0

[1448 rows x 4 columns]

>>> data('housing', show_doc='True')

housing

PyDataset Documentation (adopted from R Documentation. The displayed examples are in R)

Frequency Table from a Copenhagen Housing Conditions Survey

Description

The housing data frame has 72 rows and 5 variables.

Usage

housing

Format

Sat

Satisfaction of householders with their present housing circumstances, (High,
Medium or Low, ordered factor).

Infl

Perceived degree of influence householders have on the management of the
property (High, Medium, Low).

Type

Type of rental accommodation, (Tower, Atrium, Apartment, Terrace).

Cont

Contact residents are afforded with other residents, (Low, High).

Freq

Frequencies: the numbers of residents in each class.

Source

Madsen, M. (1976) Statistical analysis of multiple contingency tables. Two
examples. Scand. J. Statist. 3, 97โ€“106.

Cox, D. R. and Snell, E. J. (1984) Applied Statistics, Principles and
Examples
. Chapman & Hall.

References

Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S.
Fourth edition. Springer.

Examples

options(contrasts = c("contr.treatment", "contr.poly"))
# Surrogate Poisson models
house.glm0 <- glm(Freq ~ Infl*Type*Cont + Sat, family = poisson,
                  data = housing)
summary(house.glm0, cor = FALSE)
addterm(house.glm0, ~. + Sat:(Infl+Type+Cont), test = "Chisq")
house.glm1 <- update(house.glm0, . ~ . + Sat*(Infl+Type+Cont))
summary(house.glm1, cor = FALSE)
1 - pchisq(deviance(house.glm1), house.glm1$df.residual)
dropterm(house.glm1, test = "Chisq")
addterm(house.glm1, ~. + Sat:(Infl+Type+Cont)^2, test  =  "Chisq")
hnames <- lapply(housing[, -5], levels) # omit Freq
newData <- expand.grid(hnames)
newData$Sat <- ordered(newData$Sat)
house.pm <- predict(house.glm1, newData,
                    type = "response")  # poisson means
house.pm <- matrix(house.pm, ncol = 3, byrow = TRUE,
                   dimnames = list(NULL, hnames[[1]]))
house.pr <- house.pm/drop(house.pm %*% rep(1, 3))
cbind(expand.grid(hnames[-1]), round(house.pr, 2))
# Iterative proportional scaling
loglm(Freq ~ Infl*Type*Cont + Sat*(Infl+Type+Cont), data = housing)
# multinomial model
library(nnet)
(house.mult<- multinom(Sat ~ Infl + Type + Cont, weights = Freq,
                       data = housing))
house.mult2 <- multinom(Sat ~ Infl*Type*Cont, weights = Freq,
                        data = housing)
anova(house.mult, house.mult2)
house.pm <- predict(house.mult, expand.grid(hnames[-1]), type = "probs")
cbind(expand.grid(hnames[-1]), round(house.pm, 2))
# proportional odds model
house.cpr <- apply(house.pr, 1, cumsum)
logit <- function(x) log(x/(1-x))
house.ld <- logit(house.cpr[2, ]) - logit(house.cpr[1, ])
(ratio <- sort(drop(house.ld)))
mean(ratio)
(house.plr <- polr(Sat ~ Infl + Type + Cont,
                   data = housing, weights = Freq))
house.pr1 <- predict(house.plr, expand.grid(hnames[-1]), type = "probs")
cbind(expand.grid(hnames[-1]), round(house.pr1, 2))
Fr <- matrix(housing$Freq, ncol  =  3, byrow = TRUE)
2*sum(Fr*log(house.pr/house.pr1))
house.plr2 <- stepAIC(house.plr, ~.^2)
house.plr2$anova

I can't find what the actual dataset imported means. I suggest adjusting the documentation to describe the correct one.

Break same name with R

I always thought having the name "data" as a function globally is one of the weirdest things in R.
Perhaps consider changing it into load_data (comparable to load_xxxx in sklearn).
Then people can use data (however vague that term is anyway) in their scripts freely.

Display options set

Pydataset sets display options like display.max_rows = 170 without restoring after whatever it does. These should be set in my opinion in an option_context context handler. A module should not modify the user's environment permanently (until restart of the interactive interpreter).

Provide namespaces and an index

This fantastic idea, kudos.

With the growing number of dataset your tool will support, you will quickly run out of names. And searching about a particular dataset will be hard.

I'd recommand:

  • to require the dataset to have namespaces. E.G by source: "tld.domain.titanic" or by taxonomy "history.titanic.victims.#timestamp#"
  • to publish a web page with an index of all data sets with their namespace and content.
  • to define a procedure to add a dataset to the repo, or by plugins.

Unable to load datasets (Python 3.5.1 under Anaconda, Win 7)

Hi,

I'm unable to load datasets in Python 3.5.1, Win 7. I can install pydataset, import it, and view available datasets just fine. However, when I try to load datasets, I get an error saying that I have the wrong name for the dataset. For example:

In [1]: iris= data('iris')
Traceback (most recent call last):

  File "<ipython-input-3-f894fb655dca>", line 1, in <module>
    cake = data("cake", show_doc=True)

  File "C:\Users\ctaylor\AppData\Local\Continuum\Anaconda3\lib\site-packages\pydataset\__init__.py", line 36, in data
    raise Exception('Wrong dataset name! Try: data() to see available.')

Exception: Wrong dataset name! Try: data() to see available.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.