jaberg / skdata Goto Github PK

Data sets for machine learning in Python

Home Page: http://jaberg.github.com/skdata/

Python 100.00%

skdata's Introduction

skdata (scikit-data)
====================

Skdata is a library of data sets for machine learning and statistics. This
module provides standardized Python access to toy problems as well
as popular computer vision and natural language processing data sets.

The project is hosted at github:
http://jaberg.github.com/skdata


Installation
------------

There are several options for installation:

  * From scratch:

      * pip install --user

  * From a fresh git checkout:

      * python setup.py develop

      * python setup.py install


Documentation
-------------

See http://jaberg.github.com/skdata


News
----

Join the mailing list:
https://groups.google.com/forum/#!forum/skdata


Thanks
------

Github maintains an up-to-date list of direct contributors:
https://github.com/jaberg/skdata/graphs

A special thanks goes to David Cox, who provided inspiration and design
guidance, and generally got this project started.

This work was supported in part by the National Science Foundation (IIS-0963668).

skdata's People

Contributors

Stargazers

Watchers

Forkers

davidcox zstone ivanov abergeron yamins81 guniorobot wqren pierreg cadieu vincentcr web5design ardila alexsuse aurora1625 giovanichiachia ahmed26 ageek azizur77 hakangunduz hannes-brt dimatura wavelets xmzhao twistedmove timshenkao arlston laugustyniak fireae spillai mrgloom newtonmwai amueller ml-ai-nlp-ir afey johnfrye pperezrubio schevalier raindeer jethrotan daviddjchen danielmckeown marcelo-amancio joshloyal simudream allansp84 parthasen suryanarayadev jithsjoy augustlong benjamesbabala ronaldhan seth1002 suqi xuxuanxuan jxrgxn deepxkn nikolayvoronchikhin imtesla solertis honglongwu pengyuange whitesymmetry ppxie rllewell danielunderwood liormagen jeremyfanfan tokey66363 jinyu0310 baicwang leonbai nhaber weisk tarnacious hemel-cse wujiu913 at553 dailyactie isumitg vyraun mohendra maayanktyagi jordanrushing berch sureshsagir saikovuri christinanelson msmartbot ilovecv egmeek contactvictor hemanthsavasere dunmmy dot2015 liupeiyu ninachang1107 codeaudit darg0001 vigneshradhakrishnan1 pursh2002

skdata's Issues

utils.download should print a complaint if no sha1 is provided

... by the caller for post-download integrity verification.

Project name

The project name should really be scikit-data. Other scikits including scikit-learn, scikit-image and scikit-bio all use this name convention. I think a bit of consistency would help strengthen the scikit ecosystem and will also improve the visibility of your project.

Wanted: MIRFLICKR

http://press.liacs.nl/mirflickr/

Google's Billion-word language modeling data set

http://googleresearch.blogspot.ca/2014/04/a-billion-words-because-todays-language.html

pubfig83: verification task

add verification task, as in lfw

Relative import error.

Is there a way to get rid of the relative import?

I'm trying a few quick things in the datasets folder and I don't want to have to run nosetests script.py to avoid the relative import to shout at me, eg: python -c 'import lfw; lfw.Funneled().fetch() (I know I could use bin/* or python -c 'from datasets import lfw; lfw.Funneled().fetch(...), but this is just an example) gives me:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "lfw.py", line 37, in <module>
    from .base import get_data_home, Bunch
ValueError: Attempted relative import in non-package

How about forgetting about import .base and use import _base instead ?

Fetching "third party" datasets?

There's an interesting decentralized model taking shape over at the homebrew project, that might be interesting to explore with this project.

There's a brew-tap fork of homebrew (https://github.com/Sharpie/homebrew) that lets users install recipes from arbitrary other alternate repos (the default brew behavior is to pull recipes from the central mxcl/homebrew repo). These alternates are automatically discovered by searching the fork network of a particular repo (https://github.com/adamv/homebrew-alt). Thus, creating your own source of homebrew recipes (analogous to a PPA in Ubuntu) is as easy as forking that repo and getting to work.

One could imagine doing the same with this project, such that calling datasets fetch davidcox:my_dataset would look for a dataset called "my_dataset" in my own fork of the project. The advantage of the approach is that anyone can use the package for whatever they like, without having to "register" with us, or having to push their stuff to the main repo.

caltech cache is not code-aware, can go stale

Iris example not working (data is not downloaded?)

When I execute

# Create a suitable view of the Iris data set.
# (For larger data sets, this can trigger a download the first time)
from skdata.iris.view import KfoldClassification
iris_view = KfoldClassification(5)

# Create a learning algorithm based on scikit-learn's LinearSVC
# that will be driven by commands the `iris_view` object.
from sklearn.svm import LinearSVC
from skdata.base import SklearnClassifier
learning_algo = SklearnClassifier(LinearSVC)

# Drive the learning algorithm from the data set view object.
# (An iterator interface is sometimes also be available,
#  so you don't have to give up control flow completely.)
iris_view.protocol(learning_algo)

# The learning algorithm keeps track of what it did when under
# control of the iris_view object. This base example is useful for
# internal testing and demonstration. Use a custom learning algorithm
# to track and save the statistics you need.
for loss_report in algo.results['loss']:
    print loss_report['task_name'] + \
        (": err = %0.3f" % (loss_report['err_rate']))

I get

/home/moose/.local/lib/python2.7/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
Traceback (most recent call last):
  File "testskdata.py", line 4, in <module>
    iris_view = KfoldClassification(5)
  File "build/bdist.linux-x86_64/egg/skdata/iris/view.py", line 27, in __init__
  File "build/bdist.linux-x86_64/egg/skdata/toy.py", line 20, in __init__
  File "build/bdist.linux-x86_64/egg/skdata/toy.py", line 33, in build_all
  File "build/bdist.linux-x86_64/egg/skdata/iris/dataset.py", line 91, in build_meta
IOError: [Errno 20] Not a directory: '/usr/local/lib/python2.7/dist-packages/skdata-0.0.4-py2.7.egg/skdata/iris/iris.csv'

Wanted: Berkeley Segmentation Data Set

Add download option for face-emotion-rec data set

http://www.kaggle.com/c/challenges-in-representation-learning-facial-expression-recognition-challenge/download/fer2013.tar.gz

Wanted: Robust OCR dataset (Photo OCR) IDAR 2003 / 2011

http://yaroslavvb.blogspot.com/2009/08/new-robust-ocr-dataset.html

http://yaroslavvb.com/bib_digits_dataset.tar.gz

pubfig83: add splits

put split generation mechanism into pugfig83. it should take seed, ntest, ntrain, num_splits arguments

Wanted: HMDB: A Large Video Database for Human Motion Recognition

http://serre-lab.clps.brown.edu/resources/HMDB/index.htm

pubfig83: send split arguments to task methods

make the task methods accept some kind of split id or split listing so that they only return imgs or pairs (or whatever) associated with a given split. when this argument is not specified, method should return everything.

Wanted: Towards Total Scene Understanding Dataset (Fei-Fei et al.)

http://vision.stanford.edu/projects/totalscene/index.html

Paper: Towards Total Scene Understanding:Classification, Annotation and Segmentation in an Automatic Framework. Computer Vision and Pattern Recognition (CVPR) 2009 Li-Jia Li, Richard Socher and Li Fei-Fei.
Code: http://vision.stanford.edu/lijiali/codetsu.tar.gz
Data: http://vision.stanford.edu/lijiali/dataset/tsu_dataset.tar

Overlay objects that add access to pre-computed features in attributes.

e.g. LFW.features[XXX][0] will reference the XXX feature representation (numpy / lazy array, e.g. fetching from S3) for image 0 (associated with meta[0]).

documentation to explain tasks better

After discussing with @npinto Jan 31 it became clear that he had mis-understood my intention with tasks. He suggested renaming them to reflect the fact that "Tasks" are not new datastructures so much as tuples of lists of things with specific formats. Contrast "meta" which has almost no formatting guideline.

It might be useful to have a documentation example that says how to deal with a learning problem that doesn't fit neatly into one of the provided tasks.

wanted: brodatz textures

http://www.ux.uis.no/~tranden/brodatz.html

larochelle_etal_2007 data sets don't make cache dir properly

Steps to reproduce: check out project on new computer, fetch convex dataset.

Wanted: Saenko et al. ECCV10 "DATABASE FOR STUDYING EFFECTS OF DOMAIN SHIFT IN OBJECT RECOGNITION"

http://www.icsi.berkeley.edu/~saenko/projects.html#data

used in ICCV 2011 oral:
http://www.umiacs.umd.edu/~raghuram/Publications/2011_ICCV_DomainAdaptation.pdf

Wanted: Stanford Scene Understanding Datasets (incl. stanford background dataset)

http://dags.stanford.edu/projects/scene.html

http://dags.stanford.edu/projects/scenedataset.html

wanted: TRECVid

want this

write a no_download decorator, use Travis-ci

The code is unready from travis-ci because some tests download large data sets.

Travis-ci should run tests of parsing / loading code on fake data (see e.g. lfw/test/test_fake.py).

TODO: write an appropriate testing driver that

disables download_and_extract
respects test attributes that signal the use of fake data

Wanted: “Yahoo! Front Page Today Module User Click Log Dataset”

For the wish list:

Dear colleagues,

I’m happy to announce the release of a new benchmark dataset (“Yahoo!
Front Page Today Module User Click Log Dataset”) for unbiased
evaluation of multi-armed bandit algorithms, through the Yahoo!
Webscope Program:
http://webscope.sandbox.yahoo.com/catalog.php?datatype=r

Due to the inherent interactive nature of bandit problems, creating a
benchmark dataset for reliable algorithm evaluation is not as
straightforward as in most other fields of machine learning, whose
objectives are often prediction. This challenge is also known as off-
policy reinforcement learning.

Our dataset contains the click log of over 45M user visits to the
Featured Tab of the Today Module on Yahoo! Front Page. The articles
were chosen uniformly at random to the users, which enables the use of
an unbiased offline evaluation method recently shown to be highly
effective [http://portal.acm.org/citation.cfm?doid=1935826.1935878].
To the best of our knowledge, this is the first benchmark for
evaluating bandit algorithms reliably in real-world applications.

Lihong Li

Yahoo! Research
4401 Great America Parkway
Santa Clara, CA, 95054
http://www.cs.rutgers.edu/~lihong

SKDATA_ROOT has no effect

From private email:

This might be a mistake of ours, but we aren't able to change the data folder for skdata. It doesn't matter whether we set the $SKDATA_ROOT variable in the shell and export it, call the function set_data_home() or hardcode the target directory, the downloaded data files are stored in ~/.skdata/

setup.py does not copy data/ into site-packages/skdata when installing

Just a quick bug report on a (minor) issue when one installs scikit-data (other than a develop install).

More particularly, it seems that during install the directory data/ containing the small datasets does not
get copied over to the default site-packages/skdata directory.

steps:

virtualenv test_scikit_data_install
cd test_scikit_data_install
source bin/activate
git clone git://github.com/jaberg/scikit-data.git
cd scikit-data
pip install .
python -c "from skdata import toy; toy.Digits()"

end the error looks like :

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/poilvert/venv/test_scikit_data_install/lib/python2.7/site-packages/skdata/toy.py", line 20, in __init__
    meta, descr, meta_const = self.build_all()
  File "/home/poilvert/venv/test_scikit_data_install/lib/python2.7/site-packages/skdata/toy.py", line 95, in build_all
    delimiter=',')
  File "/usr/lib/python2.7/site-packages/numpy/lib/npyio.py", line 685, in loadtxt
    fh = iter(seek_gzip_factory(fname))
  File "/usr/lib/python2.7/site-packages/numpy/lib/npyio.py", line 62, in seek_gzip_factory
    f = GzipFile(f)
  File "/usr/lib/python2.7/gzip.py", line 89, in __init__
    fileobj = self.myfileobj = __builtin__.open(filename, mode or 'rb')
IOError: [Errno 2] No such file or directory: '/home/poilvert/venv/test_scikit_data_install/lib/python2.7/site-packages/skdata/data/digits.csv.gz'

If on the other hand one installs scikit-data in develop mode (where only a symbolic link is created) then
the above commands work.

Proposing a PR to fix a few small typos

Issue Type

[x] Bug (Typo)

Steps to Replicate and Expected Behaviour

Examine setup.py and observe supress, however expect to see suppress.
Examine skdata/vanhateren/dataset.py and observe stricly, however expect to see strictly.
Examine skdata/lfw/tests/test_fake.py and observe runnning, however expect to see running.
Examine skdata/vanhateren/dataset.py and observe offest, however expect to see offset.
Examine skdata/larray.py and observe memmory, however expect to see memory.
Examine skdata/synthetic.py and observe dupplicated, however expect to see duplicated.
Examine skdata/utils/archive.py and observe contributers, however expect to see contributors.
Examine skdata/descr/linnerud.rst and observe constains, however expect to see contains.

Notes

Semi-automated issue generated by
https://github.com/timgates42/meticulous/blob/master/docs/NOTE.md

To avoid wasting CI processing resources a branch with the fix has been
prepared but a pull request has not yet been created. A pull request fixing
the issue can be prepared from the link below, feel free to create it or
request @timgates42 create the PR. Alternatively if the fix is undesired please
close the issue with a small comment about the reasoning.

https://github.com/timgates42/skdata/pull/new/bugfix_typos

Thanks.

LICENSES

Data sets have licenses, and code has licenses... these should be made clear throughout.

Wanted: All the datasets, SO MANY DATA SETS!

E.g. for our LFW we could add some of their landmarks, as well as the output of face.com/detect.

wanted: SHREC

http://www.aimatshape.net/event/SHREC/

need lock on download directory

Two processes on a shared fs interfere with one another. need a lock to prevent one downloading overtop of other.

module object has no attribute colormap

when i call main_show(), i get the AttributeError: 'module' object has no attribute 'colormap';
the source code is like:
cmap=glumpy.colormap.Grey;
How can i resolve it?

Support for Python 3.x

Currently skdata is not working with 2to3: installing from pip3 install skdata will get an error:

RefactoringTool: Can't open /tmp/pip_build_root/skdata/build/py3k: [Errno 2]
No such file or directory: '/tmp/pip_build_root/skdata/build/py3k'

pubfig83: img_ version of tasks

add the img_ loader version for each of the raw tasks.

wanted: STL-10

From Ian on pylearn-dev on Sept 20,

I have added the STL-10 dataset (featured in Adam Coates' recent work) to /data/lisa/data.

The stl10_matlab directory contains Adam's original, full resolution matlab files.

The stl10_32x32 directory contains .pkl files with pylearn2 Dataset objects for a preprocessed version of the dataset where the examples have been shrunk to 32x32.

The stl10_patches directory contains .pkl files with a pylearn2 Pipeline object that replicates the preprocessing from Adam's paper, and 2 million preprocessed patches from the unlabeled and labeled training sets.

The latter two directories were produced by running scripts that I have checked into pylearn2.

datasets-call

It's slightly off-topic, but it would be super-convenient to have another bin/ script that just takes argv[1], and does the same import hoop-jumping as is currently done by datasets-fetch etc.... but just doesn't have some hard-coded symbol it's looking for.

datasets-fetch mnist == datasets-call datasets.mnist.main_fetch

With datasets-call, it is easy for individual datasets, or other classes, or whatever to provide "main" methods... and by any name. I wrote this sort of thing before. I also had support for using the rest of the argv to provide arguments to the function identified by argv[1]. This basically amounts to an alternate calling convention for python functions that is suitable for commandline use.

I find myself using this functionality at the moment to call stuff from the commandline in other modules. It's weird to put it in datasets, but it only takes a handful of lines of code to support it in the datasets project because we already have main.py.

Thoughts?

where to put built docs?

It's not much fun writing .rst files if you can't browse the results and click the hyperlinks.