GithubHelp home page GithubHelp logo

skdata's Introduction

skdata (scikit-data)
====================

Skdata is a library of data sets for machine learning and statistics. This
module provides standardized Python access to toy problems as well
as popular computer vision and natural language processing data sets.

The project is hosted at github:
http://jaberg.github.com/skdata


Installation
------------

There are several options for installation:

  * From scratch:

      * pip install --user

  * From a fresh git checkout:

      * python setup.py develop

      * python setup.py install


Documentation
-------------

See http://jaberg.github.com/skdata


News
----

Join the mailing list:
https://groups.google.com/forum/#!forum/skdata


Thanks
------

Github maintains an up-to-date list of direct contributors:
https://github.com/jaberg/skdata/graphs

A special thanks goes to David Cox, who provided inspiration and design
guidance, and generally got this project started.

This work was supported in part by the National Science Foundation (IIS-0963668).

skdata's People

Contributors

abergeron avatar davidcox avatar dwf avatar ivanov avatar jaberg avatar jseabold avatar npinto avatar yamins81 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

skdata's Issues

documentation to explain tasks better

After discussing with @npinto Jan 31 it became clear that he had mis-understood my intention with tasks. He suggested renaming them to reflect the fact that "Tasks" are not new datastructures so much as tuples of lists of things with specific formats. Contrast "meta" which has almost no formatting guideline.

It might be useful to have a documentation example that says how to deal with a learning problem that doesn't fit neatly into one of the provided tasks.

SKDATA_ROOT has no effect

From private email:

  1. This might be a mistake of ours, but we aren't able to change the data folder for skdata. It doesn't matter whether we set the $SKDATA_ROOT variable in the shell and export it, call the function set_data_home() or hardcode the target directory, the downloaded data files are stored in ~/.skdata/

pubfig83: send split arguments to task methods

make the task methods accept some kind of split id or split listing so that they only return imgs or pairs (or whatever) associated with a given split. when this argument is not specified, method should return everything.

Support for Python 3.x

Currently skdata is not working with 2to3: installing from pip3 install skdata will get an error:

RefactoringTool: Can't open /tmp/pip_build_root/skdata/build/py3k: [Errno 2]
No such file or directory: '/tmp/pip_build_root/skdata/build/py3k'

Fetching "third party" datasets?

There's an interesting decentralized model taking shape over at the homebrew project, that might be interesting to explore with this project.

There's a brew-tap fork of homebrew (https://github.com/Sharpie/homebrew) that lets users install recipes from arbitrary other alternate repos (the default brew behavior is to pull recipes from the central mxcl/homebrew repo). These alternates are automatically discovered by searching the fork network of a particular repo (https://github.com/adamv/homebrew-alt). Thus, creating your own source of homebrew recipes (analogous to a PPA in Ubuntu) is as easy as forking that repo and getting to work.

One could imagine doing the same with this project, such that calling datasets fetch davidcox:my_dataset would look for a dataset called "my_dataset" in my own fork of the project. The advantage of the approach is that anyone can use the package for whatever they like, without having to "register" with us, or having to push their stuff to the main repo.

Project name

The project name should really be scikit-data. Other scikits including scikit-learn, scikit-image and scikit-bio all use this name convention. I think a bit of consistency would help strengthen the scikit ecosystem and will also improve the visibility of your project.

setup.py does not copy data/ into site-packages/skdata when installing

Just a quick bug report on a (minor) issue when one installs scikit-data (other than a develop install).

More particularly, it seems that during install the directory data/ containing the small datasets does not
get copied over to the default site-packages/skdata directory.

steps:

virtualenv test_scikit_data_install
cd test_scikit_data_install
source bin/activate
git clone git://github.com/jaberg/scikit-data.git
cd scikit-data
pip install .
python -c "from skdata import toy; toy.Digits()"

end the error looks like :

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/poilvert/venv/test_scikit_data_install/lib/python2.7/site-packages/skdata/toy.py", line 20, in __init__
    meta, descr, meta_const = self.build_all()
  File "/home/poilvert/venv/test_scikit_data_install/lib/python2.7/site-packages/skdata/toy.py", line 95, in build_all
    delimiter=',')
  File "/usr/lib/python2.7/site-packages/numpy/lib/npyio.py", line 685, in loadtxt
    fh = iter(seek_gzip_factory(fname))
  File "/usr/lib/python2.7/site-packages/numpy/lib/npyio.py", line 62, in seek_gzip_factory
    f = GzipFile(f)
  File "/usr/lib/python2.7/gzip.py", line 89, in __init__
    fileobj = self.myfileobj = __builtin__.open(filename, mode or 'rb')
IOError: [Errno 2] No such file or directory: '/home/poilvert/venv/test_scikit_data_install/lib/python2.7/site-packages/skdata/data/digits.csv.gz'

If on the other hand one installs scikit-data in develop mode (where only a symbolic link is created) then
the above commands work.

wanted: STL-10

From Ian on pylearn-dev on Sept 20,

I have added the STL-10 dataset (featured in Adam Coates' recent work) to /data/lisa/data.

The stl10_matlab directory contains Adam's original, full resolution matlab files.

The stl10_32x32 directory contains .pkl files with pylearn2 Dataset objects for a preprocessed version of the dataset where the examples have been shrunk to 32x32.

The stl10_patches directory contains .pkl files with a pylearn2 Pipeline object that replicates the preprocessing from Adam's paper, and 2 million preprocessed patches from the unlabeled and labeled training sets.

The latter two directories were produced by running scripts that I have checked into pylearn2.

Relative import error.

Is there a way to get rid of the relative import?

I'm trying a few quick things in the datasets folder and I don't want to have to run nosetests script.py to avoid the relative import to shout at me, eg: python -c 'import lfw; lfw.Funneled().fetch() (I know I could use bin/* or python -c 'from datasets import lfw; lfw.Funneled().fetch(...), but this is just an example) gives me:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "lfw.py", line 37, in <module>
    from .base import get_data_home, Bunch
ValueError: Attempted relative import in non-package

How about forgetting about import .base and use import _base instead ?

where to put built docs?

It's not much fun writing .rst files if you can't browse the results and click the hyperlinks.

module object has no attribute colormap

when i call main_show(), i get the AttributeError: 'module' object has no attribute 'colormap';
the source code is like:
cmap=glumpy.colormap.Grey;
How can i resolve it?

Iris example not working (data is not downloaded?)

When I execute

# Create a suitable view of the Iris data set.
# (For larger data sets, this can trigger a download the first time)
from skdata.iris.view import KfoldClassification
iris_view = KfoldClassification(5)

# Create a learning algorithm based on scikit-learn's LinearSVC
# that will be driven by commands the `iris_view` object.
from sklearn.svm import LinearSVC
from skdata.base import SklearnClassifier
learning_algo = SklearnClassifier(LinearSVC)

# Drive the learning algorithm from the data set view object.
# (An iterator interface is sometimes also be available,
#  so you don't have to give up control flow completely.)
iris_view.protocol(learning_algo)

# The learning algorithm keeps track of what it did when under
# control of the iris_view object. This base example is useful for
# internal testing and demonstration. Use a custom learning algorithm
# to track and save the statistics you need.
for loss_report in algo.results['loss']:
    print loss_report['task_name'] + \
        (": err = %0.3f" % (loss_report['err_rate']))

I get

/home/moose/.local/lib/python2.7/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
Traceback (most recent call last):
  File "testskdata.py", line 4, in <module>
    iris_view = KfoldClassification(5)
  File "build/bdist.linux-x86_64/egg/skdata/iris/view.py", line 27, in __init__
  File "build/bdist.linux-x86_64/egg/skdata/toy.py", line 20, in __init__
  File "build/bdist.linux-x86_64/egg/skdata/toy.py", line 33, in build_all
  File "build/bdist.linux-x86_64/egg/skdata/iris/dataset.py", line 91, in build_meta
IOError: [Errno 20] Not a directory: '/usr/local/lib/python2.7/dist-packages/skdata-0.0.4-py2.7.egg/skdata/iris/iris.csv'

write a no_download decorator, use Travis-ci

The code is unready from travis-ci because some tests download large data sets.

Travis-ci should run tests of parsing / loading code on fake data (see e.g. lfw/test/test_fake.py).

TODO: write an appropriate testing driver that

  • disables download_and_extract
  • respects test attributes that signal the use of fake data

Wanted: “Yahoo! Front Page Today Module User Click Log Dataset”

For the wish list:

Dear colleagues,

I’m happy to announce the release of a new benchmark dataset (“Yahoo!
Front Page Today Module User Click Log Dataset”) for unbiased
evaluation of multi-armed bandit algorithms, through the Yahoo!
Webscope Program:
http://webscope.sandbox.yahoo.com/catalog.php?datatype=r

Due to the inherent interactive nature of bandit problems, creating a
benchmark dataset for reliable algorithm evaluation is not as
straightforward as in most other fields of machine learning, whose
objectives are often prediction. This challenge is also known as off-
policy reinforcement learning.

Our dataset contains the click log of over 45M user visits to the
Featured Tab of the Today Module on Yahoo! Front Page. The articles
were chosen uniformly at random to the users, which enables the use of
an unbiased offline evaluation method recently shown to be highly
effective [http://portal.acm.org/citation.cfm?doid=1935826.1935878].
To the best of our knowledge, this is the first benchmark for
evaluating bandit algorithms reliably in real-world applications.

Lihong Li

Yahoo! Research
4401 Great America Parkway
Santa Clara, CA, 95054
http://www.cs.rutgers.edu/~lihong

LICENSES

Data sets have licenses, and code has licenses... these should be made clear throughout.

datasets-call

It's slightly off-topic, but it would be super-convenient to have another bin/ script that just takes argv[1], and does the same import hoop-jumping as is currently done by datasets-fetch etc.... but just doesn't have some hard-coded symbol it's looking for.

datasets-fetch mnist == datasets-call datasets.mnist.main_fetch

With datasets-call, it is easy for individual datasets, or other classes, or whatever to provide "main" methods... and by any name. I wrote this sort of thing before. I also had support for using the rest of the argv to provide arguments to the function identified by argv[1]. This basically amounts to an alternate calling convention for python functions that is suitable for commandline use.

I find myself using this functionality at the moment to call stuff from the commandline in other modules. It's weird to put it in datasets, but it only takes a handful of lines of code to support it in the datasets project because we already have main.py.

Thoughts?

pubfig83: add splits

put split generation mechanism into pugfig83. it should take seed, ntest, ntrain, num_splits arguments

Proposing a PR to fix a few small typos

Issue Type

[x] Bug (Typo)

Steps to Replicate and Expected Behaviour

  • Examine setup.py and observe supress, however expect to see suppress.
  • Examine skdata/vanhateren/dataset.py and observe stricly, however expect to see strictly.
  • Examine skdata/lfw/tests/test_fake.py and observe runnning, however expect to see running.
  • Examine skdata/vanhateren/dataset.py and observe offest, however expect to see offset.
  • Examine skdata/larray.py and observe memmory, however expect to see memory.
  • Examine skdata/synthetic.py and observe dupplicated, however expect to see duplicated.
  • Examine skdata/utils/archive.py and observe contributers, however expect to see contributors.
  • Examine skdata/descr/linnerud.rst and observe constains, however expect to see contains.

Notes

Semi-automated issue generated by
https://github.com/timgates42/meticulous/blob/master/docs/NOTE.md

To avoid wasting CI processing resources a branch with the fix has been
prepared but a pull request has not yet been created. A pull request fixing
the issue can be prepared from the link below, feel free to create it or
request @timgates42 create the PR. Alternatively if the fix is undesired please
close the issue with a small comment about the reasoning.

https://github.com/timgates42/skdata/pull/new/bugfix_typos

Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.