GithubHelp home page GithubHelp logo

contextlab / hypertools Goto Github PK

View Code? Open in Web Editor NEW
1.8K 61.0 162.0 97.6 MB

A Python toolbox for gaining geometric insights into high-dimensional data

Home Page: http://hypertools.readthedocs.io/en/latest/

License: MIT License

Python 100.00%
data-visualization high-dimensional-data python topic-modeling text-vectorization data-wrangling visualization time-series

hypertools's Introduction

Hypertools logo

"To deal with hyper-planes in a 14 dimensional space, visualize a 3D space and say 'fourteen' very loudly. Everyone does it." - Geoff Hinton

Hypertools example

Overview

HyperTools is designed to facilitate dimensionality reduction-based visual explorations of high-dimensional data. The basic pipeline is to feed in a high-dimensional dataset (or a series of high-dimensional datasets) and, in a single function call, reduce the dimensionality of the dataset(s) and create a plot. The package is built atop many familiar friends, including matplotlib, scikit-learn and seaborn. Our package was recently featured on Kaggle's No Free Hunch blog. For a general overview, you may find this talk useful (given as part of the MIND Summer School at Dartmouth).

Try it!

Click the badge to launch a binder instance with example uses:

Binder

or

Check the repo of Jupyter notebooks from the HyperTools paper.

Installation

To install the latest stable version run:

pip install hypertools

To install the latest unstable version directly from GitHub, run:

pip install -U git+https://github.com/ContextLab/hypertools.git

Or alternatively, clone the repository to your local machine:

git clone https://github.com/ContextLab/hypertools.git

Then, navigate to the folder and type:

pip install -e .

(These instructions assume that you have pip installed on your system)

NOTE: If you have been using the development version of 0.5.0, please clear your data cache (/Users/yourusername/hypertools_data).

Requirements

  • python>=3.6
  • PPCA>=0.0.2
  • scikit-learn>=0.24.0
  • pandas>=0.18.0
  • seaborn>=0.8.1
  • matplotlib>=1.5.1
  • scipy>=1.0.0
  • numpy>=1.10.4
  • umap-learn>=0.4.6
  • requests
  • pytest (for development)
  • ffmpeg (for saving animations)

Documentation

Check out our readthedocs page for further documentation, complete API details, and additional examples.

Citing

We wrote a short JMLR paper about HyperTools, which you can read here, or you can check out a (longer) preprint here. We also have a repository with example notebooks from the paper here.

Please cite as:

Heusser AC, Ziman K, Owen LLW, Manning JR (2018) HyperTools: A Python toolbox for gaining geometric insights into high-dimensional data. Journal of Machine Learning Research, 18(152): 1--6.

Here is a bibtex formatted reference:

@ARTICLE {,
    author  = {Andrew C. Heusser and Kirsten Ziman and Lucy L. W. Owen and Jeremy R. Manning},    
    title   = {HyperTools: a Python Toolbox for Gaining Geometric Insights into High-Dimensional Data},    
    journal = {Journal of Machine Learning Research},
    year    = {2018},
    volume  = {18},	
    number  = {152},	
    pages   = {1-6},	
    url     = {http://jmlr.org/papers/v18/17-434.html}	
}

Contributing

Join the chat at https://gitter.im/hypertools/Lobby

If you'd like to contribute, please first read our Code of Conduct.

For specific information on how to contribute to the project, please see our Contributing page.

Testing

Build Status

To test HyperTools, install pytest (pip install pytest) and run pytest in the HyperTools folder

Examples

See here for more examples.

Plot

import hypertools as hyp
hyp.plot(list_of_arrays, '.', group=list_of_labels)

Plot example

Align

import hypertools as hyp
hyp.plot(list_of_arrays, align='hyper')

BEFORE

Align before example

AFTER

Align after example

Cluster

import hypertools as hyp
hyp.plot(array, '.', n_clusters=10)

Cluster Example

Describe

import hypertools as hyp
hyp.tools.describe(list_of_arrays, reduce='PCA', max_dims=14)

Describe Example

hypertools's People

Contributors

alysivji avatar andrewheusser avatar chasewilliams avatar dwillmer avatar feilong avatar jeremymanning avatar joefink2896 avatar kirstensgithub avatar ljchang avatar lmcinnes avatar lucywowen avatar matkosoric avatar paxtonfitzpatrick avatar rarredon avatar stephwright avatar swaroopgj avatar timgates42 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hypertools's Issues

Hypercube example

Add an example to the results section showing how a simple high dimensional shape looks when visualized using hyperaligned.

feature request: select points and return indices

A neat feature would be the ability to highlight points and return their indices. This could be done by clicking, but more efficiently with a selection box or circle. This could be used to identify clusters in the data and interrogate further, or style them differently..

maybe a v2 feature?

util -> tools?

Not super important, but since we are calling the package hypertools, maybe instead of a util module, we could rename it tools, i.e. hyp.tools.align(x)

@jeremymanning @KirstensGitHub what do you think?

fix describe_pca axes

currently, x axis (number of PCA components) in the plot starts at 0, when it should start at 2

demo scripts: naming and paths

all of the demo scripts have names that start with hypertools_demo-, which is redundant. i propose removing hypertools_demo- from each script name.

also, some of the paths are relative rather than absolute. for example, the sample_data folder is only visible within the examples folder, but it is referenced using relative paths. this causes some of the demo functions (e.g. hypertools_demo-align.py) to fail. references to sample_data could be changed as follows (from within any scripts that reference data in sample_data):

import os
datadir = os.path.join(os.path.realpath(__file__), 'sample_data')

Normalize=False by default?

Currently, when calling:

hyp.plot(data)

the columns of the data matrix are z-scored by default, across lists. The user may or may not want to do this, so I propose we leave normalize=False by default. Then the user can easily normalize like so:

hyp.plot(data,normalize='across') # or within or row

extend colors api

Currently, colors can only be specified by list. e.g. plot_coords.plot_coords([x[0],x[1]],color=['r','g']) but there are cases where we may want to specify the color per dot. Any ideas about how to add this functionality?

support pandas dataframes

added to the issues list so that we have an accurate estimate on the v1 progress. please close when the pandas pull request has been merged.

point_colors takes a really long time on large datasets

I was trying to plot a dataset with ~8,000 rows and it seemed to get hung up on val2colors during the ranking of the data: ranks = list(map(lambda x: sum([val <= x for val in vals]),vals))

I think this is because val2colors is automatically called if type(point_colors) is numeric. If I pass string labels, the operation is super quick. Thoughts on how t deal with this?

integrate travis-ci

wrote to the company about getting a free account on a private repo until we release publicly soon

specify plot styles

Feature request: in plot_coords.py, add ability to assign multiple plot styles to a list of trajectories (like MATLAB version)

replace animation with simulation?

the current animation is based off of data that is aligned with a buggy version of hyperalign. however, it nicely highlights the animation feature. Do we want to replace this animation with something else? Perhaps a simulation that looks similar?

MATLAB code: what should we do about it?

Should we maintain separate MATLAB and Python codebases? The original MATLAB code is already released here: https://www.mathworks.com/matlabcentral/fileexchange/56623-hyperplot-tools

The current Python toolbox goes way beyond the original MATLAB code, and our lab is no longer using MATLAB anyway. So I'm inclined to have us remove the MATLAB code from this repository and just have it be a Python repository.

In a future release we could provide wrappers (for MATLAB, Javascript, R, etc.) for the Python code if we wanted to support those languages; that would allow us to maintain a single "main" codebase without re-writing everything multiple times.

My proposal is that we replace the entire repository with the the current python directory. We could also add a link to the original MATLAB code in the readme or in our writeup.

Thoughts?

parse pandas dataframe

We could add this in the next release, but I think it would add a ton of value to add right off the bat. Many people use pandas to sift through multidimensional data, so I am thinking it could catch on with the data science community if we support dataframes as input. thoughts?

semantic versioning

after looking at our dependencies version control approaches, and rereading one of the 'rulebooks' on semantic versioning (http://semver.org/spec/v1.0.0.html), I propose that we release the first version of the software as version 0.1.0.

A few reasons for this, but the strongest is that if this is used widely, 1.0.0 is supposed to indicate that the public API will not change, and we may want to change it depending on feedback we get from the community. We've already discussed potentially changing the 'reduce' and 'align' API. Also, most of our dependencies are <1.0.0...

Moving forward, 0.0.X (patch) increments should be reserved for bug fixes (internal changes that fix an incorrect behavior).

0.X.0 (minor) increments should indicate backwards-compatible enhancements to the software (e.g. adding in new components to the plot API).

X.0.0 (major) increments should be reserved for backwards-incompatible changes that are introduced to the public API

As we release new versions, we may want to document what has changed, or been fixed in a log file so that we can keep track of bugs/development.

datapoint labels

idea: to facilitate interpretation, allow use to pass in lists of datapoint labels. plot the datapoint labels near each datapoint (e.g. next to) in a small font...or in a user-specified font

animation settings (speed & tail length)

might be interesting to :

-let user input speed option and tail length
-default tail length to be a proportion of the dataset's overall length (only within a reasonable range to avoid weird looking issues with very small or very large datasets)

categories flag

After playing around with coloring, I thought it would make sense to add a 'categories' keyword argument to plot_coords that takes a list of category labels, and then automatically colors according to those labels. e.g. plot_coords.plot_coords(x,categories=[['a','b','c'],['a','b','c']])

@jeremymanning what do you think?

rollback version for beta release

It might be good to set the version to 0.9.x to indicate we are in beta mode, and then when we are comfortable that the package is stable, bump up to 1.0.0, thoughts?

point_colors misleading?

I'm worried that the point_colors kwarg may not be the best name. Given a list of category labels (or numbers), the array is restructured into a list of arrays. something like category or group would be more true to its functionality. thoughts?

change commit history from joe fink -> kirsten

#!/bin/sh
git filter-branch --env-filter '
OLD_EMAIL="[email protected]"
CORRECT_NAME="Your Correct Name"
CORRECT_EMAIL="[email protected]"
if [ "$GIT_COMMITTER_EMAIL" = "$OLD_EMAIL" ]
then
    export GIT_COMMITTER_NAME="$CORRECT_NAME"
    export GIT_COMMITTER_EMAIL="$CORRECT_EMAIL"
fi
if [ "$GIT_AUTHOR_EMAIL" = "$OLD_EMAIL" ]
then
    export GIT_AUTHOR_NAME="$CORRECT_NAME"
    export GIT_AUTHOR_EMAIL="$CORRECT_EMAIL"
fi
' --tag-name-filter cat -- --branches --tags

utilities module

A module that houses utility functions useful for hypertools. Instead of attaching directly to the hypertools class, we could keep things organized by nesting these functions in a 'utils' module.

Proposed API:

import hypertools as hyp
x = some_data
missing_inds = hyp.utils.get_missing_indices(x)

Thoughts?

feature request: streaming data

support streaming data-- @KirstensGitHub had found a package for this (maybe it's already integrated). i think we want to specify:

  • a data file to get data from
  • an initialization interval to compute PCA transform from
  • an update interval; new data since last read should be read in, transformed using the PCA transformation, and added to the plot

we may also want to allow the user to specify a camera rotation speed (e.g. have the camera orbit the plot, facing in, where we specify the number of revolutions per minute)

PCA implementation doesn't appear to normalize features before reducing

this is important if there are mean/variance differences between the columns of features. One possibility is that we could z-score the columns before PCA automatically (with a flag to turn it off), or conversely off with the option to turn it on. Another option would be to print a warning when there are large differences in mean/var between cols.

remove *.pyc files

minor thing-- some .pyc are making their way into the repository. we should delete them and .gitignore them.

dealing with nans

For the PPCA demo, I recommend generating two datasets:

1.) First generate a well-structured covariance matrix:

from scipy.linalg import toeplitz
import numpy as np
K = 10 - toeplitz(np.arange(10))

2.) Now generate a first dataset (a random walk with the given covariance matrix)

data1 = np.cumsum(np.random.multivariate_normal(np.zeros(10), K, 250), axis=0)

3.) Now copy the first dataset

from copy import copy
data2 = copy(data1)

4.) Set random entries of data2 to nan (choose some level of sparsity for this, e.g. 10% of the entries)

5.) Now plot data1 (solid line) and data2 (dashed line) and make sure they line up with each other

rename repo?

What do you think about renaming the repo to hypertools (instead of hyper-tools). Not a big deal either way, but without the hyphen is simpler

writeup

After our OpenBCI Hackathon and whatever other polishing we want to do, we should write this up as a brief report in an appropriate forum (e.g. Nature Methods, Journal of Neuroscience Methods, arXiv, PLoS One), and then we should release the code. We should show how we can visualize a few interesting public datasets and use those visualizations to gain insights into the structure of the data. (They could be neuroscience datasets or not; the precise application will also help us narrow down a forum for reporting.)

Proposed title: The Geometry of Big Data

code cleanup

Before releasing this toolbox, we need to do a substantial code cleanup. This includes:

  • ensuring that all function names are clearly named and/or documented

  • ensuring that all variable and function names use consistent conventions

  • ensuring that the code is written as clearly and succinctly as possible

  • creating some test datasets and a battery of unit tests (one for each intended use case) that we can use to verify that nothing is broken in future updates, and that users can use to verify that their install is correct.

  • testing the installation on several "clean" systems (e.g. fresh OS X install, NeuroDebian, etc.). As a sub-issue of this, we need to decide whether to support Windows. (I'm leaning towards no, unless it's easy to get the toolbox working on Windows and we can track down a system to test it on.)

hyperalignment isn't working correctly

As we explored today, hyperalignment in both the Python and Matlab toolboxes seems to be broken. I propose replacing our implementations with existing implementations that are already known to work. One option is to use the SRM model in the BrainIAK toolbox: https://github.com/IntelPNI/brainiak/blob/master/brainiak/funcalign/srm.py

Another option is to use the Hyperalignment implementation from the PyMVPA2 toolbox: https://github.com/PyMVPA/PyMVPA/blob/master/mvpa2/algorithms/hyperalignment.py

It seems like the SRM/BrainIAK option would be easier, since the toolbox is smaller and less complicated (and the necessary function seems to rely on just a single file, with dependencies that we already have for other parts of our toolbox).

Thoughts?

jupyter 'inline' flag

a flag that indicates if hypertools is running in an ipython/jupyter notebook, and changes the display settings accordingly. This would allow for inline animations etc

add an ipython notebooks to this project (somehow)

I'd like us to link our ipython notebook examples with this project somehow. Some ideas:

  • Add a folder with the ipython notebooks
  • Host the notebooks somewhere else and add a link to that somewhere else in the readme
  • Add links to the ipython notebooks in our hypertools

Another issue we'll need to solve as part of this is to make sure that any links embedded in the notebooks are relative and/or universal (rather than being hard-coded local links that will only work on one computer).

Ideally we would have some or all of the notebooks ready as part of our main release on Friday-- I think they'd help people understand how to use different features and showcase some additional plot styles and ideas. However, we could also add these examples in after the main release as they are ready for sharing.

@lucywowen and I also discussed some new notebook ideas this morning, and in various other meetings we've had more ideas too. It might be helpful to maintain a repository of ipython notebooks of different "data explorations and visualizations," e.g. as part of this repository, our sampleData repository, or a new repository. We could then branch off those demos into full-fledged projects (e.g. @andrewheusser's Indiana Jones analyses), etc.

Thoughts on the best way to do this?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.