contextlab / hypertools Goto Github PK

A Python toolbox for gaining geometric insights into high-dimensional data

Home Page: http://hypertools.readthedocs.io/en/latest/

License: MIT License

Python 100.00%

data-visualization high-dimensional-data python topic-modeling text-vectorization data-wrangling visualization time-series

hypertools's Introduction

"To deal with hyper-planes in a 14 dimensional space, visualize a 3D space and say 'fourteen' very loudly. Everyone does it." - Geoff Hinton

Overview

HyperTools is designed to facilitate dimensionality reduction-based visual explorations of high-dimensional data. The basic pipeline is to feed in a high-dimensional dataset (or a series of high-dimensional datasets) and, in a single function call, reduce the dimensionality of the dataset(s) and create a plot. The package is built atop many familiar friends, including matplotlib, scikit-learn and seaborn. Our package was recently featured on Kaggle's No Free Hunch blog. For a general overview, you may find this talk useful (given as part of the MIND Summer School at Dartmouth).

Try it!

Click the badge to launch a binder instance with example uses:

Check the repo of Jupyter notebooks from the HyperTools paper.

Installation

To install the latest stable version run:

pip install hypertools

To install the latest unstable version directly from GitHub, run:

pip install -U git+https://github.com/ContextLab/hypertools.git

Or alternatively, clone the repository to your local machine:

git clone https://github.com/ContextLab/hypertools.git

Then, navigate to the folder and type:

pip install -e .

(These instructions assume that you have pip installed on your system)

NOTE: If you have been using the development version of 0.5.0, please clear your data cache (/Users/yourusername/hypertools_data).

Requirements

python>=3.6
PPCA>=0.0.2
scikit-learn>=0.24.0
pandas>=0.18.0
seaborn>=0.8.1
matplotlib>=1.5.1
scipy>=1.0.0
numpy>=1.10.4
umap-learn>=0.4.6
requests
pytest (for development)
ffmpeg (for saving animations)

Documentation

Check out our readthedocs page for further documentation, complete API details, and additional examples.

Citing

We wrote a short JMLR paper about HyperTools, which you can read here, or you can check out a (longer) preprint here. We also have a repository with example notebooks from the paper here.

Please cite as:

Heusser AC, Ziman K, Owen LLW, Manning JR (2018) HyperTools: A Python toolbox for gaining geometric insights into high-dimensional data. Journal of Machine Learning Research, 18(152): 1--6.

Here is a bibtex formatted reference:

@ARTICLE {,
    author  = {Andrew C. Heusser and Kirsten Ziman and Lucy L. W. Owen and Jeremy R. Manning},    
    title   = {HyperTools: a Python Toolbox for Gaining Geometric Insights into High-Dimensional Data},    
    journal = {Journal of Machine Learning Research},
    year    = {2018},
    volume  = {18},	
    number  = {152},	
    pages   = {1-6},	
    url     = {http://jmlr.org/papers/v18/17-434.html}	
}

Contributing

If you'd like to contribute, please first read our Code of Conduct.

For specific information on how to contribute to the project, please see our Contributing page.

Testing

To test HyperTools, install pytest (pip install pytest) and run pytest in the HyperTools folder

Examples

See here for more examples.

Plot

import hypertools as hyp
hyp.plot(list_of_arrays, '.', group=list_of_labels)

Align

import hypertools as hyp
hyp.plot(list_of_arrays, align='hyper')

BEFORE

AFTER

Cluster

import hypertools as hyp
hyp.plot(array, '.', n_clusters=10)

Describe

import hypertools as hyp
hyp.tools.describe(list_of_arrays, reduce='PCA', max_dims=14)

hypertools's People

Contributors

Stargazers

Watchers

Forkers

yushu-liu shannonyu benjamesbabala ganji15 coloratto allensmile vonrosenchild youkitan dartmouth-brainhack-2017 chrinide jayinai pjpan michaelfeng87 hevensun zzmjohn dreadlord1984 lidaguo ompanda ajithpad jcassiojr longchuan1985 stephwright grseb9s alysivji joseph-njogu rarredon alxsoares lzicar1 itingx chasewilliams raghavendranpm puzzledqs innerlee aylr seanhsieh alokkumary2j tingsterx lmcinnes jeremymanning bcui6611 ghostintheshellarise mutlay benzei codeaudit shyamalschandra tchen0123 shubhampachori12110095 ankitrai shlpu kormilitzin pythseq greyspurv2 cclauss millerwu2014 mldl aust-hansen naruto-sasuke strongwolf qweasdzxc110 myfortune110 hopelyj breadsh hbcbh1999 kent-ai-laboratory harsha2010 mewbak yanghaha11514 monad-one shaunstanislauslau dwillmer george86028 pengboxiangshang satopan ryanmaynard andrewheusser haiminzhang wh-forker ricelingz jhlegarreta ad05bzag wlzhong wsf1990 abc3436645 ourobouros zorba2018 afcarl samiranrl chandfan wfreelandecon gaoyangkuanglong keohaneindustries imilesmile polapon deeep-learning tensorstrings lu839684437 gaoqunxia vishalbelsare elvandy jsmilemsj

hypertools's Issues

explore mode doesn't work in 2D

AssertionError: distance: point.shape is wrong: 2, must be (3,)

Hypercube example

Add an example to the results section showing how a simple high dimensional shape looks when visualized using hyperaligned.

feature request: select points and return indices

A neat feature would be the ability to highlight points and return their indices. This could be done by clicking, but more efficiently with a selection box or circle. This could be used to identify clusters in the data and interrogate further, or style them differently..

maybe a v2 feature?

deal with nans

deal with it.

write tests for hyp.util.normalize

as well as hyp.plot(x, normalize=..) and hyp.util.reduce(x, normalize=..)

util -> tools?

Not super important, but since we are calling the package hypertools, maybe instead of a util module, we could rename it tools, i.e. hyp.tools.align(x)

@jeremymanning @KirstensGitHub what do you think?

verify that nothing broke after pull request

re-test all of the use cases we can think of in plot_coords and hyperalign to make sure nothing broke during the merge

fix describe_pca axes

currently, x axis (number of PCA components) in the plot starts at 0, when it should start at 2

demo scripts: naming and paths

all of the demo scripts have names that start with hypertools_demo-, which is redundant. i propose removing hypertools_demo- from each script name.

also, some of the paths are relative rather than absolute. for example, the sample_data folder is only visible within the examples folder, but it is referenced using relative paths. this causes some of the demo functions (e.g. hypertools_demo-align.py) to fail. references to sample_data could be changed as follows (from within any scripts that reference data in sample_data):

import os
datadir = os.path.join(os.path.realpath(__file__), 'sample_data')

Normalize=False by default?

Currently, when calling:

hyp.plot(data)

the columns of the data matrix are z-scored by default, across lists. The user may or may not want to do this, so I propose we leave normalize=False by default. Then the user can easily normalize like so:

hyp.plot(data,normalize='across') # or within or row

extend colors api

Currently, colors can only be specified by list. e.g. plot_coords.plot_coords([x[0],x[1]],color=['r','g']) but there are cases where we may want to specify the color per dot. Any ideas about how to add this functionality?

support pandas dataframes

added to the issues list so that we have an accurate estimate on the v1 progress. please close when the pandas pull request has been merged.

colors with animate

not working...
hyp.plot([HRC_data,Trump_data],color=['blue','red'])

point_colors takes a really long time on large datasets

I was trying to plot a dataset with ~8,000 rows and it seemed to get hung up on val2colors during the ranking of the data: ranks = list(map(lambda x: sum([val <= x for val in vals]),vals))

I think this is because val2colors is automatically called if type(point_colors) is numeric. If I pass string labels, the operation is super quick. Thoughts on how t deal with this?

add unit tests

integrate travis-ci

wrote to the company about getting a free account on a private repo until we release publicly soon

specify plot styles

Feature request: in plot_coords.py, add ability to assign multiple plot styles to a list of trajectories (like MATLAB version)

replace animation with simulation?

the current animation is based off of data that is aligned with a buggy version of hyperalign. however, it nicely highlights the animation feature. Do we want to replace this animation with something else? Perhaps a simulation that looks similar?

MATLAB code: what should we do about it?

Should we maintain separate MATLAB and Python codebases? The original MATLAB code is already released here: https://www.mathworks.com/matlabcentral/fileexchange/56623-hyperplot-tools

The current Python toolbox goes way beyond the original MATLAB code, and our lab is no longer using MATLAB anyway. So I'm inclined to have us remove the MATLAB code from this repository and just have it be a Python repository.

In a future release we could provide wrappers (for MATLAB, Javascript, R, etc.) for the Python code if we wanted to support those languages; that would allow us to maintain a single "main" codebase without re-writing everything multiple times.

My proposal is that we replace the entire repository with the the current python directory. We could also add a link to the original MATLAB code in the readme or in our writeup.

Thoughts?

andy broke our logo

https://www.youtube.com/watch?v=NlM3CKXJs5k

parse pandas dataframe

We could add this in the next release, but I think it would add a ton of value to add right off the bat. Many people use pandas to sift through multidimensional data, so I am thinking it could catch on with the data science community if we support dataframes as input. thoughts?

semantic versioning

after looking at our dependencies version control approaches, and rereading one of the 'rulebooks' on semantic versioning (http://semver.org/spec/v1.0.0.html), I propose that we release the first version of the software as version 0.1.0.

A few reasons for this, but the strongest is that if this is used widely, 1.0.0 is supposed to indicate that the public API will not change, and we may want to change it depending on feedback we get from the community. We've already discussed potentially changing the 'reduce' and 'align' API. Also, most of our dependencies are <1.0.0...

Moving forward, 0.0.X (patch) increments should be reserved for bug fixes (internal changes that fix an incorrect behavior).

0.X.0 (minor) increments should indicate backwards-compatible enhancements to the software (e.g. adding in new components to the plot API).

X.0.0 (major) increments should be reserved for backwards-incompatible changes that are introduced to the public API

As we release new versions, we may want to document what has changed, or been fixed in a log file so that we can keep track of bugs/development.

Rewrite plot_1to2_list function?

in plot_coords.py, lines 151-159, what is the purpose of plot_1to2_list?

@KirstensGitHub can you clarify, and if we aren't using it remove it from the code? Thanks!

datapoint labels

idea: to facilitate interpretation, allow use to pass in lists of datapoint labels. plot the datapoint labels near each datapoint (e.g. next to) in a small font...or in a user-specified font

animation settings (speed & tail length)

might be interesting to :

-let user input speed option and tail length
-default tail length to be a proportion of the dataset's overall length (only within a reasonable range to avoid weird looking issues with very small or very large datasets)

categories flag

After playing around with coloring, I thought it would make sense to add a 'categories' keyword argument to plot_coords that takes a list of category labels, and then automatically colors according to those labels. e.g. plot_coords.plot_coords(x,categories=[['a','b','c'],['a','b','c']])

@jeremymanning what do you think?

rollback version for beta release

It might be good to set the version to 0.9.x to indicate we are in beta mode, and then when we are comfortable that the package is stable, bump up to 1.0.0, thoughts?

camera zoom computation is off

cant seem to figure out what the deal is. @jeremymanning can you take a look? its in:

hypertools/plot/animate.py line 44-47, and line 77

point_colors misleading?

I'm worried that the point_colors kwarg may not be the best name. Given a list of category labels (or numbers), the array is restructured into a list of arrays. something like category or group would be more true to its functionality. thoughts?

change commit history from joe fink -> kirsten

#!/bin/sh
git filter-branch --env-filter '
OLD_EMAIL="[email protected]"
CORRECT_NAME="Your Correct Name"
CORRECT_EMAIL="[email protected]"
if [ "$GIT_COMMITTER_EMAIL" = "$OLD_EMAIL" ]
then
    export GIT_COMMITTER_NAME="$CORRECT_NAME"
    export GIT_COMMITTER_EMAIL="$CORRECT_EMAIL"
fi
if [ "$GIT_AUTHOR_EMAIL" = "$OLD_EMAIL" ]
then
    export GIT_AUTHOR_NAME="$CORRECT_NAME"
    export GIT_AUTHOR_EMAIL="$CORRECT_EMAIL"
fi
' --tag-name-filter cat -- --branches --tags

utilities module

A module that houses utility functions useful for hypertools. Instead of attaching directly to the hypertools class, we could keep things organized by nesting these functions in a 'utils' module.

Proposed API:

import hypertools as hyp
x = some_data
missing_inds = hyp.utils.get_missing_indices(x)

Thoughts?

feature request: streaming data

support streaming data-- @KirstensGitHub had found a package for this (maybe it's already integrated). i think we want to specify:

a data file to get data from
an initialization interval to compute PCA transform from
an update interval; new data since last read should be read in, transformed using the PCA transformation, and added to the plot

we may also want to allow the user to specify a camera rotation speed (e.g. have the camera orbit the plot, facing in, where we specify the number of revolutions per minute)

PCA implementation doesn't appear to normalize features before reducing

this is important if there are mean/variance differences between the columns of features. One possibility is that we could z-score the columns before PCA automatically (with a flag to turn it off), or conversely off with the option to turn it on. Another option would be to print a warning when there are large differences in mean/var between cols.

remove *.pyc files

minor thing-- some .pyc are making their way into the repository. we should delete them and .gitignore them.

dealing with nans

For the PPCA demo, I recommend generating two datasets:

1.) First generate a well-structured covariance matrix:

from scipy.linalg import toeplitz
import numpy as np
K = 10 - toeplitz(np.arange(10))

2.) Now generate a first dataset (a random walk with the given covariance matrix)

data1 = np.cumsum(np.random.multivariate_normal(np.zeros(10), K, 250), axis=0)

3.) Now copy the first dataset

from copy import copy
data2 = copy(data1)

4.) Set random entries of data2 to nan (choose some level of sparsity for this, e.g. 10% of the entries)

5.) Now plot data1 (solid line) and data2 (dashed line) and make sure they line up with each other

add legend option

to movie plot as well?

add interactive javascript plots

feature request: have plots outputted as web apps or javascript figures

rename repo?

What do you think about renaming the repo to hypertools (instead of hyper-tools). Not a big deal either way, but without the hyphen is simpler

check pymvpa license for procrustes

need to make sure its ok to use in the way that we are

choose a license

we should do this before releasing into the wild

writeup

After our OpenBCI Hackathon and whatever other polishing we want to do, we should write this up as a brief report in an appropriate forum (e.g. Nature Methods, Journal of Neuroscience Methods, arXiv, PLoS One), and then we should release the code. We should show how we can visualize a few interesting public datasets and use those visualizations to gain insights into the structure of the data. (They could be neuroscience datasets or not; the precise application will also help us narrow down a forum for reporting.)

Proposed title: The Geometry of Big Data

fonts are saved as paths rather than text

When saving figures as PDFs, the text gets converted to paths rather than being saved as editable text. This is related to the stackoverflow issue here: http://stackoverflow.com/questions/14600948/matplotlib-plot-outputs-text-as-paths-and-cannot-be-converted-to-latex-by-inks

It looks like it can be solved by adding the following line:
matplotlib.rcParams['svg.fonttype'] = 'none'

code cleanup

Before releasing this toolbox, we need to do a substantial code cleanup. This includes:

ensuring that all function names are clearly named and/or documented
ensuring that all variable and function names use consistent conventions
ensuring that the code is written as clearly and succinctly as possible
creating some test datasets and a battery of unit tests (one for each intended use case) that we can use to verify that nothing is broken in future updates, and that users can use to verify that their install is correct.
testing the installation on several "clean" systems (e.g. fresh OS X install, NeuroDebian, etc.). As a sub-issue of this, we need to decide whether to support Windows. (I'm leaning towards no, unless it's easy to get the toolbox working on Windows and we can track down a system to test it on.)

saving animations requires FFMPEG

matplotlib depends on ffmpeg to save animations, but ffmpeg is annoying to install. other options?

hyperalignment isn't working correctly

As we explored today, hyperalignment in both the Python and Matlab toolboxes seems to be broken. I propose replacing our implementations with existing implementations that are already known to work. One option is to use the SRM model in the BrainIAK toolbox: https://github.com/IntelPNI/brainiak/blob/master/brainiak/funcalign/srm.py

Another option is to use the Hyperalignment implementation from the PyMVPA2 toolbox: https://github.com/PyMVPA/PyMVPA/blob/master/mvpa2/algorithms/hyperalignment.py

It seems like the SRM/BrainIAK option would be easier, since the toolbox is smaller and less complicated (and the necessary function seems to rely on just a single file, with dependencies that we already have for other parts of our toolbox).

Thoughts?

jupyter 'inline' flag

a flag that indicates if hypertools is running in an ipython/jupyter notebook, and changes the display settings accordingly. This would allow for inline animations etc

Path (or Trace) change in opacity

Leave a faint path or trace in animation.

change API for procrustes and reduce

we want to mirror the scikit learn api i.e.

fit
transform
fit_transform
inverse_transform

add an ipython notebooks to this project (somehow)

I'd like us to link our ipython notebook examples with this project somehow. Some ideas:

Add a folder with the ipython notebooks
Host the notebooks somewhere else and add a link to that somewhere else in the readme
Add links to the ipython notebooks in our hypertools

Another issue we'll need to solve as part of this is to make sure that any links embedded in the notebooks are relative and/or universal (rather than being hard-coded local links that will only work on one computer).

Ideally we would have some or all of the notebooks ready as part of our main release on Friday-- I think they'd help people understand how to use different features and showcase some additional plot styles and ideas. However, we could also add these examples in after the main release as they are ready for sharing.

@lucywowen and I also discussed some new notebook ideas this morning, and in various other meetings we've had more ideas too. It might be helpful to maintain a repository of ipython notebooks of different "data explorations and visualizations," e.g. as part of this repository, our sampleData repository, or a new repository. We could then branch off those demos into full-fledged projects (e.g. @andrewheusser's Indiana Jones analyses), etc.

Thoughts on the best way to do this?

overleaf formatting is messed up

when writeup is complete, i'll copy everything to a clean document to try to get formatting issues to go away. in the mean time i've messily implemented an adjustment of the paper margins so that the text doesn't look crazy, but the figures and tables are still all shifted too far to the right.

https://www.overleaf.com/7651544crgxsftzghby#/26772478/

contextlab / hypertools Goto Github PK

hypertools's Introduction

Overview

Try it!

Installation

Requirements

Documentation

Citing

Contributing

Testing

Examples

Plot

Align

BEFORE

AFTER

Cluster

Describe

hypertools's People

Contributors

Stargazers

Watchers

Forkers

hypertools's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs