biocore / gneiss Goto Github PK

View Code? Open in Web Editor NEW

55.0 55.0 27.0 33.6 MB

compositional data analysis toolbox

Home Page: https://biocore.github.io/gneiss/

License: BSD 3-Clause "New" or "Revised" License

Makefile 0.11% Python 48.90% Jupyter Notebook 50.99%

balance compositional-statistics omics tree

gneiss's People

Contributors

Stargazers

Watchers

gneiss's Issues

Improving the RegressionResults object

It would be nice if the RegressionResults object just had the tree and balances built inside of it.

There are a few filtering steps done within ols and mixedlm. It would be easier to get the resulting tree and balances directly from this object rather than guessing it.

Balance trees

Need to port over the balances trees from this repo here

One thing that I'm thinking would be appropriate for documentation is IPython notebooks / markdown.

More flexibility with file io

Need to be able to pass in file handles for read_pickle and write_pickle.

RegressionModel.fit() issues

The RegressionModel object will add more variables if fit is called multiple times
fit should return the updated RegressionModel object as output
a _stale parameter should be enabled so that coefficients, predict and residuals won't return anything until fit has been called.

Add cross decomposition techniques

Partial least squares
Canonical correlation analysis
Canonical correspondence analysis?

Note that this will extend the Model superclass

Parallelize

The ols and mixedlm calculations can take upwards to 30min on 1000 OTUs.

And these calculations are embarrassingly parallel. So I'm thinking about enabling an optional dependency with joblib to enable this.

Finalize Model file format

There are some issues with using pickle to represent Model objects.

https://bugs.python.org/issue24658

We'll want to think about some alternative file formats to store all of the necessary information.

Diamond tree improvements

Enable the option to nuke labels on tree. These extra labels can get in the way of publishable figures.

Testing ETE layouts

This was brought up in #4. We need a better way to be able to test tree layouts.

Sorting algorithms

Supervised niche sort (#16)
Unsupervised niche sort

RegressionResults -> OrdinationResults

Need to encode a transformation that allows for biplots to be plotted with balances.

CC @ElDeveloper

Make summary more informative

Include formula
Include effect sizes of covariates
Include cross validation error
Include percent variance explained by balance axis

Update examples

The examples will need to be updated with the new api

Also the convert_biom_to_pandas will need to be updated to handle some edge cases

qiime2 plugin

This plugin can be found here. It would be great to knock this out by the beta release.

cc @gregcaporaso to give an idea about the latest possible deadline to have a q2 plugin for the beta version release.

Huge PDF heat map

I tested a dataset of 374 samples and 3682 OTUs. The analysis itself was swift, but it took long time (~5-10min) to generate the PDF format heat map. The resulting file is 26 MB in size.

Refactor sorting algorithms

Rename niche_sort and mean_niche_estimator to band_sort and mean_band_estimator

Also need to think about how to split up the table sorting algorithms, and the tree sorting algorithms.

Refactor tree sorts

It would be nice to have an overarching function that sorts tree tips.

Upgrade to newest release of scikit-bio

There was some error handling that has been corrected in the dev version of scikit-bio here. Will need to upgrade to depend on that version.

Refactoring balanceplot

Need to clean up balanceplot to allow for multiple attributes to be plotted simultaneously

Type checking

I personally find it really frustrating when I drop a couple of hours to build a regression model, only to realize that it was initialized with the wrong types.

For example, consider a variable like Age is actually a numerical value, but is represented as an object (i.e. string).

It would be nice to have some sort of type checking on the fly, maybe even taking advantage of PY3 types and/or the categorical / numerical types in pandas.

Not sure about what the best approach here is ...

niche_sort gives nans

This happens when applying this algorithm to the EMP dataset

Add support for Principal Balance Analysis

Hierarchical clustering via proportionality
Sparse Principle Balances via sparse approximation on ilr PCA

http://congress.cimne.com/codawork11/Admin/Files/FilePaper/p55.pdf
http://www.elib.bsu.by/bitstream/123456789/51958/1/173-176.pdf

Balance-balance transformations

It would be nice if we could perform coordinate conversions between 2 trees.

Refactor RegressionResults object

There are a few things that I think would improve this object

Coming to think of it, naming it RegressionResults is a little misnomer - its really a model rather than a results object.
The methods that it encapsulates have different functionalities. For instance MixedLM has no prediction functionality, but it may not be appropriate to have everything within the same class. However, there are shared functionalities between the classes
On top of that, there are some functionalities that we are not exposing from the statsmodels RegressionResults objects such as fit() that would allow the models to be trained with different parameters. Exposing this sort of API would also make this more inline with the scikit-learn api

Here's what I'm thinking

Rename RegressionResults to something like Model or RegressionModel and make it an abstract base class. This naming will require little care thought, considering that we will be expanding this functionality out to include classification as well in the near future.
Have separate Model classes for each of the methods, such as OLSModel or LMEModel

On top of that, it would be great if this object could support some querying functionality. Specifically

Allow for the querying of subtrees. If I query say internal node y7, I could retrieve all of the tips within that subtree.
Allow for intuitive interpretation of left/right balances. This ties into #69. It would be great to have functionality that could state which subtree is more abundant than the other subtree off of the bat.

Port pycogent drawing

Looks like pycogent has support for coloring trees
https://github.com/pycogent/pycogent/blob/master/cogent/draw/dendrogram.py

Combine this with bokeh, and we can have interactive trees
http://chuckpr.github.io/blog/trees2.html

Sparsity filters

Filter out entire clades based on abundance
Collapse entire clades together

Additional ETE layouts

Edge weight coloration. Similar to as shown in Edge PCA
Heatmaps linked to ETE
Subtree collapsing / coloring. Similar to the recent tree of life paper.

Confidence intervals

Bootstrapping pseudocounts
Subsampling tree

Log files

Having log files available for some of the pipelines, especially the regression functions will very nifty for debugging.

Tree sorting algorithms

Ladderize - sort the subtrees according to how many tips are contained. Similar to what figtree and Dendroscope do.
Gradient sort - similar to what has been implemented in #16 , but on tree leaves rather than tables.

Optimize tree operations

Optimize balance_basis (#8)
Optimize ladderize
Optimize order_tips

Singleton children

There needs to be an easy way to remove nodes with single children.

Docker container

It would be nice it there could be a Docker container for this.

Adding sphinx documentation

Add mathjax documentation for balance_basis.
Make sure that mean_niche_estimator renders properly.

Switch to BSD license

We can switch to BSD as soon as ETE becomes duel licensed.

Transform module

It may be advantageous to break out the ilr transform out of the regression module.

Specifically, this would change all of the regression modules so that they take in a balance table rather than an OTU table. That way, if multiple analyses were to be run on the same balance table, the balance table doesn't need to be recomputed every iteration.

This is also relevant to #79

Heatmap improvements

enable total width and total height of images in px or inches rather than width and height of cells
enable coloring of individual labels
enable label resizing for individual labels
enable layout embedding
enable circular plots

Warnings in regression functions

If none of the samples between the metadata and the table match, an error needs to be thrown.

Interface to statsmodels

It would be amazing if we could take advantage of the formula interface in statsmodels to run statistical tests on the individual balances.

Add classification models

Logistic regression
Generalized Linear Model
Generalized Estimating Equations
Partial Least Squares discriminate analysis

Note that this will extend the Model superclass

failed to install

I wanted to install gneiss as described in the Readme on barnacle. That is what happened. Is the readme still up to date?

barnacle x86_64 ~/>conda create -n gneiss_env python=3
barnacle x86_64 ~/>source activate gneiss_env
(gneiss_env) barnacle x86_64 ~/>conda install pyqt=4.11.4
Fetching package metadata .........
Solving package specifications: ....

UnsatisfiableError: The following specifications were found to be in conflict:

pyqt 4.11.4*
python 3.6*
Use "conda info " to see the dependencies for each package.

Additional regression modules

Generalized estimating equations #62
General linear model
Robust linear models

Taxonomy summarization

Summarize subtree taxonomies of a given balance to aid interpretation

Make RegressionResults serializable

It would be nice to be able to save the RegressionResults to a file, to avoid rerunning analyses.

I'm thinking about having the following formats

read_pickle, write_pickle
read_json, write_json
summary (i.e. a tab delimited file summarizing the statistics).

Build tree from orthonormal basis

Relevant to #70

Better support for interpreting balances

Right now, it is difficult to interpret balances - it requires quite a few auxiliary data structures in order to figure out which taxa are in the right and left balances.

Relevant to #23

cc @tanaes @amnona

Explicit support for pd.Series in mean_niche_estimator

Would be nice if the labels were kept if this function was run on a pd.Series

biocore / gneiss Goto Github PK

gneiss's People

Contributors

Stargazers

Watchers

Forkers

gneiss's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs