GithubHelp home page GithubHelp logo

biocore / gneiss Goto Github PK

View Code? Open in Web Editor NEW
55.0 55.0 27.0 33.6 MB

compositional data analysis toolbox

Home Page: https://biocore.github.io/gneiss/

License: BSD 3-Clause "New" or "Revised" License

Makefile 0.11% Python 48.90% Jupyter Notebook 50.99%
balance compositional-statistics omics tree

gneiss's People

Contributors

amnona avatar antgonza avatar ebolyen avatar eldeveloper avatar jkanbar avatar josenavas avatar lisa55asil avatar mortonjt avatar qiyunzhu avatar rnaer avatar serenejiang avatar sjanssen2 avatar tanaes avatar wasade avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gneiss's Issues

Improving the RegressionResults object

It would be nice if the RegressionResults object just had the tree and balances built inside of it.

There are a few filtering steps done within ols and mixedlm. It would be easier to get the resulting tree and balances directly from this object rather than guessing it.

Balance trees

Need to port over the balances trees from this repo here

One thing that I'm thinking would be appropriate for documentation is IPython notebooks / markdown.

RegressionModel.fit() issues

  • The RegressionModel object will add more variables if fit is called multiple times
  • fit should return the updated RegressionModel object as output
  • a _stale parameter should be enabled so that coefficients, predict and residuals won't return anything until fit has been called.

Parallelize

The ols and mixedlm calculations can take upwards to 30min on 1000 OTUs.

And these calculations are embarrassingly parallel. So I'm thinking about enabling an optional dependency with joblib to enable this.

Diamond tree improvements

  • Enable the option to nuke labels on tree. These extra labels can get in the way of publishable figures.

Make summary more informative

  • Include formula
  • Include effect sizes of covariates
  • Include cross validation error
  • Include percent variance explained by balance axis

Update examples

The examples will need to be updated with the new api

Also the convert_biom_to_pandas will need to be updated to handle some edge cases

qiime2 plugin

This plugin can be found here. It would be great to knock this out by the beta release.

cc @gregcaporaso to give an idea about the latest possible deadline to have a q2 plugin for the beta version release.

Huge PDF heat map

I tested a dataset of 374 samples and 3682 OTUs. The analysis itself was swift, but it took long time (~5-10min) to generate the PDF format heat map. The resulting file is 26 MB in size.

Refactor sorting algorithms

Rename niche_sort and mean_niche_estimator to band_sort and mean_band_estimator

Also need to think about how to split up the table sorting algorithms, and the tree sorting algorithms.

Refactoring balanceplot

Need to clean up balanceplot to allow for multiple attributes to be plotted simultaneously

Type checking

I personally find it really frustrating when I drop a couple of hours to build a regression model, only to realize that it was initialized with the wrong types.

For example, consider a variable like Age is actually a numerical value, but is represented as an object (i.e. string).

It would be nice to have some sort of type checking on the fly, maybe even taking advantage of PY3 types and/or the categorical / numerical types in pandas.

Not sure about what the best approach here is ...

Refactor RegressionResults object

There are a few things that I think would improve this object

  • Coming to think of it, naming it RegressionResults is a little misnomer - its really a model rather than a results object.

  • The methods that it encapsulates have different functionalities. For instance MixedLM has no prediction functionality, but it may not be appropriate to have everything within the same class. However, there are shared functionalities between the classes

  • On top of that, there are some functionalities that we are not exposing from the statsmodels RegressionResults objects such as fit() that would allow the models to be trained with different parameters. Exposing this sort of API would also make this more inline with the scikit-learn api

Here's what I'm thinking

  • Rename RegressionResults to something like Model or RegressionModel and make it an abstract base class. This naming will require little care thought, considering that we will be expanding this functionality out to include classification as well in the near future.

  • Have separate Model classes for each of the methods, such as OLSModel or LMEModel

On top of that, it would be great if this object could support some querying functionality. Specifically

  • Allow for the querying of subtrees. If I query say internal node y7, I could retrieve all of the tips within that subtree.

  • Allow for intuitive interpretation of left/right balances. This ties into #69. It would be great to have functionality that could state which subtree is more abundant than the other subtree off of the bat.

Sparsity filters

  • Filter out entire clades based on abundance
  • Collapse entire clades together

Log files

Having log files available for some of the pipelines, especially the regression functions will very nifty for debugging.

Tree sorting algorithms

  • Ladderize - sort the subtrees according to how many tips are contained. Similar to what figtree and Dendroscope do.
  • Gradient sort - similar to what has been implemented in #16 , but on tree leaves rather than tables.

Docker container

It would be nice it there could be a Docker container for this.

Transform module

It may be advantageous to break out the ilr transform out of the regression module.

Specifically, this would change all of the regression modules so that they take in a balance table rather than an OTU table. That way, if multiple analyses were to be run on the same balance table, the balance table doesn't need to be recomputed every iteration.

This is also relevant to #79

Heatmap improvements

  • enable total width and total height of images in px or inches rather than width and height of cells
  • enable coloring of individual labels
  • enable label resizing for individual labels
  • enable layout embedding
  • enable circular plots

failed to install

I wanted to install gneiss as described in the Readme on barnacle. That is what happened. Is the readme still up to date?

barnacle x86_64 ~/>conda create -n gneiss_env python=3
barnacle x86_64 ~/>source activate gneiss_env
(gneiss_env) barnacle x86_64 ~/>conda install pyqt=4.11.4
Fetching package metadata .........
Solving package specifications: ....

UnsatisfiableError: The following specifications were found to be in conflict:

  • pyqt 4.11.4*
  • python 3.6*
    Use "conda info " to see the dependencies for each package.

Make RegressionResults serializable

It would be nice to be able to save the RegressionResults to a file, to avoid rerunning analyses.

I'm thinking about having the following formats

  • read_pickle, write_pickle
  • read_json, write_json
  • summary (i.e. a tab delimited file summarizing the statistics).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.