GithubHelp home page GithubHelp logo

hrayrhar / linearcorex Goto Github PK

View Code? Open in Web Editor NEW

This project forked from gregversteeg/linearcorex

0.0 3.0 0.0 25.75 MB

Linear Total Correlation Explanation

License: GNU Affero General Public License v3.0

Python 100.00%

linearcorex's Introduction

Latent Factor Models Based on Linear Total Correlation Explanation (CorEx)

Linear CorEx finds latent factors that are as informative as possible about relationships in the data. The approach is described in this paper: Low Complexity Gaussian Latent Factor Models and a Blessing of Dimensionality. This is useful for covariance estimation, clustering related variables, and dimensionality reduction, especially in the high-dimensional, under-sampled regime.

To install:

pip install linearcorex

Mathematically, the objective is to find factors, y, where y = W x and
x in R^n is the data and W is an m by n weight matrix. We are minimizing TC(X|Y) + TC(Y) where TC is the "total correlation" or multivariate mutual information. This objective is optimized when X's are independent after conditioning on Y's, and the Y's themselves are independent. Instead of heuristically upper bounding this objective as we do for discrete CorEx, we are able to optimize it exactly in the linear case. While this extension required assumptions of linearity, the advantage is that the code is pretty fast since it only relies on matrix algebra. In principle it could be further accelerated using GPUs.

Without further constraints, the optima of this objective may have an undesirable property: information about the X_i's can be stored "synergistically" in the latent factors. In other words, to predict a single variable you need to combine info from all the latent factors. Therefore, we add a constraint that the solutions should be non-synergistic (latent factors are individually informative about each variable X_i). This also recovers the property of the original lower bound formulation from AISTATS that each latent factor has a non-negative added contribution towards TC. Note that by default, we constrain solutions to eliminate synergy. But, you can turn it off by setting eliminate_synergy=False in the python API or -a from the command line. For making nice trees, it should be left on (e.g. for personality data or ADNI data).

To test the command line interface, try:

cd $INSTALL_DIRECTORY/linearcorex/
python vis_corex.py ../tests/data/test_big5.csv --layers=5,1 --verbose=1 --no_row_names -o big5
python vis_corex.py ../tests/data/adni_blood.csv --layers=30,5,1 --missing=-1e6 --verbose=1 -o adni
python vis_corex.py ../tests/data/matrix.tcga_ov.geneset1.log2.varnorm.RPKM.txt --layers=30,5,1 --delimiter=' ' --verbose=1 --gaussianize="outliers" -o gene

Each of these examples generates pairwise plots of relationships and a graph.

The python API uses the sklearn conventions of fit/transform.

import linearcorex as lc
import numpy as np

out = lc.Corex(n_hidden=5, verbose=True)  # A Corex model with 5 factors
X = np.random.random((100, 50))  # Random data with 100 samples and 50 variables
out.fit(X)  # Fit the model on data
y = out.transform(X)  # Transform data into latent factors
print(out.clusters)  # See the clusters
cov = out.get_covariance()  # The covariance matrix

Missing values can be specified, but are just imputed in a naive way.

Papers

See Sifting Common Info... and Maximally informative representations... for work building up to this method. The main paper describing the method is Low Complexity Gaussian Latent Factor Models and a Blessing of Dimensionality. The connections with the idea of "synergy" will be described in future work.

Troubleshooting visualization

To get the visualization of the hierarchy looking nice sometimes takes a little effort. To get graphs to compile correctly do the following. Using "brew" to install, you need to do "brew install gts" followed by "brew install --with-gts graphviz". The (hacky) way that the visualizations are produced is the following. The code, vis_corex.py, produces a text file called "graphs/graph.dot". This just encodes the edges between nodes in dot format. Then, the code calls a command line utility called sfdp that is part of graphviz, "sfdp graph.dot -Tpdf -Earrowhead=none -Nfontsize=12 -GK=2 -Gmaxiter=1000 -Goverlap=False -Gpack=True -Gpackmode=clust -Gsep=0.01 -Gsplines=False -o graph_sfdp.pdf". These dot files can also be opened with OmniGraffle if you would like to be able to manipulate them by hand. If you want, you can try to recompile graphs yourself with different options to make them look nicer. Or you can edit the dot files to get effects like colored nodes, etc.

linearcorex's People

Contributors

gregversteeg avatar

Watchers

James Cloos avatar Hrayr Harutyunyan avatar paper2code - bot avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.