GithubHelp home page GithubHelp logo

sparsecca's Introduction

sparsecca

Python implementations for Sparse CCA algorithms. Includes:

  • Sparse (multiple) CCA based on Penalized Matrix Decomposition (PMD) from Witten et al, 2009.
  • Sparse CCA based on Iterative Penalized Least Squares from Mai et al, 2019.

One main difference between these two is that while the first is very simple it assumes datasets to be white.

Installation

Dependencies

In addition to basic scientific packages such as numpy and scipy, iterative penalized least squares needs either glmnet_python or pyglmnet to be installed.

This package can be installed normally with

git clone https://github.com/Teekuningas/sparsecca.git
cd sparsecca
python setup.py install

Usage

See examples, https://teekuningas.github.io/sparsecca

Acknowledgements

Great thanks to the original authors, see Witten et al, 2009 and Mai et al, 2019.

sparsecca's People

Contributors

teekuningas avatar

Stargazers

Siyuan Feng avatar Davide Cittaro avatar Yubin Han avatar Martin Proks avatar Asem avatar Michaela Müller avatar Guilherme Pombo avatar  avatar Nicolas Legrand avatar Johannes Wiesner avatar  avatar Yunjuan Wang avatar

Watchers

James Cloos avatar  avatar  avatar

sparsecca's Issues

What is the license for this code?

Thanks for sharing this useful code. If in case you are planning to release this code as opensource what would be a license for this repo. thanks

Cannot reproduce CCA results as computed from PMA package

First of all, thank you very much for implementing the sparse CCA from Witten et al.! As a python user, I would be glad if I would not have to switch programming languages but be able to write one analysis pipeline purely in python. Besides, your code would allow me to integrate your CCA function into a scikit-learn pipeline. I have one question though: I was trying to reproduce the example from Witten et al. as described on page 7 of their documentation but couldn't get the same results. I am not a math expert so I cannot really explain what might cause these deviations. Maybe I also did not use the exact same parameters? The only obvious difference between your package and the PMA package is, that your function only demeans x and z while Witten et al. use z-standardization . But I don't think that this drives the deviation?

Here's the code from Witten et al.:

## Run example from PMA package ################################################

# first, do CCA with type="standard"
# A simple simulated example
set.seed(3189)
u <- matrix(c(rep(1,25),rep(0,75)),ncol=1)
v1 <- matrix(c(rep(1,50),rep(0,450)),ncol=1)
v2 <- matrix(c(rep(0,50),rep(1,50),rep(0,900)),ncol=1)
x <- u%*%t(v1) + matrix(rnorm(100*500),ncol=500)
z <- u%*%t(v2) + matrix(rnorm(100*1000),ncol=1000)

# Can run CCA with default settings, and can get e.g. 3 components
out <- CCA(x,z,typex="standard",typez="standard",K=3)

## Save x,z and canonical weights as .txt files 
write.table(x,file="x.txt",row.names=F,col.names=F)
write.table(z,file="z.txt",row.names=F,col.names=F)
write.table(out$u,file="out_u.txt",row.names=F,col.names=F)
write.table(out$v,file="out_v.txt",row.names=F,col.names=F)

And here's me trying to reproduce the canonical weights:

import numpy as np
from sparsecca._cca_pmd import cca
import matplotlib.pyplot as plt

# load x and z from PMA example
x = np.loadtxt('./x.txt')
z = np.loadtxt('./z.txt')

# load canonical weights produced by PMA package
out_u = np.loadtxt('./out_u.txt')
out_v = np.loadtxt('./out_v.txt')

# run cca one more time with sparsecca package and the same settings as used
# in the pma package NOTE: Although documentation of PMA package says that the 
# penalties are NULL as default, it does not seem to be the case. 
# Instead 0.3 seems to be the default value for both x and z.
u,v,d = cca(x,z,penaltyx=0.3,penaltyz=0.3,K=3,niter=15)

# check canonical weights from PMA package against sparsecca package
print(np.array_equal(out_u,u))
print(np.array_equal(out_v,v))

# plot distribution of deviations (ignore zeros for plotting purposes)
deviations = np.abs(out_u-u).flatten()
deviations = deviations[deviations != 0]
plt.hist(deviations)

image

As you can see most of the canonical weights only have small deviations, but there are a few that are quite different between the python implementation and the PMA package.

Shouldn't the coefficients of sparse CCA be small?

First of all, thanks a lot for sharing this code! I was running your example (plot_cca.ipynb) and I find some of the outputs a bit puzzling. One of my concerns is that the weights do not change when one introduces the penalties (i.e. penaltyu/v is set to 0.9 from 1). They stay exactly the same, even when one sets penaltyu/v to 0.8. Furthermore, when the penalty for v is set to 0.7 or smaller, the weights become nans or inf. Why is that? Finally, it seems to me that the weights are not sparse, as there are many large negative weights, particularly for Z: -0.319, -12.652, -14.360, -10.534, 1.000.

If my interpretation is correct, you are implementing Algorithm 3 from the original publication (doi: 10.1093/biostatistics/kxp008) for the determination of the weights. Then it should hold, in your example:

sum_i abs(weight_i) <= penalty_v*sqrt(5)

This is not the case. Do you agree? If not, what is wrong in my interpretation?

Thanks in advance for your attention!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.