Light

teekuningas / sparsecca Goto Github PK

View Code? Open in Web Editor NEW

12.0 3.0 6.0 325 KB

Python implementations for Sparse CCA

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

sparsecca's Introduction

sparsecca

Python implementations for Sparse CCA algorithms. Includes:

Sparse (multiple) CCA based on Penalized Matrix Decomposition (PMD) from Witten et al, 2009.
Sparse CCA based on Iterative Penalized Least Squares from Mai et al, 2019.

One main difference between these two is that while the first is very simple it assumes datasets to be white.

Installation

Dependencies

In addition to basic scientific packages such as numpy and scipy, iterative penalized least squares needs either glmnet_python or pyglmnet to be installed.

This package can be installed normally with

git clone https://github.com/Teekuningas/sparsecca.git
cd sparsecca
python setup.py install

Usage

See examples, https://teekuningas.github.io/sparsecca

Acknowledgements

Great thanks to the original authors, see Witten et al, 2009 and Mai et al, 2019.

sparsecca's People

Contributors

Stargazers

Watchers

Forkers

jaidevjoshi83 theislab neurabenn

sparsecca's Issues

What is the license for this code?

Thanks for sharing this useful code. If in case you are planning to release this code as opensource what would be a license for this repo. thanks

Cannot reproduce CCA results as computed from PMA package

First of all, thank you very much for implementing the sparse CCA from Witten et al.! As a python user, I would be glad if I would not have to switch programming languages but be able to write one analysis pipeline purely in python. Besides, your code would allow me to integrate your CCA function into a scikit-learn pipeline. I have one question though: I was trying to reproduce the example from Witten et al. as described on page 7 of their documentation but couldn't get the same results. I am not a math expert so I cannot really explain what might cause these deviations. Maybe I also did not use the exact same parameters? The only obvious difference between your package and the PMA package is, that your function only demeans x and z while Witten et al. use z-standardization . But I don't think that this drives the deviation?

Here's the code from Witten et al.:

## Run example from PMA package ################################################

# first, do CCA with type="standard"
# A simple simulated example
set.seed(3189)
u <- matrix(c(rep(1,25),rep(0,75)),ncol=1)
v1 <- matrix(c(rep(1,50),rep(0,450)),ncol=1)
v2 <- matrix(c(rep(0,50),rep(1,50),rep(0,900)),ncol=1)
x <- u%*%t(v1) + matrix(rnorm(100*500),ncol=500)
z <- u%*%t(v2) + matrix(rnorm(100*1000),ncol=1000)

# Can run CCA with default settings, and can get e.g. 3 components
out <- CCA(x,z,typex="standard",typez="standard",K=3)

## Save x,z and canonical weights as .txt files 
write.table(x,file="x.txt",row.names=F,col.names=F)
write.table(z,file="z.txt",row.names=F,col.names=F)
write.table(out$u,file="out_u.txt",row.names=F,col.names=F)
write.table(out$v,file="out_v.txt",row.names=F,col.names=F)

And here's me trying to reproduce the canonical weights:

import numpy as np
from sparsecca._cca_pmd import cca
import matplotlib.pyplot as plt

# load x and z from PMA example
x = np.loadtxt('./x.txt')
z = np.loadtxt('./z.txt')

# load canonical weights produced by PMA package
out_u = np.loadtxt('./out_u.txt')
out_v = np.loadtxt('./out_v.txt')

# run cca one more time with sparsecca package and the same settings as used
# in the pma package NOTE: Although documentation of PMA package says that the 
# penalties are NULL as default, it does not seem to be the case. 
# Instead 0.3 seems to be the default value for both x and z.
u,v,d = cca(x,z,penaltyx=0.3,penaltyz=0.3,K=3,niter=15)

# check canonical weights from PMA package against sparsecca package
print(np.array_equal(out_u,u))
print(np.array_equal(out_v,v))

# plot distribution of deviations (ignore zeros for plotting purposes)
deviations = np.abs(out_u-u).flatten()
deviations = deviations[deviations != 0]
plt.hist(deviations)

As you can see most of the canonical weights only have small deviations, but there are a few that are quite different between the python implementation and the PMA package.

Shouldn't the coefficients of sparse CCA be small?

First of all, thanks a lot for sharing this code! I was running your example (plot_cca.ipynb) and I find some of the outputs a bit puzzling. One of my concerns is that the weights do not change when one introduces the penalties (i.e. penaltyu/v is set to 0.9 from 1). They stay exactly the same, even when one sets penaltyu/v to 0.8. Furthermore, when the penalty for v is set to 0.7 or smaller, the weights become nans or inf. Why is that? Finally, it seems to me that the weights are not sparse, as there are many large negative weights, particularly for Z: -0.319, -12.652, -14.360, -10.534, 1.000.

If my interpretation is correct, you are implementing Algorithm 3 from the original publication (doi: 10.1093/biostatistics/kxp008) for the determination of the weights. Then it should hold, in your example:

sum_i abs(weight_i) <= penalty_v*sqrt(5)

This is not the case. Do you agree? If not, what is wrong in my interpretation?

Thanks in advance for your attention!

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs

Jooble