GithubHelp home page GithubHelp logo

microsoft / bindvae Goto Github PK

View Code? Open in Web Editor NEW
9.0 7.0 2.0 9.64 MB

Variational Auto Encoders for learning binding signatures of transcription factors

License: MIT License

R 25.89% Python 74.11%
atac-seq dna-sequence motif-discovery transcription-factor vae variational-auto-encoder deep-learning dna machine-learning tensorflow

bindvae's Introduction

Introduction

Source code of BindVAE paper on Variational Auto Encoders for learning binding signatures of transcription factors.

https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02723-w

Installation

  1. Installation process for the machine learning model Please create a conda environment as shown below OR using the yaml file: tfp.yaml

Using the supplied yaml file:

conda env create --name tfp --file=tfp.yaml

If you have most of the dependencies already installed, the following simpler setup will suffice:

conda create -n tfp python=3.7
conda activate tfp
conda install tensorflow-gpu
conda install tensorflow-probability

Note: In some versions of tensorflow / tensorflow-probability, you might get a "KL divergence is negative" error during training. We have not yet figured out why this appears.

  1. Dependencies for the feature generation

The feature generation code uses R

install.packages("BiocManager")
BiocManager::install("GenomicRanges")
BiocManager::install("BSgenome.Hsapiens.UCSC.hg19")

install.packages("remotes")
remotes::install_github("ManuSetty/ChIPKernels")

TRAINING

python run.py --model_dir model_dir --train_path data/gm12878_all8merfeats_1k_examples.txt --eval_path data/gm12878_all8merfeats_1k_examples.txt --test_path data/SELEX_probes_all8merfeats_1k_samples.txt --num_topics 25 --prior_initial_value 10 --mode train --vocab_path data/vocabulary_all8mers_with_wildcards.npy

Parameters that are most sensitive and best ones to tweak:

batch_size (currently set at 32)
num_topics (size of the hidden bottleneck layer)
prior_initial_value
prior_burn_in_steps

Modifying the number of layers and their width in the Encoder.

TEST (or getting TOPIC POSTERIORS)

If you want to use a previously saved model to do inference on new data, use the code in "test" mode as follows:

python run.py --model_dir model_dir --test_path data/SELEX_probes_features.txt --num_topics 25 --prior_initial_value 10 --mode test --vocab_path data/vocabulary_all8mers_with_wildcards.npy

Output: a matrix of size N x K, where N = number of examples in the input file, K = number of topics / latent dimensions.

K-MER DISTRIBUTIONS (DECODER PARAMETERS that encode the TOPIC distributions over words)

python run.py --model_dir model_dir --num_topics 25 --prior_initial_value 10 --mode beta --vocab_path data/vocabulary_all8mers_with_wildcards.npy

FILE FORMATS

Data file format:

A list of feature-ids separated by spaces. The training / test files are formatted as lists of features where if a feature has count k, then it appears k times in the list. Each line of the file is one example. If you want to change this input format, please look at sparse_matrix_dataset (or let me know and i can help with it). See below for a file with two input examples (documents). The feature ids should be in an increasing order. Also see attached sample file (data/gm12878_all8merfeats_1k_examples.txt).

112 113 113 113 122 134 144 144 144 144 159 178
115 115 189 194 194 202 202 202

Vocabulary format:

Please see the sample vocabulary file (.npy file) for how to format the mapping. It is in a dictionary format. For example, below are the top few lines of the vocabulary for the k-mer model, which was converted into the vocabulary_all8mers_with_wildcards.npy file. So, if you load the dictionary, d['EMPTY']=0 and d['AAAAAAAA']=1 and so on. Please keep the first dictionary entry a dummy feature like 'EMPTY' and assign it to the index 0. Obviously, none of the examples will contain this feature :-) This is due to how the indexing is done after loading the vocabulary (i.e. the useful features should have indices >=1).

EMPTY
AAAAAAAA
AAAAAAAC
AAAAAAAG
AAAAAAAT
AAAAAACA
AAAAAACC
AAAAAACG
AAAAAACT
AAAAAAGA
AAAAAAGC
AAAAAAGG

OUTPUTS

To monitor what the model is learning, you can look at the periodic outputs. The frequency of outputs is controlled by the parameter viz_steps in the code. It is currently set to 20000, but feel free to set it to 1000 or so in the initial runs till you understand what's going on.

Here's what it looks like for k-mers and ATAC-seq peaks. Only the top few are printed. Again this can be controlled by looking at the method get_topics_strings.

elbo -2646.7239

kl 32.239582

loss 2646.5957

perplexity 79969.914

reconstruction -2614.485

topics b'index=92 alpha=4.94 CCGCCNNC NNGGGCGG NNCCGCCC NNGGCGGG CCGCNNCC NNCCCGCC CNNCGCCC CCCGCNNC GCNNCGCC CNNCCGCC'
b'index=14 alpha=1.80 NNCAGAGA NNTCTCTG NNTCTGTG NNCACAGA NNCTCTGT NNACAGAG CACAGNNA CAGAGNNA ANNCACAG NNTCACAG'
b'index=17 alpha=1.74 CCCNNCCC CCNNCCCC CCCCNNCC AGGGGNNG NNGGGGAG NNCTCCCC CNNCCCCC CNNCCCCA CCCCANNC CCCCTNNC'

....

global_step 160000

Contribute

TODO: Explain how other users and developers can contribute to make your code better.

bindvae's People

Contributors

meghana-kshirsagar avatar microsoft-github-operations[bot] avatar microsoftopensource avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

bindvae's Issues

adding motifs to jolma_selex_pwms_converted_all.meme

There are some motifs that I could not find from jolma_selex_pwms_converted_all.meme that I would like to manually add. Where can I find PWM for motifs? I checked existing motifs' PWMs in jolma_selex_pwms_converted_all.meme to check if they match with HOMER PWMs but they don't. For example,

MOTIF
RUNX2_JOLMA
letter-probability matrix: alength= 4 w= 16 nsites= 1 E= 0
0.3274 0.1526 0.0999 0.4202
0.5230 0.0805 0.2794 0.1171
0.7760 0.0633 0.1316 0.0291
0.0121 0.9321 0.0388 0.0170
0.0104 0.9424 0.0313 0.0159
0.1641 0.0147 0.8016 0.0197
0.0123 0.9537 0.0099 0.0241
0.8497 0.0067 0.0347 0.1089
0.7364 0.0395 0.2000 0.0241
0.8122 0.0528 0.0974 0.0376
0.0565 0.8585 0.0669 0.0181
0.0430 0.8650 0.0407 0.0513
0.2151 0.0502 0.7016 0.0331
0.0554 0.8002 0.0622 0.0822
0.6446 0.0835 0.1859 0.0860
0.3991 0.1165 0.3222 0.1622

is different from
runx2.motif in HOMER motifs:

NWAACCACADNN RUNX2(Runt)/PCa-RUNX2-ChIP-Seq(GSE33889)/Homer 7.672200 -948.877602 0 T:737.0(60.36%),B:143.0(9.82%),P:1e-412 Tpos:100.7,Tstd:46.4,Bpos:97.8,Bstd:80.1,StrandBias:0.0,Multiplicity:1.56
0.293 0.299 0.206 0.203
0.393 0.225 0.098 0.284
0.468 0.175 0.244 0.113
0.796 0.104 0.099 0.001
0.001 0.997 0.001 0.001
0.001 0.997 0.001 0.001
0.997 0.001 0.001 0.001
0.001 0.997 0.001 0.001
0.997 0.001 0.001 0.001
0.332 0.146 0.324 0.198
0.352 0.213 0.211 0.224
0.249 0.303 0.207 0.241

I am looking to manually add RUNX1 and RBPJ.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.