GithubHelp home page GithubHelp logo

shunsunsun / vconv-figures_and_tables Goto Github PK

View Code? Open in Web Editor NEW

This project forked from gao-lab/vconv-figures_and_tables

0.0 0.0 0.0 61.53 MB

The repository for reproducing figures and tables in the vConv paper

Python 91.67% Shell 0.24% R 3.50% C 4.59%

vconv-figures_and_tables's Introduction

Table of contents

vConv paper

This is the repository for reproducing figures and tables in the paper Identifying complex sequence patterns in massive omics data with a variable-convolutional layer in deep neural network.

A Keras-based implementation of vConv is available at https://github.com/gao-lab/vConv.

Prerequisites

Prerequisites for all but training Basset-related models

Basics

  • Imagemagick
  • Python (version 2)
  • R
  • bedtools
  • DREME (version 5.0.1)
  • MEME-ChIP (version 5.0.1)
  • CisFinder

Python packages

  • numpy
  • h5py
  • pandas
  • seaborn
  • scipy
  • keras (version 2.2.4)
  • tensorflow (version 1.3.0)
  • sklearn

Alternatively, if you want to guarantee working versions of each dependency, you can install via a fully pre-specified environment.

conda env create -f corecode/environment_vConv.yml

R packages

  • ggpubr
  • data.table
  • readxl
  • foreach
  • ggseqlogo
  • magick

the vConv layer

rm -fr ./vConv
git clone https://github.com/gao-lab/vConv
cp -r ./vConv/corecode ./

Prerequisites for training Basset-related models

  • Python 3
  • Follow the 'Installation' instruction at here to install Basenji: https://github.com/calico/basenji , with the following modifications:
    • Must use CUDA 10.0
    • Must use tensorflow version 2.3.4
      • We wrote a tensorflow-2.3.4-compatible vConv for Basset in our code (see codes in the section 'Train basset-related model' below). Currently we do not support vConv for this version of tensorflow publicly.

Step 1: Reproduce Figures 2-3, Supplementary Figures 5-10, and Supplementary Tables 2,4 (benchmarking models on motif identification)

1.1 Prepare datasets

Run the following codes to prepare the datasets.

wget -P ./ ftp://ftp.cbi.pku.edu.cn/pub/supplementary_file/VConv/Data/data.tar.gz
tar -C ./ -xzvf ./data.tar.gz

This tarball archive contains both simulated and published datasets. The following code has been used to generate the simulation dataset:

mkdir -p ./data/JasperMotif
cd ./data/JasperMotif/
python generateSequencesForSimulation.py
cd -

1.2 Train and evaluate models needed for reproducing these figures

The training takes about 18 days on a server with 1 CPU cores, 32G memory, and one NVIDIA 1080 Ti GPU card. The user can either train the models by themselves or use the pre-trained results.

1.2.1 Train the models directly

1.2.1.1 Train all but Basset-related models

  • Run the following codes to train the models.
cd ./train/JasperMotifSimulation
python trainAllVConvBasedAndConvolutionBasedNetworksForSimulation.py
python trainAllVConvBasedNetworksWithoutMSLForSimulation.py
cd -
cd ./train/ZengChIPSeqCode
python trainAllVConvBasedAndConvolutionBasedNetworksForZeng.py
cd -
cd ./train/DeepBindChIPSeqCode2015
python trainAllVConvBasedAndConvolutionBasedNetworksForDeepBind.py
cd -
cd ./train/convergenceSpeed
python estimateConvergenceSpeed.py
cd -

1.2.1.2 Train Basset-related models

  • Here the user needs to switch to the basenji environment.
  • After finishing training, the user needs to deactivate the basenji environment to run other codes.
cd ./basset/vConv/9layersvConv/
python TrainBasenjiBasset.py
cd -

cd ./basset/vConv/basenjibasset/
python basenji_train.py params_basset.json ../../../data/data_basset/
cd -

cd ./basset/vConv/singlelayervConv/
python TrainBasenjiBasset.py
cd -

1.2.2 Use the pre-trained results

  • Run the following codes to obtain the pre-trained models.
mkdir -p ./output
wget -P ./output/ ftp://ftp.cbi.pku.edu.cn/pub/supplementary_file/VConv/Data/result.tar.gz
tar -C ./output/ -xzvf ./output/result.tar.gz

1.3 Prepare summary files for reproducing figures and tables from datasets and results above

Note that

  • Both the original datasets ("./data") and the trained results ("./output/result/") are needed.
  • The scripts below must be run in the order displayed.
cd ./output/code
python checkResultsForSimulation.py
python checkResultsForSimulationWithoutMSL.py
python checkResultsForZengCase.py.py
python checkResultsForDeepBindCase.py
python extractMaskedKernelsFromSimulation.py
python prepareInputForTomtom.py
python useTomtomToCompareWithRealMotifs.py
cd -

1.4 Reproduce Figures

1.4.1 Reproduce Figure 2

Rscript ./vConvFigmain/code/generate_fig_2.R

The figure generated is located at ./vConvFigmain/result/Fig.2/Fig.2.png.

1.4.2 Reproduce Figure 3

Rscript ./vConvFigmain/code/generate_fig_3.R

The figure generated is located at ./vConvFigmain/result/Fig.3/Fig.3.png.

1.4.3 Reproduce Supplementary Figure 5

Rscript ./vConvFigmain/code/generate_supplementary_figure_5.R

The figure generated is located at ./vConvFigmain/result/Supplementary.Fig.5/Supplementary.Fig.5.png.

1.4.4 Reproduce Supplementary Figure 6

Rscript ./vConvFigmain/code/generate_supplementary_figure_6.R

The figure generated is located at ./vConvFigmain/result/Supplementary.Fig.6/Supplementary.Fig.6.png.

1.4.5 Reproduce Supplementary Figure 7

Rscript ./vConvFigmain/code/generate_supplementary_figure_7.R

The figure generated is located at ./vConvFigmain/result/Supplementary.Fig.7/Supplementary.Fig.7.png.

1.4.6 Reproduce Supplementary Figure 8

cd ./output/code
python checkResultComparedZengSearch.py
cd -

The figure generated is located at

  • Supp. Fig. 8A: output/ModelAUC/ChIPSeq/Pic/worseData/DataSize.png
  • Supp. Fig. 8B: output/ModelAUC/ChIPSeq/Pic/worseData/DataSizeWorseCase.png
  • Supp. Fig. 8C: "output/ModelAUC/ChIPSeq/Pic/convolution-based network from Zeng et al., 2016Boxplot.png"

1.4.7 Reproduce Supplementary Figure 10

cd ./output/SpeedTest/code
python DrawLoss.py
cd -

The figure generated is located at

  • 2 motifs: output/SpeedTest/Png/2.jpg
  • 4 motifs: output/SpeedTest/Png/4.jpg
  • 6 motifs: output/SpeedTest/Png/6.jpg
  • 8 motifs: output/SpeedTest/Png/8.jpg
  • TwoDiffMotif1: output/SpeedTest/Png/TwoDiff1.jpg
  • TwoDiffMotif2: output/SpeedTest/Png/TwoDiff2.jpg
  • TwoDiffMotif3: output/SpeedTest/Png/TwoDiff3.jpg
  • Basset: output/SpeedTest/Png/basset.jpg

1.4.8 Reproduce Supplementary Figure 11

Rscript ./vConvFigmain/code/generate_supplementary_figure_11.R

The figure generated is located at ./vConvFigmain/result/Supplementary.Fig.11/Supplementary.Fig.11.png.

1.4.9 Reproduce Supplementary Table 2

By now this table should have been generated at ./vConvFigmain/supptable23/SuppTable2.csv. Use the script below to reproduce it again.

cd ./output/code
python checkResultsForSimulation.py
cd -

1.4.10 Reproduce Supplementary Table 3

By now this table should have been generated at ./vConvFigmain/supptable23/SuppTable3.csv. Use the script below to reproduce it again.

cd ./output/code
python checkResultsForSimulationWithoutMSL.py
cd -

Step 2: Reproduce Figure 4 (benchmarking models on motif discovery)

2.1 Prepare datasets needed by Figure 4

mkdir -p ./vConvMotifDiscovery/ChIPSeqPeak/
wget -P ./vConvMotifDiscovery/ChIPSeqPeak/  http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeAwgTfbsUniform/files.txt
for file in `cut -f 1 ./vConvMotifDiscovery/ChIPSeqPeak/files.txt|grep narrowPeak.gz`
do
    wget -P ./vConvMotifDiscovery/ChIPSeqPeak/  http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeAwgTfbsUniform/${file}
    gunzip ./vConvMotifDiscovery/ChIPSeqPeak/${file}
done

mkdir -p ./data
wget -P ./data/ http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
gunzip ./data/hg19.fa.gz

2.2 Generate results needed by Figure 4

The user could either generate the results by themselves, or use the pre-computed version. Note that both the data files and the results are needed for reproducing Figure 4.

2.2.1 Generate the results

See Suppelementary Fig. 4. (shown below) for description of each step.

SupplementaryFig4

2.2.1.1 step (1)

## extract sequences
cd ./vConvMotifDiscovery/code/MLtools
python extractSequences.py
cd -

## generate motifs by CisFinder, DREME, and MEME-ChIP
cd ./vConvMotifDiscovery/code/MLtools
python generateMotifsByCisFinder.py
python generateMotifsByDREME.py
python generateMotifsByMEMEChIP.py
cd -

## generate motifs by vConv-based
cd ./vConvMotifDiscovery/code/vConvBased
python generateMotifsByVConvBasedNetworks.py
cd -

2.2.1.2 steps (2-3)

cd ./vConvMotifDiscovery/code/CisfinderFile
python convertIntoCisfinderFormat.py
python splitIntoIndividualMotifFiles.py
python scanSequencesWithCisFinder.py
python combineCisFinderResults.py
cd -

2.2.1.3 steps (4-5)

cd ./vConvMotifDiscovery/code/Analysis
python computeAccuracy.py
python summarizeResults.py
cd -

2.2.2 Use the pre-computed version

wget -P ./vConvMotifDiscovery/output/ ftp://ftp.cbi.pku.edu.cn/pub/supplementary_file/VConv/Data/AUChdf5.tar.gz
tar -C ./vConvMotifDiscovery/output/ -xzvf ./vConvMotifDiscovery/output/AUChdf5.tar.gz

2.3 Reproduce Figure 4 (and Supplementary Figure 9)

Supplementary Figure 9 is generated together with Figure 4.

Rscript ./vConvFigmain/code/generate_fig_4.R

Figure 4 generated is located at vConvFigmain/result/Fig.4/Fig.4.png.

Supplementary Figure 9 generated is located at ./vConvFigmain/result/Supplementary.Fig.9/Supplementary.Fig.9.png.

Step 3: Reproduce Supplementary Fig. 12 B-I (theoretical analysis)

3.1 Prepare datasets and results needed by Supp. Fig. 12 B-I

cd theoretical/code/
python runSimulation.py
python trainCNN.py
cd -

3.2 Reproduce Supplementary Fig. 12 B-I

cd theoretical/code/
python plotFigures.py
cd -

The figure generated is located at:

  • Supp. Fig. 12B: theoretical/Motif/ICSimu/simuMtf_Len-8_totIC-10.png
  • Supp. Fig. 12C: theoretical/Motif/ICSimu/simuMtf_Len-23_totIC-12.png
  • Supp. Fig. 12D: theoretical/figure/simuMtf_Len-8_totIC-10.png
  • Supp. Fig. 12E: theoretical/figure/simuMtf_Len-23_totIC-12.png
  • Supp. Fig. 12F: theoretical/figure/simuMtf_Len-8_totIC-10rank.png
  • Supp. Fig. 12G: theoretical/figure/simuMtf_Len-23_totIC-12rank.png
  • Supp. Fig. 12H: theoretical/figure/simu01.png
  • Supp. Fig. 12I: theoretical/figure/simu02.png

vconv-figures_and_tables's People

Contributors

yangdingyangding avatar sybwjdnr avatar gao-ge avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.