GithubHelp home page GithubHelp logo

hiadn's Introduction

๐Ÿ“Œ HiADN


Github Linkedin Gmail

๐Ÿ”ง Features

Workflow of HiADN

Unique features of HiADN

๐Ÿ‘ฅ User Guide

1. Installation

Clone or Download our repo.

2. Requires

see requirements.txt

We recommend using conda to create a virtual environment.
  1. Install conda firstly.
  2. Enter the repo.
  3. shell conda create --name <your_name> --file requirements.txt
  4. shell conda activate <your_name>

3. Data Preprocessing

๐Ÿ‘‰ In our experiments, we use the Hi-C data from (Rao et al. 2014).

You can view the data on NCBI via accession GSE62525; Three data sets are used:

$$ ๐Ÿ˜„ {\color{blue}!!!\ FOLLOW\ THE\ STEPS\ CAREFULLY\ !!!} $$

3.1 Set work directory

i. Create your root directory and write in /utils/config.py;

For example, we set root_dir = './Datasets_NPZ'

# the Root directory for all raw and processed data
root_dir = 'Datasets_NPZ'  # Example of root directory name

ii. Make a new directory named raw to store raw data.

mkdir $root_dir/raw

iii. Download and Unzip data into the $root_dir/raw directory.


After doing that,your dir should be like this

๐Ÿ”จ FILE STRUCTURE
Datasets_NPZ
โ”œโ”€โ”€ raw
โ”‚   โ”œโ”€โ”€ K562
โ”‚   โ”‚   โ”œโ”€โ”€ 1mb_resolution_intrachromosomal
โ”‚   โ”‚   โ””โ”€โ”€ ...
โ”‚   โ”œโ”€โ”€ GM12878
โ”‚   โ””โ”€โ”€ CH12-LX

Follow the following steps to generate datasets in .npz format:

3.2 Read the raw data

This will create a new directory $root_dir/mat/<cell_line_name> where all chrN_[HR].npz files will be stored.

usage: read_prepare.py -c CELL_LINE [-hr {5kb,10kb,25kb,50kb,100kb,250kb,500kb,1mb}] [-q {MAPQGE30,MAPQG0}] [-n {KRnorm,SQRTVCnorm,VCnorm}] [--help]

A tools to read raw data from Rao's Hi-C experiment.
------------------------------------------------------
Use example : python ./data/read_prepare.py -c GM12878
------------------------------------------------------

optional arguments:
  --help, -h            Print this help message and exit

Required Arguments:
  -c CELL_LINE          Required: Cell line for analysis[example:GM12878]

Miscellaneous Arguments:
  -hr {5kb,10kb,25kb,50kb,100kb,250kb,500kb,1mb}
                        High resolution specified[default:10kb]
  -q {MAPQGE30,MAPQG0}  Mapping quality of raw data[default:MAPQGE30]
  -n {KRnorm,SQRTVCnorm,VCnorm}
                        The normalization file for raw data[default:KRnorm]


After doing that,your dir should be like this

๐Ÿ”จ FILE STRUCTURE
Datasets_NPZ
โ”œโ”€โ”€ raw
โ”‚   โ”œโ”€โ”€ K562
โ”‚   โ”‚   โ”œโ”€โ”€ 1mb_resolution_intrachromosomal
โ”‚   โ”‚   โ””โ”€โ”€ ...
โ”‚   โ”œโ”€โ”€ GM12878
โ”‚   โ””โ”€โ”€ CH12-LX
โ”œโ”€โ”€ mat
โ”‚   โ”œโ”€โ”€ K562
โ”‚   โ”‚   โ”œโ”€โ”€ chr1_10kb.npz
โ”‚   โ”‚   โ””โ”€โ”€ ...
โ”‚   โ”œโ”€โ”€ GM12878
โ”‚   โ””โ”€โ”€ CH12-LX

3.3 Down_sample the data

This adds down_sampled HR data to $root_dir/mat/<cell_line_name> as chrN_[LR].npz.

usage: down_sample.py -c CELL_LINE -hr {5kb,10kb,25kb,50kb,100kb,250kb,500kb,1mb} -r RATIO [--help]

A tools to down sample data from high resolution data.
----------------------------------------------------------------------
Use example : python ./datasets/down_sample.py -hr 10kb -r 16 -c GM12878
----------------------------------------------------------------------

optional arguments:
  --help, -h            Print this help message and exit

Required Arguments:
  -c CELL_LINE          Required: Cell line for analysis[example:GM12878]
  -hr {5kb,10kb,25kb,50kb,100kb,250kb,500kb,1mb}
                        Required: High resolution specified[example:10kb]
  -r RATIO              Required: The ratio of down sampling[example:16]


After doing that,your dir should be like this

๐Ÿ”จ FILE STRUCTURE
Datasets_NPZ
โ”œโ”€โ”€ raw
โ”‚   โ”œโ”€โ”€ K562
โ”‚   โ”‚   โ”œโ”€โ”€ 1mb_resolution_intrachromosomal
โ”‚   โ”‚   โ””โ”€โ”€ ...
โ”‚   โ”œโ”€โ”€ GM12878
โ”‚   โ””โ”€โ”€ CH12-LX
โ”œโ”€โ”€ mat
โ”‚   โ”œโ”€โ”€ K562
โ”‚   โ”‚   โ”œโ”€โ”€ chr1_10kb.npz
โ”‚   โ”‚   โ”œโ”€โ”€ chr1_10kb_16ds.npz
โ”‚   โ”‚   โ””โ”€โ”€ ...
โ”‚   โ”œโ”€โ”€ GM12878
โ”‚   โ””โ”€โ”€ CH12-LX

3.4 Generate train, validation and test datasets

  • you can set your desired chromosomes for each set in utils/config.py within the set_dict dictionary.
  • This specific example will create a file in $root_dir/data named xxx_train.npz.
# 'train' and 'valid' can be changed for different train/valid set splitting
set_dict = {'K562_test': [3, 11, 19, 21],
            'mESC_test': (4, 9, 15, 18),
            'train': [1, 3, 5, 7, 8, 9, 11, 13, 15, 17, 18, 19, 21, 22],
            'valid': [2, 6, 10, 12],
            'GM12878_test': (4, 14, 16, 20)}
usage: split.py -c CELL_LINE -hr {5kb,10kb,25kb,50kb,100kb,250kb,500kb,1mb} -r RATIO [-s DATASET] -chunk CHUNK -stride STRIDE -bound BOUND [--help]

A tools to divide data for train, predict and test.
----------------------------------------------------------------------------------------------------------
Use example : python ./datasets/split.py -hr 10kb -r 16 -s train -chunk 64 -stride 64 -bound 201 -c GM12878
----------------------------------------------------------------------------------------------------------

optional arguments:
  --help, -h            Print this help message and exit

Required Arguments:
  -c CELL_LINE          Required: Cell line for analysis[example:GM12878]
  -hr {5kb,10kb,25kb,50kb,100kb,250kb,500kb,1mb}
                        Required: High resolution specified[example:10kb]
  -r RATIO              Required: down_sampled ration[example:16]
  -s DATASET            Required: Dataset for train/valid/predict

Method Arguments:
  -chunk CHUNK          Required: chunk size for dividing[example:64]
  -stride STRIDE        Required: stride for dividing[example:64]
  -bound BOUND          Required: distance boundary interested[example:201]

Note
๐Ÿ—ฟ For training, you must have both training and validation files present in $root_dir/data.
Change the option -s to generate the validation and other datasets needed


After doing that,your dir should be like this

๐Ÿ”จ FILE STRUCTURE
Datasets_NPZ
โ”œโ”€โ”€ raw
โ”‚   โ”œโ”€โ”€ K562
โ”‚   โ”‚   โ”œโ”€โ”€ 1mb_resolution_intrachromosomal
โ”‚   โ”‚   โ””โ”€โ”€ ...
โ”‚   โ”œโ”€โ”€ GM12878
โ”‚   โ””โ”€โ”€ CH12-LX
โ”œโ”€โ”€ mat
โ”‚   โ”œโ”€โ”€ K562
โ”‚   โ”‚   โ”œโ”€โ”€ chr1_10kb.npz
โ”‚   โ”‚   โ”œโ”€โ”€ chr1_40kb.npz
โ”‚   โ”‚   โ””โ”€โ”€ ...
โ”‚   โ”œโ”€โ”€ GM12878
โ”‚   โ””โ”€โ”€ CH12-LX
โ”œโ”€โ”€ data
โ”‚   โ”œโ”€โ”€ xxxx_train.npz
โ”‚   โ”œโ”€โ”€ xxxx_valid.npz
โ”‚   โ””โ”€โ”€ ...

======================================

๐Ÿ’— If you want to use your own data for training

  • mkdir $root_dir/mat/<cell_name>

prepare .npz data

Note
Most common Hi-C file formats, such as .cool(.mcool) and .hic can be easily converted to numpy matrix. Other formats can be converted into transition formats using HiCExplorer to generate numpy matrices.

# here is a python code for read cool to npz file
# cooler version: 0.9.2
import cooler
import numpy as np

# the filename of cooler is 4DNFI18UHVRO.mcool
# ::resolutions/10000 is required by cooler package 
c = cooler.Cooler('./4DNFI18UHVRO.mcool::resolutions/10000')

# chromosome list[1, ..., 20]
for i in range(1, 21):
    print(i)
    matrix = c.matrix(balance=False).fetch("chr" + str(i), "chr" + str(i))
    # `hic` is the key required by HiADN
    np.savez_compressed('K562_chr' + str(i) + '_10kb.npz', hic = matrix)

# print(type(matrix))
  • move your .npz data into $root_dir/mat/<cell_name>/
  • following the before step <Downsample the data>

4. Training

We have provided pre-trained file for all models:

Note, we do not make comparison with HiCARN_2, as its performance was not as good as HiCARN_1 in its paper ๐ŸŽ“.

  1. HiCSR
  2. HiCNN
  3. DeepHiC
  4. HiCARN
  5. Ours HiADN

To train:

โค๏ธ GPU acceleration is strongly recommended.

4.1 All models

$$ {\color{red}!!!\ NOTE\ !!!} $$

  1. Do not use absolute paths
  2. Put your train/valid/test data in $root/data/{your path/your filename}
  3. [if predict] Put your ckpt file in $root/checkpoints/{your path/your filename}
  4. Use relative paths {your path/your filename}
usage: train.py -m MODEL -t TRAIN_FILE -v VALID_FILE [-e EPOCHS] [-b BATCH_SIZE] [-verbose VERBOSE] [--help]

Training the models
--------------------------------------------------------------------------------------------
Use example : python train.py -m HiADN -t c64_s64_train.npz -v c64_s64_valid.npz -e 50 -b 32
--------------------------------------------------------------------------------------------

optional arguments:
  --help, -h        Print this help message and exit

Miscellaneous Arguments:
  -m MODEL          Required: models[HiADN, HiCARN, DeepHiC, HiCSR, HiCNN]
  -t TRAIN_FILE     Required: training file[example: c64_s64_train.npz]
  -v VALID_FILE     Required: valid file[example: c64_s64_valid.npz]
  -e EPOCHS         Optional: max epochs[example:50]
  -b BATCH_SIZE     Optional: batch_size[example:32]
  -verbose VERBOSE  Optional: recording in tensorboard [example:1( meaning True)]

This function will output .pytorch checkpoint files containing the trained weights in $root_dir/checkpoints/{model_name}_{best or final}.pytorch.

If using arguments -verbose, run shell

tensorboard --logdir ./Datasets_NPZ/logs/ --port=<your port>

Now you can use visualization in Browser[Google Chrome] to observe changes in indicators during model training

5. Predict

We provide pretrained weights for ours models and all other compared models. You can also use the weights generated by your own trainning data.

5.1 predict on down_sample data

These datasets are obtained by down_sampling, so they have corresponding targets.

But this data has never been put into the model before [Just for test and comparison].

usage: predict.py -m MODEL -t PREDICT_FILE [-b BATCH_SIZE] -ckpt CKPT [--help]

Predict
--------------------------------------------------------------------------------------------------
Use example : python predict.py -m HiADN -t c64_s64_GM12878_test.npz -b 64 -ckpt best_ckpt.pytorch
--------------------------------------------------------------------------------------------------

optional arguments:
  --help, -h       Print this help message and exit

Miscellaneous Arguments:
  -m MODEL         Required: models[HiADN, HiCARN, DeepHiC, HiCSR, HICNN]
  -t PREDICT_FILE  Required: predicting file[example: c64_s64_GM12878_test.npz]
  -b BATCH_SIZE    Optional: batch_size[example:64]
  -ckpt CKPT       Required: Checkpoint file[example:best.pytorch]

5.2 Predict on matrix

  1. mkdir $root/mat/{your cell_line}
  2. Put your chr{num}_{resolution}.npz file in above dir
  3. run shell python ./data/split_matrix.py -h to generate data for predict
usage: split_matrix.py -c CELL_LINE -chunk CHUNK -stride STRIDE -bound BOUND [--help]

A tools to generate data for predict.
----------------------------------------------------------------------------------------------------------
Use example : python ./data/split_matrix.py -chunk 64 -stride 64 -bound 201 -c GM12878
----------------------------------------------------------------------------------------------------------

optional arguments:
  --help, -h      Print this help message and exit

Required Arguments:
  -c CELL_LINE    Required: Cell line for analysis[example:GM12878]

Method Arguments:
  -chunk CHUNK    Required: chunk size for dividing[example:64]
  -stride STRIDE  Required: stride for dividing[example:64]
  -bound BOUND    Required: distance boundary interested[example:201]

  1. run shell python predict.py -h to predict [same as predict-on-down_sample-data]

6. Visualization

usage: visualization.py -f FILE -s START -e END [-p PERCENTILE] [-c CMAP] [-n NAME] [--help]

Visualization
--------------------------------------------------------------------------------------------------
Use example : python ./visual.py -f hic_matrix.npz -s 14400 -e 14800 -p 95 -c Reds
--------------------------------------------------------------------------------------------------

optional arguments:
  --help, -h     Print this help message and exit

Miscellaneous Arguments:
  -f FILE        Required: a npz file out from predict
  -s START       Required: start bin[example: 14400]
  -e END         Required: end bin[example: 14800]
  -p PERCENTILE  Optional: percentile of max, the default is 95.
  -c CMAP        Optional: color map[example: Reds]
  -n NAME        Optional: the name of pic[example: chr4:14400-14800]

Figure will be saved to $root_dir/img

cmap: ๐Ÿ‘‰ see matplotlib doc

Recommended:

  1. Reds
  2. YlGn
  3. Greys
  4. YlOrRd

๐Ÿ“š Appendix

The output predictions are stored in .npz files that contain numpy arrays under keys.

To access the predicted HR matrix, use the following command in a python file:

"""
# .npz file is like a dict
a = np.load("path/to/file.npz", allow_pickle=True)
# to show all keys
a.files
# return a numpy array
a['key_name'] 
"""
import numpy as np
hic_matrix = np.load("path/to/file.npz", allow_pickle=True)['hic']

๐Ÿ‘ท Acknowledgement

We thank for some wonderful repo, including

  1. DeepHiC : some code for data processing.
    • utils/io_helper.py
  2. RFDN: some code for backbone of HiADN
    • models/common.py

hiadn's People

Contributors

lyotvincent avatar nkulpj avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.