🔧 Features

Unique features of HiADN

Light-weight model
HiFM
LKA
👥 User Guide
📚 Appendix
👷 Acknowledgement

👥 User Guide

1. Installation

Clone or Download our repo.

2. Requires

see requirements.txt

We recommend using conda to create a virtual environment.

Install conda firstly.

Enter the repo.

shell conda create --name <your_name> --file requirements.txt

shell conda activate <your_name>

3. Data Preprocessing

👉 In our experiments, we use the Hi-C data from (Rao et al. 2014).

You can view the data on NCBI via accession GSE62525; Three data sets are used:

✨ GM12878 primary intrachromosomal
✨ K562 intrachromasomal
✨ CH12-LX (mouse) intrachromosomal

$$ 😄 {\color{blue}!!!\ FOLLOW\ THE\ STEPS\ CAREFULLY\ !!!} $$

3.1 Set work directory

i. Create your root directory and write in /utils/config.py;

For example, we set root_dir = './Datasets_NPZ'

# the Root directory for all raw and processed data
root_dir = 'Datasets_NPZ'  # Example of root directory name

ii. Make a new directory named raw to store raw data.

mkdir $root_dir/raw

iii. Download and Unzip data into the $root_dir/raw directory.

After doing that,your dir should be like this

🔨 FILE STRUCTURE

Datasets_NPZ
├── raw
│   ├── K562
│   │   ├── 1mb_resolution_intrachromosomal
│   │   └── ...
│   ├── GM12878
│   └── CH12-LX

Follow the following steps to generate datasets in .npz format:

3.2 Read the raw data

This will create a new directory $root_dir/mat/<cell_line_name> where all chrN_[HR].npz files will be stored.

usage: read_prepare.py -c CELL_LINE [-hr {5kb,10kb,25kb,50kb,100kb,250kb,500kb,1mb}] [-q {MAPQGE30,MAPQG0}] [-n {KRnorm,SQRTVCnorm,VCnorm}] [--help]

A tools to read raw data from Rao's Hi-C experiment.
------------------------------------------------------
Use example : python ./data/read_prepare.py -c GM12878
------------------------------------------------------

optional arguments:
  --help, -h            Print this help message and exit

Required Arguments:
  -c CELL_LINE          Required: Cell line for analysis[example:GM12878]

Miscellaneous Arguments:
  -hr {5kb,10kb,25kb,50kb,100kb,250kb,500kb,1mb}
                        High resolution specified[default:10kb]
  -q {MAPQGE30,MAPQG0}  Mapping quality of raw data[default:MAPQGE30]
  -n {KRnorm,SQRTVCnorm,VCnorm}
                        The normalization file for raw data[default:KRnorm]

After doing that,your dir should be like this

🔨 FILE STRUCTURE

Datasets_NPZ
├── raw
│   ├── K562
│   │   ├── 1mb_resolution_intrachromosomal
│   │   └── ...
│   ├── GM12878
│   └── CH12-LX
├── mat
│   ├── K562
│   │   ├── chr1_10kb.npz
│   │   └── ...
│   ├── GM12878
│   └── CH12-LX

3.3 Down_sample the data

This adds down_sampled HR data to $root_dir/mat/<cell_line_name> as chrN_[LR].npz.

usage: down_sample.py -c CELL_LINE -hr {5kb,10kb,25kb,50kb,100kb,250kb,500kb,1mb} -r RATIO [--help]

A tools to down sample data from high resolution data.
----------------------------------------------------------------------
Use example : python ./datasets/down_sample.py -hr 10kb -r 16 -c GM12878
----------------------------------------------------------------------

optional arguments:
  --help, -h            Print this help message and exit

Required Arguments:
  -c CELL_LINE          Required: Cell line for analysis[example:GM12878]
  -hr {5kb,10kb,25kb,50kb,100kb,250kb,500kb,1mb}
                        Required: High resolution specified[example:10kb]
  -r RATIO              Required: The ratio of down sampling[example:16]

After doing that,your dir should be like this

🔨 FILE STRUCTURE

Datasets_NPZ
├── raw
│   ├── K562
│   │   ├── 1mb_resolution_intrachromosomal
│   │   └── ...
│   ├── GM12878
│   └── CH12-LX
├── mat
│   ├── K562
│   │   ├── chr1_10kb.npz
│   │   ├── chr1_10kb_16ds.npz
│   │   └── ...
│   ├── GM12878
│   └── CH12-LX

3.4 Generate train, validation and test datasets

you can set your desired chromosomes for each set in utils/config.py within the set_dict dictionary.
This specific example will create a file in $root_dir/data named xxx_train.npz.

# 'train' and 'valid' can be changed for different train/valid set splitting
set_dict = {'K562_test': [3, 11, 19, 21],
            'mESC_test': (4, 9, 15, 18),
            'train': [1, 3, 5, 7, 8, 9, 11, 13, 15, 17, 18, 19, 21, 22],
            'valid': [2, 6, 10, 12],
            'GM12878_test': (4, 14, 16, 20)}

usage: split.py -c CELL_LINE -hr {5kb,10kb,25kb,50kb,100kb,250kb,500kb,1mb} -r RATIO [-s DATASET] -chunk CHUNK -stride STRIDE -bound BOUND [--help]

A tools to divide data for train, predict and test.
----------------------------------------------------------------------------------------------------------
Use example : python ./datasets/split.py -hr 10kb -r 16 -s train -chunk 64 -stride 64 -bound 201 -c GM12878
----------------------------------------------------------------------------------------------------------

optional arguments:
  --help, -h            Print this help message and exit

Required Arguments:
  -c CELL_LINE          Required: Cell line for analysis[example:GM12878]
  -hr {5kb,10kb,25kb,50kb,100kb,250kb,500kb,1mb}
                        Required: High resolution specified[example:10kb]
  -r RATIO              Required: down_sampled ration[example:16]
  -s DATASET            Required: Dataset for train/valid/predict

Method Arguments:
  -chunk CHUNK          Required: chunk size for dividing[example:64]
  -stride STRIDE        Required: stride for dividing[example:64]
  -bound BOUND          Required: distance boundary interested[example:201]

Note
🗿 For training, you must have both training and validation files present in $root_dir/data.
Change the option -s to generate the validation and other datasets needed

After doing that,your dir should be like this

🔨 FILE STRUCTURE

Datasets_NPZ
├── raw
│   ├── K562
│   │   ├── 1mb_resolution_intrachromosomal
│   │   └── ...
│   ├── GM12878
│   └── CH12-LX
├── mat
│   ├── K562
│   │   ├── chr1_10kb.npz
│   │   ├── chr1_40kb.npz
│   │   └── ...
│   ├── GM12878
│   └── CH12-LX
├── data
│   ├── xxxx_train.npz
│   ├── xxxx_valid.npz
│   └── ...

======================================

💗 If you want to use your own data for training

mkdir $root_dir/mat/<cell_name>

prepare .npz data

Note
Most common Hi-C file formats, such as .cool(.mcool) and .hic can be easily converted to numpy matrix. Other formats can be converted into transition formats using HiCExplorer to generate numpy matrices.

# here is a python code for read cool to npz file
# cooler version: 0.9.2
import cooler
import numpy as np

# the filename of cooler is 4DNFI18UHVRO.mcool
# ::resolutions/10000 is required by cooler package 
c = cooler.Cooler('./4DNFI18UHVRO.mcool::resolutions/10000')

# chromosome list[1, ..., 20]
for i in range(1, 21):
    print(i)
    matrix = c.matrix(balance=False).fetch("chr" + str(i), "chr" + str(i))
    # `hic` is the key required by HiADN
    np.savez_compressed('K562_chr' + str(i) + '_10kb.npz', hic = matrix)

# print(type(matrix))

move your .npz data into $root_dir/mat/<cell_name>/
following the before step <Downsample the data>

4. Training

We have provided pre-trained file for all models:

Note, we do not make comparison with HiCARN_2, as its performance was not as good as HiCARN_1 in its paper 🎓.

HiCSR
HiCNN
DeepHiC
HiCARN
Ours HiADN

To train:

❤️ GPU acceleration is strongly recommended.

4.1 All models

$$ {\color{red}!!!\ NOTE\ !!!} $$

Do not use absolute paths
Put your train/valid/test data in $root/data/{your path/your filename}
[if predict] Put your ckpt file in $root/checkpoints/{your path/your filename}
Use relative paths {your path/your filename}

usage: train.py -m MODEL -t TRAIN_FILE -v VALID_FILE [-e EPOCHS] [-b BATCH_SIZE] [-verbose VERBOSE] [--help]

Training the models
--------------------------------------------------------------------------------------------
Use example : python train.py -m HiADN -t c64_s64_train.npz -v c64_s64_valid.npz -e 50 -b 32
--------------------------------------------------------------------------------------------

optional arguments:
  --help, -h        Print this help message and exit

Miscellaneous Arguments:
  -m MODEL          Required: models[HiADN, HiCARN, DeepHiC, HiCSR, HiCNN]
  -t TRAIN_FILE     Required: training file[example: c64_s64_train.npz]
  -v VALID_FILE     Required: valid file[example: c64_s64_valid.npz]
  -e EPOCHS         Optional: max epochs[example:50]
  -b BATCH_SIZE     Optional: batch_size[example:32]
  -verbose VERBOSE  Optional: recording in tensorboard [example:1( meaning True)]

This function will output .pytorch checkpoint files containing the trained weights in $root_dir/checkpoints/{model_name}_{best or final}.pytorch.

If using arguments -verbose, run shell

tensorboard --logdir ./Datasets_NPZ/logs/ --port=<your port>

Now you can use visualization in Browser[Google Chrome] to observe changes in indicators during model training

5. Predict

We provide pretrained weights for ours models and all other compared models. You can also use the weights generated by your own trainning data.

5.1 predict on down_sample data

These datasets are obtained by down_sampling, so they have corresponding targets.

But this data has never been put into the model before [Just for test and comparison].

usage: predict.py -m MODEL -t PREDICT_FILE [-b BATCH_SIZE] -ckpt CKPT [--help]

Predict
--------------------------------------------------------------------------------------------------
Use example : python predict.py -m HiADN -t c64_s64_GM12878_test.npz -b 64 -ckpt best_ckpt.pytorch
--------------------------------------------------------------------------------------------------

optional arguments:
  --help, -h       Print this help message and exit

Miscellaneous Arguments:
  -m MODEL         Required: models[HiADN, HiCARN, DeepHiC, HiCSR, HICNN]
  -t PREDICT_FILE  Required: predicting file[example: c64_s64_GM12878_test.npz]
  -b BATCH_SIZE    Optional: batch_size[example:64]
  -ckpt CKPT       Required: Checkpoint file[example:best.pytorch]

5.2 Predict on matrix

mkdir $root/mat/{your cell_line}
Put your chr{num}_{resolution}.npz file in above dir
run shell python ./data/split_matrix.py -h to generate data for predict

usage: split_matrix.py -c CELL_LINE -chunk CHUNK -stride STRIDE -bound BOUND [--help]

A tools to generate data for predict.
----------------------------------------------------------------------------------------------------------
Use example : python ./data/split_matrix.py -chunk 64 -stride 64 -bound 201 -c GM12878
----------------------------------------------------------------------------------------------------------

optional arguments:
  --help, -h      Print this help message and exit

Required Arguments:
  -c CELL_LINE    Required: Cell line for analysis[example:GM12878]

Method Arguments:
  -chunk CHUNK    Required: chunk size for dividing[example:64]
  -stride STRIDE  Required: stride for dividing[example:64]
  -bound BOUND    Required: distance boundary interested[example:201]

run shell python predict.py -h to predict [same as predict-on-down_sample-data]

6. Visualization

usage: visualization.py -f FILE -s START -e END [-p PERCENTILE] [-c CMAP] [-n NAME] [--help]

Visualization
--------------------------------------------------------------------------------------------------
Use example : python ./visual.py -f hic_matrix.npz -s 14400 -e 14800 -p 95 -c Reds
--------------------------------------------------------------------------------------------------

optional arguments:
  --help, -h     Print this help message and exit

Miscellaneous Arguments:
  -f FILE        Required: a npz file out from predict
  -s START       Required: start bin[example: 14400]
  -e END         Required: end bin[example: 14800]
  -p PERCENTILE  Optional: percentile of max, the default is 95.
  -c CMAP        Optional: color map[example: Reds]
  -n NAME        Optional: the name of pic[example: chr4:14400-14800]

Figure will be saved to $root_dir/img

cmap: 👉 see matplotlib doc

Recommended:

Reds
YlGn
Greys
YlOrRd

📚 Appendix

The output predictions are stored in .npz files that contain numpy arrays under keys.

To access the predicted HR matrix, use the following command in a python file:

"""
# .npz file is like a dict
a = np.load("path/to/file.npz", allow_pickle=True)
# to show all keys
a.files
# return a numpy array
a['key_name'] 
"""
import numpy as np
hic_matrix = np.load("path/to/file.npz", allow_pickle=True)['hic']

👷 Acknowledgement

We thank for some wonderful repo, including

DeepHiC : some code for data processing.
- utils/io_helper.py
RFDN: some code for backbone of HiADN
- models/common.py

lyotvincent / hiadn Goto Github PK

hiadn's Introduction

🔧 Features

Unique features of HiADN

👥 User Guide

1. Installation

2. Requires

3. Data Preprocessing

3.1 Set work directory

3.2 Read the raw data

3.3 Down_sample the data

3.4 Generate train, validation and test datasets

prepare .npz data

4. Training

4.1 All models

5. Predict

5.1 predict on down_sample data

5.2 Predict on matrix

6. Visualization

📚 Appendix

👷 Acknowledgement

hiadn's People

Contributors

Recommend Projects

Recommend Topics

Recommend Org

Jobs