The modcnet from yulab2021

modCnet: Detection of ac4C and m5C from nanopore direct RNA sequencing using deep learning

modCnet is a deep learning framework designed to harness the power of Oxford Nanopore direct RNA sequencing for precise identification of N4-acetylcytidine (ac4C) and 5-methylcytosine (m5C) sites, a crucial aspect in RNA modification studies. By effectively distinguishing ac4C and m5C from unmodified cytidine, modCnet enables accurate estimation of modification rates at each ac4C site. Through rigorous validation on independent in vitro datasets and a human cell line, modCnet showcases its robustness, versatility, and immense potential in advancing the understanding and exploration of ac4C modifications in mRNA.

1. Installation

modCnet is implemented in python. The following modules are needed to run modCnet

module	version
minimap2	2.17-r941
python	3.7.12
h5py	3.7.0
statsmodels	0.10.0
joblib	0.16.0
scikit-learn	0.22
torch	1.9.1
guppy	6.1.5
tombo	1.5.1
ont_vbz_hdf_plugin	1.0.1
ont-fast5-api	4.1.1
numpy	1.19.5
scipy	1.7.0

You can install dependent modules manually. Conda is recommended run modCnet. Create a new conda environment and activate it:

conda create -n modCnet python=3.7.12
conda activate modCnet

Install the required modules:

conda config --add channels conda-forge
conda config --add channels bioconda

conda install -c conda-forge scipy=1.7.0
conda install -c bioconda minimap2=2.17
conda install -c conda-forge numpy=1.19.5
conda install -c anaconda h5py=3.7.0
conda install -c conda-forge joblib=0.16.0
conda install -c anaconda scikit-learn=0.22
conda install -c bioconda ont-tombo=1.5.1
conda install -c bioconda ont_vbz_hdf_plugin=1.0.1
conda install -c bioconda ont-fast5-api=4.1.1
conda install -c conda-forge statsmodels=0.10.0
pip install torch==1.9.1

Or, some of the modules can be installed by pip:

pip install numpy==1.19.5
pip install h5py==3.7.0
pip install statsmodels==0.10.0
pip install joblib==0.16.0
pip install scikit-learn==0.22
pip install ont-tombo==1.5.1
pip install ont-fast5-api==4.1.1
pip install scipy==1.7.0

Guppy for basecalling can be obtained from Oxford Nanopore Technologies or from this mirror. Install Guppy using dpkg:

alien ont-guppy-cpu-6.1.5-1.el7.x86_64.rpm
dpkg -i ont-guppy-cpu-6.1.5-1.el7.x86_64.deb

libhdf5 and libcrypto are required for running guppy.

The entire installation will take about 10 minutes. After installing all the essential packages, reset the environment’s state by deactivating and reactivating the environment:

conda deactivate
conda activate TandemMod

We also provide a yaml file in the repository so you can install the dependencies through the configuration file:

conda env create -f modCnet.yaml

2. Fast5 data processing

2.1 Guppy_basecalling

Guppy is used for basecalling in modCnet. Guppy takes fast5 files as input and generated fastq files from the current signals. This step can be time-consuming and may require several hours or even days to complete, depending on the computational capacity available:

guppy_basecaller -i demo_data/IVT_fast5 -s demo_data/IVT_fast5_guppy --num_callers 40 --recursive --fast5_out --config rna_r9.4.1_70bps_hac.cfg

Arguments:
=================================   ==========  ===================  ============================================================================================================
Argument name                       Required    Default              Description
=================================   ==========  ===================  ============================================================================================================
-i=DIR                              Yes         NA                    Input directory, containing FAST5 files generated by the nanopore sequencing platform
-s=DIR                              Yes         NA                    Output directory, containing FAST5 file as well as basecalled sequences.
--num_callers=NUM                   No          1                     Number of processes to run.
--fast5_out                         Yes         None                  Output FAST5 files to the directory.
--config=STR                        Yes         None                  The configure file is "rna_r9.4.1_70bps_hac.cfg" in TandemMod and should be adjusted according to DRS platform.
--recursive                         Yes         None                  This Argument allows recursive processing or batch processing of files
=================================   ==========  ===================  ============================================================================================================

2.2 Multi-fast5 to single_fast5

If fast5 reads are stored at multi-reads format, ont_fast5_api is recommended to convert multi-fast5 reads to single-fast5 reads. Usually, the size of multi-reads fast5 file is about 200-300M. Convert multi-reads files to single-read files::

multi_to_single_fast5 -i demo_data/IVT_fast5_guppy -s demo_data/IVT_fast5_guppy_single --recursive

Arguments:
=================================   ==========  ===================  ============================================================================================================
Argument name                       Required    Default              Description
=================================   ==========  ===================  ============================================================================================================
-i=DIR                              Yes         NA                    Input directory, containing multi-reads FAST5 files.
-s=DIR                              Yes         NA                    Output directory, containing single-read FAST5 files.
-t=NUM                              No          1                     Number of processes to run.
--recursive                         Yes         NA                    This Argument allows recursive processing or batch processing of files.
=================================   ==========  ===================  ============================================================================================================

2.3 Tombo resquiggle

The resquiggling algorithm is the basis for the Tombo framework. It takes as input a read file (in FAST5 format) containing raw signal and associated base calls. The base calls are mapped to a genome or transcriptome reference and then the raw signal is assigned to the reference sequence based on an expected current level model. Tombo is used for resquiggling in TandemMod::

tombo resquiggle --overwrite --basecall-group Basecall_1D_000 demo_data/IVT_fast5_guppy_single  demo_data/IVT_DRS_ac4C.reference.fasta --processes 40 --fit-global-scale --include-event-stdev

Arguments:
=================================   ==========  ===================  ============================================================================================================
Argument name                       Required    Default              Description
=================================   ==========  ===================  ============================================================================================================
--overwrite                         Yes         NA                    Overwrite previous corrected group in FAST5 files.
--basecall-group                    No          Basecall_1D_000       FAST5 group obtain original basecalls. 
--processes                         No          1                     Number of processes to run.
--fit-global-scale                  No          NA                    Apply a scaling factor.
--include-event-stdev               No          NA                    Include the standard deviation.
args[0]                             Yes         NA                    Fast5 basedir. 
args[1]                             Yes         NA                    Reference transcripts, in fasta format.
=================================   ==========  ===================  ============================================================================================================

2.4 Map to reference

minimap2 is used to map basecalled sequences to reference transcripts::

cat demo_data/IVT_fast5_guppy/pass/*.fastq >demo_data/IVT.fastq
minimap2 -ax map-ont demo_data/IVT_DRS_ac4C.reference.fasta demo_data/IVT.fastq >demo_data/IVT.sam

3. Feature extraction

Extract feature from fast5 files. modCnet takes features corresponding 5-mers as input.

python script/feature_extraction.py --input demo_data/IVT_fast5_guppy_single \
        --reference demo_data/IVT_DRS_ac4C.reference.fasta  \
        --sam demo_data/IVT.sam \
        --output demo_data/IVT.feature \
        --clip 10 \
        --motif NNCNN

Arguments:
=================================   ==========  ===================  ============================================================================================================
Argument name                       Required    Default              Description
=================================   ==========  ===================  ============================================================================================================
--input                             Yes         NA                    Fast5 basedir.
--reference                         Yes         NA                    Reference transcripts, in fasta format.
--sam                               Yes         NA                    Aligment results, output from minimap2.
--output                            Yes         NA                    Output file contraining current signals.
--clip                              Yes         NA                    Base clip at both ends.
--motif                             Yes         NA                    Motif pattern to extact.
=================================   ==========  ===================  ============================================================================================================

The base symbols of the motif follow the IUB code standard. Here is the full definition of IUB base symbols:

module	version
A	A
C	C
G	G
T	T
M	AC
V	ACG
R	AG
H	ACT
W	AT
D	AGT
S	CG
B	CGT
Y	CT
N	ACGT
K	GT

4. Model training

You can train modCnet with your own data. To evalate the model generalization ability and aviod overfitting, test data is needed in the training process. The train-test split can be performed by the script provided in the repository. The default split ratios are 80% for training and 20% for testing. The train-test split ratio can be customized by using the argument --train_ratio to accommodate the specific requirements of the problem and the size of the dataset.

usage: train_test_split.py [-h] [--input_file INPUT_FILE]
                            [--train_file TRAIN_FILE] [--test_file TEST_FILE]
                            [--train_ratio TRAIN_RATIO]
    
Split a feature file into training and testing sets.
    
optional arguments:
      -h, --help                  show this help message and exit
      --input_file INPUT_FILE     Path to the input feature file
      --train_file TRAIN_FILE     Path to the train feature file
      --test_file TEST_FILE       Path to the test feature file
      --train_ratio TRAIN_RATIO   Ratio of instances to use for training (default: 0.8)

4.1 Train C/ac4C model

To train the modCnet model using your own dataset from scratch, you can set the --run_mode argument to "train". modCnet accepts both modified and unmodified feature files as input. Additionally, test feature files are necessary to evaluate the model's performance. You can specify the model save path by using the argument --new_model. The model's training epochs can be defined using the argument --epoch, and the model states will be saved at the end of each epoch. TandemMod will preferentially use the GPU for training if CUDA is available on your device; otherwise, it will utilize the CPU mode. The training process duration can vary, depending on the size of your dataset and the computational capacity, and may last for several hours.

python script/modCnet.py --run_mode train \
      --model_type C/ac4C
      --new_model demo_data/model/C_ac4C.pkl \
      --train_data_C demo_data/C.feature.train.tsv \
      --train_data_ac4C demo_data/ac4C.feature.train.tsv \
      --test_data_C demo_data/C.feature.test.tsv \
      --test_data_ac4C demo_data/ac4C.feature.test.tsv \
      --epoch 100

Arguments:
=================================   ==========  ===================  ============================================================================================================
Argument name                       Required    Default              Description
=================================   ==========  ===================  ============================================================================================================
--run_mode                          Yes         NA                    Run mode [train or predict].
--model_type                        Yes         NA                    Model type [C/ac4C, C/m5C, m5C/ac4C or C/m5C/ac4C]
--new_model                         Yes         NA                    The path to save model.
--train_data_C                      Yes         NA                    Train data, unmodified.
--train_data_ac4C                   Yes         NA                    Train data, ac4C-modified.
--test_data_C                       Yes         NA                    Test data, unmodified.
--test_data_ac4C                    Yes         NA                    Test data, ac4C-modified.
--epoch                             Yes         NA                    Training epoch.
=================================   ==========  ===================  ============================================================================================================

4.2 Train C/m5C model

This is a demo to train a C/m5C model that distinguish C from m5C.

python script/modCnet.py --run_mode train \
      --model_type C/m5C \
      --new_model demo_data/model/C_m5C.pkl \
      --train_data_C demo_data/C.feature.train.tsv \
      --train_data_m5C demo_data/m5C.feature.train.tsv \
      --test_data_C demo_data/C.feature.test.tsv \
      --test_data_m5C demo_data/m5C.feature.test.tsv \
      --epoch 100

Arguments:
=================================   ==========  ===================  ============================================================================================================
Argument name                       Required    Default              Description
=================================   ==========  ===================  ============================================================================================================
--run_mode                          Yes         NA                    Run mode [train or predict].
--model_type                        Yes         NA                    Model type [C/ac4C, C/m5C, m5C/ac4C or C/m5C/ac4C]
--new_model                         Yes         NA                    The path to save model.
--train_data_C                      Yes         NA                    Train data, unmodified.
--train_data_m5C                    Yes         NA                    Train data, m5C-modified.
--test_data_C                       Yes         NA                    Test data, unmodified.
--test_data_m5C                     Yes         NA                    Test data, m5C-modified.
--epoch                             Yes         NA                    Training epoch.
=================================   ==========  ===================  ============================================================================================================

4.3 Train m5C/ac4C model

This is a demo to train a m5C/ac4C model that distinguish ac4C from m5C.

python script/modCnet.py --run_mode train \
      --model_type m5C/ac4C
      --new_model demo_data/model/m5C_ac4C.pkl \
      --train_data_m5C demo_data/m5C.feature.train.tsv \
      --train_data_ac4C demo_data/ac4C.feature.train.tsv \
      --test_data_m5C demo_data/m5C.feature.test.tsv \
      --test_data_ac4C demo_data/ac4C.feature.test.tsv \
      --epoch 100

Arguments:
=================================   ==========  ===================  ============================================================================================================
Argument name                       Required    Default              Description
=================================   ==========  ===================  ============================================================================================================
--run_mode                          Yes         NA                    Run mode [train or predict].
--model_type                        Yes         NA                    Model type [C/ac4C, C/m5C, m5C/ac4C or C/m5C/ac4C]
--new_model                         Yes         NA                    The path to save model.
--train_data_m5C                    Yes         NA                    Train data, m5C-modified.
--train_data_ac4C                   Yes         NA                    Train data, ac4C-modified.
--test_data_m5C                     Yes         NA                    Test data, m5C-modified.
--test_data_ac4C                    Yes         NA                    Test data, ac4C-modified.
--epoch                             Yes         NA                    Training epoch.
=================================   ==========  ===================  ============================================================================================================

4.4 Train C/m5C/ac4C 3-class model

This is a demo to train a C/m5C/ac4C 3-class model that distinguish ac4C from m5C.

python script/modCnet.py --run_mode train \
      --model_type C/m5C/ac4C
      --new_model demo_data/model/C_m5C_ac4C.pkl \
      --train_data_C demo_data/C.feature.train.tsv \
      --train_data_m5C demo_data/m5C.feature.train.tsv \
      --train_data_ac4C demo_data/ac4C.feature.train.tsv \
      --test_data_C demo_data/C.feature.test.tsv \
      --test_data_m5C demo_data/m5C.feature.test.tsv \
      --test_data_ac4C demo_data/ac4C.feature.test.tsv \
      --epoch 100

Arguments:
=================================   ==========  ===================  ============================================================================================================
Argument name                       Required    Default              Description
=================================   ==========  ===================  ============================================================================================================
--run_mode                          Yes         NA                    Run mode [train or predict].
--model_type                        Yes         NA                    Model type [C/ac4C, C/m5C, m5C/ac4C or C/m5C/ac4C]
--new_model                         Yes         NA                    The path to save model.
--train_data_C                      Yes         NA                    Train data, unmodified.
--train_data_m5C                    Yes         NA                    Train data, m5C-modified.
--train_data_ac4C                   Yes         NA                    Train data, ac4C-modified.
--test_data_C                       Yes         NA                    Test data, C-modified.
--test_data_m5C                     Yes         NA                    Test data, m5C-modified.
--test_data_ac4C                    Yes         NA                    Test data, ac4C-modified.
--epoch                             Yes         NA                    Training epoch.
=================================   ==========  ===================  ============================================================================================================

5. Predict new data

We provied 4 pretrained models in the model directory in the repostory: a C/ac4C model, a C/m5C model, a m5C/ac4C model and a 3-class C/m5C/ac4C model. You can load the pretrained models or your own model to predict new data.

python script/modCnet.py --run_mode predict \
    --model_type C/m5C/ac4C \
    --pretrained_model model/C_m5C_ac4C.pkl \
    --feature_file demo_data/test.feature.tsv \
    --predict_result demo_data/test.prediction.tsv

Arguments:
=================================   ==========  ===================  ============================================================================================================
Argument name                       Required    Default              Description
=================================   ==========  ===================  ============================================================================================================
--run_mode                          Yes         NA                    Run mode [train or predict].
--model_type                        Yes         NA                    Model type [C/ac4C, C/m5C, m5C/ac4C or C/m5C/ac4C]
--pretrained_model                  Yes         NA                    The path to pretrained model.
--feature_file                      Yes         NA                    Feature file. 
--predict_result                    Yes         NA                    Training epoch.
=================================   ==========  ===================  ============================================================================================================

6. Results interpretation

The following is an example of the prediction output. The results include columns transcript_id, site, motif, read_id, prediction and probability. The probability indicates the read level confidence score for the site to be modified. User can apply a cut off threshold to the probability to filt out low-confidence predictions.

    transcript_id           site    motif   read_id                                 prediction   probability
    LOC_Os07g03730.1        560     GGCAA   5bb201ce-50e6-4261-8f8b-2b2b51bb9ea6    C            0.0077731693
    LOC_Os07g03730.1        563     AACGT   5bb201ce-50e6-4261-8f8b-2b2b51bb9ea6    C            0.00074390427
    LOC_Os07g03730.1        566     GTCGA   5bb201ce-50e6-4261-8f8b-2b2b51bb9ea6    C            0.0019359465
    LOC_Os07g03730.1        569     GACGG   5bb201ce-50e6-4261-8f8b-2b2b51bb9ea6    C            0.014535325
    LOC_Os07g03730.1        572     GGCGA   5bb201ce-50e6-4261-8f8b-2b2b51bb9ea6    ac4C         0.86820465
    LOC_Os07g03730.1        577     ATCTC   5bb201ce-50e6-4261-8f8b-2b2b51bb9ea6    C            0.0008406919
    LOC_Os07g03730.1        579     CTCCC   5bb201ce-50e6-4261-8f8b-2b2b51bb9ea6    C            0.033249497
    LOC_Os07g03730.1        580     TCCCT   5bb201ce-50e6-4261-8f8b-2b2b51bb9ea6    C            0.024631226
    LOC_Os07g03730.1        581     CCCTA   5bb201ce-50e6-4261-8f8b-2b2b51bb9ea6    C            0.0748754
    LOC_Os07g03730.1        584     TACTA   5bb201ce-50e6-4261-8f8b-2b2b51bb9ea6    C            0.4033113
    LOC_Os07g03730.1        588     AGCTA   5bb201ce-50e6-4261-8f8b-2b2b51bb9ea6    C            0.37496027
    LOC_Os07g03730.1        591     TACTA   5bb201ce-50e6-4261-8f8b-2b2b51bb9ea6    C            0.20457618
    LOC_Os07g03730.1        603     TACGT   5bb201ce-50e6-4261-8f8b-2b2b51bb9ea6    C            0.000665458
    LOC_Os07g03730.1        607     TACGG   5bb201ce-50e6-4261-8f8b-2b2b51bb9ea6    C            0.0102844415
    LOC_Os07g03730.1        610     GGCTA   5bb201ce-50e6-4261-8f8b-2b2b51bb9ea6    C            0.0144964745

We also provide a script to convert results from transcriptome location to genome location.

python script/transcriptome_location_to_genome_location.py \
    --input predictions.tsv \
    --output predictions_genome_loc.tsv \
    --gff /path/to/your/genome_annotation.gtf


=================================   ==========  ===================  ============================================================================================================
Argument name                       Required    Default              Description
=================================   ==========  ===================  ============================================================================================================
--input                             Yes         NA                    Input file,transcriptome location.
--output                            Yes         NA                    Output file, genome location.
--gff                               Yes         NA                    Annotation file.
=================================   ==========  ===================  ============================================================================================================

    
    Results:
    transcript_id           site    chr     site            motif   prediction   probability
    LOC_Os07g03730.1        560     Chr7    1524212         GGCAA   C            0.0077731693
    LOC_Os07g03730.1        563     Chr7    1524215         AACGT   C            0.00074390427
    LOC_Os07g03730.1        566     Chr7    1524218         GTCGA   C            0.0019359465
    LOC_Os07g03730.1        569     Chr7    1524221         GACGG   C            0.014535325
    LOC_Os07g03730.1        572     Chr7    1524224         GGCGA   ac4C         0.86820465
    LOC_Os07g03730.1        577     Chr7    1524229         ATCTC   C            0.0008406919
    LOC_Os07g03730.1        579     Chr7    1524231         CTCCC   C            0.033249497
    LOC_Os07g03730.1        580     Chr7    1524232         TCCCT   C            0.024631226
    LOC_Os07g03730.1        581     Chr7    1524233         CCCTA   C            0.0748754
    LOC_Os07g03730.1        584     Chr7    1524236         TACTA   C            0.4033113
    LOC_Os07g03730.1        588     Chr7    1524240         AGCTA   C            0.37496027
    LOC_Os07g03730.1        591     Chr7    1524243         TACTA   C            0.20457618
    LOC_Os07g03730.1        603     Chr7    1524255         TACGT   C            0.000665458
    LOC_Os07g03730.1        607     Chr7    1524259         TACGG   C            0.0102844415
    LOC_Os07g03730.1        610     Chr7    1524262         GGCTA   C            0.0144964745

By aggregating all the predictions for each site, we can derive a consensus or summary prediction for that specific genomic location using the script read_level_prediction_to_site_level_prediction.py

In the given command, please replace read_level_prediction.tsv with the converted results obtained from the transcriptome_location_to_genome_location.py script. Specify the desired output file name using the --output option. The script will then aggregate the read-level predictions to derive site-level predictions. The resulting predictions will include the count of modified bases, considering the predictions with probability values ranging from 0.5 to 0.95 as the cutoff range. The total base count is located in the last column.

python script/read_level_prediction_to_site_level_prediction.py \
    --input read_level_prediction.tsv \
    --output site_level_prediction.tsv

=================================   ==========  ===================  ============================================================================================================
Argument name                       Required    Default              Description
=================================   ==========  ===================  ============================================================================================================
--input                             Yes         NA                    Input file, read level prediction.
--output                            Yes         NA                    Output file, site level prediction.
=================================   ==========  ===================  ============================================================================================================

    Results:
    transcriptome_id        site    chr     site            motif   p_0.5     p_0.6     p_0.7     p_0.8     p_0.9     p_0.95    total
    LOC_Os05g41060.1        445     Chr5    24059817        TGCGC   2         2         2         2         2         1         63
    LOC_Os05g41060.1        447     Chr5    24059815        CGCCA   1         0         0         0         0         0         63
    LOC_Os05g41060.1        448     Chr5    24059814        GCCAG   4         4         4         4         2         1         63
    LOC_Os05g41060.1        451     Chr5    24059811        AGCGG   10        8         7         6         3         2         63
    LOC_Os05g41060.1        454     Chr5    24059808        GGCAC   15        14        14        11        10        8         63
    LOC_Os05g41060.1        456     Chr5    24059806        CACTG   2         2         2         2         2         1         63
    LOC_Os05g41060.1        462     Chr5    24059800        TACAT   0         0         0         0         0         0         63
    LOC_Os05g41060.1        465     Chr5    24059797        ATCCA   6         6         6         5         5         3         63
    LOC_Os05g41060.1        466     Chr5    24059796        TCCAG   2         2         2         1         1         0         63
    LOC_Os05g41060.1        471     Chr5    24059791        AGCAC   11        8         7         5         3         2         63
    LOC_Os05g41060.1        473     Chr5    24059789        CACAT   1         1         1         0         0         0         63
    LOC_Os05g41060.1        479     Chr5    24059783        TGCTA   1         1         1         1         0         0         63
    LOC_Os05g41060.1        482     Chr5    24059780        TACCT   5         5         5         4         2         2         63
    LOC_Os05g41060.1        483     Chr5    24059779        ACCTC   5         5         5         5         4         4         63
    LOC_Os05g41060.1        485     Chr5    24059777        CTCTG   4         4         4         3         2         1         63
    LOC_Os05g41060.1        508     Chr5    24059754        GGCTT   2         2         2         1         1         1         63

yulab2021 / modcnet Goto Github PK

modcnet's Introduction