modCnet is a deep learning framework designed to harness the power of Oxford Nanopore direct RNA sequencing for precise identification of N4-acetylcytidine (ac4C) and 5-methylcytosine (m5C) sites, a crucial aspect in RNA modification studies. By effectively distinguishing ac4C and m5C from unmodified cytidine, modCnet enables accurate estimation of modification rates at each ac4C site. Through rigorous validation on independent in vitro datasets and a human cell line, modCnet showcases its robustness, versatility, and immense potential in advancing the understanding and exploration of ac4C modifications in mRNA.
modCnet is implemented in python. The following modules are needed to run modCnet
module | version |
---|---|
minimap2 | 2.17-r941 |
python | 3.7.12 |
h5py | 3.7.0 |
statsmodels | 0.10.0 |
joblib | 0.16.0 |
scikit-learn | 0.22 |
torch | 1.9.1 |
guppy | 6.1.5 |
tombo | 1.5.1 |
ont_vbz_hdf_plugin | 1.0.1 |
ont-fast5-api | 4.1.1 |
numpy | 1.19.5 |
scipy | 1.7.0 |
You can install dependent modules manually. Conda is recommended run modCnet. Create a new conda environment and activate it:
conda create -n modCnet python=3.7.12
conda activate modCnet
Install the required modules:
conda config --add channels conda-forge
conda config --add channels bioconda
conda install -c conda-forge scipy=1.7.0
conda install -c bioconda minimap2=2.17
conda install -c conda-forge numpy=1.19.5
conda install -c anaconda h5py=3.7.0
conda install -c conda-forge joblib=0.16.0
conda install -c anaconda scikit-learn=0.22
conda install -c bioconda ont-tombo=1.5.1
conda install -c bioconda ont_vbz_hdf_plugin=1.0.1
conda install -c bioconda ont-fast5-api=4.1.1
conda install -c conda-forge statsmodels=0.10.0
pip install torch==1.9.1
Or, some of the modules can be installed by pip:
pip install numpy==1.19.5
pip install h5py==3.7.0
pip install statsmodels==0.10.0
pip install joblib==0.16.0
pip install scikit-learn==0.22
pip install ont-tombo==1.5.1
pip install ont-fast5-api==4.1.1
pip install scipy==1.7.0
Guppy for basecalling can be obtained from Oxford Nanopore Technologies or from this mirror. Install Guppy using dpkg:
alien ont-guppy-cpu-6.1.5-1.el7.x86_64.rpm
dpkg -i ont-guppy-cpu-6.1.5-1.el7.x86_64.deb
libhdf5 and libcrypto are required for running guppy.
The entire installation will take about 10 minutes. After installing all the essential packages, reset the environment’s state by deactivating and reactivating the environment:
conda deactivate
conda activate TandemMod
We also provide a yaml file in the repository so you can install the dependencies through the configuration file:
conda env create -f modCnet.yaml
Guppy is used for basecalling in modCnet. Guppy takes fast5 files as input and generated fastq files from the current signals. This step can be time-consuming and may require several hours or even days to complete, depending on the computational capacity available:
guppy_basecaller -i demo_data/IVT_fast5 -s demo_data/IVT_fast5_guppy --num_callers 40 --recursive --fast5_out --config rna_r9.4.1_70bps_hac.cfg
Arguments:
================================= ========== =================== ============================================================================================================
Argument name Required Default Description
================================= ========== =================== ============================================================================================================
-i=DIR Yes NA Input directory, containing FAST5 files generated by the nanopore sequencing platform
-s=DIR Yes NA Output directory, containing FAST5 file as well as basecalled sequences.
--num_callers=NUM No 1 Number of processes to run.
--fast5_out Yes None Output FAST5 files to the directory.
--config=STR Yes None The configure file is "rna_r9.4.1_70bps_hac.cfg" in TandemMod and should be adjusted according to DRS platform.
--recursive Yes None This Argument allows recursive processing or batch processing of files
================================= ========== =================== ============================================================================================================
If fast5 reads are stored at multi-reads format, ont_fast5_api is recommended to convert multi-fast5 reads to single-fast5 reads. Usually, the size of multi-reads fast5 file is about 200-300M. Convert multi-reads files to single-read files::
multi_to_single_fast5 -i demo_data/IVT_fast5_guppy -s demo_data/IVT_fast5_guppy_single --recursive
Arguments:
================================= ========== =================== ============================================================================================================
Argument name Required Default Description
================================= ========== =================== ============================================================================================================
-i=DIR Yes NA Input directory, containing multi-reads FAST5 files.
-s=DIR Yes NA Output directory, containing single-read FAST5 files.
-t=NUM No 1 Number of processes to run.
--recursive Yes NA This Argument allows recursive processing or batch processing of files.
================================= ========== =================== ============================================================================================================
The resquiggling algorithm is the basis for the Tombo framework. It takes as input a read file (in FAST5 format) containing raw signal and associated base calls. The base calls are mapped to a genome or transcriptome reference and then the raw signal is assigned to the reference sequence based on an expected current level model. Tombo is used for resquiggling in TandemMod::
tombo resquiggle --overwrite --basecall-group Basecall_1D_000 demo_data/IVT_fast5_guppy_single demo_data/IVT_DRS_ac4C.reference.fasta --processes 40 --fit-global-scale --include-event-stdev
Arguments:
================================= ========== =================== ============================================================================================================
Argument name Required Default Description
================================= ========== =================== ============================================================================================================
--overwrite Yes NA Overwrite previous corrected group in FAST5 files.
--basecall-group No Basecall_1D_000 FAST5 group obtain original basecalls.
--processes No 1 Number of processes to run.
--fit-global-scale No NA Apply a scaling factor.
--include-event-stdev No NA Include the standard deviation.
args[0] Yes NA Fast5 basedir.
args[1] Yes NA Reference transcripts, in fasta format.
================================= ========== =================== ============================================================================================================
minimap2 is used to map basecalled sequences to reference transcripts::
cat demo_data/IVT_fast5_guppy/pass/*.fastq >demo_data/IVT.fastq
minimap2 -ax map-ont demo_data/IVT_DRS_ac4C.reference.fasta demo_data/IVT.fastq >demo_data/IVT.sam
Extract feature from fast5 files. modCnet takes features corresponding 5-mers as input.
python script/feature_extraction.py --input demo_data/IVT_fast5_guppy_single \
--reference demo_data/IVT_DRS_ac4C.reference.fasta \
--sam demo_data/IVT.sam \
--output demo_data/IVT.feature \
--clip 10 \
--motif NNCNN
Arguments:
================================= ========== =================== ============================================================================================================
Argument name Required Default Description
================================= ========== =================== ============================================================================================================
--input Yes NA Fast5 basedir.
--reference Yes NA Reference transcripts, in fasta format.
--sam Yes NA Aligment results, output from minimap2.
--output Yes NA Output file contraining current signals.
--clip Yes NA Base clip at both ends.
--motif Yes NA Motif pattern to extact.
================================= ========== =================== ============================================================================================================
The base symbols of the motif follow the IUB code standard. Here is the full definition of IUB base symbols:
module | version |
---|---|
A | A |
C | C |
G | G |
T | T |
M | AC |
V | ACG |
R | AG |
H | ACT |
W | AT |
D | AGT |
S | CG |
B | CGT |
Y | CT |
N | ACGT |
K | GT |
You can train modCnet with your own data. To evalate the model generalization ability and aviod overfitting, test data is needed in the training process. The train-test split can be performed by the script provided in the repository. The default split ratios are 80% for training and 20% for testing. The train-test split ratio can be customized by using the argument --train_ratio
to accommodate the specific requirements of the problem and the size of the dataset.
usage: train_test_split.py [-h] [--input_file INPUT_FILE]
[--train_file TRAIN_FILE] [--test_file TEST_FILE]
[--train_ratio TRAIN_RATIO]
Split a feature file into training and testing sets.
optional arguments:
-h, --help show this help message and exit
--input_file INPUT_FILE Path to the input feature file
--train_file TRAIN_FILE Path to the train feature file
--test_file TEST_FILE Path to the test feature file
--train_ratio TRAIN_RATIO Ratio of instances to use for training (default: 0.8)
To train the modCnet model using your own dataset from scratch, you can set the --run_mode
argument to "train". modCnet accepts both modified and unmodified feature files as input. Additionally, test feature files are necessary to evaluate the model's performance. You can specify the model save path by using the argument --new_model
. The model's training epochs can be defined using the argument --epoch
, and the model states will be saved at the end of each epoch. TandemMod will preferentially use the GPU
for training if CUDA is available on your device; otherwise, it will utilize the CPU
mode. The training process duration can vary, depending on the size of your dataset and the computational capacity, and may last for several hours.
python script/modCnet.py --run_mode train \
--model_type C/ac4C
--new_model demo_data/model/C_ac4C.pkl \
--train_data_C demo_data/C.feature.train.tsv \
--train_data_ac4C demo_data/ac4C.feature.train.tsv \
--test_data_C demo_data/C.feature.test.tsv \
--test_data_ac4C demo_data/ac4C.feature.test.tsv \
--epoch 100
Arguments:
================================= ========== =================== ============================================================================================================
Argument name Required Default Description
================================= ========== =================== ============================================================================================================
--run_mode Yes NA Run mode [train or predict].
--model_type Yes NA Model type [C/ac4C, C/m5C, m5C/ac4C or C/m5C/ac4C]
--new_model Yes NA The path to save model.
--train_data_C Yes NA Train data, unmodified.
--train_data_ac4C Yes NA Train data, ac4C-modified.
--test_data_C Yes NA Test data, unmodified.
--test_data_ac4C Yes NA Test data, ac4C-modified.
--epoch Yes NA Training epoch.
================================= ========== =================== ============================================================================================================
This is a demo to train a C/m5C model that distinguish C from m5C.
python script/modCnet.py --run_mode train \
--model_type C/m5C \
--new_model demo_data/model/C_m5C.pkl \
--train_data_C demo_data/C.feature.train.tsv \
--train_data_m5C demo_data/m5C.feature.train.tsv \
--test_data_C demo_data/C.feature.test.tsv \
--test_data_m5C demo_data/m5C.feature.test.tsv \
--epoch 100
Arguments:
================================= ========== =================== ============================================================================================================
Argument name Required Default Description
================================= ========== =================== ============================================================================================================
--run_mode Yes NA Run mode [train or predict].
--model_type Yes NA Model type [C/ac4C, C/m5C, m5C/ac4C or C/m5C/ac4C]
--new_model Yes NA The path to save model.
--train_data_C Yes NA Train data, unmodified.
--train_data_m5C Yes NA Train data, m5C-modified.
--test_data_C Yes NA Test data, unmodified.
--test_data_m5C Yes NA Test data, m5C-modified.
--epoch Yes NA Training epoch.
================================= ========== =================== ============================================================================================================
This is a demo to train a m5C/ac4C model that distinguish ac4C from m5C.
python script/modCnet.py --run_mode train \
--model_type m5C/ac4C
--new_model demo_data/model/m5C_ac4C.pkl \
--train_data_m5C demo_data/m5C.feature.train.tsv \
--train_data_ac4C demo_data/ac4C.feature.train.tsv \
--test_data_m5C demo_data/m5C.feature.test.tsv \
--test_data_ac4C demo_data/ac4C.feature.test.tsv \
--epoch 100
Arguments:
================================= ========== =================== ============================================================================================================
Argument name Required Default Description
================================= ========== =================== ============================================================================================================
--run_mode Yes NA Run mode [train or predict].
--model_type Yes NA Model type [C/ac4C, C/m5C, m5C/ac4C or C/m5C/ac4C]
--new_model Yes NA The path to save model.
--train_data_m5C Yes NA Train data, m5C-modified.
--train_data_ac4C Yes NA Train data, ac4C-modified.
--test_data_m5C Yes NA Test data, m5C-modified.
--test_data_ac4C Yes NA Test data, ac4C-modified.
--epoch Yes NA Training epoch.
================================= ========== =================== ============================================================================================================
This is a demo to train a C/m5C/ac4C 3-class model that distinguish ac4C from m5C.
python script/modCnet.py --run_mode train \
--model_type C/m5C/ac4C
--new_model demo_data/model/C_m5C_ac4C.pkl \
--train_data_C demo_data/C.feature.train.tsv \
--train_data_m5C demo_data/m5C.feature.train.tsv \
--train_data_ac4C demo_data/ac4C.feature.train.tsv \
--test_data_C demo_data/C.feature.test.tsv \
--test_data_m5C demo_data/m5C.feature.test.tsv \
--test_data_ac4C demo_data/ac4C.feature.test.tsv \
--epoch 100
Arguments:
================================= ========== =================== ============================================================================================================
Argument name Required Default Description
================================= ========== =================== ============================================================================================================
--run_mode Yes NA Run mode [train or predict].
--model_type Yes NA Model type [C/ac4C, C/m5C, m5C/ac4C or C/m5C/ac4C]
--new_model Yes NA The path to save model.
--train_data_C Yes NA Train data, unmodified.
--train_data_m5C Yes NA Train data, m5C-modified.
--train_data_ac4C Yes NA Train data, ac4C-modified.
--test_data_C Yes NA Test data, C-modified.
--test_data_m5C Yes NA Test data, m5C-modified.
--test_data_ac4C Yes NA Test data, ac4C-modified.
--epoch Yes NA Training epoch.
================================= ========== =================== ============================================================================================================
We provied 4 pretrained models in the model directory in the repostory: a C/ac4C model, a C/m5C model, a m5C/ac4C model and a 3-class C/m5C/ac4C model. You can load the pretrained models or your own model to predict new data.
python script/modCnet.py --run_mode predict \
--model_type C/m5C/ac4C \
--pretrained_model model/C_m5C_ac4C.pkl \
--feature_file demo_data/test.feature.tsv \
--predict_result demo_data/test.prediction.tsv
Arguments:
================================= ========== =================== ============================================================================================================
Argument name Required Default Description
================================= ========== =================== ============================================================================================================
--run_mode Yes NA Run mode [train or predict].
--model_type Yes NA Model type [C/ac4C, C/m5C, m5C/ac4C or C/m5C/ac4C]
--pretrained_model Yes NA The path to pretrained model.
--feature_file Yes NA Feature file.
--predict_result Yes NA Training epoch.
================================= ========== =================== ============================================================================================================
The following is an example of the prediction output. The results include columns transcript_id
, site
, motif
, read_id
, prediction
and probability
. The probability
indicates the read level confidence score for the site to be modified. User can apply a cut off threshold to the probability to filt out low-confidence predictions.
transcript_id site motif read_id prediction probability
LOC_Os07g03730.1 560 GGCAA 5bb201ce-50e6-4261-8f8b-2b2b51bb9ea6 C 0.0077731693
LOC_Os07g03730.1 563 AACGT 5bb201ce-50e6-4261-8f8b-2b2b51bb9ea6 C 0.00074390427
LOC_Os07g03730.1 566 GTCGA 5bb201ce-50e6-4261-8f8b-2b2b51bb9ea6 C 0.0019359465
LOC_Os07g03730.1 569 GACGG 5bb201ce-50e6-4261-8f8b-2b2b51bb9ea6 C 0.014535325
LOC_Os07g03730.1 572 GGCGA 5bb201ce-50e6-4261-8f8b-2b2b51bb9ea6 ac4C 0.86820465
LOC_Os07g03730.1 577 ATCTC 5bb201ce-50e6-4261-8f8b-2b2b51bb9ea6 C 0.0008406919
LOC_Os07g03730.1 579 CTCCC 5bb201ce-50e6-4261-8f8b-2b2b51bb9ea6 C 0.033249497
LOC_Os07g03730.1 580 TCCCT 5bb201ce-50e6-4261-8f8b-2b2b51bb9ea6 C 0.024631226
LOC_Os07g03730.1 581 CCCTA 5bb201ce-50e6-4261-8f8b-2b2b51bb9ea6 C 0.0748754
LOC_Os07g03730.1 584 TACTA 5bb201ce-50e6-4261-8f8b-2b2b51bb9ea6 C 0.4033113
LOC_Os07g03730.1 588 AGCTA 5bb201ce-50e6-4261-8f8b-2b2b51bb9ea6 C 0.37496027
LOC_Os07g03730.1 591 TACTA 5bb201ce-50e6-4261-8f8b-2b2b51bb9ea6 C 0.20457618
LOC_Os07g03730.1 603 TACGT 5bb201ce-50e6-4261-8f8b-2b2b51bb9ea6 C 0.000665458
LOC_Os07g03730.1 607 TACGG 5bb201ce-50e6-4261-8f8b-2b2b51bb9ea6 C 0.0102844415
LOC_Os07g03730.1 610 GGCTA 5bb201ce-50e6-4261-8f8b-2b2b51bb9ea6 C 0.0144964745
We also provide a script to convert results from transcriptome location to genome location.
python script/transcriptome_location_to_genome_location.py \
--input predictions.tsv \
--output predictions_genome_loc.tsv \
--gff /path/to/your/genome_annotation.gtf
================================= ========== =================== ============================================================================================================
Argument name Required Default Description
================================= ========== =================== ============================================================================================================
--input Yes NA Input file,transcriptome location.
--output Yes NA Output file, genome location.
--gff Yes NA Annotation file.
================================= ========== =================== ============================================================================================================
Results:
transcript_id site chr site motif prediction probability
LOC_Os07g03730.1 560 Chr7 1524212 GGCAA C 0.0077731693
LOC_Os07g03730.1 563 Chr7 1524215 AACGT C 0.00074390427
LOC_Os07g03730.1 566 Chr7 1524218 GTCGA C 0.0019359465
LOC_Os07g03730.1 569 Chr7 1524221 GACGG C 0.014535325
LOC_Os07g03730.1 572 Chr7 1524224 GGCGA ac4C 0.86820465
LOC_Os07g03730.1 577 Chr7 1524229 ATCTC C 0.0008406919
LOC_Os07g03730.1 579 Chr7 1524231 CTCCC C 0.033249497
LOC_Os07g03730.1 580 Chr7 1524232 TCCCT C 0.024631226
LOC_Os07g03730.1 581 Chr7 1524233 CCCTA C 0.0748754
LOC_Os07g03730.1 584 Chr7 1524236 TACTA C 0.4033113
LOC_Os07g03730.1 588 Chr7 1524240 AGCTA C 0.37496027
LOC_Os07g03730.1 591 Chr7 1524243 TACTA C 0.20457618
LOC_Os07g03730.1 603 Chr7 1524255 TACGT C 0.000665458
LOC_Os07g03730.1 607 Chr7 1524259 TACGG C 0.0102844415
LOC_Os07g03730.1 610 Chr7 1524262 GGCTA C 0.0144964745
By aggregating all the predictions for each site, we can derive a consensus or summary prediction for that specific genomic location using the script read_level_prediction_to_site_level_prediction.py
In the given command, please replace read_level_prediction.tsv
with the converted results obtained from the transcriptome_location_to_genome_location.py
script. Specify the desired output file name using the --output
option. The script will then aggregate the read-level predictions to derive site-level predictions. The resulting predictions will include the count of modified bases, considering the predictions with probability values ranging from 0.5 to 0.95 as the cutoff range. The total base count is located in the last column.
python script/read_level_prediction_to_site_level_prediction.py \
--input read_level_prediction.tsv \
--output site_level_prediction.tsv
================================= ========== =================== ============================================================================================================
Argument name Required Default Description
================================= ========== =================== ============================================================================================================
--input Yes NA Input file, read level prediction.
--output Yes NA Output file, site level prediction.
================================= ========== =================== ============================================================================================================
Results:
transcriptome_id site chr site motif p_0.5 p_0.6 p_0.7 p_0.8 p_0.9 p_0.95 total
LOC_Os05g41060.1 445 Chr5 24059817 TGCGC 2 2 2 2 2 1 63
LOC_Os05g41060.1 447 Chr5 24059815 CGCCA 1 0 0 0 0 0 63
LOC_Os05g41060.1 448 Chr5 24059814 GCCAG 4 4 4 4 2 1 63
LOC_Os05g41060.1 451 Chr5 24059811 AGCGG 10 8 7 6 3 2 63
LOC_Os05g41060.1 454 Chr5 24059808 GGCAC 15 14 14 11 10 8 63
LOC_Os05g41060.1 456 Chr5 24059806 CACTG 2 2 2 2 2 1 63
LOC_Os05g41060.1 462 Chr5 24059800 TACAT 0 0 0 0 0 0 63
LOC_Os05g41060.1 465 Chr5 24059797 ATCCA 6 6 6 5 5 3 63
LOC_Os05g41060.1 466 Chr5 24059796 TCCAG 2 2 2 1 1 0 63
LOC_Os05g41060.1 471 Chr5 24059791 AGCAC 11 8 7 5 3 2 63
LOC_Os05g41060.1 473 Chr5 24059789 CACAT 1 1 1 0 0 0 63
LOC_Os05g41060.1 479 Chr5 24059783 TGCTA 1 1 1 1 0 0 63
LOC_Os05g41060.1 482 Chr5 24059780 TACCT 5 5 5 4 2 2 63
LOC_Os05g41060.1 483 Chr5 24059779 ACCTC 5 5 5 5 4 4 63
LOC_Os05g41060.1 485 Chr5 24059777 CTCTG 4 4 4 3 2 1 63
LOC_Os05g41060.1 508 Chr5 24059754 GGCTT 2 2 2 1 1 1 63