GithubHelp home page GithubHelp logo

cherry's Introduction

CHERRY CHERRY is a python library for predicting the interactions between viral and prokaryotic genomes. CHERRY is based on a deep learning model, which consists of a graph convolutional encoder and a link prediction decoder.

News !!!

  1. 🚀  This folder will no longer be maintained. The program has been updated and moved to PhaBOX, which is more user-friendly. Hope you will enjoy it.

  2. 🚀  Our web server based on PhaBOX for all phage-related tasks is available! You can visit WebServer to use the GUI.

  3. 📘  If you want to use cherry on your own bacterial assemblies. Please visit https://github.com/KennthShang/CHERRY_MAGs to check. This MAGs version allows you to use your own bacterial assemblies and predict the interactions between your phages and bacteria.

Overview

There are two kind of tasks that CHERRY can work:

  1. Host prediction for virus
  2. Identifying viruses that infect pathogenic bacteria

Required Dependencies

Easy way to install

Note: we suggest you to install all the package using conda (both miniconda and Anaconda are ok)

After cloning this respository, you can use anaconda to install the CHERRY.yaml. The command is: conda env create -f CHERRY.yaml -n cherry

Prepare the database

Due to the limited size of the GitHub, we zip the database. Before using CHEERY, you need to unpack them using the following commands.

cd CHEERY
conda env create -f CHERRY.yaml -n cherry
conda activate cherry
cd dataset
bzip2 -d protein.fasta.bz2
bzip2 -d nucl.fasta.bz2
cd ../prokaryote
gunzip *
cd ..

You only need to activate your 'cherry' environment before using CHERRY in the next time.

conda activate cherry

Usage

1 Predicting host for your viruses

The input should be a fasta file containing the viral sequences. We provide an example file named "test_contigs.fa". Then, the only command that you need to run is

python run_Speed_up.py [--contigs INPUT_FA] [--len MINIMUM_LEN] [--model MODEL] [--topk TOPK_PRED]

Options

  --contigs INPUT_FA
                        input fasta file
  --len MINIMUM_LEN
                        predict only for sequence >= len bp (default 8000)
  --model MODEL (pretrain or retrain)
                        predicting host with pretrained parameters or retrained paramters (default pretrain)
  --topk TOPK_PRED
                        The host prediction with topk score (default 1)

Example

Prediction on species level with pretrained paramters:

python run_Speed_up.py --contigs test_contigs.fa --len 8000 --model pretrain --topk 1

Note: Commonly, you do not need to retrain the model, especially when you do not have gpu unit.

OUTPUT

The format of the output file is a csv file ("final_prediction.csv") which contain the prediction of each virus. Column contig_name is the accession from the input.

We will supply a script for you to convert the prediction into a complte taxonmoy tree. Use the following command to generate taxonomy tree:

python run_Taxonomy_tree.py [--k TOPK_PRED]

Because there are k prediction in the "final_prediction.csv" file, you need to specify the k to generate the tree. The output of program is 'Top_k_prediction_taxonomy.csv'.

Extension of the virus-prokaryote interactions database

If you know more virus-prokaryote interactions than our pre-trained model (given in Interactiondata), you can add them to train a custom model. Several steps you need to do to train your model:

  1. Add your viral genomes into the nucl.fasta file and run the python refresh.py to generate new protein.fasta and database_gene_to_genome.csv files. They will replace the old one in the dataset/ folder automatically.
  2. Add the entrys of host taxonomy information into dataset/virus.csv. The corresponding header of the entry is: Accession (of the virus), Superkingdom, Phylum, Class, Order, Family, Genus, Species. The required field is Species. You can left it blank if you do not know other fields. Also, the accession of the virus shall be the same as your fasta entry.
  3. Place your prokaryotic genomes into the the prokaryote/ folder and add an entry in dataset/prokaryote.csv. The guideline is the same as the previous section.
  4. Use retrain as the parameter for --mode option to run the program.

2 Predicting virus infecting prokaryote

If you want to predict candidate viruses that infect a set of given bacteria, you need to supply three kinds of inputs:

  1. Place your prokaryotic genomes in new_prokaryote/ folder.
  2. A fasta file containing the virus squences.
  3. Add the taxa information in 'database/prokaryote.csv'. (The example can be found in the Extension of the parokaryotic genomes database) Then, the program will output which virus in your fasta file will infect the prkaryotes in the new_prokaryote/ folder.

The command is simlar to the previous one but two more paramter is need:

python run_Speed_up.py [--mode MODE] [--t THRESHOLD]

Example

python run_Speed_up.py --contigs test_contigs.fa --mode prokaryote --t 0.98

Options

  --mode MODE (prokaryote or virus)
                        Switch mode for predicting virus or predicting host
  --t THRESHOLD
                        The confident threshold for predicting virus, the higier the threshold the higher the precision. (default 0.98)

OUTPUT

The format of the output file is a csv file which contain the prediction of each virus. Column prokaryote is the accession of your given prokaryotic genomes. Column virus is the list of viruses that might infect these genomes.

References

The paper published in the Briefings in Bioinformatics:

Jiayu Shang, Yanni Sun, CHERRY: a Computational metHod for accuratE pRediction of virus–pRokarYotic interactions using a graph encoder–decoder model, Briefings in Bioinformatics, 2022;, bbac182, https://doi.org/10.1093/bib/bbac182

The arXiv version can be found via: CHERRY: a Computational metHod for accuratE pRediction of virus-pRokarYotic interactions using a graph encoder-decoder model

Contact

If you have any questions, please email us: [email protected]

Notes

  1. if the program output an error (which is caused by your machine): Error: mkl-service + Intel(R) MKL: MKL_THREADING_LAYER=INTEL is incompatible with libgomp.so.1 library. You can type in the command export MKL_SERVICE_FORCE_INTEL=1 before runing run_Speed_up.py

cherry's People

Contributors

kennthshang avatar

Stargazers

 avatar kento avatar  avatar  avatar xdli avatar Liu zifeng avatar Roland avatar Yao-zhong Zhang avatar  avatar Henry Lao avatar jiaojiao avatar Maggie Langwig avatar  avatar  avatar  avatar Susheel Bhanu Busi avatar  avatar Hualin Liu (刘华林) avatar Yongxin Ji avatar Spencer Nystrom avatar

Watchers

James Cloos avatar  avatar Kostas Georgiou avatar  avatar

cherry's Issues

multimodal Graph Error for file contig_0

I want to use Cherry to predict the host with my own MAGs.
I replaced the file prokaryote and prokaryote.csv, and run the command "python run_Speed_up.py --contigs final_quality_summary.fasta --len 5000 --model pretrain --topk 1" .
Unfortunately, I met this error as follows:

image

no output

Why sometimes predict the host that program does not report an error, but does not output the result?

The question regarding the method of obtaining multiple host prediction results for a single viral isolate using CHERRY

I tryed to test CHERRY using my phage genomes, but encountered some issues.
Firstly, in the paper of CHERRY, it is mentioned that the top [n] results with a score>0.9 can be used as predictions for "multiple hosts". However, when I specified --topk 5, I found that the cherry_prediction.tsv file in the output directory only contained the first result, but I found the specified number of outputs in <cherry_prediction.csv> under the [midfolder] directory. I am not sure if this is the correct file, so I would like to ask the author for the recommended method to obtain multiple host prediction results.
cherry_prediction.csv

Additionally, Attached is midfolder/cherry_predcition.csv of my test sample, which is a phage isolated from Klebsiella pneumoniae.
I'm not sure how to interpret these results because when I tried setting --topk 200, I found that the first 195 results had a score of 1 and the last 5 results had a score of 0.99. The last column showed TYPE=CRISPR. Does this mean that this phage has a perfect CRISPR match with the 195 listed hosts across different taxa, resulting in a score of 1? Does the last column, TYPE=CRISPR, indicate that all the predicted results are based on CRISPR?

Also, when the scores are the same, does it mean that the confidence level of these host prediction results is the same? Because I found that I could only find the host Klebsiella pneumoniae, which I determined through experiments, by setting a very large --topk value. Klebsiella pneumoniae also had a score = 1 but was ranked very low in this single result.

Therefore, my questions can be summarized as follows:

  1. Is top [n] results with score>0.9 in midfolder/cherry_prediction.csv the recommended way to view multiple host prediction results?
  2. How should I interpret the large number of results with score = 1 and the last column showing TYPE=CRISPR? Does it mean that all the predicted results are of the CRISPR type?
  3. If a score>0.9 is considered usable according to the paper, then in my sample, the lowest value among the top 200 results is 0.99. This suggests that there could be several hundred or even close to a thousand potential hosts with a score >0.9. Is this normal? If it is correct, how should I interpret it? Or is it a bug?

A error in prediction of viral hosts with my own bacterial genomes

Hi, Kennth
I tried to use my own genomes to predict the viral host, but a error occured as follows:
Command: python run_Speed_up.py --contigs all_viral_combined_MMseq_out2_rep_seq.fasta --mode prokaryote --t 0.98 --len 1500

Building a new DB, current time: 06/29/2022 18:44:58
New DB name: /home/jyzhang/softwares/CHERRY/new_blast_db/bin214
New DB title: new_prokaryote/bin214.fa
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 60 sequences in 0.14666 seconds.
Running blastn...
Traceback (most recent call last):
File "edge_virus_prokaryote.py", line 151, in
with open(blast_tab_out+file) as file_in:
FileNotFoundError: [Errno 2] No such file or directory: 'blast_tab/bin117.tab'
phage_host Error for file contig_0

I have already put my genomes in the "new_prokaryote/" folder, and added corresponding taxonomies in the dataset/prokaryote.csv file. When I used my own viral contigs to predict hosts, the above-mentioned error ocurred.
I have tried to sovle this problem. I modified the line 151 of edge_virus_prokaryote.py, that is I changed "with open(blast_tab_out+file)" to "with open(new_blast_tab_out+file)". Then it worked.
I wonder whether the modification is right or not. Besides, I did not find information about Crispr spacers of my own genomes in new_prokaryote/ folder in the result folder. Thus, I also wonder whether Cherry identify Crispr spacers of my own bacterial genomes in new_prokaryote/ folder, and whether Cherry will predict viral hosts according to the Crispr spacers of my own genomes.
Look forward to your reply.

Jiayu Zhang

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.