Dear KennthShang, I hope this email finds you well. I am reaching ou

Difference in the number of CHERRY_crispr_multihost and CHERRY prediction results about cherry_crispr_multihost HOT 5 CLOSED

trx296554555 commented on August 23, 2024

Difference in the number of CHERRY_crispr_multihost and CHERRY prediction results

from cherry_crispr_multihost.

Comments (5)

KennthShang commented on August 23, 2024

Hi there,

In the CHERRY main program, the thresholds are 95% coverage and identity. In the multihost version. Because user would like to predict more potential hosts for their phages. the threshold are smaller and more outputs will be remained.

Best,
Jiayu

from cherry_crispr_multihost.

trx296554555 commented on August 23, 2024

Hi Jiayu,

Thank you for your response. If I understand correctly, the main CHERRY program uses thresholds of 95% coverage and identity. However, in the multihost version, the thresholds are smaller (such as 75%) to predict more potential hosts for phages, resulting in a higher number of outputs.

If my understanding is correct, does it mean that by using a stricter threshold of 95%, I obtained more results?

Best regards,
Robin

from cherry_crispr_multihost.

KennthShang commented on August 23, 2024

First of all, there are two main differences between the two programs:

The multi-host version will return multiple hosts (host range) for an input phage contig. However, the PhaBOX version will only return the one with the highest confidence.
The multi-host version will only depend on a pre-built CRIPSRs database for prediction. However, the PhaBOX depends on a graph and will return 'evidence' based on the graph.

Then, back to your questions: If you are using the same program and adjusting thresholds and If a higher number of outputs refers to a higher prediction rate, I suppose a stricter threshold should return a lower predicted rate. However, comparing the two programs is hard to tell since they are of different designs.

BTW, the multi-host version currently is a beta test version and is not a formal part of the original paper. It is only designed for users wanting to predict the host range. If you only care about the expected host with the highest confidence, please use the PhaBOX version as the golden standard.

Best,
Jiayu

from cherry_crispr_multihost.

trx296554555 commented on August 23, 2024

Hi,
Thank you for your response. I appreciate the information provided. However, I still feel that my original question has not been fully addressed.

While I admit that I am not well-versed in the Graph Convolutional Encoder part, when it comes to the aspect of using CRISPR to establish the relationship between phages and hosts, I believe both CHERRY_crispr_multihost.py and Cherry_single.py (PhaBOX) utilize the same strategy. Specifically, they employ the 'blastn-short' to compare the pre-built CRISPR database against the potential virus sequence. This approach is widely accepted and commonly used for host prediction. Additionally, I have cross-checked the MD5 of the CRISPR database file and confirmed that they are indeed the same database.

Based on these factors, I would expect CHERRY_crispr_multihost.py and Cherry_single.py (PhaBOX) to yield similar results in this specific aspect, rather than the significant discrepancies I have observed. In this case, I have examined the alignment results of both programs, namely the 'alignment_result.tab' and 'tmp/cherry/crispr_out.tab' files, and they are completely consistent.

> alignment_result.tab 
C_Abbottina_obtusirostris=1	34.3097640|GCF_002866885.2|	4.43e-07	94.737	38	43
C_Abbottina_obtusirostris=1	30.3097386|GCF_002866885.2|	4.43e-07	100.000	34	43
C_Abbottina_obtusirostris=1	6.424270|GCF_008362895.1|	1.33e-06	100.000	33	33
C_Abbottina_obtusirostris=1	64.3944489|GCF_006243415.1|	3.98e-06	100.000	32	32
C_Abbottina_obtusirostris=1	43.1302212|GCF_002803925.1|	3.98e-06	100.000	32	32
C_Abbottina_obtusirostris=1	29.1301357|GCF_002803925.1|	3.98e-06	100.000	32	32
C_Abbottina_obtusirostris=1	91.320081|GCF_000819725.1|	3.98e-06	100.000	32	32
C_Abbottina_obtusirostris=1	24.1645998|GCF_000801015.1|	3.98e-06	97.059	34	34
C_Abbottina_obtusirostris=1	70.3944849|GCF_006243415.1|	3.98e-06	100.000	32	32

> tmp/cherry/crispr_out.tab
PhaGCN_0	34.3097640|GCF_002866885.2|	4.43e-07	94.737	38	43
PhaGCN_0	30.3097386|GCF_002866885.2|	4.43e-07	100.000	34	43
PhaGCN_0	6.424270|GCF_008362895.1|	1.33e-06	100.000	33	33
PhaGCN_0	64.3944489|GCF_006243415.1|	3.98e-06	100.000	32	32
PhaGCN_0	43.1302212|GCF_002803925.1|	3.98e-06	100.000	32	32
PhaGCN_0	29.1301357|GCF_002803925.1|	3.98e-06	100.000	32	32
PhaGCN_0	91.320081|GCF_000819725.1|	3.98e-06	100.000	32	32
PhaGCN_0	24.1645998|GCF_000801015.1|	3.98e-06	97.059	34	34
PhaGCN_0	70.3944849|GCF_006243415.1|	3.98e-06	100.000	32	32

Upon further investigation of the code, I have identified the true cause of the discrepancy. It appears that during the filtering of the alignment results:

CHERRY_crispr_multihost.py considers both coverage and identity (using an 'AND' condition)https://github.com/KennthShang/CHERRY_crispr_multihost/blob/f6316140ad4fbf09cd5650bbaad5931f8ff3df87/Cherry_multihost.py#L76C9-L76C59, whereas Cherry_single.py only requires either one to be satisfied (using an 'OR' condition)https://github.com/KennthShang/PhaBOX/blob/852454bbc5e9abebe6f451310087bb4ed4c311b2/Cherry_single.py#L314C12-L314C49.

This discrepancy explains why Cherry_single.py predicts more hosts of the CRISPR type. After modifying the condition in Cherry_single.py (OR to AND) and re-running the program, I obtained consistent results with CHERRY_crispr_multihost.py in the CRISPR section, though Cherry_single.py chose the one with the highest confidence. (Both use 0.95 coverage and 95 identity)

Cherry_single.py RES
$cut -f 5 -d , testout2/cherry_prediction.csv|sort |uniq -c
    833 -
    603 CRISPR
   1835 Predict
      1 Type

CHERRY_crispr_multihost.py RES
$wc -l prediction.csv 
604 prediction.csv

I am unsure whether changing the 'OR' condition to an 'AND' condition is necessary, but it is worth noting that other filtering criteria in Cherry_single.py also employ the 'AND' condition. https://github.com/KennthShang/PhaBOX/blob/852454bbc5e9abebe6f451310087bb4ed4c311b2/Cherry_single.py#L270

Thank you for your attention to this matter.

Best regards,
Robin

from cherry_crispr_multihost.

KennthShang commented on August 23, 2024

As I mentioned, all the experiments in the paper are based on the same criteria in the main program. Thus, it is not recommended to modify the main program of CHERRY. But you can do it at your own risk.

if you want to use the multi-host version, you can also modify the program independently. The logic of this extension version is provided by a colleague.

Line 270 in the main program is used to find the seen phages from the database, which is patched after the paper is released. Thus, we choose a strict threshold.

Best,
Jiayu

from cherry_crispr_multihost.

Difference in the number of CHERRY_crispr_multihost and CHERRY prediction results about cherry_crispr_multihost HOT 5 CLOSED

Comments (5)

Related Issues (2)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs