GithubHelp home page GithubHelp logo

mtisza1 / cenote-taker2 Goto Github PK

View Code? Open in Web Editor NEW
57.0 2.0 7.0 63.8 MB

Cenote-Taker2: Discover and Annotate Divergent Viral Contigs (Please use Cenote-Taker 3 instead)

License: MIT License

Shell 89.73% Perl 2.24% Python 7.90% Scala 0.12%
cenote-taker2 metagenomes annotation genbank virus virome hhsearch hhblits discovery hallmark

cenote-taker2's Introduction

DEPRECATED

This repo is deprecated.

If you need help finishing a project using Cenote-Taker 2, I will still field questions/troubleshoot (open an issue).

Otherwise:

Please use Cenote-Taker 3. It's great!!

######### ######### ######### ######### ######### ######### ######### ######### #########

######### ######### ######### ######### ######### ######### ######### ######### #########

######### ######### ######### ######### ######### ######### ######### ######### #########

######### ######### ######### ######### ######### ######### ######### ######### #########

######### ######### ######### ######### ######### ######### ######### ######### #########

######### ######### ######### ######### ######### ######### ######### ######### #########

######### ######### ######### ######### ######### ######### ######### ######### #########

######### ######### ######### ######### ######### ######### ######### ######### #########

Cenote-Taker 2

Cenote-Taker 2 is a dual function bioinformatics tool. On the one hand, Cenote-Taker 2 can discover/predict virus sequences from any kind of genome or metagenomic assembly. On the other hand, virus sequences/genomes (perhaps predicted by another tool?) can be annotated with a variety of sequences features, genes, and taxonomy. Either the discovery or the the annotation module can be used independently.

+ The code is currently functional. Feel free to use Cenote-Taker 2 at will.
+ Major update on May 6th 2022: Version 2.1.5
+ Cenote-Taker 2.1.5 has an easier, more reliable installation and database downloads. 
+ Some packages that have given many users issues have been replaced. Taxonomy is more flexible. See release notes.
+ "virion" is now default database

If you just want to discover/predict virus sequences and get a report on those sequences, use Cenote Unlimited Breadsticks, also provided in the Cenote-Taker 2 repo.

If you just want to annotate your virus sequences and make genome maps, run Cenote-Taker 2 using -am True.

An ulterior motive for creating and distributing Cenote-Taker 2 is to facilitate annotation and deposition of viral genomes into GenBank where they can be used by the scientific public. Therefore, I hope you consider depositing the submittable outputs (.sqn) after reviewing them. I am not affiliated with GenBank.

See "Use Cases" below, and read the Cenote-Taker 2 wiki for useful information on using the pipeline (e.g. expected outputs) and screeds on myriad topics. Using a HPC with at least 16 CPUs and 16g of dedicated memory is recommended for most runs. (Annotation of a few selected genomes or virus discovery on smaller databases can be done with less memory/CPU in a reasonable amount of time).

To update from v2.1.3 (note that biopython and bedtools are now required):

conda activate cenote-taker2_env
pip install phanotate
conda install -c conda-forge -c bioconda hhsuite last=1282 seqkit
cd Cenote-Taker2
git pull
#Then update the BLAST database (see instructions below).

Update to HMM databases (hallmark genes) occurred on June 16th, 2021. Update to the BLAST (taxonomy) database occurred on May 6th, 2022. See instructions below to update your database.

Read the manuscript in Virus Evolution

If you cannot or do not want to install and run this on the command line, Cenote-Taker 2 v 2.1.3 is freely available to run with point-and-clink interface on the CyVerse Discovery Environment.

alt text

Install Using Conda


** Databases will require between 8GB (most basic) and 75GB (all the optional databases) of storage.
** Don't install without checking conda version first.
** Install on machine running on Linux (with a reasonably new OS). Support for MacOS is forthcoming.

If you just want a lightweight (7GB), faster, NON-ANNOTATING virus discovery tool, use Cenote Unlimited Breadsticks. The Unlimited Breadsticks module is included in the Cenote-Taker 2 repo, so no need to install it if you already have Cenote-Taker 2 (you may need to update from older versions Cenote-Taker2)

- ALERT *** If you choose to install all optional databases for HHsuite, 
- installation will take about 2 hours due to slow download speeds for pdb70
- AND require about 75GB of storage space. 
  1. Change to the directory you'd like to be the parent to the install directory

  2. Ensure Conda is installed and working (required for installation and execution of Cenote-Taker 2). Use version 4.10 or better. Note: instructions for installing Conda are probably specific to your university's/organization's requirements, so it is always best to ask your IT professional or HPC administrator. Generally, you will want to install Miniconda in your data directory.

conda -V
  1. Clone the Cenote-Taker 2 github repo.
git clone https://github.com/mtisza1/Cenote-Taker2.git
  1. Install the conda environment (phanotate, last, and hhsuite don't play nice with the .yml file, so they need special commands)
conda env create --file cenote-taker2_env.yml
# follow conda prompts to allow install

conda activate cenote-taker2_env

pip install phanotate

conda install -c conda-forge -c bioconda hhsuite last=1282
  1. Change to the Cenote-Taker2 repo directory OR a different location where you want the databases to be stored. (NOTE: if you install the databases in a custom location you will need to specify this directory each time you run the tool) Download the databases.
conda activate cenote-taker2_env
cd Cenote-Taker2

**choose one of the following**

# with all the options (75GB). The PDB database (--hhPDB) takes about 2 hours to download.
python update_ct2_databases.py --hmm True --protein True --rps True --taxdump True --hhCDD True --hhPFAM True --hhPDB True

# substantially smaller but with some hhsuite DBs (20GB). I recommend this if you are unsure which you want.
python update_ct2_databases.py --hmm True --protein True --rps True --taxdump True --hhCDD True --hhPFAM True

# only the required DBs, No hhsuite (8GB)
python update_ct2_databases.py --hmm True --protein True --rps True --taxdump True

Bioconda installation

  • THIS HAS NOT BEEN UPDATED RECENTLY. BIOCONDA VERSION NOT RECOMMENDED AT THE MOMENT *

A user has packaged Cenote-Taker 2 in Bioconda for use by their institute. However, installation can be done by anyone using their package with a few commands. All the above alerts, requirements, and warnings still apply. This will also require a user to have 32GB of storage in their default conda environment directory.

Commands:

conda create -n cenote-taker2 -c hcc -c conda-forge -c bioconda -c defaults cenote-taker2=2020.04.01

conda activate cenote-taker2

download-db.sh

The Krona database directory will then need to be manually downloaded and set up. This should work:

CT2_DIR=$PWD
KRONA_DIR=$( which python | sed 's/bin\/python/opt\/krona/g' )
cd ${KRONA_DIR}
sh updateTaxonomy.sh
cd ${KRONA_DIR}
sh updateAccessions.sh
cd ${CT2_DIR}

Discussion: LINK

Updating databases

As of now, the HMM database has been updated from the original (update on June 16th, 2021), and the BLAST database has been updated (May 6th, 2022). This update should only take a minute or two. Here's how you update (modify if your conda environment is different than below example):

# update Cenote-Taker 2 (change to main repo directory):
git pull

# load your conda environment:
conda activate cenote-taker2_env

#change to Cenote-Taker2 directory
cd Cenote-Taker2

# run the update script:
python update_ct2_databases.py --hmm True --protein True

Schematic

alt text

Running Cenote-Taker 2

Cenote-Taker 2 currently runs in a python wrapper.

  1. Activate the Conda environment.

Check environments:

conda info --envs

#Default:

conda activate cenote-taker2_env

#Or if you've put your conda environment in a custom location:

conda activate /path/to/better/directory/cenote-taker2_env
  1. Run the python script to get the help menu (see options below).
# quick help menu
python /path/to/Cenote-Taker2/run_cenote-taker2.py

# full help menu
python /path/to/Cenote-Taker2/run_cenote-taker2.py -h
  1. Run some contigs. For example:
python /path/to/Cenote-Taker2/run_cenote-taker2.py -c MY_CONTIGS.fasta -r my_contigs1_ct -m 32 -t 32 -p true -db virion

#Or, if you want to save a log of the run, add  "2>&1 | tee output.log" to the end of the command:

python /path/to/Cenote-Taker2/run_cenote-taker2.py -c MY_CONTIGS.fasta -r my_contigs1_ct -m 32 -t 32 -p true -db virion 2>&1 | tee output.log

Use Case Suggestions/Settings

Annotation

If you just want to annotate your pre-selected virus sequences and make genome maps, run Cenote-Taker 2 using -am True.

Example:

# clip and wrap circular sequences
python /path/to/Cenote-Taker2/run_cenote-taker2.py -c MY_VIRUSES.fasta -r viruses_am_ct -m 32 -t 32 -p False -am True

# do not wrap circular sequences, but label DTR regions
python /path/to/Cenote-Taker2/run_cenote-taker2.py -c MY_VIRUSES.fasta -r viruses_am_ct -m 32 -t 32 -p False -am True --wrap False

For very divergent genomes, setting -hh hhsearch will marginally improve number of genes that are annotated. This setting increasese the run time quite a bit. On the other hand, setting -hh none will skip the time consuming hhblits step. With this, you'll still get pretty good genome maps, and might be most appropriate for very large virus genome databases, or for runs where you just want to do a quick check.

Discovery + Annotation

Virus-like particle (VLP) prep assembly:

-p False -db standard

You might apply a size cutoff for linear contigs as well, e.g. --minimum_length_linear 3000 OR --minimum_length_linear 5000. Changing length minima does not affect false positive rates, but short linear contigs may not be useful, depending on your goals.

Example:

python /path/to/Cenote-Taker2/run_cenote-taker2.py -c MY_VLP_ASSEMBLY.fasta -r my_VLP1_ct -m 32 -t 32 -p False -db standard --minimum_length_linear 3000

Whole genome shotgun (WGS) metagenomic assembly:

-p True -db virion --minimum_length_linear 3000 --lin_minimum_hallmark_genes 2

While you should definitely definitely prune virus sequences from WGS datasets, CheckV also does a very good job (I'm still formally comparing these approaches) and you could use --prune_prophage False on a metagenome assembly and feed the unpruned contigs from Unlimited Breadsticks into checkv end_to_end if you prefer. My suggestion is to prune with Cenote-Taker 2, then run CheckV.

Example with prune:

python /path/to/Cenote-Taker2/run_cenote-taker2.py -c MY_WGS_ASSEMBLY.fasta -r my_WGS1_ct -m 32 -t 32 -p True -db virion --minimum_length_linear 3000 --lin_minimum_hallmark_genes 2

Bacterial isolate genome or MAG

-p True -db virion --minimum_length_linear 3000 --lin_minimum_hallmark_genes 2

Using --lin_minimum_hallmark_genes 1 -db virion with WGS or bacterial genome data will (in my experience) yield very few sequences that appear to be false positives, however, there are lots of "degraded" prophage sequences in these sequencing sets, i.e. some/most genes of the phage have been lost. That said, sequence with just 1 hallmark gene is neither a guarantee of a degraded phage (especially in the case of ssDNA viruses) nor is 2+ hallmark a guarantee of of a complete phage.

Example:

python /path/to/Cenote-Taker2/run_cenote-taker2.py -c MY_BACTERIAL_GENOME.fasta -r my_genome1_ct -m 32 -t 32 -p True -db virion --minimum_length_linear 3000 --lin_minimum_hallmark_genes 2

RNAseq assembly of any kind (if you only want RNA viruses)

-p False -db rna_virus

If you also want DNA virus transcripts, or if your data is mixed RNA/DNA sequencing, you might do a run with -db rna_virus, then, from this run, take the file "other_contigs/non_viral_domains_contigs.fna" and use it as input for another run with -db virion.

Example:

python /path/to/Cenote-Taker2/run_cenote-taker2.py -c MY_METATRANSCRIPTOME.fasta -r my_metatrans1_ct -m 32 -t 32 -p False -db rna_virus

Prepare files for Vcontact2

Vcontact2 is a popular downstream tool for clustering phage genomes into genus-level bins. Here's an example of how to prepare files from Cenote-Taker 2.

# change directory to a Cenote-Taker 2 output directory
# specify summary file (name based on run title):
ls *_CONTIG_SUMMARY.tsv
SUMMARY="cenote_out_CONTIG_SUMMARY.tsv"
# make files for VContact2
if [ -s vcontact2_gene_to_genome1.csv ] || [ -s vcontact2_all_proteins.faa ] ; then echo "vcontact2 files already exist. NOT overwriting." ; else echo "protein_id,contig_id,keywords" > vcontact2_gene_to_genome1.csv ; tail -n+2 $SUMMARY | cut -f2,4 | while read VIRUS END ;do if [[ "$END" == "DTR" ]] ; then AA=$( find . -type f -name "${VIRUS}.rotate.AA.sorted.fasta" ) ; else AA=$( find . -type f -name "${VIRUS}.AA.sorted.fasta" ) ; fi ; grep -F ">" $AA | cut -d " " -f1 | sed 's/>//g' | while read LINE ; do echo "${LINE},${VIRUS}" ; done >> vcontact2_gene_to_genome1.csv ; cat $AA >> vcontact2_all_proteins.faa ; done ; fi

All arguments:

usage: run_cenote-taker2.py [-h] 
                            -c ORIGINAL_CONTIGS 
                            -r RUN_TITLE 
                            -p PROPHAGE
                            -m MEM 
                            -t CPU 

                            [-am ANNOTATION_MODE]
                            [--template_file TEMPLATE_FILE] 
                            [--reads1 F_READS]
                            [--reads2 R_READS]
                            [--minimum_length_circular CIRC_LENGTH_CUTOFF]
                            [--minimum_length_linear LINEAR_LENGTH_CUTOFF]
                            [-db VIRUS_DOMAIN_DB]
                            [--lin_minimum_hallmark_genes LIN_MINIMUM_DOMAINS]
                            [--circ_minimum_hallmark_genes CIRC_MINIMUM_DOMAINS]
                            [--known_strains HANDLE_KNOWNS]
                            [--blastn_db BLASTN_DB]
                            [--enforce_start_codon ENFORCE_START_CODON]
                            [-hh HHSUITE_TOOL] 
                            [--crispr_file CRISPR_FILE]
                            [--isolation_source ISOLATION_SOURCE]
                            [--Environmental_sample ENVIRONMENTAL_SAMPLE]
                            [--collection_date COLLECTION_DATE]
                            [--metagenome_type METAGENOME_TYPE]
                            [--srr_number SRR_NUMBER]
                            [--srx_number SRX_NUMBER] 
                            [--biosample BIOSAMPLE]
                            [--bioproject BIOPROJECT] 
                            [--assembler ASSEMBLER]
                            [--molecule_type MOLECULE_TYPE]
                            [--data_source DATA_SOURCE]
                            [--filter_out_plasmids FILTER_PLASMIDS]
                            [--scratch_directory SCRATCH_DIR]
                            [--blastp BLASTP] 
                            [--orf-within-orf ORF_WITHIN]
                            [--cenote-dbs CENOTE_DBS] [--wrap WRAP]
                            [--hallmark_taxonomy HALLMARK_TAX]


Cenote-Taker 2 is a pipeline for virus discovery and thorough annotation of viral contigs and genomes. 
Visit https://github.com/mtisza1/Cenote-Taker2#use-case-suggestionssettings for suggestions about how to 
run different data types and https://github.com/mtisza1/Cenote-Taker2/wiki to read more. Version 2.1.5

optional arguments:

  -h, --help            show this help message and exit



 REQUIRED ARGUMENTS for Cenote-Taker2 :

  -c ORIGINAL_CONTIGS, --contigs ORIGINAL_CONTIGS

                        Contig file with .fasta extension in fasta format - OR

                        - assembly graph with .fastg extension. Each header

                        must be unique before the first space character

  -r RUN_TITLE, --run_title RUN_TITLE

                        Name of this run. A directory of this name will be

                        created. Must be unique from older runs or older run

                        will be renamed. Must be less than 18 characters,

                        using ONLY letters, numbers and underscores (_)

  -p PROPHAGE, --prune_prophage PROPHAGE

                        True or False. Attempt to identify and remove flanking

                        chromosomal regions from non-circular contigs with

                        viral hallmarks (True is highly recommended for

                        sequenced material not enriched for viruses. Virus

                        enriched samples probably should be False (you might

                        check with ViromeQC). Also, please use False if

                        --lin_minimum_hallmark_genes is set to 0)

  -m MEM, --mem MEM     example: 56 -- Gigabytes of memory available for

                        Cenote-Taker2. Typically, 16 to 32 should be used.

                        Lower memory will work in theory, but could extend the

                        length of the run

  -t CPU, --cpu CPU     Example: 32 -- Number of CPUs available for Cenote-

                        Taker2. Approximately 32 CPUs should be used

                        moderately sized metagenomic assemblies. For large

                        datasets, increased performance can be seen up to 120

                        CPUs. Fewer than 16 CPUs will work in theory, but

                        could extend the length of the run. See GitHub repo

                        for suggestions.



 OPTIONAL ARGUMENTS for Cenote-Taker2. Most of which are important to consider!!! 
 GenBank typically only accepts genome submission with ample metadata. 
 See https://www.ncbi.nlm.nih.gov/Sequin/sequin.hlp.html#ModifiersPage for more information on GenBank metadata fields:

  -am ANNOTATION_MODE, --annotation_mode ANNOTATION_MODE

                        Default: False -- Annotate sequences only (skip

                        discovery). Only use if you believe each provided

                        sequence is viral

  --template_file TEMPLATE_FILE

                        Template file with some metadata. Real one required

                        for GenBank submission. Takes a couple minutes to

                        generate: https://submit.ncbi.nlm.nih.gov/genbank/temp

                        late/submission/

  --reads1 F_READS      Default: no_reads -- ILLUMINA READS ONLY: First Read

                        file in paired read set - OR - read file in unpaired

                        read set - OR - read file of interleaved reads. Used

                        for coverage depth determination.

  --reads2 R_READS      Default: no_reads -- ILLUMINA READS ONLY: Second Read

                        file in paired read set. Disregard if not using paired

                        reads. Used for coverage depth determination.

  --minimum_length_circular CIRC_LENGTH_CUTOFF

                        Default: 1000 -- Minimum length of contigs to be

                        checked for circularity. Bare minimun is 1000 nts

  --minimum_length_linear LINEAR_LENGTH_CUTOFF

                        Default: 1000 -- Minimum length of non-circualr

                        contigs to be checked for viral hallmark genes.

  -db VIRUS_DOMAIN_DB, --virus_domain_db VIRUS_DOMAIN_DB

                        default: virion -- 'standard' database: all virus (DNA

                        and RNA) hallmark genes (i.e. genes with known

                        function as virion structural, packaging, replication,

                        or maturation proteins specifically encoded by virus

                        genomes) with low false discovery rate. 'virion'

                        database: subset of 'standard', hallmark genes

                        encoding virion structural proteins, packaging

                        proteins, or capsid maturation proteins (DNA and RNA

                        genomes) with LOWEST false discovery rate. 'rna_virus'

                        database: For RNA virus hallmarks only. Includes RdRp

                        and capsid genes of RNA viruses. Low false discovery

                        rate.

  --lin_minimum_hallmark_genes LIN_MINIMUM_DOMAINS

                        Default: 1 -- Number of detected viral hallmark genes

                        on a non-circular contig to be considered viral and

                        recieve full annotation. WARNING: Only choose '0' if

                        you have prefiltered the contig file to only contain

                        putative viral contigs (using another method such as

                        VirSorter or DeepVirFinder), or you are very confident

                        you have physically enriched for virus particles very

                        well (you might check with ViromeQC). Otherwise, the

                        duration of the run will be extended many many times

                        over, largely annotating non-viral contigs, which is

                        not what Cenote-Taker2 is meant for. For unenriched

                        samples, '2' might be more suitable, yielding a false

                        positive rate near 0.

  --circ_minimum_hallmark_genes CIRC_MINIMUM_DOMAINS

                        Default:1 -- Number of detected viral hallmark genes

                        on a circular contig to be considered viral and

                        recieve full annotation. For samples physically

                        enriched for virus particles, '0' can be used, but

                        please treat circular contigs without known viral

                        domains cautiously. For unenriched samples, '1' might

                        be more suitable.

  --known_strains HANDLE_KNOWNS

                        Default: do_not_check_knowns -- do not check if

                        putatively viral contigs are highly related to known

                        sequences (via MEGABLAST). 'blast_knowns': REQUIRES '

                        --blastn_db' option to function correctly.

  --blastn_db BLASTN_DB

                        Default: none -- Set a database if using '--

                        known_strains' option. Specify BLAST-formatted

                        nucleotide datase. Probably, use only GenBank 'nt'

                        database downloaded from ftp://ftp.ncbi.nlm.nih.gov/

                        or another GenBank formatted .fasta file to make

                        databse

  --enforce_start_codon ENFORCE_START_CODON

                        Default: False -- For final genome maps, require ORFs

                        to be initiated by a typical start codon? GenBank

                        submissions containing ORFs without start codons can

                        be rejected. However, if True, important but

                        incomplete genes could be culled from the final

                        output. This is relevant mainly to contigs of

                        incomplete genomes

  -hh HHSUITE_TOOL, --hhsuite_tool HHSUITE_TOOL

                        default: hhblits -- hhblits will query PDB, pfam, and

                        CDD to annotate ORFs escaping identification via

                        upstream methods. 'hhsearch': hhsearch, a more

                        sensitive tool, will query PDB, pfam, and CDD to

                        annotate ORFs escaping identification via upstream

                        methods. (WARNING: hhsearch takes much, much longer

                        than hhblits and can extend the duration of the run

                        many times over. Do not use on large input contig

                        files). 'no_hhsuite_tool': forgoes annotation of ORFs

                        with hhsuite. Fastest way to complete a run.

  --crispr_file CRISPR_FILE

                        Tab-separated file with CRISPR hits in the following

                        format: CONTIG_NAME HOST_NAME NUMBER_OF_MATCHES. You

                        could use this tool:

                        https://github.com/edzuf/CrisprOpenDB. Then reformat

                        for Cenote-Taker 2

  --isolation_source ISOLATION_SOURCE

                        Default: unknown -- Describes the local geographical

                        source of the organism from which the sequence was

                        derived

  --Environmental_sample ENVIRONMENTAL_SAMPLE

                        Default: False -- True or False, Identifies sequence

                        derived by direct molecular isolation from an

                        unidentified organism

  --collection_date COLLECTION_DATE

                        Default: unknown -- Date of collection. this format:

                        01-Jan-2019, i.e. DD-Mmm-YYYY

  --metagenome_type METAGENOME_TYPE

                        Default: unknown -- a.k.a. metagenome_source

  --srr_number SRR_NUMBER

                        Default: unknown -- For read data on SRA, run number,

                        usually beginning with 'SRR' or 'ERR'

  --srx_number SRX_NUMBER

                        Default: unknown -- For read data on SRA, experiment

                        number, usually beginning with 'SRX' or 'ERX'

  --biosample BIOSAMPLE

                        Default: unknown -- For read data on SRA, sample

                        number, usually beginning with 'SAMN' or 'SAMEA' or

                        'SRS'

  --bioproject BIOPROJECT

                        Default: unknown -- For read data on SRA, project

                        number, usually beginning with 'PRJNA' or 'PRJEB'

  --assembler ASSEMBLER

                        Default: unknown_assembler -- Assembler used to

                        generate contigs, if applicable. Specify version of

                        assembler software, if possible.

  --molecule_type MOLECULE_TYPE

                        Default: DNA -- viable options are DNA - OR - RNA

  --data_source DATA_SOURCE

                        default: original -- original data is not taken from

                        other researchers' public or private database.

                        'tpa_assembly': data is taken from other researchers'

                        public or private database. Please be sure to specify

                        SRA metadata.

  --filter_out_plasmids FILTER_PLASMIDS

                        Default: True -- True - OR - False. If True, hallmark

                        genes of plasmids will not count toward the minimum

                        hallmark gene parameters. If False, hallmark genes of

                        plasmids will count. Plasmid hallmark gene set is not

                        necessarily comprehensive at this time.

  --scratch_directory SCRATCH_DIR

                        Default: none -- When running many instances of

                        Cenote-Taker2, it seems to run more quickly if you

                        copy the hhsuite databases to a scratch space

                        temporarily. Use this argument to set a scratch

                        directory that the databases will be copied to (at

                        least 100GB of scratch space are required for copying

                        the databases)

  --blastp BLASTP       Do not use this argument as of now.

  --orf-within-orf ORF_WITHIN

                        Default: False -- Remove called ORFs without HMMSCAN

                        or RPS-BLAST hits that begin and end within other

                        ORFs? True or False

  --cenote-dbs CENOTE_DBS

                        Default: cenote_script_path -- If you downloaded and

                        setup the databases in a non-standard location,

                        specify path

  --wrap WRAP           Default: True -- Wrap/rotate DTR/circular contigs so

                        the start codon of an ORF is the first nucleotide in

                        the contig/genome

  --hallmark_taxonomy HALLMARK_TAX

                        Default: False -- Get hierarchical taxonomy

                        information for all hallmark genes? This report

                        (*.hallmarks.taxonomy.out) is not considered in the

                        final taxonomy call.

Directory Tree

Directory Tree Image

Citation

Michael J Tisza, Anna K Belford, Guillermo Domínguez-Huerta, Benjamin Bolduc, Christopher B Buck, Cenote-Taker 2 democratizes virus discovery and sequence annotation, Virus Evolution, Volume 7, Issue 1, January 2021, veaa100, https://doi.org/10.1093/ve/veaa100

cenote-taker2's People

Contributors

mtisza1 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

cenote-taker2's Issues

Contigs in the CONTIG_SUMMARY.tsv have neither ".rotate.fasta" nor ".tax_guide.blastx.out"

Hi,

In the CONTIG_SUMMARY.tsv, ERR1301564_a1_ct40759 exists as below.

ERR1301564_k79_72066 ERR1301564_a1_ct40759 Unknown DTR 1611 Prodigal 0 none BLASTN not conducted

But, in the DTR_contigs_with_viral_domain, ERR1301564_a1_ct40759 has only fna file.

ls ERR1301564_a1_ct4075*
ERR1301564_a1_ct40759.fna

Success Case has .rotate.fasta, .tax_guide.blastx.out as well as .fna.

ls ERR1301564_a1_ct38931*
ERR1301564_a1_ct38931.AA.fasta ERR1301564_a1_ct38931.rotate.fasta ERR1301564_a1_ct38931.trans.fasta
ERR1301564_a1_ct38931.fna ERR1301564_a1_ct38931.tax_guide.blastx.out

Why the contig in the CONTIG_SUMMARY.tsv has no .rotate.fasta, .tax_guide.blastx.out?

ValueError: unsupported format character 'I' (0x49) at index 66

i just ran cenote-taker2, and this came up:
Traceback (most recent call last):
File "/home/Public/software/cenote-taker2/Cenote-Taker2/run_cenote-taker2.py", line 72, in
args = parser.parse_args()
File "/home/Public/Anaconda3/ENTER/envs/cenote-taker2_env/lib/python3.6/argparse.py", line 1734, in parse_args
args, argv = self.parse_known_args(args, namespace)
File "/home/Public/Anaconda3/ENTER/envs/cenote-taker2_env/lib/python3.6/argparse.py", line 1766, in parse_known_args
namespace, args = self._parse_known_args(args, namespace)
File "/home/Public/Anaconda3/ENTER/envs/cenote-taker2_env/lib/python3.6/argparse.py", line 1972, in _parse_known_args start_index = consume_optional(start_index)
File "/home/Public/Anaconda3/ENTER/envs/cenote-taker2_env/lib/python3.6/argparse.py", line 1912, in consume_optional
take_action(action, args, option_string)
File "/home/Public/Anaconda3/ENTER/envs/cenote-taker2_env/lib/python3.6/argparse.py", line 1840, in take_action
action(self, namespace, argument_values, option_string)
File "/home/Public/Anaconda3/ENTER/envs/cenote-taker2_env/lib/python3.6/argparse.py", line 1024, in call
parser.print_help()
File "/home/Public/Anaconda3/ENTER/envs/cenote-taker2_env/lib/python3.6/argparse.py", line 2366, in print_help
self._print_message(self.format_help(), file)
File "/home/Public/Anaconda3/ENTER/envs/cenote-taker2_env/lib/python3.6/argparse.py", line 2350, in format_help
return formatter.format_help()
File "/home/Public/Anaconda3/ENTER/envs/cenote-taker2_env/lib/python3.6/argparse.py", line 282, in format_help
help = self._root_section.format_help()
File "/home/Public/Anaconda3/ENTER/envs/cenote-taker2_env/lib/python3.6/argparse.py", line 213, in format_help
item_help = join([func(*args) for func, args in self.items])
File "/home/Public/Anaconda3/ENTER/envs/cenote-taker2_env/lib/python3.6/argparse.py", line 213, in
item_help = join([func(*args) for func, args in self.items])
File "/home/Public/Anaconda3/ENTER/envs/cenote-taker2_env/lib/python3.6/argparse.py", line 213, in format_help
item_help = join([func(*args) for func, args in self.items])
File "/home/Public/Anaconda3/ENTER/envs/cenote-taker2_env/lib/python3.6/argparse.py", line 213, in
item_help = join([func(*args) for func, args in self.items])
File "/home/Public/Anaconda3/ENTER/envs/cenote-taker2_env/lib/python3.6/argparse.py", line 523, in _format_action
help_text = self._expand_help(action)
File "/home/Public/Anaconda3/ENTER/envs/cenote-taker2_env/lib/python3.6/argparse.py", line 610, in _expand_help
return self._get_help_string(action) % params
ValueError: unsupported format character 'I' (0x49) at index 66

Hangs with no feedback

Hi,

I'm playing with Cenote-Taker2 for the first time, and (as far as I can tell) it keeps hanging: i.e. simply stopping execution with no feedback or continued output or execution. There are a couple of errors thrown, but no indication as to what might cause them or what the solution might be.

The command looks like this

python ~/apps/CenoteTaker2/run_cenote-taker2.py -c LongWebster.fasta --known_strains blast_knowns --blastn_db /data/BLAST_databases/nt -r WebsterMelRebuild -m 150 -t 40 -p False
/data/home/dobbard/apps/CenoteTaker2

#and things start well
######################################################################


prodigal found
BWA found
samtools found
mummer found
circlator found
blastp found
blastn found
blastx found
rpsblast found
bioawk found
efetch found
ktClassifyBLAST found
hmmscan found
bowtie2 found
tRNAscan-SE found
pileup.sh found
tbl2asn found
getorf found
transeq found
@@@@@@@@@@@@@@@@@@@@@@@@@
Your specified arguments:
original contigs:                  LongWebster.fasta
forward reads:                     /data/home/dobbard/scratch/test_cenote/no_reads
reverse reads:                     /data/home/dobbard/scratch/test_cenote/no_reads
title of this run:                 WebsterMelRebuild
Isolate source:                    unknown
collection date:                   unknown
metagenome_type:                   unknown
SRA run number:                    unknown
SRA experiment number:             unknown
SRA sample number:                 unknown
Bioproject number:                 unknown
template file:                     /data/home/dobbard/apps/CenoteTaker2/dummy_template.sbt
minimum circular contig length:    1000
minimum linear contig length:      1000
virus domain database:             standard
min. viral hallmarks for linear:   1
min. viral hallmarks for circular: 1
handle known seqs:                 blast_knowns
contig assembler:                  unknown_assembler
DNA or RNA:                        DNA
HHsuite tool:                      hhblits
original or TPA:                   original
Do BLASTP?:                        no_blastp
Do Prophage Pruning?:              False
Filter out plasmids?:              True
Run BLASTN against nt?             /data/BLAST_databases/nt
Location of Cenote scripts:        /data/home/dobbard/apps/CenoteTaker2
Location of scratch directory:     none
GB of memory:                      150
number of CPUs available for run:  40
Annotation mode?                   False
@@@@@@@@@@@@@@@@@@@@@@@@@
scratch space will not be used in this run
HHsuite database locations:
/data/home/dobbard/apps/CenoteTaker2/NCBI_CD/NCBI_CD
/data/home/dobbard/apps/CenoteTaker2/pfam_32_db/pfam
/data/home/dobbard/apps/CenoteTaker2/pdb70/pdb70
/data/home/dobbard/scratch/test_cenote/LongWebster.fasta
time update: locating inputs:  03-11-21---09:01:43
/data/home/dobbard/scratch/test_cenote/LongWebster.fasta
File with .fasta extension detected, attempting to keep contigs over 1000 nt and find circular sequences with apc.pl
WebsterMelRebuild121.fasta has DTRs/circularity
WebsterMelRebuild189.fasta has DTRs/circularity
WebsterMelRebuild249.fasta has DTRs/circularity
WebsterMelRebuild643.fasta has DTRs/circularity
WebsterMelRebuild652.fasta has DTRs/circularity
WebsterMelRebuild88.fasta has DTRs/circularity
no reads provided or reads not found
Circular fasta file(s) detected

Putting non-circular contigs in a separate directory
time update: running IRF for ITRs in non-circular contigs 03-11-21---09:02:22
time update: running prodigal on linear contigs  03-11-21---09:02:24
time update: running linear contigs with hmmscan against virus hallmark gene database: standard  03-11-21---09:02:39
time update: Calling ORFs for circular/DTR sequences with prodigal  03-11-21---09:02:49
time update: running hmmscan on circular/DTR contigs  03-11-21---09:02:50
/data/home/dobbard/apps/CenoteTaker2/cenote-taker2.1.1.sh: line 547: s/#/ /g: No such file or directory
/data/home/dobbard/apps/CenoteTaker2/cenote-taker2.1.1.sh: line 547: s/#/ /g: No such file or directory
/data/home/dobbard/apps/CenoteTaker2/cenote-taker2.1.1.sh: line 547: s/#/ /g: No such file or directory
/data/home/dobbard/apps/CenoteTaker2/cenote-taker2.1.1.sh: line 547: s/#/ /g: No such file or directory
/data/home/dobbard/apps/CenoteTaker2/cenote-taker2.1.1.sh: line 547: s/#/ /g: No such file or directory
/data/home/dobbard/apps/CenoteTaker2/cenote-taker2.1.1.sh: line 547: s/#/ /g: No such file or directory
 Grabbing ORFs wihout RPS-BLAST hits and separating them into individual files for HHsearch
time update: running HHsearch or HHblits  03-11-21---09:02:50
 Combining tbl files from all search results AND fix overlapping ORF module
No ITR contigs with minimum hallmark genes found.
Annotating linear contigs
time update: running BLASTX, annotate linear contigs  03-11-21---09:02:50
time update: running Prodigal, annotate linear contigs  03-11-21---09:04:32
time update: running hmmscan1, annotating linear contigs  03-11-21---09:04:34
awk: cmd. line:1: (FILENAME=- FNR=1) fatal: expression for `>>' redirection has null string value
time update: running hmmscan2, annotating linear contigs  03-11-21---09:04:35
cat: SPLIT_DTR_HMM2_GENOME_AA_*AA.hmmscan2.out: No such file or directory

###################################################################################

But the failed awk and the failed cat suggest something is going wrong. At this point it appears nothing is running, so I am suspicious that cat is attempting to read from stdin because there was no file?

also, the missing file requested in line 547

sed 's/ /#/g' $REMAINDER | bioawk -c fastx '{print ">"$name"#DTRs" ; print $seq}' | 's/#/ /g' >> other_contigs/non_viral_domains_contigs.fna

doesn't bode well.

Clarification about how CT2 version 2.0.1 identifies a contig as a "Conjugative Transposon" in the *.tsv file?

Hi, I'm looking for some clarification on how contigs are annotated as a "Conjugative Transposon" both from a technical and biological perspective.

We have been analyzing data run through version 2.0.1, so that is the version of the code I have been studying.

I'll start with my understanding of the technical part to make sure I've got the flow of the code right. I tried to follow the path of a contig that ends up as a "Conjugative Transposon" in the final output report, and here's my understanding:

  • during the "guess taxonomy" step, blastx writes out a results file *.tax_guide.blastx.out
  • during the "combine tbl files" step, TAX_ORF variable is assigned as a "Conjugative Transposon" if CONJ_COUNT > 0 and STRUCTURAL_COUNT == 0 when grepping the *.tax_guide.blastx.out file. TAX_ORF is then written over the *.tax_guide.blastx.out file
  • In the section "Getting info for virus nomenclature and divergence" the variable tax_guess is assigned from the *.tax_guide.blastx.out file
  • tax_guess is written to /sequin_directory/*.fsa
  • information from *.fsa is pulled into the *.tsv report file

Does this seem correct?

My biological colleagues have broader conceptual questions:
(1) Why is "taxonomy" reported for phage predictions, but not for CTns (why do we not annotate CTn predictions like we do for phages)?
(2) What database of proteins was used for the CTn assignments (assuming it is a protein similarity approach?), and what are the minimum criteria for making that assignment (how many genes/proteins?)

I'm not sure how to answer either of these questions outside of stating what the code is doing!

Any help is greatly appreciated, thanks so much!

find : Argument list too long

Hi!

I get an "argument list too long" problem with line 431 of cenote-taker2.1.2.sh

cat $( find * -maxdepth 0 -type f -name "*.AA.sorted.fasta" ) > all_large_genome_proteins.AA.fasta

probably because that * is expanding to something very large.

I don't understand the reason for the construction "find * -maxdepth 0". Why not just "find " or "find -maxdepth 1" ? both seem to work.

Regards!

D

Taxonomic rank information of blastx.out

Dear CT2 team,
I get the contig taxonomic classification from *blastx.out, for example:

contig1 Viruses; Duplodnaviria; Heunggongvirae; Uroviricota; Caudoviricetes; Caudovirales; Myoviridae; Brunovirus; Salmonella virus SEN34 (9 levels)
contig2 Viruses; Duplodnaviria; Heunggongvirae; Uroviricota; Caudoviricetes; Caudovirales; Lilyvirus; Paenibacillus virus Lily (8 levels)
contig3 Viruses; Riboviria; Orthornavirae; Negarnaviricota; Polyploviricotina; Insthoviricetes; Articulavirales; Orthomyxoviridae; Alphainfluenzavirus; Influenza A virus; H5N1 subtype (11 levels)

I want split it to different taxonomic ranks (eg, realm, kingdom, plylum, class, ...) by ;, but I find each classification reult seems not identity with the same rank level. How colud I fix the problem.
Looking forward your reply. Thank a lot.

Unable to reproduce results of your test contigs example

Hi Mike,

I'm having difficulties reproducing your test contig example. For instance, annotation for end features is missing for the contigs. Could you please provide what the output should look like at default or recommended for this test case parameters? Also, any tips on troubleshooting/debugging it would be helpful.

Thank you very much.

Databases

Dear, developers

During unpacking, the following error occurs.

2022-01-26 05:04:08 (1.72 MB/s) - ‘pdb70_from_mmcif_latest.tar.gz’ saved [23794671727/23794671727]
md5sum
pdb70_a3m.ffdata
tar: pdb70_a3m.ffdata: Cannot write: Invalid argument

I solved this problem by unzipping with another program.

$ python /media/sf_S_DRIVE/Cenote_Taker2/Cenote-Taker2/run_cenote-taker2.py -c testcontigs_RNA_ct2.fasta -r Test_RNA_ct -p True -m 20 -t 4 -db rna_virus
/media/sf_S_DRIVE/Cenote_Taker2/Cenote-Taker2
circlator is not found. Exiting. #pip3 install helped me
efetch is not found. Exiting. # sudo apt-get install -y efetch helped me
bowtie2 is not found. Exiting. # conda helped me
tRNAscan-SE is not found. Exiting. # sudo apt install trnascan-se
tbl2asn is not found. Exiting.

Many packages were not installed, had to be installed manually.

I installed a new operating system (Debian 11) and found a number of installation errors. How critical are they?
1)

Cloning into 'PHANOTATE'...
make: *** No targets specified and no makefile found.  Stop.
Downloading and Extracting Packages
bedtools-2.30.0      | 17.9 MB   | ########## | 100% 
Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... done
cenote-taker2_env not loaded correctly
-- Installing: /home/sergey/Cenote-Taker2/Cenote-Taker2/hh-suite/build/scripts/splitfasta.pl
updateTaxonomy.sh: 9: [: Linux: unexpected operator
updateTaxonomy.sh: 70: function: not found
Update failed.
updateAccessions.sh: 9: Bad substitution

It will also be a problem for beginners if they have not installed packages pip3 and git since they are not there initially.

If I use 100 contigs and an RNA database,
python run_cenote-taker2.py -c 100_RNA.fasta -r results_annotationRNA -m 20 -t 4 -p False -db rna_virus --minimum_length_linear 1000 -am True -hh none

I get the following messages.

/media/sf_S_DRIVE/Cenote_Taker2/Cenote-Taker2/cenote-taker2.1.3.sh: line 2062: bc: command not found
/media/sf_S_DRIVE/Cenote_Taker2/Cenote-Taker2/cenote-taker2.1.3.sh: line 2063: bc: command not found
/media/sf_S_DRIVE/Cenote_Taker2/Cenote-Taker2/cenote-taker2.1.3.sh: line 2064: bc: command not found
/media/sf_S_DRIVE/Cenote_Taker2/Cenote-Taker2/cenote-taker2.1.3.sh: line 2065: bc: command not found

ModuleNotFoundError: No module named 'fastpathz'

cp: ‘/home/users/s/saini7/scratch/MS3/Cenote-Taker2/dummy_template.sbt’ and ‘/home/users/s/saini7/scratch/MS3/Cenote-Taker2/dummy_template.sbt’ are the same file
tail: cannot open ‘viruses_am_ct_9m25.tax_guide.blastx.tab’ for reading: No such file or directory
Entrez Direct does not support positional arguments.
Please remember to quote parameter values containing
whitespace or shell metacharacters.
tail: cannot open ‘viruses_am_ct_9m250.tax_guide.blastx.tab’ for reading: No such file or directory
Entrez Direct does not support positional arguments.
Please remember to quote parameter values containing
whitespace or shell metacharacters.
tail: cannot open ‘viruses_am_ct_9m510.tax_guide.blastx.tab’ for reading: No such file or directory
Entrez Direct does not support positional arguments.
Please remember to quote parameter values containing
whitespace or shell metacharacters.
tail: cannot open ‘viruses_am_ct_9m314.tax_guide.blastx.tab’ for reading: No such file or directory
Entrez Direct does not support positional arguments.
Please remember to quote parameter values containing
whitespace or shell metacharacters.
tail: cannot open ‘viruses_am_ct_9m1649.tax_guide.blastx.tab’ for reading: No such file or directory
Entrez Direct does not support positional arguments.
Please remember to quote parameter values containing
whitespace or shell metacharacters.
tail: cannot open ‘viruses_am_ct_9m153.tax_guide.blastx.tab’ for reading: No such file or directory
Entrez Direct does not support positional arguments.
Please remember to quote parameter values containing
whitespace or shell metacharacters.
tail: cannot open ‘viruses_am_ct_9m781.tax_guide.blastx.tab’ for reading: No such file or directory
Entrez Direct does not support positional arguments.
Please remember to quote parameter values containing
whitespace or shell metacharacters.
tail: cannot open ‘viruses_am_ct_9m167.tax_guide.blastx.tab’ for reading: No such file or directory
Entrez Direct does not support positional arguments.
Please remember to quote parameter values containing
whitespace or shell metacharacters.
Traceback (most recent call last):
File "/home/users/s/saini7/scratch/MS3/Cenote-Taker2/PHANOTATE/phanotate.py", line 7, in
import fastpathz as fz
ModuleNotFoundError: No module named 'fastpathz'
Traceback (most recent call last):
File "/home/users/s/saini7/scratch/MS3/Cenote-Taker2/PHANOTATE/phanotate.py", line 7, in
import fastpathz as fz
ModuleNotFoundError: No module named 'fastpathz'
Traceback (most recent call last):
File "/home/users/s/saini7/scratch/MS3/Cenote-Taker2/PHANOTATE/phanotate.py", line 7, in
import fastpathz as fz
ModuleNotFoundError: No module named 'fastpathz'
Traceback (most recent call last):
File "/home/users/s/saini7/scratch/MS3/Cenote-Taker2/PHANOTATE/phanotate.py", line 7, in
import fastpathz as fz
ModuleNotFoundError: No module named 'fastpathz'
Traceback (most recent call last):
File "/home/users/s/saini7/scratch/MS3/Cenote-Taker2/PHANOTATE/phanotate.py", line 7, in
import fastpathz as fz
ModuleNotFoundError: No module named 'fastpathz'
sed: can't read *.phan.fasta: No such file or directory
sed: can't read *.trans.fasta: No such file or directory

BlastN "no high coverage hits"

Hello Mike,

I'm using Cenote-Taker2 to identify viral contig and detect certain virus species. I've been working with your test data in order to see if I could get your tool working on the server but have one problem with the results. With the default parameters, as you advised on the wiki, the organism name is something I can't really work with. You provide the option to perform blastn to get a more specific result, but when I look at these results, the blast result is always "no high coverage hits" when using my own data or your provided test data. I've looked into this problem and came across issue #15 but still couldn't get BLASTN_INFO to display anything else besides the aforementioned result. I've read in your paper that the pipeline:

marks contigs with at least 90 per cent average nucleotide identity to existing database entries.

Looking at the blastn results in the intermediate files only shows % identities over 90%, so I am wondering whether I am doing something wrong. Could you elaborate on how Cenote-Taker2 uses blastn?

My command
python run_cenote-taker2.py -c testcontigs_DNA_ct2.fasta -r test_DNA_ct_3 -p True -m 16 -t 16 --known_strains blast_knowns --blastn_db /lustre/BIF/nobackup/kon001/thesis/Databases/NCBI_NT/nt | tee test_DNA_ct_3_output.log

Log file
test_DNA_ct_3_output.log

Thx in advance,

Matthijs

download-db.sh

Thanks for the very nice tool! Is possible to give option database download, e.g. specify the storage file (not home directory) and use wget -c option? Is it possible download the database manually? The database is large and it may take quite some time to finish the download and it may fail during the downloading procedure, is it possible to have -resume option?

No taxonomic assignment rank infomation

Dear CT2 team,
Thanks for such amzing work.
My issue is that there is no taxonomic assignment rank infomation (eg: "Viruses; Duplodnaviria; Heunggongvirae; Uroviricota; Caudoviricetes; Caudovirales; Podoviridae; unclassified Podoviridae; crAss-like viruses; environmental samples") compare normal output (see as below "blastx.out (correct)"), and the error output is "*blastx.out (something error)", when I run CT2 on "testcontig_DNA.fasta", and so threre
is no correct taxnonmic name for contig 4 / 5 in " DNA_CONTIG_SUMMARY.tsv", which it should be "crAass-like phage".

I check the "err.log", but find nothing. Looking forward your reply. Thanks a lot.

  • DNA_CONTIG_SUMMARY.tsv
(cenote-taker2) [yutao@amms-sugon8401@DNA]$ cat DNA_CONTIG_SUMMARY.tsv
ORIGINAL_NAME   CENOTE_NAME     ORGANISM_NAME   END_FEATURE     LENGTH  ORF_CALLER      NUM_HALLMARKS   HALLMARK_NAMES     BLASTN_INFO
testcontig4     DNA4_vs01       Phage  sp. ct2Oo4_01    None    18176   PHANOTATE       4       crass_cluster.2/PFAM:PF16510.4-portal-protein|crass_cluster.3/PDB:3EZK_B-Terminase-large-subunit|crass_cluster.106/PFAM:PF05732.10-replication-protein|crass_cluster.1/PFAM:PF17236.1-Major-Capsid-Protein    BLASTN not conducted
testcontig1     DNA1_vs01       Microviridae  sp. ctDwk1_01     None    6217    Prodigal        2       Varsani_microviridae_cluster.14/PFAM:PF02305.16-Major-capsid-protein|Varsani_microviridae_cluster.30/PFAM:PF05840.12-Replication-associated-protein   BLASTN not conducted
testcontig2     DNA2_vs01       Virus  sp. ct0Ko2_01    None    1880    Prodigal        2       circoviridae_cluster.0/PFAM:PF01057.16-Rep-protein|circoviridae_cluster.1/PFAM:PF02443.14-capsid-protein   BLASTN not conducted
testcontig3     DNA3_vs01       Virus  sp. ctE8e3_01    None    9783    Prodigal        1       phycodna_cluster.4/PDB:5TIP_B-Major-capsid-protein BLASTN not conducted
testcontig5     DNA5_vs01       Phage  sp. ctqyQ5_01    None    8551    PHANOTATE       0       none    BLASTN not conducted
  • blastx.out (something error)
(cenote-taker2) [yutao@amms-sugon8401@DNA]$ ls no_end_contigs_with_viral_domain/*blastx.out
no_end_contigs_with_viral_domain/DNA1_vs01.tax_guide.blastx.out
no_end_contigs_with_viral_domain/DNA2_vs01.tax_guide.blastx.out
no_end_contigs_with_viral_domain/DNA3_vs01.tax_guide.blastx.out
no_end_contigs_with_viral_domain/DNA4_vs01.tax_guide.blastx.out
no_end_contigs_with_viral_domain/DNA5_vs01.tax_guide.blastx.out
(cenote-taker2) [yutao@amms-sugon8401@DNA]$ head no_end_contigs_with_viral_domain/*blastx.out
==> no_end_contigs_with_viral_domain/DNA1_vs01.tax_guide.blastx.out <==
DNA1_vs01_4     gi|906476413|ref|YP_009160408.1| replication protein VP4 [Microviridae Fen7918_21]      33.000     8.37e-09        100

==> no_end_contigs_with_viral_domain/DNA2_vs01.tax_guide.blastx.out <==
DNA2_vs01       gi|289163351|ref|YP_003422530.1| replicase [Porcine circovirus type 1/2a]       99.359  0.0312

==> no_end_contigs_with_viral_domain/DNA3_vs01.tax_guide.blastx.out <==
DNA3_vs01_8     gi|1464309244|ref|YP_009507580.1| major capsid protein [Heterosigma akashiwo virus 01]  46.154     9.20e-122       442

==> no_end_contigs_with_viral_domain/DNA4_vs01.tax_guide.blastx.out <==
DNA4_vs01_12    gi|674660398|ref|YP_009052554.1| putative Terminase large subunit [uncultured crAssphage] 100.000  0.0     751

==> no_end_contigs_with_viral_domain/DNA5_vs01.tax_guide.blastx.out <==
DNA5_vs01       gi|674660359|ref|YP_009052511.1| putative Protein of unknown function (DUF932) [uncultured crAssphage]     100.000 0.0     343
  • blastx.out (correct)
(drep) [u@h@no_end_contigs_with_viral_domain]$ head *blastx.out
==> DNA_ct2_out1_vs01.tax_guide.blastx.out <==
DNA_ct2_out1_vs01_4     gi|906476413|ref|YP_009160408.1| replication protein VP4 [Microviridae Fen7918_21]      33.000  8.37e-09  100
Viruses; Monodnaviria; Sangervirae; Phixviricota; Malgrandaviricetes; Petitvirales; Microviridae; unclassified Microviridae

==> DNA_ct2_out2_vs01.tax_guide.blastx.out <==
DNA_ct2_out2_vs01       gi|289163351|ref|YP_003422530.1| replicase [Porcine circovirus type 1/2a]       99.359  0.0     312
Viruses; Monodnaviria; Shotokuvirae; Cressdnaviricota; Arfiviricetes; Cirlivirales; Circoviridae; Circovirus; unclassified Circovirus

==> DNA_ct2_out3_vs01.tax_guide.blastx.out <==
DNA_ct2_out3_vs01_8     gi|1464309244|ref|YP_009507580.1| major capsid protein [Heterosigma akashiwo virus 01]  46.154  9.20e-122 442
Viruses; Varidnaviria; Bamfordvirae; Nucleocytoviricota; Megaviricetes; Algavirales; Phycodnaviridae; Raphidovirus

==> DNA_ct2_out4_vs01.tax_guide.blastx.out <==
DNA_ct2_out4_vs01_12    gi|674660398|ref|YP_009052554.1| putative Terminase large subunit [uncultured crAssphage]       100.000   0.0     751
Viruses; Duplodnaviria; Heunggongvirae; Uroviricota; Caudoviricetes; Caudovirales; Podoviridae; unclassified Podoviridae; crAss-like viruses; environmental samples

==> DNA_ct2_out5_vs01.tax_guide.blastx.out <==
DNA_ct2_out5_vs01       gi|674660359|ref|YP_009052511.1| putative Protein of unknown function (DUF932) [uncultured crAssphage]    100.000 0.0     343
Viruses; Duplodnaviria; Heunggongvirae; Uroviricota; Caudoviricetes; Caudovirales; Podoviridae; unclassified Podoviridae; crAss-like viruses; environmental samples
  • err.log
(base) [yutao@headnode@DNA]$ cat err.log
00000000000000000000000000
00000000000000000000000000
000000000^^^^^^^^000000000
000000^^^^^^^^^^^^^^000000
00000^^^^^CENOTE^^^^^00000
00000^^^^^TAKER!^^^^^00000
00000^^^^^^^^^^^^^^^^00000
000000^^^^^^^^^^^^^^000000
000000000^^^^^^^^000000000
00000000000000000000000000
00000000000000000000000000

Version 2.1.3

@@@@@@@@@@@@@@@@@@@@@@@@@
Your specified arguments:
original contigs:                  ../testcontigs_DNA_ct2.fasta
forward reads:                     /mnt/data/share/software/Cenote-Taker2/test/no_reads
reverse reads:                     /mnt/data/share/software/Cenote-Taker2/test/no_reads
title of this run:                 DNA
Isolate source:                    unknown
collection date:                   unknown
metagenome_type:                   unknown
SRA run number:                    unknown
SRA experiment number:             unknown
SRA sample number:                 unknown
Bioproject number:                 unknown
template file:                     /mnt/data/share/software/Cenote-Taker2/dummy_template.sbt
minimum circular contig length:    1000
minimum linear contig length:      1
virus domain database:             standard
min. viral hallmarks for linear:   0
min. viral hallmarks for circular: 0
handle known seqs:                 do_not_check_knowns
contig assembler:                  unknown_assembler
DNA or RNA:                        DNA
HHsuite tool:                      hhblits
original or TPA:                   original
Do BLASTP?:                        no_blastp
Do Prophage Pruning?:              True
Filter out plasmids?:              True
Run BLASTN against nt?             none
Location of Cenote scripts:        /mnt/data/share/software/Cenote-Taker2
Location of scratch directory:     none
GB of memory:                      500
number of CPUs available for run:  100
Annotation mode?                   True
@@@@@@@@@@@@@@@@@@@@@@@@@
scratch space will not be used in this run
HHsuite database locations:
/mnt/data/share/software/Cenote-Taker2/NCBI_CD/NCBI_CD
/mnt/data/share/software/Cenote-Taker2/pfam_32_db/pfam
/mnt/data/share/software/Cenote-Taker2/pdb70/pdb70
/mnt/data/share/software/Cenote-Taker2/test/../testcontigs_DNA_ct2.fasta
no CRISPR file given
Prophage pruning requires --lin_minimum_hallmark_genes >= 1. changing to:
--lin_minimum_hallmark_genes 1
time update: locating inputs:  11-16-21---10:25:35
/mnt/data/share/software/Cenote-Taker2/test/../testcontigs_DNA_ct2.fasta
File with .fasta extension detected, attempting to keep contigs over 1 nt and find circular sequences with apc.pl
No circular contigs detected.
no reads provided or reads not found
No circular fasta files detected.
time update: running IRF for ITRs in non-circular contigs 11-16-21---10:25:35
time update: running prodigal on linear contigs  11-16-21---10:25:35
time update: running linear contigs with hmmscan against virus hallmark gene database: standard  11-16-21---10:25:37
 Starting pruning of non-DTR/circular contigs with viral domains
pruning script opened
fna files found
./DNA1.fna is too short to prune chromosomal regions
./DNA2.fna is too short to prune chromosomal regions
./DNA3.fna is too short to prune chromosomal regions
mv: cannot stat './DNA5.AA.sorted.fasta': No such file or directory
./DNA5.fna is too short to prune chromosomal regions
time update: HMMSCAN of common viral domains beginning 11-16-21---10:25:38
time update: making tables for hmmscan and rpsblast outputs  11-16-21---10:25:39
time update: running RPSBLAST on each sequence  11-16-21---10:25:39
/mnt/data/share/software/Cenote-Taker2/test/DNA/no_end_contigs_with_viral_domain/COMBINED_RESULTS_PRUNE.AA.rpsblast.out
time update: parsing tables into virus_signal.seq files for hmmscan and rpsblast outputs  11-16-21---10:25:40
time update: Identifying virus chunks, chromosomal junctions, and pruning contigs as necessary  11-16-21---10:25:41
Running file: DNA4.virus_signal.seq
   Window +/- to the right  ...  Chunk_end  Window midpoint
0       1                +  ...       none             2500
0       1                +  ...       none             2500

[2 rows x 6 columns]
time update: Making prophage table  11-16-21---10:25:43
cat: DNA5.AA.hmmscan.sort.out: No such file or directory
cut: DNA5.AA.hmmscan.sort.out: No such file or directory
cut: DNA5.AA.hmmscan.sort.out: No such file or directory
 FINISHED PRUNING CONTIGS WITH AT LEAST 1 VIRAL DOMAIN(S)
 Grabbing ORFs wihout RPS-BLAST hits and separating them into individual files for HHsearch
time update: running HHsearch or HHblits  11-16-21---10:25:43
 Combining tbl files from all search results AND fix overlapping ORF module
No ITR contigs with minimum hallmark genes found.
Annotating linear contigs
time update: running BLASTX, annotate linear contigs  11-16-21---10:25:43
time update: running PHANOTATE, annotate linear contigs  11-16-21---10:26:18
time update: running Prodigal, annotate linear contigs  11-16-21---10:26:23
time update: running hmmscan1, annotating linear contigs  11-16-21---10:26:24
time update: running hmmscan2, annotating linear contigs  11-16-21---10:26:25
time update: running RPSBLAST, annotating linear contigs  11-16-21---10:26:27
/mnt/data/share/software/Cenote-Taker2/test/DNA/no_end_contigs_with_viral_domain/COMBINED_RESULTS.rotate.AA.rpsblast.out
time update: running tRNAscan-SE  11-16-21---10:26:28
 Grabbing ORFs wihout RPS-BLAST hits and separating them into individual files for HHsearch
time update: running HHsearch or HHblits  11-16-21---10:26:31
 Combining tbl files from all search results AND fix overlapping ORF module, linear contigs
finalizing taxonomy for linear contigs
DNA2_vs01 is a CRESS virus of some kind
No suitable ORF for taxonomy found for DNA5_vs01, using BLASTX result.
time update: finished annotating linear contigs  11-16-21---10:26:51
time update: running tbl2asn  11-16-21---10:26:53
[real-tbl2asn] Replaced '       ' with '#'
[real-tbl2asn] Replaced '       ' with '#'
[real-tbl2asn] Flatfile DNA1_vs01

[real-tbl2asn] Validating DNA1_vs01

[real-tbl2asn] Replaced '       ' with '#'
[real-tbl2asn] Replaced '       ' with '#'
[real-tbl2asn] Flatfile DNA2_vs01

[real-tbl2asn] Validating DNA2_vs01

[real-tbl2asn] Replaced '       ' with '#'
[real-tbl2asn] Replaced '       ' with '#'
[real-tbl2asn] Flatfile DNA3_vs01

[real-tbl2asn] Validating DNA3_vs01

[real-tbl2asn] Replaced '       ' with '#'
[real-tbl2asn] Replaced '       ' with '#'
[real-tbl2asn] Flatfile DNA4_vs01

[real-tbl2asn] Validating DNA4_vs01

[real-tbl2asn] Replaced '       ' with '#'
[real-tbl2asn] Replaced '       ' with '#'
[real-tbl2asn] Flatfile DNA5_vs01

[real-tbl2asn] Validating DNA5_vs01

[tbl2asn-forever] WARNING: .gbf|.sqn files have incorrect date (01-JAN-2019) and will need to be corrected.
Making gtf tables from final feature tables

time update: Finishing  11-16-21---10:26:55
Virus prediction summary:
5 virus contigs were detected/predicted. 0 contigs had DTRs/circularity. 0 contigs had ITRs. 5 were linear/had no end features
Prophage pruning summary:
1 linear contigs > 10 kb were run through pruning module, and 0 virus sub-contigs (putative prophages/proviruses) were extracted from these. 1 virus contigs were kept intact.
removing ancillary files
output directory: DNA
 >>>>>>CENOTE-TAKER 2 HAS FINISHED TAKING CENOTES<<<<<<
/mnt/data/share/software/Cenote-Taker2
prodigal found
BWA found
samtools found
mummer found
circlator found
blastp found
blastn found
blastx found
rpsblast found
bioawk found
efetch found
ktClassifyBLAST found
hmmscan found
bowtie2 found
tRNAscan-SE found
pileup.sh found
tbl2asn found
getorf found
transeq found
bedtools found

API key of NCBI

Dear CT2 team,
I had a problem with "PLEASE REQUEST AN API_KEY FROM NCBI" when running CT2, it seems that casused by eutils.
does CT2 parse the taxonomic rank information by this way? How could I fix the problem?
Looking forward your reply.
Thank a lot.

(cenote-taker2) [yut@amms-sugon8401@006]$ cat 350.log
00000000000000000000000000
00000000000000000000000000
000000000^^^^^^^^000000000
000000^^^^^^^^^^^^^^000000
00000^^^^^CENOTE^^^^^00000
00000^^^^^TAKER!^^^^^00000
00000^^^^^^^^^^^^^^^^00000
000000^^^^^^^^^^^^^^000000
000000000^^^^^^^^000000000
00000000000000000000000000
00000000000000000000000000

Version 2.1.3

@@@@@@@@@@@@@@@@@@@@@@@@@
Your specified arguments:
original contigs:                  /mnt/data/share/database/Human_gut_virome_database/Public_Merged_Human_Virome_Database/SPLITS_8_PART/006/PMHVD_DREP_rep_seq.part_006.part_350.fasta
forward reads:                     /mnt/data/share/database/Human_gut_virome_database/Public_Merged_Human_Virome_Database/PMHVD_DREP_REP_SEQ_CT2_OUT/006/no_reads
reverse reads:                     /mnt/data/share/database/Human_gut_virome_database/Public_Merged_Human_Virome_Database/PMHVD_DREP_REP_SEQ_CT2_OUT/006/no_reads
title of this run:                 350_out
Isolate source:                    unknown
collection date:                   unknown
metagenome_type:                   unknown
SRA run number:                    unknown
SRA experiment number:             unknown
SRA sample number:                 unknown
Bioproject number:                 unknown
template file:                     /mnt/data/share/software/Cenote-Taker2/dummy_template.sbt
minimum circular contig length:    1000
minimum linear contig length:      1
virus domain database:             standard
min. viral hallmarks for linear:   0
min. viral hallmarks for circular: 0
handle known seqs:                 do_not_check_knowns
contig assembler:                  unknown_assembler
DNA or RNA:                        DNA
HHsuite tool:                      hhblits
original or TPA:                   original
Do BLASTP?:                        no_blastp
Do Prophage Pruning?:              True
Filter out plasmids?:              True
Run BLASTN against nt?             none
Location of Cenote scripts:        /mnt/data/share/software/Cenote-Taker2
Location of scratch directory:     none
GB of memory:                      50
number of CPUs available for run:  10
Annotation mode?                   True
@@@@@@@@@@@@@@@@@@@@@@@@@
scratch space will not be used in this run
HHsuite database locations:
/mnt/data/share/software/Cenote-Taker2/NCBI_CD/NCBI_CD
/mnt/data/share/software/Cenote-Taker2/pfam_32_db/pfam
/mnt/data/share/software/Cenote-Taker2/pdb70/pdb70
no CRISPR file given
Prophage pruning requires --lin_minimum_hallmark_genes >= 1. changing to:
--lin_minimum_hallmark_genes 1
time update: locating inputs:  11-18-21---21:44:16
/mnt/data/share/database/Human_gut_virome_database/Public_Merged_Human_Virome_Database/PMHVD_DREP_REP_SEQ_CT2_OUT/006/PMHVD_DREP_rep_seq.part_006.part_350.fasta
File with .fasta extension detected, attempting to keep contigs over 1 nt and find circular sequences with apc.pl
No circular contigs detected.
no reads provided or reads not found
No circular fasta files detected.
time update: running IRF for ITRs in non-circular contigs 11-18-21---21:44:18
time update: running prodigal on linear contigs  11-18-21---21:44:18
time update: running linear contigs with hmmscan against virus hallmark gene database: standard  11-18-21---21:46:01
 Starting pruning of non-DTR/circular contigs with viral domains
pruning script opened
fna files found
mv: cannot stat './350_out7.AA.sorted.fasta': No such file or directory
cut: ./350_out7.AA.hmmscan.sort.out: No such file or directory
mv: cannot stat './350_out8.AA.sorted.fasta': No such file or directory
cut: ./350_out8.AA.hmmscan.sort.out: No such file or directory
mv: cannot stat './350_out18.AA.sorted.fasta': No such file or directory
cut: ./350_out18.AA.hmmscan.sort.out: No such file or directory
mv: cannot stat './350_out20.AA.sorted.fasta': No such file or directory
cut: ./350_out20.AA.hmmscan.sort.out: No such file or directory
mv: cannot stat './350_out23.AA.sorted.fasta': No such file or directory
cut: ./350_out23.AA.hmmscan.sort.out: No such file or directory
time update: HMMSCAN of common viral domains beginning 11-18-21---21:46:56
time update: making tables for hmmscan and rpsblast outputs  11-18-21---21:49:57
time update: running RPSBLAST on each sequence  11-18-21---21:50:05
/mnt/data/share/database/Human_gut_virome_database/Public_Merged_Human_Virome_Database/PMHVD_DREP_REP_SEQ_CT2_OUT/006/350_out/no_end_contigs_with_viral_domain/COMBINED_RESULTS_PRUNE.AA.rpsblast.out
time update: parsing tables into virus_signal.seq files for hmmscan and rpsblast outputs  11-18-21---21:50:27
time update: Identifying virus chunks, chromosomal junctions, and pruning contigs as necessary  11-18-21---21:50:50
Running file: 350_out10.virus_signal.seq
   Window +/- to the right  ...  Chunk_end  Window midpoint
0       1                +  ...       none             2500
0       1                +  ...       none             2500

[2 rows x 6 columns]
Running file: 350_out11.virus_signal.seq
   Window +/- to the right  ...  Chunk_end  Window midpoint
0       1                +  ...       none             2500
0       1                +  ...       none             2500

[2 rows x 6 columns]
Running file: 350_out12.virus_signal.seq
   Window +/- to the right  ...  Chunk_end  Window midpoint
0       1                +  ...       none             2500
0       1                +  ...       none             2500

[2 rows x 6 columns]
Running file: 350_out13.virus_signal.seq
   Window +/- to the right  ...  Chunk_end  Window midpoint
0       1                +  ...       none             2500
0       1                +  ...       none             2500

[2 rows x 6 columns]
Running file: 350_out14.virus_signal.seq
   Window +/- to the right  ...  Chunk_end  Window midpoint
0       1                +  ...       none             2500
0       1                +  ...       none             2500

[2 rows x 6 columns]
Running file: 350_out15.virus_signal.seq
   Window +/- to the right  ...  Chunk_end  Window midpoint
0       1                +  ...       none             2500
0       1                +  ...       none             2500

[2 rows x 6 columns]
Running file: 350_out16.virus_signal.seq
   Window +/- to the right  ...  Chunk_end  Window midpoint
0       1                +  ...       none             2500
0       1                +  ...       none             2500

[2 rows x 6 columns]
Running file: 350_out17.virus_signal.seq
   Window +/- to the right  ...  Chunk_end  Window midpoint
0       1                +  ...       none             2500
0       1                +  ...       none             2500

[2 rows x 6 columns]
Running file: 350_out19.virus_signal.seq
   Window +/- to the right  ...  Chunk_end  Window midpoint
0       1                +  ...       none             2500
0       1                +  ...       none             2500

[2 rows x 6 columns]
Running file: 350_out1.virus_signal.seq
   Window +/- to the right  ...  Chunk_end  Window midpoint
0       1                +  ...       none             2500
0       1                +  ...       none             2500

[2 rows x 6 columns]
Running file: 350_out21.virus_signal.seq
   Window +/- to the right  ...  Chunk_end  Window midpoint
0       1                +  ...       none             2500
0       1                +  ...       none             2500

[2 rows x 6 columns]
Running file: 350_out22.virus_signal.seq
   Window +/- to the right  ...  Chunk_end  Window midpoint
0       1                +  ...       none             2500
0       1                +  ...       none             2500

[2 rows x 6 columns]
Running file: 350_out24.virus_signal.seq
   Window +/- to the right  ...  Chunk_end  Window midpoint
0       1                +  ...       none             2500
0       1                +  ...       none             2500

[2 rows x 6 columns]
Running file: 350_out25.virus_signal.seq
   Window +/- to the right  ...  Chunk_end  Window midpoint
0       1                +  ...       none             2500
0       1                +  ...       none             2500

[2 rows x 6 columns]
Running file: 350_out26.virus_signal.seq
   Window +/- to the right  ...  Chunk_end  Window midpoint
0       1                +  ...       none             2500
0       1                +  ...       none             2500

[2 rows x 6 columns]
Running file: 350_out2.virus_signal.seq
   Window +/- to the right  ...  Chunk_end  Window midpoint
0       1                +  ...       none             2500
0       1                +  ...       none             2500

[2 rows x 6 columns]
Running file: 350_out3.virus_signal.seq
   Window +/- to the right  ...  Chunk_end  Window midpoint
0       1                +  ...       none             2500
0       1                +  ...       none             2500

[2 rows x 6 columns]
Running file: 350_out4.virus_signal.seq
   Window +/- to the right  ...  Chunk_end  Window midpoint
0       1                +  ...       none             2500
0       1                +  ...       none             2500

[2 rows x 6 columns]
Running file: 350_out5.virus_signal.seq
   Window +/- to the right  ...  Chunk_end  Window midpoint
0       1                +  ...       none             2500
0       1                +  ...       none             2500

[2 rows x 6 columns]
Running file: 350_out6.virus_signal.seq
   Window +/- to the right  ...  Chunk_end  Window midpoint
0       1                +  ...       none             2500
0       1                +  ...       none             2500

[2 rows x 6 columns]
Running file: 350_out9.virus_signal.seq
   Window +/- to the right  ...  Chunk_end  Window midpoint
0       1                +  ...       none             2500
0       1                +  ...       none             2500

[2 rows x 6 columns]
time update: Making prophage table  11-18-21---21:53:10
 FINISHED PRUNING CONTIGS WITH AT LEAST 1 VIRAL DOMAIN(S)
 Grabbing ORFs wihout RPS-BLAST hits and separating them into individual files for HHsearch
time update: running HHsearch or HHblits  11-18-21---21:53:11
 Combining tbl files from all search results AND fix overlapping ORF module
No ITR contigs with minimum hallmark genes found.
Annotating linear contigs
time update: running BLASTX, annotate linear contigs  11-18-21---21:53:11
429 Too Many Requests
PLEASE REQUEST AN API_KEY FROM NCBI
No do_post output returned from 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&id=35344&rettype=full&retmode=xml&edirect_os=linux&edirect=13.3&tool=edirect&email=yut@amms-sugon8401'
Result of do_post http request is
$VAR1 = bless( {
                 '_headers' => bless( {
                                        'referrer-policy' => 'origin-when-cross-origin',
                                        'client-response-num' => 1,
                                        'content-length' => '87',
                                        'client-peer' => '130.14.29.110:443',
                                        'x-xss-protection' => '1; mode=block',
                                        'access-control-expose-headers' => 'X-RateLimit-Limit,X-RateLimit-Remaining,Retry-After',
                                        'x-ua-compatible' => 'IE=Edge',
                                        'x-ratelimit-remaining' => '0',
                                        'content-type' => 'application/json',
                                        'x-test-test' => 'test42',
                                        'client-ssl-cert-subject' => '/C=US/ST=Maryland/L=Bethesda/O=National Library of Medicine/CN=*.ncbi.nlm.nih.gov',
                                        'client-ssl-cipher' => 'ECDHE-RSA-AES256-GCM-SHA384',
                                        'vary' => 'Accept-Encoding',
                                        'client-ssl-cert-issuer' => '/C=US/O=DigiCert Inc/CN=DigiCert TLS RSA SHA256 2020 CA1',
                                        'date' => 'Thu, 18 Nov 2021 13:54:47 GMT',
                                        'client-date' => 'Thu, 18 Nov 2021 13:54:48 GMT',
                                        'server' => 'Finatra',
                                        'strict-transport-security' => 'max-age=31536000; includeSubDomains; preload',
                                        'x-ratelimit-limit' => '3',
                                        '::std_case' => {
                                                          'x-test-test' => 'X-Test-Test',
                                                          'x-ratelimit-remaining' => 'X-RateLimit-Remaining',
                                                          'client-ssl-cert-subject' => 'Client-SSL-Cert-Subject',
                                                          'x-ua-compatible' => 'X-UA-Compatible',
                                                          'access-control-expose-headers' => 'Access-Control-Expose-Headers',
                                                          'client-peer' => 'Client-Peer',
                                                          'referrer-policy' => 'Referrer-Policy',
                                                          'client-response-num' => 'Client-Response-Num',
                                                          'x-xss-protection' => 'X-XSS-Protection',
                                                          'content-security-policy' => 'Content-Security-Policy',
                                                          'x-ratelimit-limit' => 'X-RateLimit-Limit',
                                                          'strict-transport-security' => 'Strict-Transport-Security',
                                                          'client-ssl-socket-class' => 'Client-SSL-Socket-Class',
                                                          'client-date' => 'Client-Date',
                                                          'client-ssl-cert-issuer' => 'Client-SSL-Cert-Issuer',
                                                          'client-ssl-cipher' => 'Client-SSL-Cipher'
                                                        },
                                        'client-ssl-socket-class' => 'IO::Socket::SSL',
                                        'retry-after' => '2',
                                        'content-security-policy' => 'upgrade-insecure-requests',
                                        'connection' => 'close'
                                      }, 'HTTP::Headers' ),
                 '_request' => bless( {
                                        '_method' => 'POST',
                                        '_headers' => bless( {
                                                               'content-type' => 'application/x-www-form-urlencoded',
                                                               'user-agent' => 'libwww-perl/6.39',
                                                               '::std_case' => {
                                                                                 'if-ssl-cert-subject' => 'If-SSL-Cert-Subject'
                                                                               }
                                                             }, 'HTTP::Headers' ),
                                        '_uri' => bless( do{\(my $o = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi')}, 'URI::https' ),
                                        '_content' => 'db=taxonomy&id=35344&rettype=full&retmode=xml&edirect_os=linux&edirect=13.3&tool=edirect&email=yut@amms-sugon8401',
                                        '_uri_canonical' => $VAR1->{'_request'}{'_uri'}
                                      }, 'HTTP::Request' ),
                 '_msg' => 'Too Many Requests',
                 '_protocol' => 'HTTP/1.1',
                 '_content' => '{"error":"API rate limit exceeded","api-key":"218.241.250.70","count":"4","limit":"3"}
',
                 '_rc' => 429
               }, 'HTTP::Response' );

time update: running PHANOTATE, annotate linear contigs  11-18-21---21:56:39

Find taxonomy via best ORF

Hi,

I am struggling to find where the taxonomic classification is output. The CONTIG_SUMMARY has an ORGANISM_NAME, but no taxonomy. Is one meant to trace the ORGANISM_NAME to the database to find the taxonomy? Sorry if I have missed something obvious. I am essentially looking for the output from the step where BLASTp is used to query the best ORF against RefSeq.

Command:
run_cenote-taker2.py -c contig.file.fasta -r contig.names -p False -t 70 -m 200 -am True -db rna_virus -hh hhsearch

Thanks in advance!

ERROR: Invalid inference value tRNAscan-SE

I am trying Cenote-take. Everything was fine until running tbl2asn step. Kindly check the following logfile

[tbl2asn] ERROR: Invalid inference value tRNAscan-SE score:33.8
[tbl2asn] ERROR: Invalid inference value tRNAscan-SE score:44.2
[tbl2asn] Flatfile x10

[tbl2asn] Validating x10

[tbl2asn] Flatfile x1

[tbl2asn] Validating x1

[tbl2asn] Flatfile x2

[tbl2asn] Validating x2

[tbl2asn] ERROR: Invalid inference value tRNAscan-SE score:44.5
[tbl2asn] ERROR: Invalid inference value tRNAscan-SE score:37.9
[tbl2asn] ERROR: Invalid inference value tRNAscan-SE score:37.9
[tbl2asn] Flatfile x3

[tbl2asn] Validating x3

[tbl2asn] ERROR: Invalid inference value tRNAscan-SE score:47.6
[tbl2asn] Flatfile x4

[tbl2asn] Validating x4

[tbl2asn] Flatfile x5

[tbl2asn] Validating x5

[tbl2asn] Flatfile x6

[tbl2asn] Validating x6

[tbl2asn] ERROR: Invalid inference value tRNAscan-SE score:44.5
[tbl2asn] ERROR: Invalid inference value tRNAscan-SE score:37.9
[tbl2asn] ERROR: Invalid inference value tRNAscan-SE score:37.9
[tbl2asn] Flatfile x7

[tbl2asn] Validating x7

[tbl2asn] Flatfile x8

[tbl2asn] Validating x8

[tbl2asn] Flatfile x9

[tbl2asn] Validating x9

Making gtf tables from final feature tables

kindly let me know how to resolve this

"-am True" edits the sequence

I want to annotate a sequence that I am confident of. It is a densovirus, with ITRs that have internal repeat structure. I do not want cenote-take to reorientate or edit the sequence before annotating it. However, even when I run "-am True" it (mis)identifies the sequence as circular, and reverse complements it before annotating it.

How do I make it JUST annotate the sequence?

Minimum protein length

This is less of an issue, and more of a feature.

I am using cenote-taker2 to annotate some manually curated contigs. It seems keen to annotate proteins as short as 30AAs that lack any homology to relatives.

How do I control the minim length of annotated proteins?

Thanks!

D

erro: Argument list too long when using latest cenote-Taker2.1.5

Hi Mike,
Thank you for Cenote-Taker2! However, I am facing a problem when i run the commond:
python ~/miniconda3/bin/Cenote-Taker2/run_cenote-taker2.py -c ~/hxj/depth_analysis/raw_data_2/contig_megahit/SRR9161502_9000000/final.contigs.fa -r SRR9161502_9000000 -p true -m 32 -t 32 2>&1 | tee output.log

the tail of the output.log file is like:
`SRR9161502_900000074949.fasta has DTRs/circularity
SRR9161502_9000000541967.fasta has DTRs/circularity
SRR9161502_9000000137044.fasta has DTRs/circularity
no reads provided or reads not found
Circular fasta file(s) detected

Putting non-circular contigs in a separate directory
time update: running IRF for ITRs in non-circular contigs 10-12-22---00:21:13
/media/home/user11/miniconda3/bin/Cenote-Taker2/cenote-taker2.1.5.sh: line 464: /usr/bin/find: Argument list too long
time update: running prodigal on linear contigs 10-12-22---01:20:08
`
Could you help me solve this problem?Is that because my input file has too much sequences?

Regards!
Hazel

Only detecting 3 test viruses, and no genbank output

Hey all,

I've been trying to use cenote-taker2 and ran into an issue: When analysing the testcontigs_DNA_ct2.fasta file, I only get hits for 3 elements, and no genbank file output. As far as I can tell everything has been installed correctly, although there are somehow no template files in my folder. I can run the program without the template, which might be the reason I'm not getting any genbank files, but I can't imagine why the detection would fail.

Any idea what I might be doing wrong? I have attached a log file and my output folder
cheers
Paul

test.zip
cenote_log.txt

Linking rna_virus search to original contig

Hi,

Is it possible to have the original contig name in the output include the information after the space so that the name of the contig from the original contig dictionary would be captured?

KM_ct2761 contig_1689 (from non_viral_domains_contigs.fna) would be reported as KM_ct2761 contig_1689, not just KM_ct2761. This would save a lookup step.

Thank you,
Kathie Mihindukulasuriya

Minor issue on a source modifier

Thank you for providing a tool for GenBank submissions, the whole process can be quite confusing.
I realized I cannot assign the 'country' as a source modifier and it is shown as 'USA' by default in the .sqn file. Should this be manually circumvented?

My command looks like this just in case:
python run_cenote-taker2.py -am True --contigs vp1_ORF_only.fas --run_title vp1_genbank --template_file cs_template.sbt --prune_prophage False --mem 32 --cpu 28 --enforce_start_codon False --isolation_source feces --collection_date 2008 --bioproject PRJNAXXXX --host Homo sapiens --assembler metaSPADES --known_strains blast_knowns --blastn_db db/nt.fasta

Getting phage contigs even after "phage pruning" is set to true

Hi Mike,
I just wanted to clarify if I'm expected to see some phage contigs even if I set phage pruning to True. I read your notes (https://github.com/mtisza1/Cenote-Taker2/wiki#notes-on-virus-taxonomy ) on getting caudovirales not pruned. But I'm also getting varsani microvidiae and some other phages. Would you say those could be contigs where phages integrated into viral genomes? Or something else?

Thank you very much for your help.

About output files

Hello, Mike!
First of all, thank you very much for providing us with such a good tool, but I think this tool is not very friendly to noob (such as me). I am a first-year post-graduate student and just started studying bioinformatics. I used CT2 to analyze some virus sequences, but the output files stumped me. Many of them are file types I haven't seen before and I don't know how to open them. Except for .gbf files, I can use any genome/plasmid viewer to open them, but what tools should I use to open and view other files? Hope to get your answers and help, thank you very much!
out1
out2

Ancillary files not found when trying to remove

Hi,

I am trying to install "cenote-taker2" on our HPC system for a local research group.
I downloaded all databases and dependencies using the commit from 03/26/2020.

I did a test run you have provided in the Wiki page using the command:

run_cenote-taker2.0.1.py --contigs testcontigs_DNA_ct2.fasta --run_title test_DNA_ct --template_file template.sbt --prune_prophage True --mem 58 --cpu 8 --filter_out_plasmids False --enforce_start_codon False --handle_contigs_without_hallmark sketch_all --known_strains blast_knowns --blastn_db /work/HCC/BCRF/BLAST/nt

Most of the programs finished ok I think, but I am getting the following error at the end:

9783
 Summary file made: test_DNA_ct.tsv 
removing ancillary files
rm: cannot remove '*.comb.tbl': No such file or directory
rm: cannot remove '*.remove_hypo.txt': No such file or directory
rm: cannot remove '*.out.hhr': No such file or directory
rm: cannot remove '*.out.hhr': No such file or directory
rm: cannot remove 'bt2_indices/': No such file or directory
rm: cannot remove 'other_contigs/*.dat': No such file or directory
rm: cannot remove 'no_end_contigs_with_viral_domain/*.remove_hypo.txt': No such file or directory
rm: cannot remove 'no_end_contigs_with_viral_domain/*.trans.fasta': No such file or directory
rm: cannot remove 'no_end_contigs_with_viral_domain/test_DNA_ct3_vs1.AA.called_hmmscan2.txt': No such file or directory
rm: cannot remove 'no_end_contigs_with_viral_domain/test_DNA_ct4.AA.called_hmmscan2.txt': No such file or directory
rm: cannot remove 'no_end_contigs_with_viral_domain/test_DNA_ct4_vs1.AA.called_hmmscan2.txt': No such file or directory

These files indeed do not exist in the output directory, thus the message.
I was wondering if this type of error message is familiar to you, and whether you have some suggestions on how to fix it.
I am using the "testcontigs_DNA_ct2.fasta" file you have provided, and a dummy "template.sbt" file.
Please find the complete log here, cenote-taker2.log

I am looking forward to hearing from you, and if you need any additional information, please let me know.

Thank you,
Natasha

Error in script for VcontACT?

In the script provided to generate the files required for vcontACT2 in the README.md file, the definition of the env variable that contains the input file: SUMMARY="cenote_out_CONTIG_SUMMARY.tsv" should actually be SUMMARY="cenote_CONTIG_SUMMARY.tsv"....

cenote-taker2 vs (blastn nt & diamond nr)

Hi Mike,
Recently, I ran cenote-taker2 and blastn against nt database & diamond against nr database with the contigs assembled by Megahit. I found that about 10000 sequences were classified as viruses, while about 1000 were identified by blast. I am confused about why the results from blast are ten times less than cenote-taker2.

As you pointed that "Many virus genomes are integrated into host chromosomes" and "viral genes and genomes are often misidentified as host sequences"(Tisza M J, Belford A K, Dominguez-Huerta G, et al. Cenote-Taker 2 democratizes virus discovery and sequence annotation[J]. Virus evolution, 2021, 7(1): veaa100.). Thus, blast may have some false-negatives results. So, Is there a threshold to classify sequences as viral or non-viral using both tools (e.g. blast p-value or percent of ident or mapping length)?

wish you a merry Christmas in advance!

Nailou Zhang

Catalog content no_end_contigs_with_viral_domain

Good afternoon! Tell me please. Which of these files is common to all annotated proteins? Which can be taken for example for vConTACT analysis?

all_LIN_HMM2_proteins.AA.fasta
all_LIN_rps_proteins.AA.fasta
all_LIN_sort_genome_proteins.AA.fasta
all_prunable_rps_proteins.AA.fasta
all_prunable_seq_proteins.AA.fasta

Argument list too long

When I run python ~/software/Cenote-Taker2/Cenote-Taker2/run_cenote-taker2.py -c AsianEle01.fa -r AsianEle01_ct.out -m 32 -t 32 -p True -db virion --minimum_length_linear 3000 --lin_minimum_hallmark_genes 2
It called:
~/software/Cenote-Taker2/Cenote-Taker2/cenote-taker2.1.3.sh: line 228: /usr/bin/rm: Argument list too long

Is that ok?

erro:grep: sequin directory/*.fsa: No such file or directory when using latest cenote-Taker2.0.1

Hi Mike,
Thank you for Cenote-Taker2! However, I am facing a problem when i run the commond:
This is my code: python /media/home/user05/anaconda3/envs/cenote-taker2/bin/run_cenote-taker2.0.1.py --contigs /media/atm1/user05/virome1018/spades_contig/xunyu/XSQ_xunyu.contig.fa --run_title xunyu --template_file /media/home/user05/Cenote-Taker2/dummy_template.sbt --mem 250 --cpu 24 --prune_prophage true

grep: sequin_directory/.fsa: No such file or directory
head: cannot open 'sequin_directory/
.fsa' for reading: No such file or directory

bioawk: can't open file sequin_directory/.fsa
source line number 1
length
cat: '
.rotate.AA.called_hmmscan.txt': No such file or directory
head: cannot open 'no_end_contigs_with_viral_domain/sequin_directory/*.fsa' for reading: No such file or directory

grep: no_end_contigs_with_viral_domain/sequin_directory/.fsa: No such file or directory
head: cannot open 'no_end_contigs_with_viral_domain/sequin_directory/
.fsa' for reading: No such file or directory

head: cannot open 'no_end_contigs_with_viral_domain/sequin_directory/*.fsa' for reading: No such file or directory

head: cannot open 'no_end_contigs_with_viral_domain/sequin_directory/*.fsa' for reading: No such file or directory

grep: no_end_contigs_with_viral_domain/sequin_directory/.fsa: No such file or directory
head: cannot open 'no_end_contigs_with_viral_domain/sequin_directory/
.fsa' for reading: No such file or directory

bioawk: can't open file no_end_contigs_with_viral_domain/sequin_directory/*.fsa
source line number 1

cat: 'no_end_contigs_with_viral_domain/.AA.hmmscan2.sort.out': No such file or directory
head: cannot open 'no_end_contigs_with_viral_domain/
.fna' for reading: No such file or directory
Summary file made: xunyu.tsv
removing ancillary files
Looking forward to your reply.

DTR removing

Hi all,

I am using version 2.1.2 and I do not know if this was corrected in newer versions. When I search for circular genomes assembled with De Bruijn Graph based assembler I expect to find a duplicated sequence in both contigs ends the same size as the Kmer used for the assembly. These duplicated sequences are produce by the assembler as it can not be possible to "decide" where to continue (I think this is call a "buble" in graph theory). Nevermind, I know the "real" genome has only one copy of this sequence, and one of them should be remove. Here, in Cenote-taker, this is call DTR (Direct Terminal Repeat) and after the rotating process I would expect to have one copy remove from the final genome, but this is not happening, at least in version 2.1.2. I have check that several circular genomes in sequin_and_genome_maps folder contains the 2 copies of the repeat produce by the assembler.

Was this corrected in newer version? I could no find any information on that.

Thanks,

Alberto

DTR

Dear, Mike
I found that not all contigs with DTR are added to the final file (final_combined_virus_sequences). Why can this be? For example, 4 contigs with DTR were found, but only 1 in the final file. I can send a file with contigs.

Rotation to avoid ORFs overlapping with breakpoint does not seem to work?

Hi
Thank you for developing this amazing tool!

I have some circular phage genomes for which an ORF still overlaps with the breakpoint of the sequence.
If I understand correctly, the genome should be rotated if this is the case (cenotetaker2.1.3.sh: starting from lines 746)
The genomes are rotated, but there are still overlapping ORFs...

I received no errors and ran CenoteTaker2 with the --enforce_start_codon False option (I want partial ORFs on linear contigs to be included - not sure if this has anything to do with it), --min_circular_hallmark_genes 0 & -am True.
As far as I can tell, none of the files report the "missing" part of the ORF - not even as a second "partial" ORF at the end of the sequence (which would already be helpful).

I'm not sure what went wrong or if this just means that there is no better position to break open the circular sequence?

Thanks in advance for your help!

Unexpected sensitivity to . characters in output name

On running:

python ~/apps/CenoteTaker2/run_cenote-taker2.py -c $S.scaffolds.fasta --srr_number $S --known_strains blast_knowns --blastn_db /data/BLAST_databases/nt --reads1 $S/$S.trim.1.fq.gz --reads2 $S/$S.trim.2.fq.gz -r $S.virus -m 150 -t 40 -p False

I get a large number of mv and grep errors, when they can't find files, and the final table is empty. If, instead, I choose to run:

python ~/apps/CenoteTaker2/run_cenote-taker2.py -c $S.scaffolds.fasta --srr_number $S --known_strains blast_knowns --blastn_db /data/BLAST_databases/nt --reads1 $S/$S.trim.1.fq.gz --reads2 $S/$S.trim.2.fq.gz -r ${S}_virus -m 150 -t 40 -p False

It completes as expected, albeit with the error:

/data/home/dobbard/apps/CenoteTaker2/cenote-taker2.1.3.sh: line 596: s/#/ /g: No such file or directory

I think the code must have a problem with a '.' in the output file string, which is somewhat unexpected ...

Thanks!

D

Circularity detection and BlastN

Hello Mike,

I'm using Cenote-Taker2 to discover viral elements in my samples. I'm really glad that there are lots of viral genes are identified but I have two questions and hope to have you suggestions.

  1. The end_feature of some metagenomic plasmids are labeled as NONE.
    Screen Shot 2021-10-01 at 9 58 28 AM

I got different end_feature results of analyzing the same assembly data file when I run it separately (s00.1003single.txt) or with other samples (s00.1003list.txt). The latter one means that I have a list of assemblies (scaffold_list.txt) from SPAdes and I run the program follow the list. Below are the scripts for both runs. The first sample in the list is 1003-01.fasta. What I found was that, when it was analyzed separately, the end features of some viral sequences could be annotated as DTR (as the picture shown above). However, when I run it in a list, all of the metagenomic plasmids were labeled as NONE, though other info were the same such as LENGTH.

Also, different outputs were observed. In separated run, I got AA.fasta files (e.g., d00_ct1003single1033.AA.fasta.txt) but when run in list, I got permuted.fa files (e.g., permuted.1033.fa.txt) instead. I know it's really confused and I'm not sure if these problems are connected to each other. I attached the log files for both runs and please let me know if you need more information to figure the problem.

s00.1003single.txt
s00.1003list.txt

Log file for separated running:
x_dna_test_33051552.txt
y_dna_test_33051552.txt

Log file for running in the list:
x_dna_test_1.txt
y_dna_test_1.txt

  1. I turned on blast function as followed script and expected to identify some highly conserved sequences. But all samples showed "no high coverage hits." I want to know if I make any mistakes. Thank you!
--known_strains blast_knowns \
--blastn_db ftp://ftp.ncbi.nlm.nih.gov/ \

Best,
Kailun

conda install of bowtie fails?

Hi!

I just tried a new clean set up of cenote-taker2, and I get this error:

Aligning provided reads to contigs over cutoff to determine coverage.
time update: making bowtie2 indices  01-25-22---17:26:07
Aligning reads to BowTie2 index.
time update: aligning reads to bowtie2 indices  01-25-22---17:26:07
/data/home/dobbard/miniconda3/envs/cenote-taker2_env/bin/bowtie2-align-s: symbol lookup error: /data/home/dobbard/miniconda3/envs/cenote-taker2_env/bin/bowtie2-align-s: undefined symbol: _ZN3tbb10interface78internal15task_arena_base19internal_
initializeEv
(ERR): Description of arguments failed!
Exiting now ...

It may be related to this conda / bowtie2 problem: BenLangmead/bowtie2#336

Blindly following the instruction I found there:

conda install tbb=2020.2

seems to work.

any option to use continue mode

Thanks for this nice tool. It looks the whole process can be very slow. Is it possible to set a flag for continue mode? So we can restart the process where it stops rather than to start from the beginning.

Pipeline hangs at PHANOTATE step

I've been running Cenote Taker2 on a single sample, and it's been stuck on the PHANOTATE step for a bit over 24 hours now, without any visible update or running process.

When I run ps -a I see that I have several phanotate.py processes running, but via top they all appear to be sleeping.

I've had this happen on several other samples - at one point I ran about 10k Cenote-Taker2 runs, I'd say it happened in about 1-5% of samples.

I saw another issue about this from months ago, but the fix was just to update to the latest version. I did a complete re-install of Cenote-Taker2 and still have the same issue. I've attached the log file here.

Any help getting it to finish the PHANOTATE step would be massively appreciated!!
Carter

cenote_log.txt

None of my viral contigs are getting functional annotations

Hello, I've tried both importaning known viral contigs for annotation only, and the original metagenome from which they came for virus prediction and annotation, and neither time are there any viral annotations in the results. Not one viral hallmark gene is being predicted on any contig, but I know that other tools that I've used have found several. Do you know what might be causing this? Thanks in advance.

BlastN "no high coverage hits

hi Mike,

I met the same as Issue #22. I install krona with mamba. Then I successfully updated the Kron database using the following code.

KRONA_DIR=$( which python | sed 's/bin/python/opt/krona/g' )
cd ${KRONA_DIR}
./updateTaxonomy.sh
cd ${KRONA_DIR}
./updateAccessions.sh

the blast result is always "no high coverage hits" when I used known virus download from NCBI (e.g. ZIKV, sars-cov2, TBEV).

My command
python run_cenote-taker2.py --run_title ZNLtest -c testVirus.fasta -m 260 -t 45 -p False -db standard -am True --molecule_type RNA --known_strains blast_knowns --blastn_db ~/nt/nt --minimum_length_linear 1000 --lin_minimum_hallmark_genes 1 -hh hhsearch --hallmark_taxonomy True

ktClassifyBLAST -o test1.tab ZNLtest3.blastn_intraspecific.out
Loading taxonomy...
Classifying ZNLtest3.blastn_intraspecific.out...
[ WARNING ] "ZNLtest3.blastn_intraspecific.out" had e-values of 0. Approximated log[10] of 0 as -450.
[ WARNING ] The following accessions look strange and may yield erroneous results. Please check if they are acual valid NCBI
accessions: tname
[ WARNING ] The following accessions were not found in the local database (if they were recently added to NCBI, use
updateAccessions.sh to update the local database): tname
Writing test1.tab...

To import, run:
ktImportTaxonomy test1.tab # This can also work well
#-
How can I resolve this issue?

How to include locus_tag or gene_id in gbf?

Hi Mike,
Thank you for Cenote-Taker2! It is doing a great job with our datasets.
Question: There is no locus_tag or gene_id in our output gbf files - is there an option to include this? Or an easy way to add it?
Best wishes,
Kathryn

Updates

Dear, Mike!
I really liked your program! Could you clarify please, are you planning to update soon the hmm profile db?

ResolvePackageNotFound: - bbtools=37.62

Hi Mike,

Whenever I try to run the conda env create --file cenote-taker2_env.yml command I get the ResolvePackageNotFound error. To deal with this I tried to download bbtools from the website itself, however, when I got to the last step of the installation I kept getting an error saying the wget file could not be found.

ITR length and running irf

Hi Mike,

  1. I was wondering if there's any limitation on contig length to detect ITRs? For instance, in your test contig example you're showing a relatively short ITR. What if my ITR is > 100bp? Any parameters to irf input I should adjust?

  2. If I want to debug irf module by itself - how would you recommend to run it? I tried just invoking the executable, but it doesn't work. It seems to work from within cenote, though, as no error message in the output is found.

Thank you very much for your help.

input reads (=megahit output) of cenote

This is the last lines of megahit log. It means megahit successfully done.

2376 contigs, total 4900430 bp, min 201 bp, max 129463 bp, avg 2062 bp, N50 8928 bp
ALL DONE. Time elapsed: 78.262247 seconds

But, there is an error in the cenote log
Error, fewer reads in file specified with -2 than in file specified with -1

Does it mean that the input reads of cenote are few?

Run time with 7,000 viral genomes

Hello, I have been attempting to analyse a block of metagenomically derived putative viral genomes. These have been curated to a certain extent, so the confidence in the set is reaonably good.

That said, I gave this job to a physical machine with 32 cores and 250 GB of memory. The job has currently accumulated 90 hours of wall time and 125 hrs of CPU time. Although I have specified 32 cpus, I was hoping there would be more effective use of the concurrency available -- ie maybe a ratio of CPUtime/walltime ~ 20.

The pipeline has been processing the largest directory named no_end_contigs_with_viral_domain and most of that time seems to have been spent on steps involving Phanotate ORF prediction. I suppose it will follow this with anotation? It has already completed the same with Prodigal I think -- you're thorough!

Anyway, I had to have a look at the main script cenote-taker2.1.3.sh and I would like to ask if you have ever considered converting these steps to Nextflow? You would achieve reliable parallelisation, that could be further tuned per-step depending on IO, memory and CPU bounds of the related programs. You would also be able to keep a lot of the complex bash-fu as it is currently. The pipeline would also be easily moved onto HPCs through different queuing systems.

I would offer you my time to do so, but I am pretty limited at present.

Core dump

Hello!

Thank you for sharing this powerful tool.

We just got Cenote-Taker2 installed in the system. I tried to run it using the test dataset "testcontigs_DNA_ct2.fasta" but always got a problem that there was no circular viral sequences or DTRs were detected. Instead, it has "core" file generated. I was running it in terminal. Here is the running script:

time ${CENOTE_BASE}/run_cenote-taker2.py
-c ${indir}/testcontigs_DNA_ct2.fasta
-r ${outdir}
-p True
-m 30
-t ${SLURM_CPUS_PER_TASK}

In the slurm file it says "ESC[35mFile with .fasta extension detected, attempting to keep contigs over 1000 nt and find circular sequences with apc.plESC(BESC[0m
No circular contigs detected. "

I'm not sure if anyone else also had this problem. Could you help me figure it out? Any suggestions would be greatly appreciated!

Best,
Kailun

Quiet failure (strict mode option?)

This is less of a specific bug, and more something that would be helpful to change a bit.

I have noticed that when some step in the pipeline fails, it isn't entirely obvious at the end when looking at the outputs. For example, because of an issue on my end, phanotate.py wasn't running, but I didn't realize this until I saw the xarg error in the output "xargs: phanotate.py: No such file or directory." From that run, I still had a full summary .tsv file at the end, and it looked mostly normal.

In the past when I've had other errors, I've also only noticed because I went looking to make sure everything was right. This is somewhat of an issue when running Cenote-Taker on many samples, where most might work but some might fail -- It isn't practical to have to manually check the log files of 100s of runs.

It would be helpful if there were a "strict" mode, where cenote-taker2 quits if it encounters any error, rather than just continuing as though nothing happened. This would make it more obvious that a run has failed, and easier to distinguish between output files where no viruses were found vs output files where the run failed.

Thanks!

Feature request: blastn could use multiple threads

Hi!

I'm assuming its better to create new issues, rather than mush multiple issues into one? Tell me if you'd prefer I don't open new ones!

blastn (--known_strains blast_knowns) takes an age - is there a reason to only use one thread? and/or do one job at a time? From memory, blastn is most efficient at around 4-6 threads.

I guess it may depend on how many sequences there are to search, but if there are just a few then -num_threads 4 might be faster?

Thanks!

Darren

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.