vinuesa / get_phylomarkers Goto Github PK

A pipeline to select optimal markers for microbial phylogenomics and species tree estimation using the multispecies coalescent and concatenation approaches

License: Other

Perl 55.13% R 0.41% Shell 6.90% C 29.84% Makefile 2.01% C++ 0.42% M4 0.31% HTML 0.08% Roff 4.83% Dockerfile 0.06%

phylogenomics population-genetics phylogenomics-pipeline species-trees markers genomics pipeline phylogenetics phylogenetic-trees codon-alignment

get_phylomarkers's Introduction

GET_PHYLOMARKERS

GET_PHYLOMARKERS (Vinuesa et al., 2018) is an open-source software package for selecting optimal markers for microbial phylogenomics and species tree estimation. It implements a bioinformatics pipeline to filter core-genome gene clusters computed by the companion package GET_HOMOLOGUES, and selects only those with optimal attributes for phylogenetic inference using maximum likelihood (ML). The multiple sequence alignments of the filtered loci are concatenated into a supermatrix to estimate a species tree using the state-of-the-art fast ML tree searching algorithms FastTree or IQ-TREE. It also estimates ML and parsimony trees from the pan-genome matrix, including unsupervised learning methods. We have also tested it successfully with plant coding sequences (details here).

GET_PHYLOMARKERS 2

Starting with release 2.0.0 (2022-11-20), GET_PHYLOMARKERS also computes a concatenation-independent species tree from the ML gene trees estimated from top-scoring alignments using ASTRAL-III.

Release 2.1.0 (2024-03-31) introduced maximal matched-pairs tests to assess violations of the data to the Stationarity, Reversibility, and Homogeneity (SRH) assumptions made by maximum-likelihood phylogenetic models, as implemented in IQ-TREE. Additionally, ASTRAL-IV is used to estimate the species tree directly from the filtered ML source gene trees, which computes terminal and internal branch lengths in substitution-per-site units.

Release 2.2.0 (v2.2.0, 2024-04-14) introduced new features, most notably:

significant extension of run mode 2 (run_get_phylomarkers_pipeline.sh -R 2) for population genetics (multiple sequences from the same species), computing a population tree from a SNP matrix of top-scoring, neutral alignments.
Reporting of a detailed numeric overview of the different filtering steps
run_get_phylomarkers_pipeline.sh now also calls the C binary WEIGHTED-ASTRAL to estimate a species tree using as input the filtered gene trees estimated by iqtree2 or FastTree
Complex protein mixture models are used for concatenated protein alignments

Release 2.2.1 (2024-04-16) includes a static binary of snp-sites, which is called by run_get_phylomarkers_pipeline.sh >= v2.8.1.0_2024-04-15 under run mode 2 (-R 2, population genetics) to generate SNP matrices in FASTA and VCF formats from the concatenated alignments of filtered, highly informative, and neutral loci. The FASTA SNP matrix is used for estimating a ML population tree. Thanks to Alfredo Hernández @ccg_unam for compiling snp-sites-static.

This release was used to build the latest Docker GET_PHYLOMARKERS image (20240418) ready to pull from Docker Hub (docker pull vinuesa/get_phylomarkers:latest. This is a significantly lighter (2G.0B) image than the previous one (v20240414 (2.09GB), because several unnecessary R packages were removed. On Dockerhub, you will find detailed instructions on installing and configuring the Docker client on your machine, pulling the latest image, and running the containerized instance of the GET_PHYLOMARKERS pipeline.

GET_PHYLOMARKERS has a detailed manual and step-by-step tutorials document the software and help the user to get up and running quickly. For convenience, html and markdown versions of the documentation material are available.

Installation, dependencies and Docker image

For detailed instructions and dependencies please check INSTALL.md.

A GET_PHYLOMARKERS Docker image is available, as well as an image bundling GET_PHYLOMARKERS + GET_HOMOLOGUES, ready to use. Detailed instructions for setting up the Docker environment are provided in INSTALL.md. How to run container instances with the test sequences distributed with GET_PHYLOMARKERS is described in the tutorial.

Aim

GET_PHYLOMARKERS (Vinuesa et al. 2018) implements a series of sequential filters (detailed below) to selects markers from the homologous gene clusters produced by GET_HOMOLOGUES with optimal attributes for phylogenomic inference. It estimates gene-trees and species-trees under the maximum likelihood (ML) optimality criterion using state-of-the-art fast ML tree searching algorithms. The species tree is estimated from the supermatrix of concatenated, top-scoring alignments that passed the quality filters outlined in the figures below and explained in detail in the manual and publication.

Figure 1A. Simplified flow-chart of the GET_PHYLOMARKERS pipeline showing only those parts used and described in this work. The left branch, starting at the top of the diagram, is fully under control of the master script run_get_phylomarkes_pipeline.sh. The names of the worker scripts called by the master program are indicated on the relevant points along the flow, as detailed in the manual. The image corresponds to Fig. 1 of Vinuesa et al. 2018.

Figure 1B. Combined filtering actions performed by GET_HOMOLOGUES and GET_PHYLOMARKERS to select top-ranking phylogenetic markers to be concatenated for phylogenomic analyses, and benchmark results of the performance of the FastTree (FT) and IQ-TREE (IQT) maximum-likelihood (ML) phylogeny inference programs. The image corresponds to Fig. 3 of Vinuesa et al. 2018.

GET_HOMOLOGUES is a genome-analysis software package for microbial pan-genomics and comparative genomics originally described in the following publications:

More recently we developed GET_HOMOLOGUES-EST, which can be used to cluster eukaryotic genes and transcripts, as described in Contreras-Moreira et al, Front. Plant Sci. 2017.

If GET_HOMOLOGUES_EST is fed both .fna and .faa files of CDS sequences it will produce identical output to that of GET_HOMOLOGUES and thus can be analyzed with GET_PHYLOMARKERS all the same.

GET_PHYLOMARKERS is primarily tailored towards selecting CDSs (gene markers) to infer DNA-level phylogenies of different species of the same genus or family. It can also select optimal markers for population genetics, when the source genomes belong to the same species (Vinuesa et al. 2018). For more divergent genome sequences, classified in different genera, families, orders or higher taxa, the pipeline should be run using protein instead of DNA sequences.

Figure 2A. Best maximum-likelihood core-genome phylogeny for the genus Stenotrophomonas found in the IQ-TREE search, based on the supermatrix obtained by concatenation of 55 top-ranking alignments. The image corresponds to Fig. 5 of Vinuesa et al. 2018.

Figure 2B. Maximum-likelihood pan-genome phylogeny estimated with IQ-TREE from the consensus pan-genome clusters displayed in the Venn diagram. Clades of lineages belonging to the S. maltophilia complex are collapsed and are labeled as in Figure 2A. Numbers on the internal nodes represent the approximate Bayesian posterior probability/UFBoot2 bipartition support values (see methods). The tabular inset shows the results of fitting either the binary (GTR2) or morphological (MK) models implemented in IQ-TREE, indicating that the former has an overwhelmingly better fit. The scale bar represents the number of expected substitutions per site under the binary GTR2+F0+R4 substitution model. The image corresponds to Fig. 6 of Vinuesa et al. 2018.

Manual and tutorials

Please, follow the links for a detailed manual and tutorials, including a graphical flowchart of the pipeline and explanations of the implementation details. See also our plant tutorial.

Citation.

Pablo Vinuesa, Luz-Edith Ochoa-Sanchez and Bruno Contreras-Moreira (2018). GET_PHYLOMARKERS, a software package to select optimal orthologous clusters for phylogenomics and inferring pan-genome phylogenies, used for a critical geno-taxonomic revision of the genus Stenotrophomonas. Front. Microbiol. | doi: 10.3389/fmicb.2018.00771

Published in the Research Topic on "Microbial Taxonomy, Phylogeny and Biodiversity" http://journal.frontiersin.org/researchtopic/5493/microbial-taxonomy-phylogeny-and-biodiversity

Code

Source sode is freely available from GitHub and released under a GPLv3-like license.
Docker images ready to pull
- GET_PHYLOMARKERS Docker image
- GET_HOMOLOGUES+GET_PHYLOMARKERS Docker image

Developers

The code is developed and maintained by Pablo Vinuesa at CCG-UNAM, Mexico and Bruno Contreras-Moreira at EEAD-CSIC, Spain.

Acknowledgements

We thank Alfredo J. Hernández and Víctor del Moral at CCG-UNAM for technical support with server administration.

Funding

We gratefully acknowledge the funding provided over the years by DGAPA-PAPIIT/UNAM (grants IN201806-2, IN211814, IN206318, and IN216424) and CONAHCyT-Mexico (grants P1-60071, 179133 and A1-S-11242) to Pablo Vinuesa, as well as the Fundación ARAID,Consejo Superior de Investigaciones Científicas (grant 200720I038 and Spanish MINECO (AGL2013-48756-R) to Bruno Contreras-Moreira.

get_phylomarkers's People

Contributors

Stargazers

Watchers

Forkers

thisisliuqing felipelira cowriegump alexpersa7 viruss-corp jesus-baltazar geize mauriandresmu1313

get_phylomarkers's Issues

>>> ERROR: Input f?aed files do not contain the same number of strains and a single instance for each strain...

Hello
I couldn't find the guidance for this error in the manual and tutorials. Following the tutorial at https://vinuesa.github.io/get_phylomarkers/GET_PHYLOMARKERS_manual.html#get_phylomarkers-tutorial I assume that I did all right until the command run_get_phylomarkers_pipeline.sh -R 2 -t DNA.
I couldn't understand what I should do according to the error description. May someone help me, please? Thank you

My commands:
initial folder 'PlanktoPampPhylo_gbk': 5 references (NCBI) + 10 MAGs, all in gbff format (annotated using bakta)
then:

get_homologues.pl -d PlanktoPampPhylo_gbk -t 15 -e -n 16
get_homologues.pl -d PlanktoPampPhylo_gbk -G -t 0 -n 16
get_homologues.pl -d PlanktoPampPhylo_gbk -M -t 0 -n 16
compare_clusters.pl -d ./MetaPamp092015AG5bin6_f0_15taxa_algBDBH_e1_,
./MetaPamp092015AG5bin6_f0_0taxa_algCOG_e0_,
./MetaPamp092015AG5bin6_f0_0taxa_algOMCL_e0_ -o core_BCM -t 15 -n
compare_clusters.pl -d ./MetaPamp092015AG5bin6_f0_15taxa_algBDBH_e1_,
./MetaPamp092015AG5bin6_f0_0taxa_algCOG_e0_,
./MetaPamp092015AG5bin6_f0_0taxa_algOMCL_e0_ -o core_BCM -t 15
compare_clusters.pl -d ./MetaPamp092015AG5bin6_f0_0taxa_algCOG_e0_,
./MetaPamp092015AG5bin6_f0_0taxa_algOMCL_e0_ -o pan_CM -n -m
cd PlanktoPampPhylo_gbk_homologues/core_BCM
find . -name '.faa' | wc -l
4269
find . -name '.fna' | wc -l
4269
run_get_phylomarkers_pipeline.sh -R 2 -t DNA
...
1 >Bakta_PJECAB_20240_
1 >Bakta_PJECAB_20245_
1 >Bakta_PJECAB_20250_
1 >Bakta_PJECAB_20255_
1 >Bakta_PJECAB_20260_
1 >Bakta_PJECAB_20265_
1 >Bakta_PJECAB_20270_
1423 >[Planktothrix_mougeotii_LEGE_06226]_LEGE_06226
1423 >[Planktothrix_pseudagardhii]_No.713
1423 >[Planktothrix_sp._FACHB-1365]_FACHB-1365
1423 >[Planktothrix_tepida]PCC9214
1423 >[Planktothrix_tepida_PCC_9214]

ERROR: Input f?aed files do not contain the same number of strains and a single instance for each strain...
Please check input FASTA files as follows:
1. Revise the output above to make sure that all genomes have a strain assignation and the same number of associated sequences.
If not, add strain name manually to the corresponding fasta files or exclude them.
2. Make sure that only one genome/gbk file is provided for each strain.
3. You may need to run get_homologues.pl with -e or compare_clusters.pl with -t NUM_OF_INPUT_GENOMES to get clusters of equal sizes
Please check the GET_HOMOLOGUES manual

feature-request-HipMCL

Can you guys add HipMCL as one of the clustering options? HipMCL

best, Sean

ERROR: Input f?aed files do not contain the same number of strains and a single instance for each strain...

Dear get_phylomarkers team,

I attempted to use get_phylomarkers pipeline in a Docker environment (DockerToolbox-18.09.3). I tried to test it with provided dataset test_sequences/core_genome, but getting the error message (log file attached). I wonder if you could help me with this issue.

I look forward to hearing from you.

Sincerely,

Nemanja

ERROR.log

feature development?

Hi,
Just found your pipeline and looks great! Just had a very fast read but will try it soon

But, was wondering if you would consider adding a track to your pipeline: this would be very usefull to be able to use date data for reconstructing phylogenies (therefore we have to get bayesian), even if IQtree is really great, ML models still do not take into account that our sampled bacteria strains could actually be the ancestors to current bacteria.
Being able to do pangenome phylogeny while accounting for ancestral strains would be of great help in epidemiology (ie for locating putative source from outbreak).

I am testing some methods now, and very preliminary data indicated 2 clusters: old strains and new strains of what we believe might be unique slow going epidemy.

So why not adding add track where you could use bayesian (ie BEAST with dates as prior .. but they are others poping up recently) for your phylogeny reconstruction - that would be really helpfull for population level studies & classification of bacterial pathogens

All the best
Eve

get_phylomarkers_error

Dear get_phylomarkers team,

I am using the pipeline through the docker image. Scripts get_homologues and compare_clusters worked fine, but I am getting the following error with get_phylomarkers:

 >>>>>>>>>>>>>>> parallel IQ-TREE runs to estimate gene trees <<<<<<<<<<<<<<<

 >>> wrote file IQT_DNA_gene_tree_Tmedium_stats.tsv ...
 >>> wrote file iqtree_output_files.tgz ...
[02:15:46] # running compute_MJRC_tree treefile I ...
 >>> Wrote file IQT_MJRC_tree.nwk ...
[02:15:47] # counting branches on 619 gene trees ...
 >>> wrote file no_tree_branches.list ...

 >>>>>>>>>>>>>>> filter gene trees for outliers with kdetrees test <<<<<<<<<<<<<  <<

 >>> wrote file all_gene_trees.tre ...
[02:15:48] # running kde test ...

 ERROR: the expected outfile kde_dfr_file_all_gene_trees.tre.tab was not produce  d
            This may be because R package(s) kdetrees,ape are not properly insta  lled.
            Please run the script './install_R_deps.R' from within /get_phylomar  kers to install them.
            Further installation tips can be found in file Rscript install_R_dep  s.R

>
>        print("Your R installation currently searches for packages in :")
[1] "Your R installation currently searches for packages in :"
>        print(.libPaths())
[1] "/get_phylomarkers/lib/R"       "/usr/local/lib/R/site-library"
[3] "/usr/lib/R/site-library"       "/usr/lib/R/library"
>

I wonder if you could help me with this?

I look forward to hearing from you.

Sincerely,

Nemanja

error with get_phylomarkers

singularity exec /usr/local/singularity-images/get_phylomarkers-2.2.8.1.simg run_get_phylomarkers_pipeline.sh -R 1 -t DNA

Executed the above command. Got the following error:

awk: cannot open sorted_aggregated_support_values4loci.tab (No such file or directory)

ERROR! The expected output file sorted_aggregated_support_values4loci_ge70perc.tab was not produced, will exit now

Remove different number of sequences from output

Hello, I am getting the same issue as #2. My question is, is there a way to identify which fasta files generate the error given their number, so I can manually remove them instead of having to run the pipe again? Or can I rerun the pipe using -e without having to perform the blasts again? I have limited computing power, so I do not want to wait another week.

Just to clarify, I'm getting this error
`>>> ERROR: Input FASTA files do not contain the same number of sequences...

107
126
128
132
143
145
146
153
154
155
156
157
158
165
234
78
79
80
81
83
85
87
88
91
92
95
`
But my fastas have names from 97420 to 100927

Thank you so much

ERROR! The expected output file sorted_aggregated_support_values4loci_ge70perc.tab was not produced, will exit now

Dear get_phylomarkers team,

These are my running codes:
get_homologues.pl -d gbk -t 12 -e -n 10
get_homologues.pl -d gbk -G -t 0 -o COG
get_homologues.pl -d gbk -M -t 0 -o OMCL

compare_clusters.pl -d ./COG/gbk_homologues/PararhizobiummangroviBGMRC6574T_f0_0taxa_algCOG_e0_,./OMCL2/gbk_homologues/PararhizobiummangroviBGMRC6574T_f0_0taxa_algOMCL_e0_,./gbk_homologues/PararhizobiummangroviBGMRC6574T_f0_12taxa_algBDBH_e1_ -o core -t 30
compare_clusters.pl -d ./COG/gbk_homologues/PararhizobiummangroviBGMRC6574T_f0_0taxa_algCOG_e0_,./OMCL2/gbk_homologues/PararhizobiummangroviBGMRC6574T_f0_0taxa_algOMCL_e0_,./gbk_homologues/PararhizobiummangroviBGMRC6574T_f0_12taxa_algBDBH_e1_ -o core -t 30 -n

and I get core, then
cd core
run_get_phylomarkers_pipeline.sh -R 1 -t DNA

finally I get error:
awk: fatal: cannot open file `sorted_aggregated_support_values4loci.tab' for reading (No such file or directory)

ERROR! The expected output file sorted_aggregated_support_values4loci_ge70perc.tab was not produced, will exit now!

I dont know why, Is there a problem with my input file?

the attachment is my log.

I look forward to hearing from you.
get_phylomarkers_run_AIR1tDNA_k1.5_m0.7_Thigh_27Sep23.log
run_get_phylomarkers_pipeline.log

Input f?aed files do not contain the same number of strains and a single instance for each strain

Dear get_phylomarkers team,

Hello, I have a question that I would like to request your help with.

I attempted to run "run_get_phylomarkers_pipeline.sh -R 1 -t DNA", but the operation reported an error.
I have the following data:
ll *faa
Rhizobium_rhizophilum_7209-2.faa
ll *fna
Rhizobium_rhizophilum_7209-2.fna

but getting the error message (log file attached). I saw a previous issue that was very similar to what I am proposing now, but in reality, it seems different

I look forward to hearing from you.

Sincerely,

run.log
get_phylomarkers_run_AIR1tDNA_k1.5_m0.7_Thigh_21Sep23.log

Number of Sequences Error

Hi there,

I am trying to run GET_PHYLOMARKERS on my dataset. I successfully added cds and peptide files 1.faa, 1.fna, 2.faa, 2.fna, ... etc (number denotes strain of same species) to my directory inside of the docker image, and when running the pipeline, I came across this error:

ERROR: Input FASTA files do not contain the same number of sequences...
4584
4671
4707
4924

I took a look at the source code here:

When I grepped for number of sequence for my coding transcript and peptide file pairs:
4584 1.fna
4584 1.faa
4707 2.fna
4707 2.faa
4924 3.fna
4924 3.faa
4671 4.faa
4671 4.fna

(# sequences in nucleotide coding sequences = # peptide sequences)

And so it seems that all fasta files need to have the same number of sequences. But if the number of CDS elements within each line is different (my goal is to perform orthologous clustering), why must this number be the same? Is there anyway to run the pipeline without this limitation, or how do I run it with this condition?

Thank you for the help.

ERROR: Input f?aed files do not contain the same number of strains and a single instance for each strain

Good day GET_PHYLOMARKERS team

First, thanks a lot for this tool. It's really helpful for what I want to do in my thesis project! I'm working with the docker image you provided, which is really handy and easy to use :)

I'm interested in doing a phylogenomic analysis with cloroplast genome sequences. Due to these genomes having ~ 20 genes with introns, I have been using get_homologues-est. Some genomes are from the same species (aprox. 4), but all these genomes are from the same genus.

Until yesterday, I have been testing get_homologues-est and get_phylomarkers with the data that I downloaded from NCBI. I downloaded the CDS and proteins sequences for each genome and changed their name to:

speciesA.cds.fna for the CDS sequences and
speciesA.cds.faa for protein sequences

, to have the twin files that get_homologues-est requires (although in the manual, it says that the protein file is optional?).

I did not change the headers of the sequences. I wanted to see if the program could work with them as they are (I read in the manual that the headers have to have a particular format so I was worried that the programs will not accept them). They look like these:

==> b73.cds.faa <==
>lcl|NC_001666.2_prot_NP_043003.1_1 [gene=rps12] [locus_tag=ZemaCp001] [db_xref=UniProtKB/Swiss-Prot:P12340] [protein=ribosomal protein S12] [exception=trans-splicing] [protein_id=NP_043003.1] [location=complement(join(92301..92329,92870..93101,69307..69420))] [gbkey=CDS]
MPTVKQLIRNARQPIRNARKSAALKGCPQRRGTCARVYTINPKKPNSALRKVARVRLTSGFEITAYIPGI
GHNLQEHSVVLVRGGRVKDLPGVRYRIIRGTLDAVAVKNRQQGRSKYGAKKPKK
>lcl|NC_001666.2_prot_NP_043004.1_2 [gene=psbA] [locus_tag=ZemaCp002] [db_xref=UniProtKB/Swiss-Prot:P48183] [protein=photosystem II protein D1] [protein_id=NP_043004.1] [location=complement(89..1150)] [gbkey=CDS]
MTAILERRESTSLWGRFCNWITSTENRLYIGWFGVLMIPTLLTATSVFIIAFIAAPPVDIDGIREPVSGS
LLYGNNIISGAIIPTSAAIGLHFYPIWEAASVDEWLYNGGPYELIVLHFLLGVACYMGREWELSFRLGMR
PWIAVAYSAPVAAATAVFLIYPIGQGSFSDGMPLGISGTFNFMIVFQAEHNILMHPFHMLGVAGVFGGSL
FSAMHGSLVTSSLIRETTENESANEGYKFGQEEETYNIVAAHGYFGRLIFQYASFNNSRSLHFFLAAWPV
VGIWFTALGISTMAFNLNGFNFNQSVVDSQGRVINTWADIINRANLGMEVMHERNAHNFPLDLAALEVPY
LNG

==> b73.cds.fna <==
>lcl|NC_001666.2_cds_NP_043003.1_1 [gene=rps12] [locus_tag=ZemaCp001] [db_xref=UniProtKB/Swiss-Prot:P12340] [protein=ribosomal protein S12] [exception=trans-splicing] [protein_id=NP_043003.1] [location=complement(join(92301..92329,92870..93101,69307..69420))] [gbkey=CDS]
ATGCCAACGGTTAAACAACTTATTAGAAATGCAAGACAGCCAATACGAAATGCTAGAAAATCGGCCGCGC
TTAAGGGATGTCCTCAGCGTCGAGGAACATGTGCTAGGGTGTATACTATCAACCCCAAAAAACCCAACTC
TGCCTTACGTAAAGTTGCCAGAGTACGATTAACCTCTGGATTTGAAATCACTGCTTATATACCTGGTATT
GGCCATAATTTACAAGAACATTCTGTAGTATTAGTAAGAGGAGGAAGGGTTAAGGATTTACCCGGTGTGA
GATATCGCATTATTCGAGGAACCCTAGATGCTGTCGCAGTAAAGAATCGTCAACAAGGGCGTTCTAAATA
TGGGGCCAAAAAGCCAAAAAAATAA
>lcl|NC_001666.2_cds_NP_043004.1_2 [gene=psbA] [locus_tag=ZemaCp002] [db_xref=UniProtKB/Swiss-Prot:P48183] [protein=photosystem II protein D1] [protein_id=NP_043004.1] [location=complement(89..1150)] [gbkey=CDS]
ATGACTGCAATTTTAGAGAGACGCGAAAGTACAAGCCTGTGGGGTCGCTTCTGCAACTGGATAACTAGCA
CCGAAAACCGTCTTTACATTGGATGGTTCGGTGTTTTGATGATCCCTACCTTATTGACCGCAACTTCCGT

But both pipelines (get_homologues-est and get_phylomarkers) worked great! I was getting my core genes and my top phylomarkers. The problem started yesterday, when I added the twin files (*.fna and *.faa) of 2 genomes I assembled and annotated using cpGavas2. I was now working with 15 genomes.

The files look like these inside:

==> zmex.cds.fna <==
>rps12_join{[69326:69439](-),[92925:93156](-),[92356:92384](-)}
ATGCCAACGGTTAAACAACTTATTAGAAATGCAAGACAGCCAATACGAAATGCTAGAAAATCGGCCGCGCTTAAGGGATGTCCTCAGCGTCGAGGAACATGTGCTAGGGTGTATACTATCAACCCCAAAAAACCCAACTCTGCCTTACGTAAAGTTGCCAGAGTACGATTAACCTCTGGATTTGAAATCACTGCTTATATACCTGGTATTGGCCATAATTTACAAGAACATTCTGTAGTATTAGTAAGAGGAGGAAGGGTTAAGGATTTACCCGGTGTGAGATATCGCATTATTCGAGGAACCCTAGATGCTGTCGCAGTAAAGAATCGTCAACAAGGGCGTTCTAAATATGGGGCCAAAAAGCCAAAAAAATAA
>psbA_[89:1150](-)
ATGACTGCAATTTTAGAGAGACGCGAAAGTACAAGCCTGTGGGGTCGCTTCTGCAACTGGATAACTAGCACCGAAAACCGTCTTTACATTGGATGGTTCGGTGTTTTGATGATCCCTACCTTATTGACCGCAACTTCCGTATTTATTATCGCCTTCATCGCTGCTCCTCCAGTAGATATTGATGGTATTCGTGAGCCTGTTTCTGGTTCTTTACTTTATGGAAACAATATTATCTCTGGTGCTATTATTCCTACTTCTGCGGCGATCGGATTGCATTTTTACCCAATTTGGGAAGCTGCATCTGTTGATGAATGGTTATACAATGGCGGTCCTTATGAGCTAATTGTTCTACACTTCTTACTTGGTGTAGCTTGTTATATGGGTCGTGAGTGGGAACTTAGTTTCCGTCTGGGTATGCGCCCTTGGATTGCTGTTGCATATTCAGCTCCTGTTGCAGCTGCTACTGCTGTTTTCTTGATTTACCCTATTGGTCAAGGAAGTTTCTCTGATGGTATGCCTTTAGGAATATCTGGTACTTTCAACTTTATGATTGTATTCCAGGCAGAGCACAACATCCTTATGCATCCATTTCACATGTTAGGTGTAGCTGGTGTATTCGGCGGTTCCCTATTCAGTGCTATGCATGGTTCCTTGGTAACCTCTAGTTTGATCAGGGAAACCACTGAAAATGAGTCTGCTAATGAGGGTTACAAATTTGGTCAGGAAGAAGAGACTTATAATATTGTGGCTGCTCACGGTTATTTTGGTCGATTAATCTTCCAATATGCTAGTTTCAACAATTCTCGTTCTTTACACTTCTTCTTGGCTGCTTGGCCTGTAGTAGGGATCTGGTTCACTGCTTTAGGTATTAGTACTATGGCATTCAACCTAAATGGTTTCAATTTCAACCAATCTGTAGTTGATAGCCAAGGTCGCGTTATTAATACTTGGGCTGATATCATCAACCGTGCTAATCTTGGTATGGAGGTAATGCACGAACGTAATGCTCACAACTTCCCTCTAGACCTAGCTGCTCTTGAAGTTCCTTACCTTAATGGATAA
>matK_[1670:3211](-)
ATGGAAAAATTCGAAGGGTATTCAGAAAAACAGAAATCTCGTCAACACTACTTCGTCTACCCACTTCTCTTTCAGGAATATATTTATGCATTTGCTCATGATTATGGATTAAATGGTTCCGAACCCGTGGAAATTTTTGGTTGTAATAACAAGAAATTTAGTTCACTACTTGTGAAACGTTTAATTATTCGAATGTATCAGCAGAATTTTTTAATTAATTCGGTTAATTATCCTAACCAAGATCGATTGTTGGATCACCGTAATTATTTTTATTCTGAGTTTTATTCTCAGATTCTATCTGAGGGGTTTGCGATCGTTGTAGAAATCCCACTCTCGCTAGGGCAACTATCTTGTCCGGAAGAAAAAGAAATACCAAATTTTCAAAATTTACAATCTATTCATTCAATATTTCCCTTTTTAGAAGACAAATTTTTGCATTTACATTATCTATCACATATAAAAATACCCTATCCTATCCATTTAGAAATCCTGGTTCAACTCCTTGAATACCGGATTCAAGATGTTCCCTCTTTGCATTTATTGCGATTCTTTCTCCACTATTATTCGAATTGGAATAGTTTTATTACTTCAATGAAATCCATTTTTCTTTTGAAAAAAGAAAATAAAAGACTATTTCGATTCCTATATAACTCTTATGTATCAGAATATGAATTTTTCTTGTTGTTTCTTCATAAACAATCTTCTTGCTTACGATTAACATCTTCTGGAACCTTTCTGGAACGAATCATCTTTTCTGGGAAGATGGAACATTTTGGTGTAATGTACCCGGGGTTTTTTCGGAAAACCATATGGTTCTTTATGGATCCTCTTATGCATTATGTTCGATATCAAGGAAAGGCAATTCTTGCATCAAAAGGCACTCTTCTTTTGAAGAAGAAATGGAAATCTTACCTTGTCAATTTCTCGCAATATTTTTTCTCTTTTTGGACTCAACCACAAAGGATTCGTCTAAACCAATTAACAAACTCTTGCTTCGATTTTCTGGGGTACCTTTCAAGTGTACCAATAAATACTTTGTTAGTAAGGAATCAAATGCTGGAGAATTCTTTTCTAATAGATACTCGAATGAAAAAATTCGATACCACAGTCCTTGCAACTCCCCTTGTCGGATCCTTATCAAAAGCTCAATTTTGTACTGGATCGGGGCATCCTATTAGTAAACCCGTTTGGACTGATTTATCAGATTGGGATATTCTTGATCGTTTTGGTCGGATATGTAGAAATATTTTTCATTATCATAGTGGATCTTCGAAAAAACAGACTTTGTATCGACTAAAATATATACTTCGACTTTCATGTGCTAGAACTTTAGCTCGTAAACATAAAAGTACGGTACGAACTTTTATGCAACGATTGGGTTCGGTATTTTTAGAAGAATTTTTTACGGAAGAAGAGCAAGTTTTTTCTTTGATGTTCACCAAAACAATTCACTTTTCTTTCCATGGATCACACAGTGAGCGTATTTGGTATTTGGATATTATCCGTATCAATGACCTGGTGAATCCTCTTACTCTTAATTAA
>rps16_join{[5558:5597](-),[4484:4701](-)}
ATGGTAAAACTTCGTTTAAAACGATGTGGTAGAAAGCAACAAGCTATCTATCGAATCGTTGCAATTGATGTTCGATCTCGAAGAGAAGGAAGAGATCTTCGAAAAGTAGGTTTTTATGATCCGATAAAGAATCAAACTTGTTTAAATGTTCCAGCTATTCTCTATTTCCTTGAAAAGGGTGCTCAACCTACAAGAACAGTTTATGATATTTTAAGGAAGGCAGAATTCTTTAAAGATAAAGAAAGAACTTTGAGTTAA
>psbK_[7195:7380](+)
ATGCCTAATATACTTAGTTTAACCTGTATCTGTTTTAATTCTGTTCTTTGTCCGACTAGCTTTTTCTTCGCCAAATTACCCGAAGCTTATGCCATTTTCAACCCAATCGTGGATGTTATGCCTGTCATACCTGTACTCTTTTTTCTATTAGCGTTTGTTTGGCAAGCTGCTGTAAGTTTTCGATGA

==> zmex.cds.faa <==
>rps12_join{[69326:69439](-),[92925:93156](-),[92356:92384](-)}
MPTVKQLIRNARQPIRNARKSAALKGCPQRRGTCARVYTINPKKPNSALRKVARVRLTSGFEITAYIPGIGHNLQEHSVVLVRGGRVKDLPGVRYRIIRGTLDAVAVKNRQQGRSKYGAKKPKK
>psbA_[89:1150](-)
MTAILERRESTSLWGRFCNWITSTENRLYIGWFGVLMIPTLLTATSVFIIAFIAAPPVDIDGIREPVSGSLLYGNNIISGAIIPTSAAIGLHFYPIWEAASVDEWLYNGGPYELIVLHFLLGVACYMGREWELSFRLGMRPWIAVAYSAPVAAATAVFLIYPIGQGSFSDGMPLGISGTFNFMIVFQAEHNILMHPFHMLGVAGVFGGSLFSAMHGSLVTSSLIRETTENESANEGYKFGQEEETYNIVAAHGYFGRLIFQYASFNNSRSLHFFLAAWPVVGIWFTALGISTMAFNLNGFNFNQSVVDSQGRVINTWADIINRANLGMEVMHERNAHNFPLDLAALEVPYLNG
>matK_[1670:3211](-)
MEKFEGYSEKQKSRQHYFVYPLLFQEYIYAFAHDYGLNGSEPVEIFGCNNKKFSSLLVKRLIIRMYQQNFLINSVNYPNQDRLLDHRNYFYSEFYSQILSEGFAIVVEIPLSLGQLSCPEEKEIPNFQNLQSIHSIFPFLEDKFLHLHYLSHIKIPYPIHLEILVQLLEYRIQDVPSLHLLRFFLHYYSNWNSFITSMKSIFLLKKENKRLFRFLYNSYVSEYEFFLLFLHKQSSCLRLTSSGTFLERIIFSGKMEHFGVMYPGFFRKTIWFFMDPLMHYVRYQGKAILASKGTLLLKKKWKSYLVNFSQYFFSFWTQPQRIRLNQLTNSCFDFLGYLSSVPINTLLVRNQMLENSFLIDTRMKKFDTTVLATPLVGSLSKAQFCTGSGHPISKPVWTDLSDWDILDRFGRICRNIFHYHSGSSKKQTLYRLKYILRLSCARTLARKHKSTVRTFMQRLGSVFLEEFFTEEEQVFSLMFTKTIHFSFHGSHSERIWYLDIIRINDLVNPLTLN
>rps16_join{[5558:5597](-),[4484:4701](-)}
MVKLRLKRCGRKQQAIYRIVAIDVRSRREGRDLRKVGFYDPIKNQTCLNVPAILYFLEKGAQPTRTVYDILRKAEFFKDKERTLS
>psbK_[7195:7380](+)
MPNILSLTCICFNSVLCPTSFFFAKLPEAYAIFNPIVDVMPVIPVLFFLLAFVWQAAVSFR

Both have the same number of sequences: 82 in fna and faa in one genome, and 81 in fna and faa in the other.

Everything worked fine in get_homologues-est, which I ran with the parameters:

$ get_homologues-est.pl -d test5zea/ -r b73.cds.fna -e -z
$ get_homologues-est.pl -d test5zea/ -r b73.cds.fna -M -t 0

And compare_clusters.pl:

$ compare_clusters.pl -d ./b73_alltaxa_algBDBH_e1_,\
> ./b73_0taxa_algOMCL_e0_ -o core_BCM -t 15 -n

$ compare_clusters.pl -d ./b73_alltaxa_algBDBH_e1_,\
> ./b73_0taxa_algOMCL_e0_ -o core_BCM -t 15

I got 63 clusters (core_BCM directory).

$ find . -name '*.fna' | wc -l 
63
$ find . -name '*.faa' | wc -l 
63

Then I started the get_phylomarkers pipeline with these parameters:

$ run_get_phylomarkers_pipeline.sh -R 1 -t DNA -S 'TIM,TVM,GTR'

And here it showed me this message (I think the program has a problem with the data of the 2 genomes I just added ...):

>>>>>>>>>>>>>>> running input data sanity checks <<<<<<<<<<<<<<< 

[07:52:29] # processing source fastas in directory get_phylomarkers_run_AIR1tDNA_k1.5_m0.7_Tmedium_27Apr21 ...
[07:52:30] # Performing strain composition check on f?aed files ...
2
      1 >KF241980.1_cds_AGV02637.1_2_[A188.cds.fna]_
      1 >KF241980.1_cds_AGV02638.1_3_[A188.cds.fna]_
      1 >KF241980.1_cds_AGV02640.1_5_[A188.cds.fna]_
.
.
.
     63 >[inia.corrected.cds.fna]_
     63 >[zmex.corrected.cds.fna]_
2
      1 >KF241980.1_cds_AGV02637.1_2_[A188.cds.fna]_
      1 >KF241980.1_cds_AGV02638.1_3_[A188.cds.fna]_
      1 >KF241980.1_cds_AGV02640.1_5_[A188.cds.fna]_
      1 >KF241980.1_cds_AGV02641.1_6_[A188.cds.fna]_
.
.
.
.
.
      1 >NC_001666.2_cds_NP_043085.1_83_[b73.cds.fna]_
      1 >NC_001666.2_cds_NP_043086.1_84_[b73.cds.fna]_
      1 >NC_001666.2_cds_NP_043088.1_86_[b73.cds.fna]_
      1 >NC_001666.2_cds_NP_043089.1_87_[b73.cds.fna]_
      1 >NC_001666.2_cds_NP_043090.1_88_[b73.cds.fna]_
      1 >NC_001666.2_cds_NP_043091.1_89_[b73.cds.fna]_
      1 >NC_001666.2_cds_NP_043092.1_90_[b73.cds.fna]_
      1 >NC_001666.2_cds_NP_043093.1_91_[b73.cds.fna]_
     63 >[inia.corrected.cds.fna]_
     63 >[zmex.corrected.cds.fna]_
2
.
.
.
.
.
     63 >[inia.corrected.cds.fna]_
     63 >[zmex.corrected.cds.fna]_

 >>> ERROR: Input f?aed files do not contain the same number of strains and a single instance for each strain...
         Please check input FASTA files as follows: 
	 1. Revise the output above to make sure that all genomes have a strain assignation and the same number of associated sequences. 
	      If not, add strain name manually to the corresponding fasta files or exclude them.
	 2. Make sure that only one genome/gbk file is provided for each strain.     
	 3. You may need to run get_homologues.pl with -e or compare_clusters.pl with -t NUM_OF_INPUT_GENOMES to get clusters of equal sizes
	     Please check the GET_HOMOLOGUES manual


http://eead-csic-compbio.github.io/get_homologues/manual

The sequences I recently added have both have the same number of sequences in the twin files: 82 in fna and faa in one genome, and 81 in fna and faa in the other.

I'm sure the error message is very helpful but I'm not sure how to check what is wrong ...

Please, any help well be very much apreciated.

Kind regards

error while running estimate_pangenome_phylogenies.sh: declare: -A: invalid option

Hello,
I am running into an error using the script: estimate_pangenome_phylogenies.sh on a mac. The complete error looks like:

Path/to/estimate_pangenome_phylogenies.sh: line 67: declare: -A: invalid option
declare: usage: declare [-afFirtx] [-p] [name[=value] ...]
input_fasta: pangenome_matrix_t0.fasta
discrete_model: BIN
branch support type: UFBoot
 
 >>>>>>>>>>>>>>> ModelFinder + IQ-TREE run on pan-genome matrix <<<<<<<<<<<<<<< 
 
[01:31:05]  >>> running iqtree -s pangenome_matrix_t0.fastaed -st BIN to find best model
grep: pangenome_matrix_t0.fastaed.log: No such file or directory
 >>> Best-fit model:  ...
 ... created and moved into dir iqtree_10_runs
[01:31:05] # Will sequentially launch 10 IQ-TREE searches on the supermatrix with best model  -UFBoot!.
		   This will take a while ...
[01:31:05] > iqtree -s pangenome_matrix_t0.fastaed -st BIN -m  -bb 1000 -nt AUTO -pre UFBoot_run1 &> /dev/null
[01:31:05] > iqtree -s pangenome_matrix_t0.fastaed -st BIN -m  -bb 1000 -nt AUTO -pre UFBoot_run2 &> /dev/null
[01:31:05] > iqtree -s pangenome_matrix_t0.fastaed -st BIN -m  -bb 1000 -nt AUTO -pre UFBoot_run3 &> /dev/null
[01:31:05] > iqtree -s pangenome_matrix_t0.fastaed -st BIN -m  -bb 1000 -nt AUTO -pre UFBoot_run4 &> /dev/null
[01:31:05] > iqtree -s pangenome_matrix_t0.fastaed -st BIN -m  -bb 1000 -nt AUTO -pre UFBoot_run5 &> /dev/null
[01:31:05] > iqtree -s pangenome_matrix_t0.fastaed -st BIN -m  -bb 1000 -nt AUTO -pre UFBoot_run6 &> /dev/null
[01:31:05] > iqtree -s pangenome_matrix_t0.fastaed -st BIN -m  -bb 1000 -nt AUTO -pre UFBoot_run7 &> /dev/null
[01:31:05] > iqtree -s pangenome_matrix_t0.fastaed -st BIN -m  -bb 1000 -nt AUTO -pre UFBoot_run8 &> /dev/null
[01:31:05] > iqtree -s pangenome_matrix_t0.fastaed -st BIN -m  -bb 1000 -nt AUTO -pre UFBoot_run9 &> /dev/null
[01:31:05] > iqtree -s pangenome_matrix_t0.fastaed -st BIN -m  -bb 1000 -nt AUTO -pre UFBoot_run10 &> /dev/null
grep: ./*log: No such file or directory


 >>> ERROR! The expected output file sorted_lnL_scores_IQ-TREE_searches.out was not produced, will exit now!

The error shows up even when just asking for the help:

Path/to/estimate_pangenome_phylogenies.sh

Thanks!

pipeline problems with alignments

Hi, I never had this problem before, but the script doesn't generates the aln files as bellow:

I have no idea why, I don't know if it is a BioPerl module or a R module issue.

Can you help with that?
Thanks in advance.

pan-genome phylogeny under the parsimony criterion with GET_PHYLOMARKERS

Dear GET_PHYLOMARKERS team,

I am suing docker image "10032020-10Jan20" on Ubuntu 18.04 LTS.
I successfully generated ML tree by using PGM as an input, but getting the following error with parsimony inference. Please see bellow the error.

you@13680c26d4a9:~/get_homPhy/v2_homologues/pan_CM_C90_COG_OMCL$ estimate_pangenome_phylogenies.sh -c PARS -R 3 -i pangenome_matrix_t0.phylip -n 28 -b 25 -j 10 -t 1

### estimate_pangenome_phylogenies.sh v.1.1_10Jan20 run on 2020_05_03-10.12.28 with the following parameters:
# topdir = /home/you/get_homPhy/v2_homologues/pan_CM_C90_COG_OMCL | criterion:PARS
# runmode = 3 | input_phylip = pangenome_matrix_t0.phylip | input_phylo =
# n_cores = 28 | bootstrap no. = 25 | n_jumbles_pars = 10 | t_jumbles_boot_pars = 1
# DEBUG=0

found dir boot_pars
should dir boot_pars be removed y|n? y
 ... working in dir full_pars
# running pars < pars.params &> /dev/null in run1
# running pars < pars.params &> /dev/null in run2
# running pars < pars.params &> /dev/null in run3
# running pars < pars.params &> /dev/null in run4
# running pars < pars.params &> /dev/null in run5
# running pars < pars.params &> /dev/null in run6
# running pars < pars.params &> /dev/null in run7
# running pars < pars.params &> /dev/null in run8
# running pars < pars.params &> /dev/null in run9
# running pars < pars.params &> /dev/null in run10
# running pars < pars.params &> /dev/null in run11
# running pars < pars.params &> /dev/null in run12
# running pars < pars.params &> /dev/null in run13
# running pars < pars.params &> /dev/null in run14
# running pars < pars.params &> /dev/null in run15
# running pars < pars.params &> /dev/null in run16
# running pars < pars.params &> /dev/null in run17
# running pars < pars.params &> /dev/null in run18
# running pars < pars.params &> /dev/null in run19
# running pars < pars.params &> /dev/null in run20
# running pars < pars.params &> /dev/null in run21
# running pars < pars.params &> /dev/null in run22
# running pars < pars.params &> /dev/null in run23
# running pars < pars.params &> /dev/null in run24
# running pars < pars.params &> /dev/null in run25
# running pars < pars.params &> /dev/null in run26
# running pars < pars.params &> /dev/null in run27
# running pars < pars.params &> /dev/null in run28
# waiting for pars jobs to finish ...
# working in /home/you/get_homPhy/v2_homologues/pan_CM_C90_COG_OMCL/boot_pars ...
/get_phylomarkers/bin/linux/.libs/lt-nw_reroot: error while loading shared libraries: libnw.so.0: cannot open shared object file: No such file or directory
/get_phylomarkers/bin/linux/.libs/lt-nw_reroot: error while loading shared libraries: libnw.so.0: cannot open shared object file: No such file or directory
/get_phylomarkers/bin/linux/.libs/lt-nw_support: error while loading shared libraries: libnw.so.0: cannot open shared object file: No such file or directory
awk: program limit exceeded: maximum number of fields size=32767
        FILENAME="pangenome_matrix_t0.tab" FNR=1 NR=1
the seq_key => seq_ID correspondences for aln pang_ID-Strain_corresp.tsv are:
0000000000 =>
0000000001 =>
0000000002 =>
0000000003 =>
0000000004 =>
0000000005 =>
0000000006 =>
0000000007 =>
0000000008 =>
0000000009 =>
0000000010 =>
0000000011 =>
0000000012 =>
0000000013 =>
0000000014 =>
0000000015 =>
0000000016 =>
0000000017 =>
0000000018 =>
0000000019 =>
0000000020 =>
0000000021 =>
0000000022 =>
0000000023 =>
0000000024 =>
0000000025 =>
0000000026 =>
0000000027 =>
0000000028 =>
0000000029 =>
0000000030 =>
0000000031 =>
0000000032 =>
0000000033 =>
0000000034 =>
0000000035 =>
0000000036 =>
0000000037 =>
0000000038 =>
0000000039 =>
0000000040 =>
0000000041 =>
0000000042 =>
0000000043 =>
0000000044 =>
0000000045 =>
0000000046 =>
0000000047 =>
0000000048 =>
0000000049 =>
0000000050 =>
0000000051 =>
0000000052 =>
0000000053 =>
0000000054 =>
0000000055 =>
0000000056 =>
0000000057 =>
0000000058 =>
0000000059 =>
0000000060 =>
0000000061 =>
0000000062 =>
0000000063 =>
0000000064 =>
0000000065 =>
0000000066 =>
0000000067 =>
0000000068 =>



 >>> ERROR! The expected output file full_pars_tree_rooted_withBoot.ph was not produced, will exit now!

I wonder if you could help me with this.

Cheers,

Nemanja

vinuesa / get_phylomarkers Goto Github PK

get_phylomarkers's Introduction

GET_PHYLOMARKERS

GET_PHYLOMARKERS 2

Installation, dependencies and Docker image

Aim

Manual and tutorials

Citation.

Code

Developers

Acknowledgements

Funding

get_phylomarkers's People

Contributors

Stargazers

Watchers

Forkers

get_phylomarkers's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs