eead-csic-compbio / get_homologues Goto Github PK

GET_HOMOLOGUES: a versatile software package for pan-genome analysis

License: Other

Perl 59.23% Shell 0.53% Python 0.28% Makefile 0.13% CSS 0.69% JavaScript 0.99% HTML 38.08% R 0.07%

pangenome bacteria annotation plants transcriptome clustering fasta genbank pangene

get_homologues's Introduction

GET_HOMOLOGUES: a versatile software package for pan-genome analysis

This software is maintained by Bruno Contreras-Moreira (bcontreras at eead.csic.es) and Pablo Vinuesa (vinuesa at ccg.unam.mx). The original version, suitable for bacterial genomes, was described in:

Contreras-Moreira B, Vinuesa P (2013) Appl. Environ. Microbiol. 79:7696-7701

Vinuesa P, Contreras-Moreira B (2015) Methods in Molecular Biology Volume 1231, 203-232

The software was subsequently adapted to the study of intra-specific eukaryotic pan-genomes resulting in script GET_HOMOLOGUES-EST, described in:

Contreras-Moreira B, Cantalapiedra CP et al (2017) Front. Plant Sci. 10.3389/fpls.2017.00184

Contreras-Moreira B, Rodriguez del Rio A et al (2022) Methods in Molecular Biology https://doi.org/10.1007/978-1-0716-2429-6_9

GET_HOMOLOGUES-EST was benchmarked with genomes and transcriptomes of Arabidopsis thaliana and Hordeum vulgare, available at http://floresta.eead.csic.es/plant-pan-genomes, and used to analyze the pan-genomes of Brachypodium distachyon and Brachypodium hybridum (press release).

Two tutorials are available:

Pangenome analysis of plant transcripts and coding sequences, published in 2022.
From genomes to pangenomes: understanding variation among individuals and species, which includes step by step instructions for both bacterial and plant data, first released in 2017.

Installation instructions, including the bioconda package, are available in the manual and the README.txt file.

version	HTML
original, for the analysis of bacterial pan-genomes	manual
EST, for the analysis of intra-species eukaryotic pan-genomes, tested on plants	manual-est

A Docker image is also available with GET_HOMOLOGUES bundled with GET_PHYLOMARKERS, ready to use. The GET_PHYLOMARKERS manual explains how to use nucleotide & peptide clusters produced by GET_HOMOLOGUES to compute robust multi-gene and pangenome phylogenies.

The code is regularly patched (see CHANGES.txt in each release, and has been used in a variety of studies (see citing papers here and here, respectively).

We kindly ask you to report errors or bugs to the authors and to acknowledge the use of the software in scientific publications.

GET_HOMOLOGUES is part of the INB/ELIXIR-ES resources portfolio:

Funding: Fundacion ARAID, Consejo Superior de Investigaciones Cientificas, DGAPA-PAPIIT UNAM, CONACyT, FEDER, MINECO, DGA-Obra Social La Caixa.

get_homologues's People

Contributors

Stargazers

Watchers

get_homologues's Issues

hcluster_matrix.sh error

Hi there,

I was getting an error while running the hcluster_matrix.sh giving Avg_identity.tab as an input.
Error is given below

Error in hclust(my_dist, method = "ward.D2") :
NA/NaN/Inf in foreign function call (arg 11)
Calls: as.phylo -> identical -> hclust
Execution halted

ERROR: File hclust_gower-ward.D2_FperidonATCC33693PP4_f0_0taxa_algOMCL_e0_Avg_identity_tree.svg was were NOT generated!
File gower_dist_matrix.tab was generated
ERROR: File hclust_gower-ward.D2_FperidonATCC33693PP4_f0_0taxa_algOMCL_e0_Avg_identity_heatmap.svg was were NOT generated!
ERROR: File hclust_gower-ward.D2_FperidonATCC33693PP4_f0_0taxa_algOMCL_e0_Avg_identity_tree.ph was were NOT generated!

Please suggest any modifications
Thanks
Aditya

# parsing blast result! Illegal division by zero at /lib/marfil_homology.pm line 1041, <BLASTOUT> line 2.

I am trying do make a pangenome matrix. The command " ./get_homologues.pl -d genomas_rev19072017 -t 0 -M" gives the following log:

# ./get_homologues.pl -i 0 -d genomas_rev19072017 -o 0 -e 0 -f 0 -r 0 -t 0 -c 0 -z 0 -I 0 -m local -n 2 -M 1 -G 0 -P 0 -C 75 -S 1 -E 1e-05 -F 1.5 -N 0 -B 50 -b 0 -s 0 -D 0 -g 0 -a '0' -x 0 -R 0 -A 0

# version 07112016
# results_directory=/opt/get_homologues-x86_64-20161107/genomas_rev19072017_homologues
# parameters: MAXEVALUEBLASTSEARCH=0.01 MAXPFAMSEQS=250 BATCHSIZE=100 KEEPSCNDHSPS=1

# checking input files...
# P_HW567.gbk 5363
# P_graminis_DSM_15220.gbk 5584
# P_jilunlii.gbk 5766
# P_polymyxa_ATCC_842.gbk 5068
# P_riograndensis_CAR114.gbk 6098
# P_riograndensis_CAS34.gbk 5969
# P_riograndensis_LN831776.gbk 6705
# P_sonchi_X19-5.gbk 7117
# Paenibacillus_borealis.gbk 6698
# Paenibacillus_durus.gbk 5140
# Paenibacillus_durus_ATCC_35681.gbk 4992
# Paenibacillus_forsythiae_T98.gbk 4193
# Paenibacillus_odorifer.gbk 5752
# Paenibacillus_sabinae_T27.gbk 4634
# Paenibacillus_stellifer.gbk 4899
# Paenibacillus_wynnii.gbk 5282
# Paenibacillus_zanthoxyli_JH29.gbk 4372

# 17 genomes, 93632 sequences

# taxa considered = 17 sequences = 93632 residues = 29545143 MIN_BITSCORE_SIM = 19.9

# mask=PaenibacillusforsythiaeT98_f0_0taxa_algOMCL_e0_ (_algOMCL)

# skipped genome parsing (genomas_rev19072017_homologues/tmp/selected.genomes)

# running BLAST searches ...
# done

# parsing blast result! (/opt/get_homologues-x86_64-20161107/genomas_rev19072017_homologues/tmp/all.blast , 1.1e+03MB)
Illegal division by zero at /opt/get_homologues-x86_64-20161107/lib/marfil_homology.pm line 1041, line 2.

Any idea what is going on?
The sample files (Buchnera) ran fine.
Version: 20161107

typo-like in get_homologues-est manual

Hello Pablo and Bruno,

I'm starting in the use of your great program get_homologues and I've found a tiny typo

In this manual,

in section 4.2 the code used to make the plot are:
./plot_matrix_heatmap.sh -i sample_[...]/Esterel_alltaxa_algOMCL_e0_Avg_identity.tab \ -t "clusters=66" -k "Average Nucleotide Identity" -o pdf -m 28 -v 35 -H 9 -W 10

wile I think it must read:

./plot_matrix_heatmap.sh -i [...]/[...]_alltaxa_algOMCL_e0_Avg_identity.tab \ -t "clusters=66" -k "Average Nucleotide Identity" -o pdf -m 28 -v 35 -H 9 -W 10

thats all, sorry if this is very vain but it caused a small error and I think it can be solved

your git and docker file are really cool
Cheers

Clustering orthologous sequences using multiple threads?

Dear Get_Homologues developers,

I've been using the program a lot. Recently I noticed that the step of " # clustering orthologous sequences" proceeds serially... if this step could use multiple threads simultaneously, the speed bottleneck could be reduced substantially, I guess. Could it be implemented in a forthcoming version?

Best,

Jose

Effect of BLAST parameter -max_target_seqs on GET_HOMOLOGUES

Hi,
a recent paper reported that the usual way the parameter -max_target_seqs in BLAST jobs is used is impacting the correctness of bioinformatics workflows. You can read more about that at https://www.biostars.org/p/340129. The story is that users of this option expected to get the best N hits out of the total list of hits produced by BLAST, sorted by bitscore. However, current standalone BLAST binaries in fact return the first H hits found that meet the E-value cutoff.
What is the effect of this on GET_HOMOLOGUES jobs? Given that coverage is the main parameter controlling our BLAST jobs and that -max_target_seqs is set to the number of sequences in the proteome/genome being searched, this parameter cannot have any significant effect of the alignments reported, as in fact we set a large number of hits to ensure that all hits, even poor ones, are recovered.

Auxiliary script transcripts2CDS.pl does use a small value (5) for this parameter, but again I expect it to have littel effect on results as these BLASTX searches don't necessarily require the best hit as long as the correct Open Reading Frame in found.

Anyway, please report any problems you might observe, Bruno

re-run pipeline for compositional analyses using subset of genomes does not use previous data

Dear Bruno,
I run the pipeline once including among the genomes an outgrup. I did that for phylogenomics reasons.
Now I would like to perform a compositional analysis based only on the genomes of interest, meaning excluding the outgroup.
I am re-running get_homologues with the -I flag (the text includes the names od the file (AB.gbk) I want to include, one per line). However, get_homologues starts a new computation (e.g. making diamond DB and running diamomd).
Is there a possibility of using the previous results for such a task?
Thanks a lot for your time and help!
Best
Stefano

Please add some explanation about the outputs to the manual

Hello,
I decided to give GET_HOMOLOGOUS_EST a try, so I ran the sample data as explained in the manual.
The output directory created contains the following files (among others):

pan_genome_algOMCL.tab
core_genome_algOMCL.tab
soft-core_genome_algOMCL.tab
Esterel_alltaxa_algOMCL_e0_.cluster_list
Esterel_alltaxa_algOMCL_e0_Avg_identity.tab

as well as the directory Esterel_alltaxa_algOMCL_e0_, containing more files.
However, I am failing to understand the contents/format of these output files, and I couldn't find an explanation anywhere in the documentation. It would be very useful to me (and probably other users) if you could add this to the manual, unless it is already documented somewhere (in which case please direct me to the relevant document).
For example:

$ cat sample_transcripts_fasta_est_homologues/pan_genome_algOMCL.tab
g1      g2      g3      g4
4665    8938    13004   27014
4665    9002    24277   26201
4665    21283   23347   25641
4665    8938    24058   26352
20013   22562   24637   26478
20013   22169   24583   26507
4665    21283   23697   25621
4838    9098    12961   26971
20013   22562   24565   26489
4838    9098    24345   26186
4755    9100    24347   26188
4755    9100    12963   26973
4755    21348   23761   25602
4755    8956    24076   26370
4838    9114    24389   26313
4665    9002    12931   26941
4755    8956    13022   27032
20013   22169   24233   26527
20013   22245   24658   26499
4838    21639   23714   25555

What are g1,g2,g3,g4? And what do the numbers represent?
Thanks!

incorrect file path for $ENV{'BLAST_PATH'} when installing

When running perl install.pl I kept getting the following error:

### 1) checking required parts: 


## checking mcl-14-137 (lib/phyTools: $ENV{'EXE_MCL'})
>> OK
## checking COGsoft/COGlse (lib/phyTools: $ENV{'EXE_COGLSE'})
>> OK
## checking COGsoft/COGtriangles (lib/phyTools: $ENV{'EXE_COGTRI'})
>> OK
## checking COGsoft/COGmakehash (lib/phyTools: $ENV{'EXE_MAKEHASH'})
>> OK
## checking COGsoft/COGreadblast (lib/phyTools: $ENV{'EXE_READBLAST'})
>> OK
## Checking blast (lib/phyTools: $ENV{'EXE_BLASTP'})
<< Cannot run shipped blastp, please download it from ftp.ncbi.nih.gov/blast/executables/blast+/LATEST/ ,
<< install it and edit variable BLAST_PATH as explained in the manual
<< (inside set_phyTools_env in file lib/phyTools.pm) .
<< Then re-run

While I do have a newer version of BLAST+ (2.31) installed, the included version is sufficient but the $ENV{'EXE_BLASTP'} variable in lib/phyTools.pm is incorrect. As declared in lib/phyTools.pm line 70 the path is

{ $ENV{'BLAST_PATH'} = $ENV{'MARFIL'}.'bin/ncbi-blast-2.2.27+/bin/'; }

but the correct path is

{ $ENV{'BLAST_PATH'} = $ENV{'MARFIL'}.'bin/ncbi-blast-2.2.27+/bin/bin/'; }

The issue can be resolved either by creating symbolic links to the executables in bin/ncbi-blast-2.2.27+/bin/ or by modifying lib/phyTools.pm to the full path of the bundled BLAST+ executables.

Eukaryotic version not finding orthologs

Hi,

I have small fungal genomes (<40 Mb) that I have analyzed with both the get_homologues and get_homologues-est versions.
However, the -est version is not finding orthologs, but the regular version is.
Not sure why this is. Is there anything wrong with using the original version on small fungal genomes, i.e. can I trust the results?

Thanks,
Morgan

[Question] Is it possible to exclude a taxon from get_homologs.pl when running compare_clusters.pl?

Hi there,

For the initial script get_homologs.pl I included a taxon that I would like to leave out of the downstream analysis for the core, shell & pan transcriptome analysis. Re-running the get_homologs.pl would take close to a week. Is it possible to exclude a taxon in compare_clusters.pl?

Here are the flags I am currently using for the scripts:
get_homologues.pl -M -t 0 -d .
then
compare_clusters.pl -o Genus-OMCL_intersection -m -d ._homologues/path-to-dir

Thanks,

Liza

Error in reading input genbank files

ERROR: could not extract sequences from file APCEc01_gbk/.gbk
When trying to submit the genbank files for analysis Am getting this error
ERROR: could not extract sequences from file APCEc01_gbk/.gbk
I have checked the genbank files and they are complete files with nucleotide sequences. Is there anything I was missing out?
Could you please help me with this
Thanks
Aditya

flag -P in parse_pangenome_matrix.pl

Hi,

I am still having some concerns regarding the command parse_pangenome_matrix.pl.
The question that I am trying to answer is “what are the genes that present in any of gp A genomes (or a percentage like 50 % of gp A) but absent in all of gp B genomes (100 %).

By default, flag –P is invoked, and making a single cutoff value to both groups at once, and at the end, we get the genes present in all A, absent in all B. Relaxing the –P value will result in some shared genes between the gps.

My inquiry is,
Is that possible to completely disable the -P option, and so I can rely on the –S parameter instead , or alternatively, it would be great if I can set a cutoff value to each gp separately, like –P_gpA 50 & -P_gpB 100

stuck on WARNING: please remove/rename results directory

so i ran
./get_homologues.pl
with this parameters -t 0 -M -n 16
and its always this message:

192 genomes, 1030166 sequences

taxa considered = 192 sequences = 1030166 residues = 331534637 MIN_BITSCORE_SIM = 21.6

mask=GCF000598065_f0_0taxa_algOMCL_e0_ (_algOMCL)

skipped genome parsing (Genomas_Teste_2_homologues/tmp/selected.genomes)

skip BLAST searches and parsing

WARNING: please remove/rename results directory:

'/mnt/e/Downloads/get_homologues-x86_64-20170609/Genomas_Teste_2_homologues/'

if you change the sequences in your .gbk/.faa files or want to re-run

I haven't changed any sequence or any file on my source directory, any help?

run time

Hi,

I would like to run get_homologues on a dataset of over 2000 microbial genomes. Is there is any way I can achieve this in a reasonable amount of time.

I have at my disposable a large computer cluster.

Best,
/SB

error running COGs

I finish the get_homologs with mcl. But i run with COGS, he returns the following error:

ERROR: find_COGs (/home/user/Tools/get_homologues-master//bin/COGsoft/COGtriangles/COGtriangles ) failed to terminate job.

I don't known why ??

ERROR: find_COGs

Dear chic-compbio staff, I don't know why im getting this error, if I modify the coverage the script run is success, but C 100 (with S 100) give the COGs error. the lines are the following:

perl get_homologues.pl -d cdifficile_all -n 16 -C 100 -S 100 -G

/home/ecastron/programs/get_homologues-x86_64-20160201/get_homologues.pl -i 0 -d cdifficile_all -o 0 -e 0 -f 0 -r 0 -t all -c 0 -z 0 -I 0 -m local -n 16 -M 0 -G 1 -P 0 -C 100 -S 100 -E 1e-05 -F 1.5 -N 0 -B 50 -b 0 -s 0 -D 0 -g 0 -a '0' -x 0 -R 0 -A 0

WARNING : cannot lock files in /lustre/groups/cbi/ecastron/Sandro/mayor/assembled/white_list_no_redecont/cdifficile_all_homologues ,

please ensure that no other instance of the program is running at this location

results_directory=/lustre/groups/cbi/ecastron/Sandro/mayor/assembled/white_list_no_redecont/cdifficile_all_homologues

parameters: MAXEVALUEBLASTSEARCH=0.01 MAXPFAMSEQS=250 BATCHSIZE=100 KEEPSCNDHSPS=1

checking input files...

97-16_S15.faa 3908

CGB1038-2_S2.faa 3688

CGB1038-3_S3.faa 3685

CGB1038-5_S5.faa 3849

CGB1038-6_S6.faa 3852

difficile630.faa 3823

difficileCIP107932.faa 3688

difficileQCD-32g58.faa 4162

difficileQCD-66c26.faa 3677

difficilecd196.faa 3630

difficiler20291.faa 3702

11 genomes, 41664 sequences

taxa considered = 11 sequences = 41664 residues = 12654511 MIN_BITSCORE_SIM = 19.2

mask=difficilecd196_f0_alltaxa_algCOG_e0_C100_S100_ (_algCOG_C100_S100)

skipped genome parsing (cdifficile_all_homologues/tmp/selected.genomes)

skip BLAST searches and parsing

WARNING: please remove/rename results directory:

'/lustre/groups/cbi/ecastron/Sandro/mayor/assembled/white_list_no_redecont/cdifficile_all_homologues/'

if you change the sequences in your .gbk/.faa files or want to re-run

creating indexes, this might take some time (lines=3.27e+06) ...

construct_taxa_indexes: number of taxa found = 11

number of file addresses/BLAST queries = 4.2e+04

clustering orthologous sequences

checking lineage-specific expansions

making COGs

ERROR: find_COGs (/home/ecastron/programs/get_homologues-x86_64-20160201//bin/COGsoft/COGtriangles/COGtriangles ) failed to terminate job

can you help me? regards

(-S) parameter affect the output of BDBH but not COG or OMCL

Hi,
Why change the value of (-S) parameter results in a big difference only with the BDBH algorithm but not with COG or OMCL while estimating the core genome?
When using BDBH, (-S 1 default) results in 2422 cluster, (-S 90) results in 2319 cluster, (-S 95) results in 1916 clusters.
For OMCL and COG, both result in 2412 and 2404 respectively, regardless to the (-S) value.
Regards,

compare_clusters.pl argument parsing problem

Hey, I use both BDBH and OMCL algorithm and got two directories GeobacterpickeringiistrainG13_dmd_f0_alltaxa_algBDBH_e0_ and GeobacterpickeringiistrainG13_dmd_f0_alltaxa_algOMCL_e0_. Then, to pick intersection between them, I use compare_clusters.pl, but I found that only the first directory will be used if there's a blank after comma.

For instance, if I send -d parameter as evo_homologues/GeobacterpickeringiistrainG13_dmd_f0_alltaxa_algBDBH_e0_, evo_homologues/GeobacterpickeringiistrainG13_dmd_f0_alltaxa_algOMCL_e0_, the STDOUT like:

# ../get_homologues-x86_64-20190102/compare_clusters.pl -d evo_homologues/GeobacterpickeringiistrainG13_dmd_f0_alltaxa_algBDBH_e0_, -o evo_intersection -n 0 -m 0 -t 0 -I  -r 0 -s 0 -x 0 -T 0

# number of input cluster directories = 1

# parsing clusters in evo_homologues/GeobacterpickeringiistrainG13_dmd_f0_alltaxa_algBDBH_e0_ ...
# cluster_list in place, will parse it (evo_homologues/GeobacterpickeringiistrainG13_dmd_f0_alltaxa_algBDBH_e0_.cluster_list)
# number of clusters = 1245

# intersection output directory: evo_intersection
# intersection size = 1245 clusters

# intersection list = evo_intersection/intersection_t0.cluster_list


# WARNING: Venn diagrams are only available for 2 or 3 input cluster directories

See, only the first one is taken in for analysis.

And if I send them only separated by comma as -d evo_homologues/GeobacterpickeringiistrainG13_dmd_f0_alltaxa_algBDBH_e0_,evo_homologues/GeobacterpickeringiistrainG13_dmd_f0_alltaxa_algOMCL_e0_, it works well:

# ../get_homologues-x86_64-20190102/compare_clusters.pl -d evo_homologues/GeobacterpickeringiistrainG13_dmd_f0_alltaxa_algBDBH_e0_,evo_homologues/GeobacterpickeringiistrainG13_dmd_f0_alltaxa_algOMCL_e0_ -o evo_intersection -n 0 -m 0 -t 11 -I  -r 0 -s 0 -x 0 -T 0

# number of input cluster directories = 2

# parsing clusters in evo_homologues/GeobacterpickeringiistrainG13_dmd_f0_alltaxa_algBDBH_e0_ ...
# cluster_list in place, will parse it (evo_homologues/GeobacterpickeringiistrainG13_dmd_f0_alltaxa_algBDBH_e0_.cluster_list)
# number of clusters = 1214
# parsing clusters in evo_homologues/GeobacterpickeringiistrainG13_dmd_f0_alltaxa_algOMCL_e0_ ...
# cluster_list in place, will parse it (evo_homologues/GeobacterpickeringiistrainG13_dmd_f0_alltaxa_algOMCL_e0_.cluster_list)
# number of clusters = 1173

# intersection output directory: evo_intersection
# WARNING: output directory evo_intersection already exists, note that you might be mixing clusters from previous runs

# intersection size = 1127 clusters

# intersection list = evo_intersection/intersection_t11.cluster_list

# input set: evo_intersection/GeobacterpickeringiistrainG13_dmd_f0_alltaxa_algBDBH_e0_.venn_t11.txt
# input set: evo_intersection/GeobacterpickeringiistrainG13_dmd_f0_alltaxa_algOMCL_e0_.venn_t11.txt

# Venn diagram = evo_intersection/venn_t11.pdf
# Venn region file: evo_intersection/unique_GeobacterpickeringiistrainG13_dmd_f0_alltaxa_algBDBH_e0_.venn_t11.txt (87)
# Venn region file: evo_intersection/unique_GeobacterpickeringiistrainG13_dmd_f0_alltaxa_algOMCL_e0_.venn_t11.txt (46)

As we all know, programmers/users always like add single blank behind comma, and I believe that I'm not the one who is blocked by this problem, so can you allow this syntax? Thank you.

ERROR: failled generating all_ortho.mcl file (OMCL)

hi
Get_homologues gives this error:
ERROR: failed generating /home/user/Tools/get_homologues-master/faa_homologues/tmp/all_ortho.mcl file (OMCL).

I do not know the cause of this error ??

cloning git doesn't include external binaries

After running:

% ./install.pl

Comes a message:

# mcl binaries are missing
# If you just git cloned you should also get the latest release,
# unpack it and copy the contents of the bin/ folder to your local repo bin/

That 'bin/' folder from this git is empty. And I tried compiling MCL on my own, and adding it to the $PATH, but this install script still gives the same error message.

COG pan-transcriptome from single file

Hi there,

Trying to run get_homologues pan transciptome analysis from a single file. Just fine for OMCL, but the COG algorithm is coming up with the error # EXIT: need at least three taxa to run COGtriangles algorithm

There are five taxa, but the input is a single file. taxa are identified by [genus species] in the amino acid fasta header.
command called with ./get_homologues.pl -G -t 0 -i allAA_transdec_cd-hit_cluster-out.fasta

Am I doing something wrong or is there a trick to calling the COG pan-tran from a single file over the OMCL one?

compare_clusters.pl not producing .fasta file

Hi,

I am trying to produce the binary alignment file so that I can feed it t IQ-Tree.
I ran the compare_clusters.pl script after running get_homologues, but the .fasta file was not produced. Only the .phylip, .tab, and cluster list were produced.
Can you please help me figure out why this file was not produced?

My get_homologues script:
get_homologues.pl -d /work/Geomicrobiology/msobol/IODP_329_SPG/get_homologues/genomes/ -n 20 -M -t 18 -e

My compare_clusters script:
compare_clusters.pl -o compare_clusters -d /work/Geomicrobiology/msobol/IODP_329_SPG/get_homologues/genomes_homologues/Pdigitatumproteins_f0_18taxa_algOMCL_e1_/ -m

Thanks,
Morgan

error locating Pfam-A.hmm after successful install

I successfully downloaded and installed the latest get_homologues release:

$ perl install.pl 

### 1) checking required parts: 


## checking mcl-14-137 (lib/phyTools: $ENV{'EXE_MCL'})
>> OK
## checking COGsoft/COGreadblast (lib/phyTools: $ENV{'EXE_READBLAST'})
>> OK
## checking COGsoft/COGtriangles (lib/phyTools: $ENV{'EXE_COGTRI'})
>> OK
## checking COGsoft/COGmakehash (lib/phyTools: $ENV{'EXE_MAKEHASH'})
>> OK
## checking COGsoft/COGlse (lib/phyTools: $ENV{'EXE_COGLSE'})
>> OK
## Checking blast (lib/phyTools: $ENV{'EXE_BLASTP'})
>> OK

### 2) checking optional parts: 


## checking optional HMMER binaries (lib/phyTools: $ENV{'EXE_HMMPFAM'})
# required by get_homologues.pl -D
>> OK
## checking optional PFAM library (lib/phyTools: $ENV{'PFAMDB'})
# required by get_homologues.pl -D and get_homologues-est.pl -D
# cannot locate Pfam-A, would you like to download it now? [Y/n]
Y
# connecting to ftp.ebi.ac.uk ...
# downloading ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release//Pfam-A.hmm.gz (242.8Mb) ...
# [        50%       ]
# ####################

# gunzip Pfam-A.hmm.gz ...
# pressing Pfam-A.hmm ...
Working...    done.
Pressed and indexed 16295 HMMs (16295 names and 16295 accessions).
Models pressed into binary file:   Pfam-A.hmm.h3m
SSI index for binary model file:   Pfam-A.hmm.h3i
Profiles (MSV part) pressed into:  Pfam-A.hmm.h3f
Profiles (remainder) pressed into: Pfam-A.hmm.h3p
>> OK
## checking optional SWISSPROT library (lib/phyTools: $ENV{'BLASTXDB'})
# required by transcripts2cds.pl and transcripts2cdsCPP.pl
# cannot locate SWISSPROT, would you like to download it now? [Y/n]
Y
# connecting to ftp.ncbi.nih.gov ...
# downloading ftp://ftp.ncbi.nih.gov//blast/db//swissprot.tar.gz (117.0Mb) ...
# [        50%       ]
# ####################

# untar swissprot.tar.gz ...
>> OK
## checking optional software R (lib/phyTools: $ENV{'EXE_R'})
# required by compare_clusters.pl, parse_pangenome_matrix.pl -s, plot_pancore_matrix.pl
>> OK
## checking optional Perl module GD
# required by parse_pangenome_matrix.pl -p
>> OK

### 3) Your get_homologues kit is now fully functional

but when I run get_homologues.pl on a small dataset I get an error:

$ ./get_homologues.pl -n 4 -d /data/test_geo_cyto/ -t 0 -D -A -z -M
# ./get_homologues.pl -i 0 -d /data/test_geo_cyto -o 0 -e 0 -f 0 -r 0 -t 0 -c 0 -z 0 -I 0 -m local -n 4 -M 1 -G 0 -P 0 -C 75 -S 1 -E 1e-05 -F 1.5 -N 0 -B 50 -b 0 -s 0 -D 1 -g 0 -a '0' -x 0 -R 0 -A 1

# results_directory=/sw/get_homologues-macosx-20160113/test_geo_cyto_homologues
# parameters: MAXEVALUEBLASTSEARCH=0.01 MAXPFAMSEQS=5000 BATCHSIZE=100 KEEPSCNDHSPS=1

# checking input files...
# Geobacter_metallireducens_GS-15.cytochromes.concatenated.faa 58
# Geobacter_pickeringii_G13.cytochromes.faa 58
# Geobacter_soli.cytochromes.concatenated.faa 58
# Geobacter_sulfurreducens_PCA.cytochromes.faa 72

#4 genomes, 246 sequences

# taxa considered = 4 sequences = 246 residues = 119322 MIN_BITSCORE_SIM = 16.2

# mask=GeobactermetallireducensGS-15_f0_0taxa_algOMCL_Pfam_e0_ (_algOMCL_Pfam)

# submitting Pfam HMMER jobs ... 
# ERROR: cannot find database file /sw/get_homologues-macosx-20160113/db/Pfam-A.hmm
# EXIT: failed while running localPfam search (/sw/get_homologues-macosx-20160113/_split_hmmscan.pl 4 100 /sw/get_homologues-macosx-20160113//bin/hmmer-3.1b2/binaries/hmmscan --noali --acc --cut_ga  --cpu 1 /sw/get_homologues-macosx-20160113/db/Pfam-A.hmm  /sw/get_homologues-macosx-20160113/test_geo_cyto_homologues/_Geobacter_metallireducens_GS-15.cytochromes.concatenated.faa.fasta0 > /sw/get_homologues-macosx-20160113/test_geo_cyto_homologues/_Geobacter_metallireducens_GS-15.cytochromes.concatenated.faa.fasta0.pfam )

Examining the get_homologues-macosx-20160113/db directory, Pfam-A.hmm.gz and Pfam-A.hmm are not present.

I was able to resolve the issue by re-downloading Pfam-A.hmm.gz and manually unzipping the file in get_homologues-macosx-20160113/db

cd get_homologues-macosx-20160113/db
wget ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz
gunzip -c Pfam-A.hmm.gz > Pam-A.hmm

After doing this get_homologues completed successfully re-running the same command. Is the install.pl script mistakenly removing Pfam-A.hmm from get_homologues-macosx-20160113/db?

Here is the output of get_homologues.pl (OS X 10.11, hmmer 3.1 and blast+/legacy blast installed separately)

$ ./get_homologues.pl -v

./get_homologues.pl version 2.0 (2015)

Program written by Bruno Contreras-Moreira (1) and Pablo Vinuesa (2).

 1: http://www.eead.csic.es/compbio (Estacion Experimental Aula Dei/CSIC/Fundacion ARAID, Spain)
 2: http://www.ccg.unam.mx/~vinuesa (Center for Genomic Sciences, UNAM, Mexico)

Primary citation (PubMed:24096415):

 Contreras-Moreira B, Vinuesa P. (2013) GET_HOMOLOGUES, a versatile software package for scalable and
 robust microbial pangenome analysis. Appl Environ Microbiol 79(24):7696-701. doi: 10.1128/AEM.02411-13

This software employs code, binaries and data from different authors, please cite them accordingly:
 OrthoMCL v1.4 (www.orthomcl.org , PubMed:12952885)
 COGtriangles v2.1 (sourceforge.net/projects/cogtriangles , PubMed=20439257)
 NCBI Blast-2.2 (blast.ncbi.nlm.nih.gov , PubMed=9254694,20003500)
 Bioperl v 1.5.2 (www.bioperl.org , PubMed=12368254)
 HMMER 3.1b2 (hmmer.org)
 Pfam (pfam.sanger.ac.uk , PubMed=24288371)

Checking required binaries and data sources, all set in phyTools.pm :
        EXE_BLASTP : OK (path:/sw/get_homologues-macosx-20160113/bin/ncbi-blast-2.2.27+/bin/blastp)
        EXE_BLASTN : OK (path:/sw/get_homologues-macosx-20160113/bin/ncbi-blast-2.2.27+/bin/blastn)
      EXE_FORMATDB : OK (path:/sw/get_homologues-macosx-20160113/bin/ncbi-blast-2.2.27+/bin/makeblastdb)
           EXE_MCL : OK (path:/sw/get_homologues-macosx-20160113//bin/mcl-14-137/src/shmcl/mcl)
      EXE_MAKEHASH : OK (path:/sw/get_homologues-macosx-20160113//bin/COGsoft/COGmakehash/COGmakehash )
     EXE_READBLAST : OK (path:/sw/get_homologues-macosx-20160113//bin/COGsoft/COGreadblast/COGreadblast )
        EXE_COGLSE : OK (path:/sw/get_homologues-macosx-20160113//bin/COGsoft/COGlse/COGlse )
        EXE_COGTRI : OK (path:/sw/get_homologues-macosx-20160113//bin/COGsoft/COGtriangles/COGtriangles )
       EXE_HMMPFAM : OK (/sw/get_homologues-macosx-20160113//bin/hmmer-3.1b2/binaries/hmmscan --noali --acc --cut_ga  /sw/get_homologues-macosx-20160113/db/Pfam-A.hmm)
        EXE_INPARA : OK (path:/sw/get_homologues-macosx-20160113/_cluster_makeInparalog.pl)
         EXE_ORTHO : OK (path:/sw/get_homologues-macosx-20160113/_cluster_makeOrtholog.pl)
         EXE_HOMOL : OK (path:/sw/get_homologues-macosx-20160113/_cluster_makeHomolog.pl)
    EXE_SPLITBLAST : OK (path:/sw/get_homologues-macosx-20160113/_split_blast.pl)
  EXE_SPLITHMMPFAM : OK (path:/sw/get_homologues-macosx-20160113/_split_hmmscan.pl)

Method used to define 'Pangenome size' with -c option?

Hello,

I am working with a list of 20 closely related bacterial genomes. I analyze them with get_homologues.pl with both orthoMCL and COG triangles, and then run compare_clusters.pl to obtain a pangenome matrix from both result sets, resulting in ~10,000 orthologue clusters. However, if i run compare_clusters.pl with -c with either orthoMCL or COG, the identified pangenome size using all 20 genomes is only ~7,500 genes. I am confused by this discrepancy, as I would expect these two methods to give comparable numbers of orthologue clusters for the 'pangenome'.

Can you explain how the -c option of compare_clusters.pl calculates pangenome size? Do you know the source of the discrepancy in the above analysis I highlighted?

Best,
Joe

compare_cluster option with -I

how do I pick only part of my genomes with the compare_cluster option?
i tried -I and nothing happened

can't find the fna files after clustering intergenic segments from GenBank files

Hello,

I tried to use get_homologues to cluster the intergenic segments of different chloroplasts. However, the output dir didn't contain any dnafile. I could find the names of dnafile in the cluster_list though. The log file can be find in the attachment.

By the way, my primary goal was to find the organism-unique genes or intergenic segments. Reasonably, the gene content of my query chloroplasts was uniform. So my bet was on intergenic segments. From the cluster_list, all intergenic clusters were shared by all organisms. How could I adjust the parameters to find organism-specific intergenic segments?

Many thanks!

nohup.out.zip

Problem with file naming (?)

I made a successful GET_HOMOLOGUES analysis of cca. 60 proteomes (each in its own file). However, when I want to do the same analysis on a subset containing cca. 50 of these proteomes (with -I), I get 0 clusters. If I use it on a smaller subset (10 proteomes) the results seem to be ok.

If I do pangenome analyses on the same subset, I get a result, but some proteomes are missing in the final output.

What I notice is that there are strange pairings of the genomes in the screen output - see below marked in bold. When blast results are processed, proteomes 1, 2, 3, and 4 are added several times. The proteomes in pairs with them (e.g. 11, 22, 31, 33) are then missing from the pangenome results. I don't know if this could be the problem. If I remove proteomes 1, 2, 3 and 4 from the list, the strange pairs are still there.
Am I breaking some file naming rules and if this is the case, is it possible to avoid the problem without re-running the whole analysis from start (which takes a very long time)?

perl ~/Path/get_homologues-x86_64-20170807/get_homologues.pl -d input/ -r 10.faa -c -n 8 -I ListName.list

included input files (51):

: 10.faa 10.faa 10704
: 1.faa 11.faa 10637
: 12.faa 12.faa 10771
: 13.faa 13.faa 10513
: 14.faa 14.faa 10890
: 15.faa 15.faa 10577
: 16.faa 16.faa 10573
: 17.faa 17.faa 10668
: 18.faa 18.faa 10550
: 19.faa 19.faa 10775
: 20.faa 20.faa 10561
: 2.faa 22.faa 10589
: 23.faa 23.faa 10858
: 24.faa 24.faa 10660
: 25.faa 25.faa 10597
: 26.faa 26.faa 10445
: 27.faa 27.faa 10643
: 28.faa 28.faa 10770
: 29.faa 29.faa 10887
: 30.faa 30.faa 10511
: 1.faa 31.faa 10489
: 3.faa 33.faa 10752
: 34.faa 34.faa 10889
: 35.faa 35.faa 10506
: 36.faa 36.faa 10788
: 37.faa 37.faa 10798
: 38.faa 38.faa 10650
: 40.faa 40.faa 10625
: 2.faa 42.faa 10719
: 3.faa 43.faa 10577
: 4.faa 44.faa 10723
: 45.faa 45.faa 10537
: 46.faa 46.faa 10777
: 47.faa 47.faa 10827
: 48.faa 48.faa 10613
: 49.faa 49.faa 10817
: 50.faa 50.faa 9779
: 1.faa 51.faa 10513
: 2.faa 52.faa 10615
: 3.faa 53.faa 10689
: 4.faa 54.faa 10774
: 1.faa 1.faa 10778
: 2.faa 2.faa 10822
: 3.faa 3.faa 11081
: 4.faa 4.faa 10726
: 5.faa 5.faa 10558
: 6.faa 6.faa 10722
: 7.faa 7.faa 10589
: 8.faa 8.faa 9527
: 9.faa 9.faa 10484

sample input data not in distro

The manual has a section devoted to examples, but the sample input data are not provided.

Could you add them to the repository?

Thanks

JML

compare_cluster.pl fail to extract fasta headers?

Something wierd after running get_homologues.pl for gbk files(sample_plasmids_gbk and my gbk data all get this errors), compare_cluster.pl was run to get cluster of genes. But some errors showed like this:

errors in v 3.0.8

...
Argument "" isn't numeric in numeric comparison (<=>) at ./compare_clusters.pl line 619.
...

errors in 3.1.2.

...
Use of uninitialized value in hash element at ./compare_clusters.pl line 300.
...
Argument "" isn't numeric in numeric comparison (<=>) at ./compare_clusters.pl line 627.
...

And the *_t0.tab file contained few gene distribution, which will get error msg by running plot_matrix_heatmap.sh.

two different versions were tried, but kind of same errors were displayed. Need help! thanks.

strains missing in final output

I ran get_homologues version "24052018" with diamond searches and MCL clustering, as well as compare_clusters.pl and parse_pangenome_matrix.pl. The analysis ran to completion, however two of my strains are missing from the pangenome matrix as well as the individual cluster files.

These strain files were named "2.gbk" and "4.gbk", which sounds like a similar problem as another user, however I also have input files such as 24.gbk and 17.gbk that were correctly included. These annotated assemblies were assembled at the same time using the same pipeline. It looks like something is failing during the "identifying orthologs" step.

Here is a portion of the screen output:

/home/Software/get_homologues-x86_64-20180524/get_homologues.pl`

-i 0 -d ./genomesfull -o 0 -X 1 -e 0 -f 0 -r USDA110.gbk -t 0 -c 0 -z 0 -I 0 -m local -n 1 -M 1 -G 0 -p >0 -C 75 -S 1 -E 1e-05 -F 1.5 -N 0 -B 50 -b 0 -s 0 -D 0 -g 0 -a '0' -x 0 -R 0 -A 0 -P 0

version 24052018
results_directory=/home/Data/genomesfull_homologues
parameters: MAXEVALUEBLASTSEARCH=0.01 MAXPFAMSEQS=250 BATCHSIZE=100 >KEEPSCNDHSPS=1
diamond job:1

checking input files...
...
17.gbk 7929
2.gbk 9953
21.gbk 6625
23.gbk 8320
24.gbk 9336
38B.gbk 8065
4.gbk 7658
48-1.gbk 8860
62-1.gbk 6921
...
identifying orthologs between otherstrain.gbk and 2.gbk (0)
0 sequences

identifying orthologs between otherstrain.gbk and 21.gbk (0)
5145 sequences

identifying orthologs between otherstrain.gbk and 23.gbk (0)
5279 sequences

identifying orthologs between otherstrain.gbk and 24.gbk (0)
4467 sequences

identifying orthologs between otherstrain.gbk and 38B.gbk (0)
5240 sequences

identifying orthologs between otherstrain.gbk and 4.gbk (0)
0 sequences
...
occupancy stats:
cloud shell soft_core core
17.gbk 48 4204 2683 284
21.gbk 26 3292 2678 284
23.gbk 82 4385 2548 284
24.gbk 88 3434 1971 284
38B.gbk 77 3977 2681 284
48-1.gbk 83 4723 2683 284
62-1.gbk 36 3515 2648 284
...etc

Error running COGS

so i got this while running COGs, what could it be?

creating taxa indexes...

construct_taxa_indexes: number of taxa found = 192

clustering orthologous sequences

checking lineage-specific expansions

making COGs

prunning COGs

done

add_unmatched_singletons : 2010 sequences, 185 taxa

looking for valid ORF clusters (n_of_taxa=0)...

EXIT : cannot create /mnt/e/Downloads/get_homologues-x86_64-20170609/Genomas_Teste_2_homologues/GCF000598065_f0_0taxa_algCOG_e0_/800310_endo-alpha--1->5--L-...faa

redirect output to mounted volume

Hi,

I'm running get_homologues.pl from a 8 CPU 64GB RAM ubuntu instance and trying to redirect the output to a 2TB mounted volume (because the local 120GB is filling up). My issue is that I tried to use -o to redirect the output but it is still using the local disk rather than the mounted volume. The command I used is below. I realise this is almost certainly my issue (is what I am attempting to do even possible?) but if you have any suggestions that would be amazing.

sudo nohup /home/ubuntu/ghomol/get_homologues-x86_64-20170609/get_homologues.pl -d /dev/cog_data/genbank/ -c -R 1234 -m local -n 16 -M -X -z -t 0 -A -g -o /dev/cog_data/genbank_homologues/ &

Thanks so much in advance.
D

Error with plot_matrix_heatmap.sh

plot_matrix_heatmap.sh -i pangenome_matrix_t0.tab -o pdf -N -H 8 -W 14 -m 28 -v 28 -t "pangenome" -k "genes per cluster"

Plotting file pangenome_matrix_t0_heatmap.pdf

Error: There are less than four complete data rows. Pleae revise your input table!
Execution halted

ERROR: file pangenome_matrix_t0_heatmap.pdf was NOT produced.

You can try option -C or alternatively remove columns in the matrix.

ERROR: file pangenome_matrix_t0_BioNJ.ph was NOT produced!

ERROR: file ANDg_meand_silhouette_width_statistic_plot.pdf was NOT produced!

ERROR: file was NOT produced!

Please suggest what should I do to get an output

parse_pangenome_matrix.pl

Hi,

I came across your nice tool parse_pangenome_matrix.pl which is to identify genes present in gp A absent in B. We can skip singletons using -S to avoid sequences found in <2 taxa.
My question is:
What if we want to calculate only the most common genes (>50%) in group A not in B. Could the -S default value be adjusted?

Thanks a lot,

get_homologues-est on one input file

I am interested in using get_homologues-est, but I have a question about how the program would deal with one input fasta. I have RNA and DNA data from a multiple species in the same genus all in the same file. Would get_homologues-est be able to cluster a single input into orthologs? I apologize if this is a dumb question.

install failing at COGtriangles step

I'm trying to install get_homologues on a computing cluster. The cluster does not have a COGtriangles module pre-loaded.

I followed the steps in the manual to compile COGtriangles with g++/8.1.0.

It seems like COGtriangles, COGmakeblast, COGcognitor, and COGmakehash were all made properly once I added

#include <unistd.h>

to the top of the os.h files in COGcoginitor and COGtriangles, and the bc.h file in COGreadblast as per this site: [http://seqanswers.com/forums/archive/index.php/t-51326.html]

However, I'm still getting this error when I try to install get_homologues:

[ka41a@ghpcc06 homo]$ perl install.pl

### 1) checking required parts: 

## checking whether source and binaries of dependencies are in place
>> OK

## checking mcl-14-137 (lib/phyTools: $ENV{'EXE_MCL'})
>> OK
## checking COGsoft/COGtriangles (lib/phyTools: $ENV{'EXE_COGTRI'})

# compiling COGsoft/COGtriangles ...
g++ -O3 -g -Wall  -c cogtriangles.cpp
In file included from cogtriangles.cpp:5:
os.h: In function ‘void myOpenDir(const char*)’:
os.h:13:7: error: ‘chdir’ was not declared in this scope
   if (chdir(dirpath))
       ^~~~~
os.h:13:7: note: suggested alternative: ‘char’
   if (chdir(dirpath))
       ^~~~~
       char
make: *** [cogtriangles.o] Error 1
<< Cannot run COGtriangles, please check the manual for compilation instructions and re-run.

So even though I get the message below when I make COGtriangles, get_homologues doesn't recognize it and keeps failing during the install.

[ka41a@ghpcc06 COGtriangles]$ cd ../COGtriangles;make
make: `COGtriangles' is up to date.

Could someone give me any help with this? I can install and run get_homologues on my personal computer just fine, but I just can't get it installed on a computing cluster.
Thank you!!
-Korin

pangenome_matrix_t0.tab display hash reference instead of 1s

Hello!

First of all, thank you for your amazing program (using the latest release, 20170807).
I have an issue though, with the pangenome_matrix_t0.tab file generated by compare_clusters.pl .
Instead of 1s, it displays hash references (eg : HASH(0x338d6b8)). I've been trying to debug it for a while, even using your commented line in compare_clusters.pl to "artificially" print 1s instead (to later use that matrix with the plot_matrix_heatmap.sh, which returns me "Error in read.table(file = "0824_intersect_pangenome/pangenome_matrix_t0.tab", : more columns than columns names").
I'm out of ideas for now, could you help me resolve the issue? Idk if the bug lies in my data, in my R dependencies, or else...

Thanks a lot,
Cheers,

Audrey

Cloud_genes

I used 81 strains for pan-genome studies and all 81 strains are contributing to cloud genes ranging from 1 gene to 498 genes. Cloud genes come from ~2% of the population (rare genes) is what I know. How do I understand this contribution of cloud genes from all the 81 strains of bacteria?
Book1.xlsx.

Also, can we identify unique genes?

pfam_enrich missing output?

I'm running pfam_enrich.pl and it seems to be running fine but nothing shows up in the output.

Below is what I get when as the output--I'm running this on a computing cluster and the following result is what was emailed to me. Does this script produce a .txt file that would include the list of enriched pfams, or do I really have no enriched pfams?

# /share/pkg/get_homologues/3.1.2/pfam_enrich.pl -d infantis_homologues -c infantis_homologues/UCD304_f0_0taxa_algOMCL_e0_ -x infantis_intersection/infpan/pangenome_matrix_t0__shell_list.txt -n 1 -s 0 -e 0 -t greater -p 0.05 -r  -f 0

# parsing clusters...
# 66910 sequences extracted from 5988 clusters

# total experiment sequence ids = 26283
# total control    sequence ids = 66908

# parse_Pfam_freqs: set1 = 623 Pfams set2 = 1560 Pfams

# fisher exact test type: 'greater'
# multi-testing p-value adjustment: fdr
# adjusted p-value threshold: 0.05

# total annotated domains: experiment=1375 control=4086

#PfamID counts(exp)     counts(ctr)     freq(exp)       freq(ctr)       p-value p-value(adj)    description

Thanks!

pfam_enrich.pl can't match ids of 'experiment'

I first run command like this:

parse_pangenome_matrix.pl -m matrix_t2/pangenome_matrix_t0.tab -B ref.list -a

I got file #matrix_t2/pangenome_matrix_t0__pangenes_list.txt#

Then when I run pfam_enrich.pl:

pfam_enrich.pl -d sequence_est_homologues/ -c sequence_est_homologues/A-grandislongest_0taxa_algOMCL_e0_S75_ -x matrix_t2/pangenome_matrix_t0__pangenes_list.txt

I got this error:
cannot match ids of 'experiment' sequences.

I checked many times but could hardly find any thing I missed. Can you help ?

issue drawing PNG images

Posting this here in case anyone runs into a similar issue. I'm running get_homologues on CentOS 6.9. R installed via linuxbrew with brew install -s r --without-x11. However, calling plot_pancore_matrix.pl kept returning this error:

# using default colors, defined in %COLORS

# globals controlling R plots: $YLIMRATIO=1.2
Error in .External2(C_X11, paste0("png::", filename), g$width, g$height,  : 
  unable to start device PNG
Calls: png
In addition: Warning message:
In png(file = "sample_intersection/pangenome_matrix_t0__shell.png") :
  unable to open connection to X11 display ''
Execution halted

Apparently R is trying to use X11 even though R was deliberately built without X11 support.

Fortunately, this post fixed things for me. I added the following to my ~/.Rprofile:

## Set default 'type' for png() calls - useful when X11 device is not available!
## NOTE: Needs 'cairo' capability
options(bitmapType='cairo')

Feature request - add input file name to cluster fasta header

When parsing the fasta format output of parse_pangenome_matrix.pl, I often run into issues determining which sequences came from which genomes, or reformatting headers to only list the strain name (ie for making a partitioned alignment, etc.). This is especially the case when RefSeq annotated genomes are included, as strains now have non-redundant protein ids that are no longer unique.

It would be nice to have an option to include the input filename (strainA.gbk) included in the header line of each sequence so that it is easy to separate them out and not have to deal with inconsistent or non-unique labeling of strain names.

plot core-genome and pan-genome plots

Could you please help me with this one:

I need to plot core-genome and pan-genome plots as shown in Fig. 16 of the manual.
In order to get these plots I am executing the following script first that will give me a .tab file
get_homologues.pl -d faa -c -M -n 25 as suggested in 4.8.4
faa directory contains all the amino acid sequences of my bacterial strains.
The above script will generate faa_homologue folder where I should find the .tab file to used further.

I have two questions here:

The above script has been running for last four days and hasn't ended. Does it take this much time? I have assigned 40 Gb of memory, 200 CPU hrs and 25 threads on a single node for 81 bacterial genomes. There are a total of 14,452 genes, including 3153 core genes, 3799 soft core, 3845 shell and 6806 cloud genes.

Do I have to use the .tab file produced by the above script only as input in the following way:
plot_pancore_matrix.pl -i xyz.tab -f core_Tettelin
xyz.tab file is the file that is expected to be generated by get_homologues.pl as shown above (first).
or
Can I use any other already generated pangenome.tab files by earlier used scripts?

Thank you
Gaurav

Question: restarting get_homologues-est

Hi there,
get_homologues-est ran out of memory. Is there a way to re-start the run continuing from the output so far or is it better to start from scratch?

Thanks,
Liza

Error: File Pfam-A.hmm does not appear to be in a recognized HMM format.

I am trying to format the Pfam database and I keep getting the error message --> Error: File Pfam-A.hmm does not appear to be in a recognized HMM format. I successfully used hmmpress on the Pfam-A.hmm file and added the hmmpress output to the db directory. But when I run ./install.pl, I'm still getting the message below. Am I doing something wrong or missing a step? Thanks for your help!

Korins-MacBook-Air:gethomologues korinalbert$ ./install.pl

1) checking required parts:

checking mcl-02-063 (lib/phyTools: $ENV{'EXE_MCL'})

OK

checking COGsoft/COGlse (lib/phyTools: $ENV{'EXE_COGLSE'})

OK

checking COGsoft/COGreadblast (lib/phyTools: $ENV{'EXE_READBLAST'})

OK

checking COGsoft/COGtriangles (lib/phyTools: $ENV{'EXE_COGTRI'})

OK

checking COGsoft/COGmakehash (lib/phyTools: $ENV{'EXE_MAKEHASH'})

OK

Checking blast (lib/phyTools: $ENV{'EXE_BLASTP'})

OK

2) checking optional parts:

checking optional HMMER binaries (lib/phyTools: $ENV{'EXE_HMMPFAM'})

required by get_homologues.pl -D

OK

checking optional PFAM library (lib/phyTools: $ENV{'PFAMDB'})

required by get_homologues.pl -D

pressing Pfam-A.hmm ...

Error: File Pfam-A.hmm does not appear to be in a recognized HMM format.

OK

checking optional software R (lib/phyTools: $ENV{'EXE_R'})

required by compare_clusters.pl, parse_pangenome_matrix.pl -s, plot_pancore_matrix.pl

OK

checking optional Perl module GD

required by parse_pangenome_matrix.pl -p

<< missing

operating system guess: macOSX

Please install fink ( http://www.finkproject.org ) if required.

Please open a terminal with root permissions (sudo, otherwise su)

and paste the following commands to install the missing optional parts:

fink install gd ;

HELP while installing optional dependencies under macOSX:

http://cran.r-project.org/bin/macosx/
http://www.bugzilla.org/docs/2.16/html/osx.html

In case of errors please check the manual for instructions and re-run.

Korins-MacBook-Air:gethomologues korinalbert$

feature-add HipMCL as one of the clustering options

Bruno, can you add HipMCL as one of the clustering options?

tettelin core/pan estimate fails with cryptic "error"

I am trying to run the Tettelin estimate for pan- and core- genome size on a set of genomes. Using the following command:
get_homologues.pl -d ./genomestettelin -c -M
The analysis appears to be running and gets this far:
~/Software/get_homologues-x86_64-20160413/get_homologues.pl -i 0 -d ./genomestettelin -o 0 -e 0 -f 0 -r 0 -t all -c 1 -z 0 -I 0 -m local -n 2 -M 1 -G 0 -P 0 -C 75 -S 1 -E 1e-05 -F 1.5 -N 0 -B 50 -b 0 -s 0 -D 0 -g 0 -a '0' -x 0 -R 0 -A 0

results_directory=~/gethomologues/genomestettelin_homologues
parameters: MAXEVALUEBLASTSEARCH=0.01 MAXPFAMSEQS=250 BATCHSIZE=100 KEEPSCNDHSPS=1

checking input files...
02-815.gbk 5656
02-816c.gbk 5648`

... etc ...

parsing blast result! (~/gethomologues/genomestettelin_homologues/tmp/all.blast , 8.4e+03MB)
parsing file finished

creating indexes, this might take some time (lines=1.41e+08) ...

construct_taxa_indexes: number of taxa found = 62
number of file addresses/BLAST queries = 3.2e+05

genome composition report (samples=10,permutations=3.147e+85,seed=0)
genomic composition parameters: MIN_PERSEQID_HOM=0 MIN_COVERAGE_HOM=20 (set in lib/marfil_homology.pm)
genome order:

... etc ...

identifying orthologs between RS1D5.gbk and UNC23MFCrub1.1.gbk (0)
3182 sequences

identifying inparalogs in RS1D5.gbk
72 sequences

identifying inparalogs in UNC23MFCrub1.1.gbk
49 sequences

running MCL (inflation=1.5) ...
running MCL finished

find_OMCL_clusters: parsing clusters (~/gethomologues/genomestettelin_homologues/tmp/all_ortho.mcl)

sample 0 (02-815.gbk | 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,...)
adding 02-815.gbk: core=5490 pan=5490
finding homologs between 02-816c.gbk and 02-815.gbk
ARRAY(0x582c7c20)
ARRAY(0x57397460)
ARRAY(0x7b96fe98)
ARRAY(0x83357218)
ARRAY(0x93e73ca8)
ARRAY(0xbbf44b68)
ARRAY(0x4f6f3628)
ARRAY(0xacd7a200)
...
and so on for 4GB of log file. The analysis has been running for about a week now but I'm fairly sure it will not finish at this point.

Not able to get proper Tettelin fit to OrthoMCL clusters to estimate the size of the

plot_pancore_matrix.pl -i pan_genome_algOMCL.tab -f core_Tettelin
Also tried, plot_pancore_matrix.pl -i pan_genome_algOMCL.tab -f core_both

core_Tettelin fit failed to converge

core_Willenbrock fit failed to converge

Warning message while downloading uniprot_sprot.fasta.gz

A warning message in install.pl
says "cannot connect to $BLASTSERVERPATH: ..." while it is actually trying to download a file (uniprot_sprot.fasta.gz) from server $SWISSPROTSERVER. Just a typo I think.

POCP values greater than 100

Hello,

I just tested the "-P" option of the latest version of GET_HOMOLOGUES to calculate the values of POCP. I sometimes have values over 100. How is this possible?

Thank you

eead-csic-compbio / get_homologues Goto Github PK

get_homologues's Introduction

GET_HOMOLOGUES: a versatile software package for pan-genome analysis

get_homologues's People

Contributors

Stargazers

Watchers

Forkers

get_homologues's Issues

192 genomes, 1030166 sequences

taxa considered = 192 sequences = 1030166 residues = 331534637 MIN_BITSCORE_SIM = 21.6

mask=GCF000598065_f0_0taxa_algOMCL_e0_ (_algOMCL)

skipped genome parsing (Genomas_Teste_2_homologues/tmp/selected.genomes)

skip BLAST searches and parsing

WARNING: please remove/rename results directory:

'/mnt/e/Downloads/get_homologues-x86_64-20170609/Genomas_Teste_2_homologues/'

if you change the sequences in your .gbk/.faa files or want to re-run

/home/ecastron/programs/get_homologues-x86_64-20160201/get_homologues.pl -i 0 -d cdifficile_all -o 0 -e 0 -f 0 -r 0 -t all -c 0 -z 0 -I 0 -m local -n 16 -M 0 -G 1 -P 0 -C 100 -S 100 -E 1e-05 -F 1.5 -N 0 -B 50 -b 0 -s 0 -D 0 -g 0 -a '0' -x 0 -R 0 -A 0

WARNING : cannot lock files in /lustre/groups/cbi/ecastron/Sandro/mayor/assembled/white_list_no_redecont/cdifficile_all_homologues ,

please ensure that no other instance of the program is running at this location

results_directory=/lustre/groups/cbi/ecastron/Sandro/mayor/assembled/white_list_no_redecont/cdifficile_all_homologues

parameters: MAXEVALUEBLASTSEARCH=0.01 MAXPFAMSEQS=250 BATCHSIZE=100 KEEPSCNDHSPS=1

checking input files...

97-16_S15.faa 3908

CGB1038-2_S2.faa 3688

CGB1038-3_S3.faa 3685

CGB1038-5_S5.faa 3849

CGB1038-6_S6.faa 3852

difficile630.faa 3823

difficileCIP107932.faa 3688

difficileQCD-32g58.faa 4162

difficileQCD-66c26.faa 3677

difficilecd196.faa 3630

difficiler20291.faa 3702

11 genomes, 41664 sequences

taxa considered = 11 sequences = 41664 residues = 12654511 MIN_BITSCORE_SIM = 19.2

mask=difficilecd196_f0_alltaxa_algCOG_e0_C100_S100_ (_algCOG_C100_S100)

skipped genome parsing (cdifficile_all_homologues/tmp/selected.genomes)

skip BLAST searches and parsing

WARNING: please remove/rename results directory:

'/lustre/groups/cbi/ecastron/Sandro/mayor/assembled/white_list_no_redecont/cdifficile_all_homologues/'

if you change the sequences in your .gbk/.faa files or want to re-run

creating indexes, this might take some time (lines=3.27e+06) ...

construct_taxa_indexes: number of taxa found = 11

number of file addresses/BLAST queries = 4.2e+04

clustering orthologous sequences

checking lineage-specific expansions

making COGs

ERROR: find_COGs (/home/ecastron/programs/get_homologues-x86_64-20160201//bin/COGsoft/COGtriangles/COGtriangles ) failed to terminate job

included input files (51):

creating taxa indexes...

construct_taxa_indexes: number of taxa found = 192

clustering orthologous sequences

checking lineage-specific expansions

making COGs

prunning COGs

done

add_unmatched_singletons : 2010 sequences, 185 taxa

looking for valid ORF clusters (n_of_taxa=0)...

EXIT : cannot create /mnt/e/Downloads/get_homologues-x86_64-20170609/Genomas_Teste_2_homologues/GCF000598065_f0_0taxa_algCOG_e0_/800310_endo-alpha--1->5--L-...faa

Plotting file pangenome_matrix_t0_heatmap.pdf

1) checking required parts:

checking mcl-02-063 (lib/phyTools: $ENV{'EXE_MCL'})

checking COGsoft/COGlse (lib/phyTools: $ENV{'EXE_COGLSE'})

checking COGsoft/COGreadblast (lib/phyTools: $ENV{'EXE_READBLAST'})

checking COGsoft/COGtriangles (lib/phyTools: $ENV{'EXE_COGTRI'})

checking COGsoft/COGmakehash (lib/phyTools: $ENV{'EXE_MAKEHASH'})

Checking blast (lib/phyTools: $ENV{'EXE_BLASTP'})

2) checking optional parts:

checking optional HMMER binaries (lib/phyTools: $ENV{'EXE_HMMPFAM'})

required by get_homologues.pl -D

checking optional PFAM library (lib/phyTools: $ENV{'PFAMDB'})

required by get_homologues.pl -D

pressing Pfam-A.hmm ...

checking optional software R (lib/phyTools: $ENV{'EXE_R'})

required by compare_clusters.pl, parse_pangenome_matrix.pl -s, plot_pancore_matrix.pl

checking optional Perl module GD

required by parse_pangenome_matrix.pl -p

operating system guess: macOSX

Please install fink ( http://www.finkproject.org ) if required.