fephyfofum / sortadate Goto Github PK

License: GNU General Public License v2.0

Python 100.00%

sortadate's Introduction

SortaDate

Introduction

This repository contains scripts that you can use at different stages to attempt to find more clock-like genes. Generally, you would use these for dating analyses with another package.

The results you get from these analyses are just suggestions and tools to help explore your dataset. You should be sure to examine the results carefully and not use them blindly.

Requirements

You will minimally need python to run the scripts. Other than python, you will probably want two packages from phyx (see installation below) in order to run all parts of SortaData.

The expected input is a set of genes that have been aligned and for which there are gene trees. Generally, we would expect these to be rooted as well. The input filetype would be fasta. If you need help converting these files, you can use phyx or put a feature request and we can add scripts to convert files.

Usage

With the input of a directory of gene alignments and corresponding gene trees, the steps of these analyses include

Get the tip-to-root variation with the get_var_length.py
Get the bipartition support with get_bp_genetrees.py
Combine the results from these two runs with combine_results.py
Sort and get the list of the good genes with get_good_genes.py. You can give the order that you prefer the sorting so --order 3,1,2 would mean that you want the bipartition sorted first (3=bipartition), then root-to-tip variance (1=root-to-tip variance), and then tree length (2=treelength).

These steps are separated out with the intention that you can examine the results of each step. In each case you can type python NAMEOFSCRIPT.py -h to get what each argument is. The example below should also help.

Installation

To install and use SortaDate you will need python (it is probably already on your computer if you have a Mac or Linux machine) and you will need two programs from the phyx set of programs (found https://github.com/FePhyFoFum/phyx). While phyx has many different programs, we will only need 3 and to install these, you will not need any other software other than a compiler. Because you can install only part of the phyx package, instructions for this will be placed here. If you would like to install all of the programs, consult the phyx website.

Installing pxlstr and pxbp

The two bits that you will need from phyx are pxlstr, pxrmt and pxbp. To install this in Mac or Linux

Go to https://github.com/FePhyFoFum/phyx then click the “Clone or download” and download the zip. Unzip the resulting file.
Open the Terminal and change directory to the src directory (if you downloaded and unzipped the phyx-master.zip in your Downloads directory, on Mac you will probably run the command cd Downloads/phyx-master/src)
Run ./configure. You will get a couple errors, probably, but you can ignore. This would be important only if you needed all of the programs in phyx.
Run make pxlstr, then make pxrmt and then make pxbp. Then copy the programs into your PATH (so probably sudo cp pxlstr pxrmt pxbp /usr/local/bin/. You will have to type your password.

Installing SortaDate

This is trivial. Because the package is a python package, all you will need is to download the scripts here, and just run them directory from wherever you download them. You can see how to use the commands in the example below. You can grab the repository by clicking the “Clone or download” and download the zip. Unzip the resulting file and within you will find all the files.

Example

There is an example dataset included in the repository. This is a partial dataset from the Jarvis et al. (2014) genomic bird paper. Once you have downloaded or cloned the repository, you can run these analyses:

Get the root-to-tip variance with python src/get_var_length.py examples/genes_trees/ --flend .tre.rr --outf examples/var --outg Struthio_camelus,Tinamou_guttatus
Get the bipartition support with python src/get_bp_genetrees.py examples/genes_trees/ examples/Chrono_Tent_Bird_study.new --flend .tre.rr --outf examples/bp
Combine the results from these two runs with python src/combine_results.py examples/var examples/bp --outf examples/comb
Sort and get the list of the good genes with python src/get_good_genes.py examples/comb --max 3 --order 3,1,2 --outf examples/gg

Additional analyses

Rooting

If you need trees to be rooted, you can perform this in many different ways. One way that you can root these trees is with the phyx program pxrr (found https://github.com/FePhyFoFum/phyx). Since Sorta date requires 3 other phyx packages, installing the 4th should be relatively trivial by running make pxrr in the source directory of phyx when the other programs were made. To then move this to the path you can run sudo cp pxrr /usr/local/bin/.

Conducting clock analyses

If you would like to conduct likelihood ratio tests for the clock for each of the gene trees, you may do this with paup and you can find that here http://phylosolutions.com/paup-test/. There are scripts included in SortaDate to conduct these analyses that will add the necessary paup block to the file.

Citations

The scripts and procedures discussed here are presented in Smith et al. 2018 So many genes, so little time: A practical approach to divergence-time estimation in the genomic era. Plos One.

sortadate's People

Contributors

Stargazers

Watchers

Forkers

jfwalker huangsunan y842739756

sortadate's Issues

Dealing with outgroups

Hello,

I am trying to use SortaDate. My data contains 11 outgroups, but not all outgroups appear in each gene tree. So, when I ran get_var_length.py, I got error: unrecognized arguments for --outg. Do you have any suggestion to fix it?

Thanks in advance :D
Pirada

Unit for variance in SortaDate

Hi Stephen,

Would you be able to share the units in which variance is calculated please? I cannot find this in the paper or online documentation.

Many thanks in advance,
Ben

variance output is NA

Thanks for the useful scripts! The var output for my newick trees has 'NA' for the variance column; tree length prints properly. I have played with different rooted and unrooted versions of the newick trees and always have the same problem.

I can run get_var_length.py with the example/genestrees/*.tre.rr and the script works normally.
I can run pxlstr -v and the variance prints normally. So I will proceed via this route.
Python 3.7.12

issue with pxrr rooted trees

I'm having trouble getting the get_var_length.py script to run with pxrr rooted trees. For example, I typically get the following error:

terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Error: this really only works with nexus or newick. Exiting.
Aborted (core dumped)

when running

python ~/Phylo-software/SortaDate/src/get_var_length.py gene_trees --flend .part.treefile.rr.txt
--outf var --outg Cladomyrma_petalae_SRR2184127,Euprenolepis_procera_SRR2184132,Lasius_americanus_nr_N232,Lasius_atopus_nr_N220,Lasius_brevicornis_N239,Lasius_brunneus_N224,Lasius_californicus_SRR2184142,Lasius_carniolicus_N231,Lasius_emarginatus_N226,Lasius_flavus_N222,Lasius_fuliginosus_N223,Lasius_latipes_N227,Lasius_myops_N241,Lasius_nearcticus_N234,Lasius_niger_SRR2184143,Lasius_pallitarsis_N221,Lasius_platythorax_N235,Lasius_psammophilus_N236,Lasius_sabularum_N225,Lasius_sonobei_N238,Lasius_spathepus_N237,Lasius_subumbratus_nr_N230,Lasius_turcicus_N233,Metalasius_myrmidon_N240,Myrmecocystus_flaviceps_SRR2184150,Myrmecocystus_mendax_N121,Myrmecocystus_mexicanus_N228,Myrmecocystus_tenuinodis_N229,Paraparatrechina_brunnella_N122,Paraparatrechina_glabra_SRR2184166,Paraparatrechina_oceanica_SRR2184167,Paratrechina_antsingy_SRR2184168,Paratrechina_longicornis_N120,Paratrechina_longicornis_N125,Paratrechina_longicornis_PPP036,Paratrechina_longicornis_SRR2184169,Paratrechina_zanjensis_SRR2184170,Prenolepis_imparis_SRR2184180,Prenolepis_shanialena_PPP226,Pseudolasius_australis_SRR2184183,Zatania_albimaculata_SRR2184196,Zatania_cisipa_PPP203,Zatania_gloriosa_PPP201

uce_9.part.treefile.rr.txt

SortaDate python script to add Paup bloc

Hello,
it's mentioned that there are scripts included in SortaDate to conduct these analyses that will add the necessary paup block to conduct likelihood ratio tests for the clock for each of the gene trees. I seem to be unable to locate this, or have overlooked it on your website? Would it be possible to have access to this script? Thanks!

Index error when running "get_good_genes.py"

Hello,

I'm getting the error below when I run get_good_genes.py.

$ python /path/to/SortaDate/src/get_good_genes.py sortadate-comb.tsv --max 30 --order 3,1,2 --outf sortadate-gg.tsv
Traceback (most recent call last):
  File "/SortaDate/src/get_good_genes.py", line 41, in <module>
    tuples.append((spls[0],float(spls[1]),float(spls[2]),-float(spls[-1])))
IndexError: list index out of range

Here is a sample of the input file:

OG0004687.rr.tre	0.00138266	0.366563	0.714285714286
OG0005659.rr.tre	0.137497	4.78917	0.785714285714
OG0006077.rr.tre	0.00031333	0.806113	0.785714285714
OG0002883.rr.tre	0.000139026	0.170492	0.714285714286
OG0004136.rr.tre	0.0538138	5.35008	0.857142857143
OG0004637.rr.tre	0.000655984	0.69533	0.857142857143
OG0003807.rr.tre	0.000490135	0.604678	0.928571428571
OG0003609.rr.tre	0.0781797	9.26864	0.928571428571
OG0005328.rr.tre	0.00155003	0.621942	0.714285714286
OG0003624.rr.tre	0.0246032	4.08067	1.0
OG0006631.rr.tre	0.00231353	1.50411	0.857142857143
OG0005406.rr.tre	0.00943378	2.35777	1.0
OG0004489.rr.tre	0.00101057	1.57531	1.0
OG0002972.rr.tre	3.19924e-05	0.0494835	0.214285714286
OG0002527.rr.tre	2.7528e-05	0.0637047	0.428571428571
OG0003340.rr.tre	0.00322205	1.08101	0.785714285714
OG0006105.rr.tre	0.0105704	3.41476	0.785714285714
OG0004584.rr.tre	0.00242958	0.545403	0.785714285714
OG0004306.rr.tre	0.00104878	0.328157	0.571428571429
OG0006537.rr.tre	0.113686	2.93719	0.785714285714

I would appreciate any help.

Script for LTR tests clock with PAUP

Hello,
I am interested to test your python script to filter my genes based on their "clock-likeness". You mention in the tutorial that they are scripts in SortaDate that can conduct LRT analyses with paup (is it with the option "clockChecker" of new version of paup).
What is the script to use to make this analysis? Or is running get_var_length.py enough?
Thanks
Regards
Nicolas

issue/bug getting root to tip variation

I have the following slurm script:

#!/bin/sh
#SBATCH --job-name=sorta_date           # job name
#SBATCH --mem=10000                     # 10 gb of ram
#SBATCH --ntasks=1                      # number of tasks
#SBATCH --cpus-per-task=1               # cores each task needs
#SBATCH --partition=public-cpu          # partition
#SBATCH --time=02:00:00                 # max two hours 

module load GCC/11.3.0 OpenMPI/4.1.4 phyx/1.3 Anaconda3

SD="SortaDate/src"
REROOT="rerooted/"
SPECIESTREE="noflank_parti-gene_30p.rr.treefile"

printf "get root to tip variation\n"
python ${SD}/get_var_length.py ${REROOT} --flend .rr.treefile --outf analysis/var --outg GCA905333025_gen,SRR2083676_tra,SRR921609_tra,SRR2083695_tra,SRR10334043_uce,SRR10334056_uce

# get bipartition support
printf "get bipartition support\n"
python ${SD}/get_bp_genetrees.py --flend .rr.treefile ${REROOT} ${SPECIESTREE} --outf analysis/bp

# get good genes
printf "combine_results\n"
python ${SD}/combine_results.py analysis/var analysis/bp --outf analysis/comb

# get good genes
printf "get good genes\n"
python ${SD}/get_good_genes.py analysis/comb --max 200 --order 3,1,2 --outf analysis/good_genes

Running it gives me a similar error to this closed topic.

$ cat slurm-6074476.out
get root to tip variation
free(): double free detected in tcache 2
Error: this really only works with nexus or newick. Exiting.
Error: this really only works with nexus or newick. Exiting.
Error: this really only works with nexus or newick. Exiting.
Error: this really only works with nexus or newick. Exiting.
free(): double free detected in tcache 2
free(): double free detected in tcache 2
free(): double free detected in tcache 2
free(): double free detected in tcache 2
Error: this really only works with nexus or newick. Exiting.
free(): double free detected in tcache 2
free(): double free detected in tcache 2
free(): double free detected in tcache 2
free(): double free detected in tcache 2
free(): double free detected in tcache 2
Error: this really only works with nexus or newick. Exiting.
free(): double free detected in tcache 2
Error: this really only works with nexus or newick. Exiting.
Error: this really only works with nexus or newick. Exiting.
free(): double free detected in tcache 2
free(): double free detected in tcache 2
free(): double free detected in tcache 2
Error: this really only works with nexus or newick. Exiting.
free(): double free detected in tcache 2
free(): double free detected in tcache 2
free(): double free detected in tcache 2
Error: this really only works with nexus or newick. Exiting.
free(): double free detected in tcache 2
free(): double free detected in tcache 2
Error: this really only works with nexus or newick. Exiting.
free(): double free detected in tcache 2
free(): double free detected in tcache 2
free(): double free detected in tcache 2
Error: this really only works with nexus or newick. Exiting.
free(): double free detected in tcache 2
free(): double free detected in tcache 2
free(): double free detected in tcache 2
free(): double free detected in tcache 2
Error: this really only works with nexus or newick. Exiting.
directory: rerooted
file ending for trees: .rr.treefile
outgroups: GCA905333025_gen,SRR2083676_tra,SRR921609_tra,SRR2083695_tra,SRR10334043_uce,SRR10334056_uce
phyx location: 
outfile: analysis/var
get bipartition support
directory: rerooted
file ending for trees: .rr.treefile
species tree: noflank_parti-gene_30p.rr.treefile
phyx location: 
outfile: analysis/bp
combine_results
outfile: analysis/comb
get good genes
order: bipartition root-to-tip variance treelength
outfile: analysis/good_genes
Traceback (most recent call last):
  File "/home/users/c/cardenac/phylo/2023_adephaga/SORTADATE/SortaDate/src/get_good_genes.py", line 41, in <module>
    tuples.append((spls[0],float(spls[1]),float(spls[2]),-float(spls[-1])))
IndexError: list index out of range

I looked at the var file, and indeed there are empty lines"

...
trimalauto_30p_ENSMPTG00005025653.rr.treefile	0.0205059	10.0266
trimalauto_30p_ENSMPTG00005027711.rr.treefile	0.0325183	17.8281
trimalauto_30p_uce194347.rr.treefile	0.0577492	12.7498
trimalauto_30p_ENSMPTG00005020600.rr.treefile		
trimalauto_30p_ENSMPTG00005022173.rr.treefile	0.0850962	40.1361
trimalauto_30p_ENSMPTG00005025359.rr.treefile	0.0842451	22.3873
...

Trying to make a copy of the files in a new directory didn't help and returned a similar error. I checked the treefiles (newick format) and they look fine. I was able to successfully run it without indicating out groups. But I still need accurate variation... so I removed those trees from the run. ( for i in $(cat analysis/var | tr " " "\t" | awk '{if ($2=="") print;}'); do echo mv rerooted/${i} rerooted_removed/${i}; done)

However, there were still some warnings/errors in the slurm out file

get root to tip variation
free(): double free detected in tcache 2
free(): double free detected in tcache 2
free(): double free detected in tcache 2
free(): double free detected in tcache 2
free(): double free detected in tcache 2
free(): double free detected in tcache 2
free(): double free detected in tcache 2
free(): double free detected in tcache 2
free(): double free detected in tcache 2
free(): double free detected in tcache 2
free(): double free detected in tcache 2
free(): double free detected in tcache 2
free(): double free detected in tcache 2
free(): double free detected in tcache 2
free(): double free detected in tcache 2
free(): double free detected in tcache 2
free(): double free detected in tcache 2
free(): double free detected in tcache 2
free(): double free detected in tcache 2
free(): double free detected in tcache 2
free(): double free detected in tcache 2
free(): double free detected in tcache 2
free(): double free detected in tcache 2
free(): double free detected in tcache 2
free(): double free detected in tcache 2
free(): double free detected in tcache 2
directory: rerooted/
file ending for trees: .rr.treefile
outgroups: GCA905333025_gen,SRR2083676_tra,SRR921609_tra,SRR2083695_tra,SRR10334043_uce,SRR10334056_uce
phyx location: 
outfile: analysis/var
get bipartition support
directory: rerooted/
file ending for trees: .rr.treefile
species tree: noflank_parti-gene_30p.rr.treefile
phyx location: 
outfile: analysis/bp
combine_results
outfile: analysis/comb
get good genes
order: bipartition root-to-tip variance treelength
outfile: analysis/good_genes

So I am not sure if the fact I'm running phyx as a module on a slurm cluster is the problem.

I think it could be because I'm not sure how to indicate where phyx lives, but I loaded it as a module (the --loc parameter). Its in my paths ...:/opt/ebsofts/phyx/1.3-foss-2022a/bin:.... but just running a pxrr --help throws an error.

python /opt/ebsofts/phyx/1.3-foss-2022a/bin/pxrr --help
SyntaxError: Non-UTF-8 code starting with '\xd2' in file /opt/ebsofts/phyx/1.3-foss-2022a/bin/pxrr on line 2, but no encoding declared; see https://python.org/dev/peps/pep-0263/ for details

While, of course, without indicating the bin in my paths python pxrr --help works fine.

I would try running it on my computer, but I had an issue installing phyx....