GithubHelp home page GithubHelp logo

cpockrandt / phylocsfpp Goto Github PK

View Code? Open in Web Editor NEW
28.0 28.0 4.0 12.94 MB

PhyloCSF++ computes PhyloCSF tracks for whole-genome multiple sequence alignments, scores single MSA, annotates CDS features in GFF/GTF files with PhyloCSF and confidence scores.

License: Other

C++ 98.78% CMake 0.78% Shell 0.44%

phylocsfpp's People

Contributors

cpockrandt avatar martin-steinegger avatar milot-mirdita avatar pskvins avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

phylocsfpp's Issues

Scores are not calculated for multi-fasta input by `score-msa` subcommand

Hi, dear developer,

I want to score a single alignment with phylocsf++ as shown below. However, the output file only contains headers and scores were not calculated. Is there something wrong with my input file?

cat test.fa
# >hg38
# ATGTGCAAATTTCCCGGGACGTGACGAATGCAGCTGGTAAGGATCATACAA---AAG
# >mm39
# ATGTCCAAATTTCCCGGGACGTGACGAATGCAGCTGGTAAGGATCATACAAGGGAAG
# >rn6
# ATGTCCAAATTTCCCGGGACGTGACGAATGCAGCTGGTAAGGATCATACTAGCGAAG

phylocsf++ score-msa --threads 1 100vertebrates test.fa
# Done!

cat test.fa.scores 
# # PhyloCSF scores computed with PhyloCSF++ v1.2.0 (9643238d, 2022-01-04)
# seq	start	end	strand	phylocsf-score	bls-score

Thanks for your help!

Error when running mmseqs createsubdb: sh: 1: Syntax error: ")" unexpected

Hi,

I'm interested in running PhyloCSF++ with annotate-with-mmseqs on Chinese hamster, but I am getting an error when it reaches the mmseqs createsubdb step:

./phylocsf++ annotate-with-mmseqs --threads 35 --output conservation species.txt 58mammals criGri1.refGene.gtf

Checking whether MMseqs2 is installed ...
Processing GFF /mnt/HDD2/conservation/criGri1.refGene.gtf
Created the genomesDB directory.
Created the cds directory.
Reading reference genome of GFF file /mnt/HDD2/conservation/fastas/criGri1.fa ...
Reading GFF file and extracting CDS coordinates ...
MMseqs2: Indexing genomes ...
MMseqs Version: 42bf6438fec1e1b987f46d8f6d4b09926ecfc019
Database type 0
Shuffle input database true
Createdb mode 0
Write lookup file 1
Offset of numeric ids 0
Compressed 0
Verbosity 3

Converting sequences
[410465] 1m 2s 307ms
Time for merging to genbankseqs_h: 0h 0m 0s 74ms
Time for merging to genbankseqs: 0h 0m 43s 532ms
Database type: Nucleotide
Time for processing: 0h 1m 46s 799ms
bash -c $'mmseqs createsubdb <(awk '$3 == 0' /mnt/HDD2/conservation//genomesDB/genbankseqs.lookup) conservation//genomesDB/genbankseqs /mnt/HDD2/conservation//genomesDB/genbankseqs_0'
sh: 1: Syntax error: ")" unexpected

This is how the input species.txt file looks like:

chinese_hamster conservation/fastas/criGri1.fa
mouse conservation/fastas/Mus_musculus.GRCm39.dna.primary_assembly.fa
rat conservation/fastas/Rattus_norvegicus.Rnor_6.0.dna.toplevel.fa
human conservation/fastas/Homo_sapiens.GRCh38.dna.primary_assembly.fa
naked_mole_rat conservation/fastas/Heterocephalus_glaber_female.HetGla_female_1.0.dna.toplevel.fa
guinea_pig conservation/fastas/Cavia_porcellus.Cavpor3.0.dna.toplevel.fa
squirrel conservation/fastas/Ictidomys_tridecemlineatus.SpeTri2.0.dna.toplevel.fa
rabbit conservation/fastas/Oryctolagus_cuniculus.OryCun2.0.dna.toplevel.fa
pika conservation/fastas/Ochotona_princeps.OchPri2.0-Ens.dna.toplevel.fa

And I have downloaded the reference GTF file and fasta files from https://hgdownload.soe.ucsc.edu/goldenPath/criGri1/bigZips/genes/criGri1.refGene.gtf.gz and https://hgdownload.soe.ucsc.edu/goldenPath/criGri1/bigZips/criGri1.fa.gz

Thanks so much,

Marina

TODO

  • add clang support
  • add tests for gtf annotation to travis
  • Option --dry: tells which species cannot be mapped. suggest which model is the best and outputs a --species string to speed up computation
  • weighted mean PhyloCSF scores instead of the mean score in annotation tools
  • maf format: support tab-separation and missing scores (i.e. "a\n" lines)
  • print output file path instead of just "Done"
  • Check other libraries other than cblas, such as OpenBLAS

@martin-steinegger Can you look into this:

  • FAQ in wiki for annotate-with-mmseqs
  • add mmseqs params in phylocsf++ and forward them to mmseqs

OpenBLAS Warning creating heavy logs

Hello,

I've installed PhyloCSF++ using conda in a newly created environment. I then launched it using

phylocsf++ build-tracks --output-raw-phylo 1 \
--output-regions 1 --genome-length 143726002 --coding-exons ./model/coding_exons.txt \
--threads 24 --output coding_regions_test ./model/mwga /path/to/maf

It hasn't finished to run yet, but it has already produced a very heavy log file (several tens of GB) containing the same line, repeated over and over:
OpenBLAS Warning : Detect OpenMP Loop and this application may hang. Please rebuild the library with USE_OPENMP=1 option.

Do you have any idea on what could cause this and how to solve it?
Thanks in advance!

Antonin

Issue in assert(lpr_per_codon[thread_id].size() * 3 <= bls_per_bp[thread_id].size());

Hi,
I was trying to run PhyloCSF++ with build-tracks command, and I'm having issue with it so I want you to see what would be the problem with it.

First of all, the command I used for the system was
build-tracks
--threads 8
--output-phylo 1
--genome-length 23513712
--coding-exons codingexon_for_drosophila.txt
23flies
chr2L.maf

For the maf, fasta, and off file, I used the chr2L.maf.gz, dm6.fa.gz, dm6.ncbiRefSeq.gtf.gz file from UCSC which you can find from
https://hgdownload.soe.ucsc.edu/goldenPath/dm6/multiz27way/maf/
https://hgdownload.soe.ucsc.edu/goldenPath/dm6/bigZips/
https://hgdownload.soe.ucsc.edu/goldenPath/dm6/bigZips/genes/ respectively,
and I made the codingexon_for_drosophila.txt with the command you provided(awk -F'\t' 'BEGIN { OFS="\t" } ($3 == "CDS") { print $1, $7, $8, $4, $5 }').

When I tried to run the program with the command above, it always stopped at the 152th line of phylocsf++build_tracks.hpp, which says assert(lpr_per_codon[thread_id].size() * 3 <= bls_per_bp[thread_id].size());.

For the value of lpr_per_codon and bus_per_bp, I got the value as below.
(lldb) p lpr_per_codon[thread_id].size() * 3
(unsigned long) $2 = 43596
(lldb) p bls_per_bp[thread_id].size()
(vector<double, allocator >::size_type) $4 = 5207

I also tried to run the program with other man files which I just cut some informations from chr2L.maf file, such informations that are about triCas2, musDom2, apiMel4, anoGam1 species which are not in the 23flies model, or alignments that is not aligned with any other species, which is just giving sequence of one species with score=0, and I got exactly same error from these cases too.

I also tried to run the program with making the command --threads 1 and with the pruned files, but it also made the same error as well.

If you need any more informations about the issue, please leave comments below, then I will answer as soon as possible.

Thank you.

Cannot generate scores for hg38 maf

Hi,

I cannot run the program with a fasta alignment file, should I only use maf format? Also, if I use publicly available https://hgdownload.soe.ucsc.edu/goldenPath/hg38/multiz100way/maf/chr1_KI270707v1_random.maf.gz as maf and run phylocsf++ build-tracks 58mammals maf it cannot generate scores. It gives long something like this:

OpenBLAS Warning : Detect OpenMP Loop and this application may hang. Please rebuild the library with USE_OPENMP=1 option.
OpenBLAS Warning : Detect OpenMP Loop and this application may hang. Please rebuild the library with USE_OPENMP=1 option.
OpenBLAS Warning : Detect OpenMP Loop and this application may hang. Please rebuild the library with USE_OPENMP=1 option.
Done!
WARNING: gorilla in the model does not occur in alignment file(s). Check --species to select a subset (this affects the power/confidence track).
WARNING: squirrel_monkey in the model does not occur in alignment file(s). Check --species to select a subset (this affects the power/confidence track).
WARNING: bushbaby in the model does not occur in alignment file(s). Check --species to select a subset (this affects the power/confidence track).
WARNING: chinese_tree_shrew in the model does not occur in alignment file(s). Check --species to select a subset (this affects the power/confidence track).

I ignored the first warning and tried to correct the second warning by adding species mapping names, it still gives the same warnings. I wonder if you can replicate the error. Also can you add hg38.100way.maf file into example folder so that I can check file structures. I want to try mammals model with human maf, since running bird example gives no error and I cannot generate scores for human maf file above.

Thanks!

creating multiple sequence alignments for plants

Hi,

I work on plants, which seem to be lacking with the precomputed MSAs available.

What software would you recommend for creating genome scale MSAs of plants ? I can think of pangenome software like PGGB and minigraph-cactus, and converting from odgi to MSA formats, but I don't know of other reliable non-graph based alternatives.

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.