cov-lineages / pangolin Goto Github PK

View Code? Open in Web Editor NEW

418.0 46.0 108.0 17.47 MB

Software package for assigning SARS-CoV-2 genome sequences to global lineages.

License: GNU General Public License v3.0

Python 98.98% Dockerfile 1.02%

pangolin's Introduction

pangolin

Phylogenetic Assignment of Named Global Outbreak LINeages

Web application

The pangolin web app is maintained by the Centre for Genomic Pathogen Surveillance

pangolin's People

Contributors

Stargazers

Watchers

Forkers

antunderwood lisaa610 jtmccr1 richardharrigan druvus ohell ambarishk viralverity aamcgenomics theboocock maxibor biocyberman vikash84 emos8710 artpoon thebready channing-zeng gregcaporaso omarai rintukutum wook2014 zhemingfan pvanheus peterk87 ericfournier3 taneung haroon123 ewouth fmaguire shelly77 winterki dridk ingramoralesclaro angiehinrichs appliedmicrobiologyresearch jd2112 mikeinnes pastvir donalbonny apetkau delfair maikschr hsnguyen bgruening kevinlibuit fredericlemoine doytsujin tgallagh hkeward victormaricato saravananpsg buingocniem rdeborja mabvakureb sudiptochoudhury aristchatz71 javiertognarelli shulp2211 flufighter wisplinghoffgenomics smu19github alphadera-labs jdavisturak danam-molbiol karyakarte matt-sd-watson azrirosehaizat corneliusroemer monsanto-pinheiro florianzwagemaker josielikescats c4fun msrcos3s marynjerey lrq29134 panariellofrancesco juanmaria-rr laudarch suhasmhaske scudodipietra carlottaolivero t227766 baileyglen fanninpm hivdb xuxingyu821 pcjentsch wm75 mindoftea wook2013 qianqli sashka3076 cenisera taolabuv lvreynoso paultcn lovelifelovecode tianhutang mouhuihui afdhalrashid

pangolin's Issues

Error in header line of lineages.csv

Spelling mistake: amiguity -> ambiguity

lineage_report.csv can't be found.

I am running with pangolin version 1.1.4. After several tries with specific tempdirs --tempdir xxx I still can't find lineage_report.csv output file under outdir. Can it be that pangolin remove output before it finishes (i.e. clean up the file from tempdir by mistake)?

The jobs ends with messages look like the following:

[Mon May  4 15:32:30 2020]
Finished job 2.
1486 of 1488 steps (100%) done

[Mon May  4 15:32:30 2020]
rule add_failed_seqs:
    input: /user/pangtmp/tmpzs4hfj4m/lineage_report.pass_qc.csv, /user/pangtmp/tmpzs4hfj4m/query.failed_qc.fasta
    output: TestPang/lineage_report.csv
    jobid: 1

Job counts:
        count   jobs
        1       add_failed_seqs
        1
[Mon May  4 15:32:31 2020]
Finished job 1.
1487 of 1488 steps (100%) done

[Mon May  4 15:32:31 2020]
localrule all:
    input: TestPang/lineage_report.csv
    jobid: 0

tag release?

Would it be possible to start tagging releases so the tool could be added to Bioconda?

Could you share the code?

This is related to issue #2.

It would be useful if there was a tool that would automatically place new sequences in their lineages, or name a new lineage.

Thank you.

following on from issue #19

continuing the discussion from #19
OK, thanks, I'm also asking about the way your output is presented. 95 in the last column corresponds to 95% support from what I interpret in your answer -- is that correct?

When you say "quite high", what kind of cutoff would you suggest?

Thank you.

Awesome package, do you think you could put it on pypi?

Hey great work on this package. I wondered whether you could put it on pypi I was hoping to include it in a conda environment for a pipeline I have been working on.

Thanks again for your great work!

temp files not cleaned up

The temp folder is never deleted after the final output file is created?

% pangolin -t 4 -o sampledir/pangolin sampledir/genome.fasta

% find sampledir

sampledir/pangolin
sampledir/pangolin/temp
sampledir/pangolin/temp/expanded_query
sampledir/pangolin/temp/query_alignments
sampledir/pangolin/temp/query_alignments/tax1tax.aln.fasta.log
sampledir/pangolin/temp/query_alignments/tax1tax.aln.fasta.parstree
sampledir/pangolin/temp/query_alignments/tax1tax.aln.fasta.treefile
sampledir/pangolin/temp/query_alignments/tax1tax.aln.fasta.splits.nex
sampledir/pangolin/temp/query_alignments/tax1tax.aln.fasta.contree
sampledir/pangolin/temp/query_alignments/tax1tax.aln.fasta.iqtree
sampledir/pangolin/temp/query_alignments/tax1tax.aln.fasta.ckp.gz
sampledir/pangolin/lineage_report.csv

ISO ID Country, Province, city, lat, long missing.

Hi, i made a desktop app that uses Jhon Hopkins dataset and your data.
It imports your data and export to a xlsx file with geoloc info added so your data can be sorted and compared more easily.

lineages.xlsx
This way it can be easily imported to qgis arcgis or other plataforms for geographic visualization.
I can send you the app if you want ( i have to translate some words first tho)
(it's a csv xls file, change the extension before opening to csv or xls, i had to put xlsx extension in order to drag the file here)
Im a Licenciado en Bioquimica Clínica working here in Argentina.
Thanks for all your hard work.

Some entries in the lineages.csv bootstrap column look like dates

e.g. Apr-99

Maybe this got auto-converted from 04/99 e.g. in Excel?

Also, is it documented anywhere how to interpret this column?

Error in rule assign_lineages:

Hi, thank you for your wonderful work and documentation! I had been able to follow through. However, I was not able to run my file and I got this error message. I would also like to know is this only support/analyze GISAID sequences and not sequences from NCBI?

Thanks and I appreciate your help.

Error in rule assign_lineages:
    jobid: 0
    output: /Users/Swan/Documents/_ResearchProjects/15. Covid-19/lineage_report.csv

RuleException:
CalledProcessError in line 87 of /Users/Swan/miniconda3/envs/pangolin/lib/python3.6/site-packages/pangolin/scripts/assign_query_file.smk:
Command 'set -euo pipefail;  touch /Users/Swan/Documents/_ResearchProjects/15. Covid-19/lineage_report.csv' returned non-zero exit status 1.
  File "/Users/Swan/miniconda3/envs/pangolin/lib/python3.6/site-packages/pangolin/scripts/assign_query_file.smk", line 87, in __rule_assign_lineages
  File "/Users/Swan/miniconda3/envs/pangolin/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Exiting because a job execution failed. Look above for error message
Exiting because a job execution failed. Look above for error message

Add data prep command as flag to alternate pipeline

There are lineages with only one representative defined.

As of 14/3/2020, B.4.1 only has one representative defined, EPI_ISL_413597. If we are being strict, doesn't this mean that a query sequence can never get assigned to this lineage?

B.4 cannot be parented by both B and B.4.1!

This is a snippet from anonymised.aln.fasta.treefile:

The location of 175_B.4 implies (as we would expect) that lineage B.4 is parented by B.
But the location of 72_B.4 implies that B.4 is parented by B.4.1.

Something is wrong, I think.

I wonder if the assignment of 72_B.4 might be a mistake here. The bootstraps indicate it could be a member of B.4.2.

Could you make a documentation of how to use the program?

Hi,
Thank you for making this program. It is helpful in this global epidemic situation. I was trying to assign some of the genomes to a clade and got this error:

pangolin query.fasta
Found the snakefile
The query file is /labs/pathogen/data/software/pangolin/EPI_ISL_417156.fasta
Number of threads is 1
MissingInputException in line 3 of /python3.7/site-packages/pangolin/scripts/assign_query_file.smk:
Missing input files for rule decrypt_aln:
/python3.7/site-packages/pangolin/scripts/../data/anonymised.encrypted.aln.fasta

Is the query a fasta file? I have checked the folder but no anonymised.encrypted.aln.fasta file in the folder?

Thank you.

assign_lineage.py

I had to add: #!/usr/bin/env python on the first line of assign_lineage.py

Enhancement offer: CIPRES analyses?

Hi,

I've been working on some tooling to have the sequence alignment (MAFFT) and tree inference (IQTree) steps delegated to the CIPRES REST portal. The general idea is that especially the alignment step is a bit expensive for people to run locally so they might want to offload that to the cloud, and I understand from the CIPRES PI that they give preferential treatment to SARS-Cov-2 analyses right now.

With a bit of effort, I should be able to make it so that the tool is a bioconda package that you could use in place of the local steps you have here and here, but the upshot is that users would have to get a user account and app key registration on the CIPRES server.

In other words: better performance but with somewhat more complexity. Is this something you care for?

Compare to a solution provided by "curated clades.tsv + nextstrain/augur clade" ?

Hi
I am trying to understand use case of this package. From what I can see ncov build of nextstrain/augur can use a "clades.tsv" file to annotate the whole tree.

Pros: annotate whole tree, auspice ready.

Cons: One must have a curated clades.tsv. However having a well synced curated clade.tsv file is also very readable, and maintainable (except that there should be a discussion on what to put there).

From what I can see with pangolin so far: It annotate the query sequences, but doesn't annotate internal nodes of my own tree (not the guide tree provided by pangolin). This annotation is useful indeed. However, what should I do if I want to annotate the internal nodes (the whole tree) as well?

Or can I generate clades.tsv from another guide tree, alignment and use that for my tree and genomes?

Reference genome and description data source

It will be very helpful to describe which reference sequence and how the guiding tree was prepared.

Thanks

Pangolin assignments are vastly different from the those in Nextstrain

I am running assignment using the "2020-04-27" version of the software. I found that the assignment I got is very different from those in Nextstrain website. For example, using pangoin, the sequence "Brazil/SPBR-02/2020, EPI_ISL_413016" is assigned to "B.2" lineage with high confidence (Also B.2 in the 'lineages.2020-04-27.csv' file), but in Nextstrain (https://nextstrain.org/ncov/global?c=clade_membership&s=Brazil/SPBR-02/2020), it is "A1a". This is not isolated case, I have found many more. I am wondering what is causing the difference, just notation?

Lineage B.11 in lineages.csv has no representatives.

$ grep "B\.11" lineages.csv Netherlands/NoordBrabant_28/2020|EPI_ISL_414537||Netherlands|Noord_Brabant||2020-03-08,B.11,90/96,0.01, Netherlands/NoordBrabant_30/2020|EPI_ISL_414539||Netherlands|Noord_Brabant||2020-03-08,B.11,Apr-92,0.01, Netherlands/NoordBrabant_31/2020|EPI_ISL_414540||Netherlands|Noord_Brabant||2020-03-08,B.11,90/95,2.15, Netherlands/NoordBrabant_32/2020|EPI_ISL_414541||Netherlands|Noord_Brabant||2020-03-08,B.11,90/99,0.64, Netherlands/NoordBrabant_35/2020|EPI_ISL_414544||Netherlands|Noord_Brabant||2020-03-09,B.11,90/98,0.01, Netherlands/Utrecht_14/2020|EPI_ISL_414553||Netherlands|Utrecht||2020-03-09,B.11,98,1.3,

attributeError?

Hello, would be grateful if you could troubleshoot this error preventing successful analysis. The empty command runs fine in a conda environment.
AttributeError in line 59 of /Users/cooperv/anaconda3/lib/python3.7/site-packages/pangolin/scripts/assign_query_file.smk:
'Workflow' object has no attribute 'cores'
File "/Users/cooperv/anaconda3/lib/python3.7/site-packages/pangolin/scripts/Snakefile", line 24, in
File "/Users/cooperv/anaconda3/lib/python3.7/site-packages/pangolin/scripts/assign_query_file.smk", line 59, in

Reason for status=fail / Lineage=None

Was wondering what the main reason for the following fail situation might be?

It's only 8 SNPs and 0 indels from WUHAN-1 so was a bit surprised.

taxon,lineage,SH-alrt,UFbootstrap,lineages_version,status,note
XXXX,None,0,0,2020-04-27,fail,N_content:0.78

A       T       C       G       N       K       M       R       W       Y
8229    8852    5048    5401    2341    4       4       4       2       18

I notice N_content:0.78 is not a proportion or a percentage - out by factor of 10 ?
It should be 7.8% N

2341/29903 = .07828645955255325552

https://github.com/hCoV-2019/pangolin/blob/ae8cf33001a7df33048e74c027562d2d9fcbee6b/pangolin/command.py#L99-L105

pangolin --version should return 0 and go to stdout

% pangolin -v
pangolin: 1.1.4
%  echo $?
255

The GNU standard is:

--version returns 0 (success)
prints pangolin x.y.z (without the colon)
prints to stdout (not stderr)

Installation doesn't include the config.yaml file in release 1.0

Installation with pip install doesn't include the config file leading to a workflow error

WorkflowError in line 2 of /home/CSCScience.ca/dhole/miniconda3/envs/pangolin/lib/python3.6/site-packages/pangolin/scripts/Snakefile:
Config file /home/CSCScience.ca/dhole/miniconda3/envs/pangolin/lib/python3.6/site-packages/pangolin/scripts/../config.yaml not found.
  File "/home/CSCScience.ca/dhole/miniconda3/envs/pangolin/lib/python3.6/site-packages/pangolin/scripts/Snakefile", line 2, in <module>

Easy fix, just add the config to the setup.py file as such which worked for me

setup(name='pangolin',
      version=__version__,
      packages=find_packages(),
      scripts=['pangolin/scripts/assign_query_file.smk',
                'pangolin/scripts/assign_query_lineage.smk',
                'pangolin/scripts/prepare_package_data.smk',
                'pangolin/scripts/Snakefile',
                'pangolin/scripts/assign_lineage.py',
                'pangolin/scripts/lineage_finder.py',
                'pangolin/scripts/utils.py',
                'pangolin/scripts/defining_snps.py',
                'pangolin/scripts/prepare_package_data.smk',
                'pangolin/config.yaml'
                ],

Lots of "Tree doesn't exist here" errors

rule iqtree_with_guide_tree:
    input: deleteme1/temp/query_alignments/tax1tax.aln.fasta, 
/opt/python/lib/python3.7/site-packages/pangolin/scripts/../data/anonymised.aln.fasta.treefile
    output: deleteme1/temp/query_alignments/tax1tax.aln.fasta.treefile
    jobid: 4
    wildcards: query=tax1tax

Job counts:
        count   jobs
        1       iqtree_with_guide_tree
        1
Tree doesn't exist here deleteme1/temp/query_alignments/tax1tax.aln.fasta.treefile

I still seem to get a lineage output

deleteme1/lineage_report.csv 
taxon,lineage,SH-alrt,UFbootstrap
2020-17937,B.1.13,100,32

persistent_dict.py:709: UserWarning: could not obtain lock--delete

I think this error is related to snakemake and a job being interrupted?

pangolin -t 36 -o delme  genome.fa
Found the snakefile
The query file is genome.fa
Number of threads is 36
Looking in /opt/python/lib/python3.7/site-packages/lineages/data for data files...

Data files found
Sequence alignment:     /opt/python/lib/python3.7/site-packages/lineages/data/anonymised.encrypted.aln.fasta
Guide tree:             /opt/python/lib/python3.7/site-packages/lineages/data/anonymised.aln.fasta.treefile
Lineages csv:          /opt/python/lib/python3.7/site-packages/lineages/data/lineages.2020-04-27.csv
Job counts:
        count   jobs
        1       all
        1       assign_lineages
        1       decrypt_aln
        1       pass_query_hash
        4
Job counts:
        count   jobs
        1       pass_query_hash
        1
Job counts:
        count   jobs
        1       decrypt_aln
        1
2 hashed sequences written
Decrypted 261 sequences
/opt/python/lib/python3.7/site-packages/pytools/persistent_dict.py:709: UserWarning: could not obtain lock--delete
 '/home/tseemann/.cache/pytools/pdict-v2-query_store-py3.7.7.final.0/a718b23febb31f030bc71ed884bc027868fb4a8d62ff2d5186df9aafa0c6e8f1.lock' if necessary
  1 + _stacklevel)

Add --version or -V flag

% pangolin --version 2> /dev/null
pangolin 1.2.3

% echo $?
0

Should B.10 be renamed B.7.1?

B.10 is wholly contained within B.7 in the tree. I would've thought, following the naming convention, it would be named B.7.1. It's the only case of this in the whole tree. What's the justification? Is this likely to happen much in the future?

Would you consider adding the version number to the output csv?

Thanks!

ModuleNotFoundError: No module named 'lineages'

pangolin-1.1 master HEAD

  File "/opt/bin/pangolin", line 5, in <module>
    from pangolin.command import main
  File "//opt/python/lib/python3.7/site-packages/pangolin/command.py", line 10, in <module>
    import lineages
ModuleNotFoundError: No module named 'lineages'

installation woes

Hi,
I'm trying to install on ubuntu 16.06.6 LTS, python 3.6.10

I ran the command
python3 setup.py install

and got the following error when I tried riunning pangolin (after activating the environment):
Traceback (most recent call last): File "/usr/local/bin/pangolin", line 9, in <module> load_entry_point('pangolin==0.1.0', 'console_scripts', 'pangolin')() File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 542, in load_entry_point return get_distribution(dist).load_entry_point(group, name) File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 2569, in load_entry_point return ep.load() File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 2229, in load return self.resolve() File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 2235, in resolve module = __import__(self.module_name, fromlist=['__name__'], level=0) File "/usr/local/lib/python3.5/dist-packages/pangolin-0.1.0-py3.5.egg/pangolin/command.py", line 45 print(f"The query file is {query}") ^ SyntaxError: invalid syntax
Any ideas what could be going wrong?
Thanks.

questions about how to interpret the bootstrap column

Hi,

I think that a boostrap of 95.0 means that 95% of the bootstrapped trees support the assignment -- is that correct? If so, what does 100/92 and 22/88 mean?

Thanks.

/tmp and .snakemake clash when running two at once

Need to use tmpdir() (import tempfile?) for each run - do not put directly into $TMPDIR
https://docs.python.org/3/library/tempfile.html
Otherwise I can't run > 1 easily when using -o folder/folder2 etc

Also can you put the .snakemake folder in the isolated temp dir too?

Will the nomenclature be updated periodically?

Great work! Thanks for providing this.
Will the nomenclature be updated periodically? Like every week?
Regards,
Shaokang

Alternative handling of 'bad' sequences

When i run on a sequence with >50% N pangolin exits with errcode 1 and outputs nothing.

Would it be possible to add an option to put this in the report instead?

% cat lineage_report.csv

ID,lineage,BS,ALRT
good,A.1,84,98
bad,-,0,0

maybe even add a Note column saying why it failed?

Output tree files for queries

It would be nice to have an option to output a tree file with the query placed in the context of the representative lineages.

Could you share the full genome fasta file before trimming

Please shared the full fasta file.

iq-tree error

'pangolin hCoV-19AustraliaNSW022020EPI_ISL_4089762020-01-22.fasta -o out -t 16
Found the snakefile
The query file is /media/crl-kims/Data_Vol_3/Varun/covid-19/ncbi_india/all/hCoV-19AustraliaNSW022020EPI_ISL_4089762020-01-22.fasta
Number of threads is 16
Job counts:
count jobs
1 all
1 assign_lineages
1 decrypt_aln
1 pass_query_hash
4
Job counts:
count jobs
1 decrypt_aln
1
Job counts:
count jobs
1 pass_query_hash
1
2 hashed sequences written
Decrypted 261 sequences
Job counts:
count jobs
1 assign_lineages
1
Passing 1 into processing pipeline.
snakemake --nolock --snakefile /home/crl-kims/miniconda3/envs/pangolin-2/lib/python3.6/site-packages/pangolin-0.1.1_2020_04_27-py3.6.egg/pangolin/scripts/assign_query_lineage.smk --configfile /home/crl-kims/miniconda3/envs/pangolin-2/lib/python3.6/site-packages/pangolin-0.1.1_2020_04_27-py3.6.egg/pangolin/scripts/../config.yaml --config query_sequences=tax1tax outdir=out query_fasta=out/temp/query.fasta representative_aln=out/temp/anonymised.aln.fasta guide_tree=/home/crl-kims/miniconda3/envs/pangolin-2/lib/python3.6/site-packages/pangolin-0.1.1_2020_04_27-py3.6.egg/pangolin/scripts/../data/anonymised.aln.fasta.treefile key=out/temp/query_key.csv --cores 16
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 16
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 all
1 assign_lineage
1 expand_query_fasta
1 gather_reports
1 iqtree_with_guide_tree
1 profile_align_query
1 to_nexus
7

[Tue Apr 28 11:19:53 2020]
rule expand_query_fasta:
input: out/temp/query.fasta
output: out/temp/expanded_query/tax1tax.fasta
jobid: 6

Job counts:
count jobs
1 expand_query_fasta
1
[Tue Apr 28 11:19:54 2020]
Finished job 6.
1 of 7 steps (14%) done

[Tue Apr 28 11:19:54 2020]
rule profile_align_query:
input: out/temp/anonymised.aln.fasta, out/temp/expanded_query/tax1tax.fasta
output: out/temp/query_alignments/tax1tax.aln.fasta
jobid: 5
wildcards: query=tax1tax

tbitr = 0, tbrweight = 3, tbweight = 0
####### in galn
file1 = out/temp/anonymised.aln.fasta
file2 = out/temp/expanded_query/tax1tax.fasta
generating a scoring matrix for nucleotide (dist=200) ... done
Constructing dendrogram ...
done. 262
GroupAglin..
group-to-group 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 / 262 17052203.752760

mafft-profile (nuc) Version 7.464
alg=A, model=DNA200 (2), 1.53 (4.59), -0.00 (-0.00), noshift, amax=0.0
1 thread(s)

Removing temporary output file out/temp/expanded_query/tax1tax.fasta.
[Tue Apr 28 11:19:55 2020]
Finished job 5.
2 of 7 steps (29%) done

[Tue Apr 28 11:19:55 2020]
rule iqtree_with_guide_tree:
input: out/temp/query_alignments/tax1tax.aln.fasta, /home/crl-kims/miniconda3/envs/pangolin-2/lib/python3.6/site-packages/pangolin-0.1.1_2020_04_27-py3.6.egg/pangolin/scripts/../data/anonymised.aln.fasta.treefile
output: out/temp/query_alignments/tax1tax.aln.fasta.treefile, out/temp/query_alignments/tax1tax.aln.fasta.parstree, out/temp/query_alignments/tax1tax.aln.fasta.splits.nex, out/temp/query_alignments/tax1tax.aln.fasta.contree, out/temp/query_alignments/tax1tax.aln.fasta.log, out/temp/query_alignments/tax1tax.aln.fasta.ckp.gz, out/temp/query_alignments/tax1tax.aln.fasta.iqtree
jobid: 4
wildcards: query=tax1tax

Job counts:
count jobs
1 iqtree_with_guide_tree
1
For AU test please specify number of bootstrap replicates via -zb option
[Tue Apr 28 11:19:56 2020]
Error in rule iqtree_with_guide_tree:
jobid: 0
output: out/temp/query_alignments/tax1tax.aln.fasta.treefile, out/temp/query_alignments/tax1tax.aln.fasta.parstree, out/temp/query_alignments/tax1tax.aln.fasta.splits.nex, out/temp/query_alignments/tax1tax.aln.fasta.contree, out/temp/query_alignments/tax1tax.aln.fasta.log, out/temp/query_alignments/tax1tax.aln.fasta.ckp.gz, out/temp/query_alignments/tax1tax.aln.fasta.iqtree

RuleException:
CalledProcessError in line 50 of /home/crl-kims/miniconda3/envs/pangolin-2/lib/python3.6/site-packages/pangolin-0.1.1_2020_04_27-py3.6.egg/pangolin/scripts/assign_query_lineage.smk:
Command 'set -euo pipefail; iqtree -s out/temp/query_alignments/tax1tax.aln.fasta -bb 1000 -au -alrt 1000 -m HKY -g /home/crl-kims/miniconda3/envs/pangolin-2/lib/python3.6/site-packages/pangolin-0.1.1_2020_04_27-py3.6.egg/pangolin/scripts/../data/anonymised.aln.fasta.treefile -quiet -o 'outgroup_A'' returned non-zero exit status 2.
File "/home/crl-kims/miniconda3/envs/pangolin-2/lib/python3.6/site-packages/pangolin-0.1.1_2020_04_27-py3.6.egg/pangolin/scripts/assign_query_lineage.smk", line 50, in __rule_iqtree_with_guide_tree
File "/home/crl-kims/miniconda3/envs/pangolin-2/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Exiting because a job execution failed. Look above for error message
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /media/crl-kims/Data_Vol_3/Varun/covid-19/ncbi_india/all/.snakemake/log/2020-04-28T111953.703347.snakemake.log
[Tue Apr 28 11:19:56 2020]
Error in rule assign_lineages:
jobid: 0
output: out/lineage_report.csv

RuleException:
CalledProcessError in line 68 of /home/crl-kims/miniconda3/envs/pangolin-2/lib/python3.6/site-packages/pangolin-0.1.1_2020_04_27-py3.6.egg/pangolin/scripts/assign_query_file.smk:
Command 'set -euo pipefail; snakemake --nolock --snakefile /home/crl-kims/miniconda3/envs/pangolin-2/lib/python3.6/site-packages/pangolin-0.1.1_2020_04_27-py3.6.egg/pangolin/scripts/assign_query_lineage.smk --configfile /home/crl-kims/miniconda3/envs/pangolin-2/lib/python3.6/site-packages/pangolin-0.1.1_2020_04_27-py3.6.egg/pangolin/scripts/../config.yaml --config query_sequences=tax1tax outdir=out query_fasta=out/temp/query.fasta representative_aln=out/temp/anonymised.aln.fasta guide_tree=/home/crl-kims/miniconda3/envs/pangolin-2/lib/python3.6/site-packages/pangolin-0.1.1_2020_04_27-py3.6.egg/pangolin/scripts/../data/anonymised.aln.fasta.treefile key=out/temp/query_key.csv --cores 16' returned non-zero exit status 1.
File "/home/crl-kims/miniconda3/envs/pangolin-2/lib/python3.6/site-packages/pangolin-0.1.1_2020_04_27-py3.6.egg/pangolin/scripts/assign_query_file.smk", line 68, in __rule_assign_lineages
File "/home/crl-kims/miniconda3/envs/pangolin-2/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Exiting because a job execution failed. Look above for error message'

Please help me with the error

Differing results between lineages.csv and output lineage

Hi
Excellent work - thank you!
I downloaded lineages.csv and selected the ~147 ones with "Representative = 1". When I processed these, there were twelve (see attached
discordant.txt
) which changed lineage between the input lineage.csv and the output lineage_report.csv. (This may have something to do with an alignment step I did).

The median bootstrap values were lower in those with different reported lineages vs those that agreed (89.5 vs 94), but I was struck by the fact that all the discordant ones were either originally B10 (N=5), B1.10 (N=4) or B3.1 (N=3), and there did not appear to be any from those three lineages which DID agree.

Cheers

Add lineages version as a flag to pangolin.

Trouble with pangolin env

Very basic problem at my end - following installing and setting up miniconda for windows and cloning the pangolin git repo, I get the following error

(base) C:\Users\charl\COVID19\pangolin>conda env create -f environment.yml
Collecting package metadata (repodata.json): done
Solving environment: failed

ResolvePackageNotFound:

mafft
iqtree

Do I need to have mafft and iqtree git repos (from elsewhere) in the same folder as pangolin?

Thanks,
Charlotte

Honour $TMPDIR or allow user to choose it?

As i understand it, you are using temp in -o (outdir) for all the temp files.

Ideally os.tmpdir() would be used so that it would honour $TMPDIR which is usually super fast storage.

Would this be possible?

mafft-profile will be defunct,use mafft --addprofile

See bottom of https://mafft.cbrc.jp/alignment/software/addsequences.html

Difference from the mafft-profile program

The --addprofile option covers all the situations where the mafft-profile program was used. Morever, the former is applicable to larger datasets than the latter. Therefore, the mafft-profile program will be deleted in future releases.

Temp name collision

I started to jobs at the same time, and didn't choose a specific temp directory. This lead to name collision and 'race condition' in temp files: Run 1 creates temp files with /tmp/tax6358tax.aln.*
Run 2 at the same time can create the similar files. Specifying tempdir is a workaround, but it would be safer with unique temp file names.

'NoneType' object has no attribute 'annotations'

Traceback (most recent call last):
File "/home/xxx/miniconda3/envs/pangolin/bin/assign_lineage.py", line 86, in
main()
File "/home/xxx/miniconda3/envs/pangolin/bin/assign_lineage.py", line 80, in main
lineage = finder.get_lineage()
File "/home/xxx/miniconda3/envs/pangolin/bin/lineage_finder.py", line 67, in get_lineage
grandparent_lineage = self.query_node_parent.parent_node.annotations.get_value("lineage")
AttributeError: 'NoneType' object has no attribute 'annotations'

So now, does the number in the bootstrap column mean "the number of boostrap trees, out of 100 tested, that support this lineage assignement"?

Originally posted by @rainwala in #18 (comment)

--version works but not in --help

@aineniamh thanks for adding --version !
needs to be in help still?

optional arguments:
  -h, --help            show this help message and exit
  -o OUTDIR, --outdir OUTDIR
                        Output directory
  -d DATA, --data DATA  Data directory minimally containing a fasta alignment
                        and guide tree
  -n, --dry-run         Go through the motions but don't actually run
  -f, --force           Overwrite all output
  -t THREADS, --threads THREADS
                        Number of threads
pangolin: 0.1.1-2020-04-27

Why don't you describe each lineage by a set of mutations

For example the Italian-European lineage is C241T C3037T A23403G C14408T
The French New York lineage is C241T C3037T A23403G C14408T G25563T
Some trees might set G25563T as a subtree of C1059T but that's the game and people reporting why they chose alternative mutations/lineages pairs would be interesting.

sequence with all Ns gets a lineage assignment of A.1 with bootstrap 63

Hi, I thought I'd share that with you: I had a failed sequence that had all Ns (100% in the non-UTR regions), and pangolin outputted A.1 with bootstrap 63. Is this expected behaviour?

rule pass_query_hash

Hi,

Thank you for developing this tool. It'll be really helpful, especially when new lineage names are proposed, as more genomes are released.

Just to let you know: the command below crashes the pipeline when a space is found in the path.

touch /Users/username/Team Dropbox/User name/working/directory/temp/temp.txt
touch: Dropbox/User: No such file or directory
touch: name/working/directory/temp/temp.txt: No such file or directory

Cheers,
Anderson

cov-lineages / pangolin Goto Github PK

pangolin's Introduction

pangolin

pangolin's People

Contributors

Stargazers

Watchers

Forkers

pangolin's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs