GithubHelp home page GithubHelp logo

neherlab / pangraph Goto Github PK

View Code? Open in Web Editor NEW
76.0 8.0 6.0 35.82 MB

A bioinformatic toolkit to align genome assemblies into pangenome graphs

Home Page: https://neherlab.github.io/pangraph

License: MIT License

Makefile 0.71% Shell 1.72% Julia 63.80% Dockerfile 0.49% Python 33.27%
bioinformatics genome julia pangraph genome-assembly pangenome bacteria

pangraph's Introduction

pangraph

Documentation Docker Image Version (latest semver) Docker Pulls DOI

a bioinformatic toolkit to align large sets of closely related genomes into a graph data structure

Overview

pangraph provides both a command line interface, as well as a Julia library, to find homology amongst large collections of closely related genomes. The core of the algorithm partitions each genome into pancontigs that represent a sequence interval related by vertical descent. Each genome is then an ordered walk along pancontigs; the collection of all genomes form a graph that captures all observed structural diversity. pangraph is a standalone tool useful to parsimoniously infer horizontal gene transfer events within a community; perform comparative studies of genome gain, loss, and rearrangement dynamics; or simply to compress many related genomes.

Installation

The core algorithm and command line tools are self-contained and require no additional dependencies. The library is written in and thus requires Julia to be installed on your machine.

pangraph is available:

  • as a julia library
  • as a Docker container
  • it can be compiled into a relocatable binary

For more extended instructions on installation please refer to the documentation.

Note: pangraph was written in Julia version 1.7.2. Compatibility more recent versions of Julia is not guaranteed.

Julia Library

To install pangraph as a julia library in a local environment:

    # clone the repository
    git clone https://github.com/neherlab/pangraph.git && cd pangraph
    # build the package
    julia --project=. -e 'using Pkg; Pkg.build()'

The library can be accessed directly by entering the REPL:

    julia --project=.

Alternatively, command-line functionalities can be accessed by running the main src/PanGraph.jl script:

    # example: build a graph from E.coli genomes
    julia --project=. src/PanGraph.jl build -c example_datasets/ecoli.fa.gz > graph.json

Note that to access the complete set of functionalities, the optional dependencies must be installed and available in your $PATH.

Docker container

PanGraph is available as a Docker container:

    docker pull neherlab/pangraph:latest

See the documentation for extended instuctions on its usage.

Relocatable binary

pangraph can be built locally on your machine by running (inside the cloned repo)

    export jc="path/to/julia/executable" make pangraph && make install

This will build the executable and place a symlink into bin/. Importantly, if jc is not explicitly set, it will default to vendor/julia-$VERSION/bin/julia. If this file does not exist, we will download automatically for the user, provided the host system is Linux or MacOSX. Moreover, for the compilation to work, it is necessary to have MAFFT and mmseqs2 available in your $PATH, see optional dependencies.

Note, it is recommended by the PackageCompiler.jl documentation to utilize the officially distributed binaries for Julia, not those distributed by your Linux distribution. As such, compilation may not work if you attempt to do so.

Note, pangraph was developed in Julia v.1.7.2. Compatibility with more recent versions of Julia is not guaranteed.

Optional dependencies

pangraph can optionally use mash, MAFFT, mmseqs2 or fasttree for some optional functionalities, as explained in the documentation. For use of these functionalities, it is recommended to install these tools and have them available on $PATH.

Alternatively, a script bin/setup-pangraph is provided to install both tools into bin/ for Linux-based operating systems.

Examples

Please refer to the tutorials within the documentation for an in-depth usage guide. For a quick reference, see below.

Align a multi-fasta sequence.fa and realign each pancontig with MAFFT

	pangraph build sequence.fa | pangraph polish > graph.json

Export a graph graph.json into export/pangraph.gfa as GFA for visualization

	pangraph export graph.json

Compute all pairwise graphs and estimate parsimonious number of events between strains. Output all computed data to directory pairs

	pangraph marginalize graph.json -o pairs

Citing

PanGraph: scalable bacterial pan-genome graph construction Nicholas Noll, Marco Molari, Liam P. Shaw, Richard Neher Microbial Genomics 9.6 (2023); doi: https://doi.org/10.1099/mgen.0.001034

License

MIT License

pangraph's People

Contributors

dependabot[bot] avatar ivan-aksamentov avatar liampshaw avatar mmolari avatar nnoll avatar pierrebarrat avatar plaquette avatar rneher avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pangraph's Issues

Unable to build pangraph

Hello, I am encountering the following error when attempting to build the pangraph on a monkeypox dataset. I am using this command: pangraph build in.fa

ERROR: BoundsError: attempt to access 1-element Vector{PanGraph.Graphs.Align.Clade} at index [2]

I would greatly appreciate any assistance.

Thanks,
Sumit

Mismatching SARS-CoV-2 Sequences

Hi there,

I created a PanGraph of 200 SARS-CoV-2 sequences using FASTA sequences as input, and it seems that eleven of them aren't represented incorrectly in the JSON file. I have uploaded the data here. The original FASTA file is denoted by sars_200_orig.fa. The represented sequences (determined by me) are represented by sars_200_pangraph.fa, and the PanGraph JSON file is denoted by sars_200.json. The sequences that we believe aren't matching are England/BRBR-2B7C38D/2021|OV263009.1|2021-11-22, IMS-10178-CVDP-0E892CAB-4101-45AD-A5AB-82C23A77B85B|OX112182.1|2021-10-14, Denmark/DCGC-179132/2021|OW435830.1|2021-10-02, SouthAfrica/NHLS-UCT-GS-AD95/2021|OM739820.1|2021-08-30, IMS-10150-CVDP-7250DCF0-8B47-40DA-89AF-8E56669A8CB5|OU964784.1|2021-10-12, USA/CA-CDC-FG-175698/2021|OL666921.1|2021-11-18, Denmark/DCGC-196557/2021|OW446795.1|2021-10-24, Denmark/DCGC-151767/2021|OV830941.1|2021-08-12, USA/MA-CDCBI-CRSP_4TOCNN2I3HYX32WD/2021|MZ752955.1|2021-08-02, England/LOND-12FD57B/2021|OU391062.1|2021-05-23 and RNA|OX380648.1|2022-10-22.
Can you please look into it?

Best,
Harsh

Automate release routine

We could introduce a release script, which would perform all the steps required for a release, such that no step could be forgotten. This could include:

  • setting a version in the manifest. The tool called dasel can be used for manipulating toml safely. Or perhaps can just use sed, for simplicity.
  • extracting changelog snippet to set text for a GitHub release. Can be as simple as splitting the string on ## and taking the first item (example)
  • checking that the changelog entry has a correct version
  • creating a GitHub release (which will also create a git tag, triggering CI build and release). For example using gh tool (example)

Example usage of such script:

./tools/release 1.2.3

Motivation: even though the current version is 0.6.0, the manifest still says 0.5.0:

version = "0.5.0"

It is easy to forget a step in the multi-step process. Introducing a script allows to automate and enforce the release routine.

Dockerfile/pangraph not building

wget  https://github.com/neherlab/pangraph/archive/refs/tags/0.5.0.tar.gz
tar xzvf 0.5.0.tar.gz
cd pangraph-0.5.0
docker build . -t pangraph/pangraph
# All requested packages already installed.

ERROR: InitError: PyError (PyImport_ImportModule) <class 'NameError'>
NameError("name 'ListType' is not defined")
  File "/root/.julia/conda/3/lib/python3.9/site-packages/ete3/__init__.py", line 61, in <module>
    from .clustering.clustertree import *
  File "/root/.julia/conda/3/lib/python3.9/site-packages/ete3/clustering/__init__.py", line 40, in <module>
    from .clustertree import *
  File "/root/.julia/conda/3/lib/python3.9/site-packages/ete3/clustering/clustertree.py", line 43, in <module>
    from . import clustvalidation
  File "/root/.julia/conda/3/lib/python3.9/site-packages/ete3/clustering/clustvalidation.py", line 199, in <module>
    from . import stats
  File "/root/.julia/conda/3/lib/python3.9/site-packages/ete3/clustering/stats.py", line 1933, in <module>
    geometricmean = Dispatch ( (lgeometricmean, (ListType, TupleType)), )

Stacktrace:
 [1] pyimport(name::String)
   @ PyCall ~/.julia/packages/PyCall/3fwVL/src/PyCall.jl:550
 [2] pyimport_conda(modulename::String, condapkg::String, channel::String)
   @ PyCall ~/.julia/packages/PyCall/3fwVL/src/PyCall.jl:714
 [3] __init__()
   @ PanGraph.PanX.Phylo /build_dir/src/panX.jl:21
 [4] _include_from_serialized(path::String, depmods::Vector{Any})
   @ Base ./loading.jl:768
 [5] _require_search_from_serialized(pkg::Base.PkgId, sourcepath::String)
   @ Base ./loading.jl:854
 [6] _require(pkg::Base.PkgId)
   @ Base ./loading.jl:1097
 [7] require(uuidkey::Base.PkgId)
   @ Base ./loading.jl:1013
 [8] require(into::Module, mod::Symbol)
   @ Base ./loading.jl:997
during initialization of module Phylo

caused by: PyError (PyImport_ImportModule) <class 'NameError'>
NameError("name 'ListType' is not defined")
  File "/root/.julia/conda/3/lib/python3.9/site-packages/ete3/__init__.py", line 61, in <module>
    from .clustering.clustertree import *
  File "/root/.julia/conda/3/lib/python3.9/site-packages/ete3/clustering/__init__.py", line 40, in <module>
    from .clustertree import *
  File "/root/.julia/conda/3/lib/python3.9/site-packages/ete3/clustering/clustertree.py", line 43, in <module>
    from . import clustvalidation
  File "/root/.julia/conda/3/lib/python3.9/site-packages/ete3/clustering/clustvalidation.py", line 199, in <module>
    from . import stats
  File "/root/.julia/conda/3/lib/python3.9/site-packages/ete3/clustering/stats.py", line 1933, in <module>
    geometricmean = Dispatch ( (lgeometricmean, (ListType, TupleType)), )

Stacktrace:
 [1] pyimport(name::String)
   @ PyCall ~/.julia/packages/PyCall/3fwVL/src/PyCall.jl:550
 [2] pyimport_conda(modulename::String, condapkg::String, channel::String)
   @ PyCall ~/.julia/packages/PyCall/3fwVL/src/PyCall.jl:708
 [3] __init__()
   @ PanGraph.PanX.Phylo /build_dir/src/panX.jl:21
 [4] _include_from_serialized(path::String, depmods::Vector{Any})
   @ Base ./loading.jl:768
 [5] _require_search_from_serialized(pkg::Base.PkgId, sourcepath::String)
   @ Base ./loading.jl:854
 [6] _require(pkg::Base.PkgId)
   @ Base ./loading.jl:1097
 [7] require(uuidkey::Base.PkgId)
   @ Base ./loading.jl:1013
 [8] require(into::Module, mod::Symbol)
   @ Base ./loading.jl:997
make: *** [Makefile:34: data/synthetic/test.fa] Error 1
The command 'bash -c set -euxo pipefail && cd /build_dir && make' returned a non-zero code: 2

Inaccurate output for Klebsiella pneumonia dataset

Hi All,

We attempted to construct a Pangraph using the Klebsiella pneumonia dataset. The raw sequences and Pangraph output JSON file are available here. However, we encountered an issue in the output file which is as follows:

In the klebs_100 PanGraph JSON file, we believe that the following five sequences are not represented correctly: NZ_CP013985.1 NZ_LR607362.1 NZ_CAKACX010000001.1 NZ_QIXX01000100.1 NZ_JARAMW010000001.1
We noticed that the unique string of eight nucleotides (TGCTTTTT or its reverse complement AAAAAGCA) is missing from these sequences, despite being present in the raw sequence. For example, in the sequence, 'NZ_CP013985.1', this string should be present in the block 'STETJDHNZS' (the 199th block on the path). This block is represented by the reverse strand. We manually reconstructed this block using the mutations in the PanGraph and found that the string of 8 nucleotides (TGCTTTTT) is missing from the reverse complement. We recommend that you do the same to verify. We believe the block is missing an 'insertion' at position 2775 for the sequence 'NZ_CP013985.1'. The nucleotides of the insertion should be AAAAAGCA.
We have used the following command to build the Pangraph: pangraph --circular klebs_100.fa

Thanks,
Sumit

The process gets into sleep status without reporting

Hi,

Thanks for developing PanGraph, I am looking forward to using it! After some complications, I managed to install it and run it successfully using the E-coli dataset. I'm now trying to run it on my dataset, with 163 strains of the same bacterial species. Unfortunately, it gets to only approximately 70% of the aligning pairs step and does not progress anymore (get into S status for days). Any advice on how to figure out where it gets stuck and how to solve it?

This is what the top line looks like:
PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
20 0 9516696 4.244g 164740 S 0.0 0.2 320:22.34 pangraph

Thanks!

Problems installing..

Excited to try pangraph but I'm struggling to install at the moment - any chance you could make a conda package?


julia --version
julia version 1.6.5
export jc="/well/bag/users/lipworth/miniconda3/bin/julia" make pangraph && make install
julia -q --project=. -e 'import Pkg; Pkg.instantiate(); Pkg.build()'
ERROR: The manifest file you are using was most likely generated by a different version of Julia and is not compatible with this Julia version
Stacktrace:
 [1] load_urls(ctx::Pkg.Types.Context, pkgs::Vector{Pkg.Types.PackageSpec})
   @ Pkg.Operations /home/conda/feedstock_root/build_artifacts/julia_1641439007442/work/usr/share/julia/stdlib/v1.6/Pkg/src/Operations.jl:545
 [2] #download_source#57
   @ /home/conda/feedstock_root/build_artifacts/julia_1641439007442/work/usr/share/julia/stdlib/v1.6/Pkg/src/Operations.jl:733 [inlined]
 [3] download_source
   @ /home/conda/feedstock_root/build_artifacts/julia_1641439007442/work/usr/share/julia/stdlib/v1.6/Pkg/src/Operations.jl:732 [inlined]
 [4] instantiate(ctx::Pkg.Types.Context; manifest::Nothing, update_registry::Bool, verbose::Bool, platform::Base.BinaryPlatforms.Platform, allow_build::Bool, allow_autoprecomp::Bool, kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ Pkg.API /home/conda/feedstock_root/build_artifacts/julia_1641439007442/work/usr/share/julia/stdlib/v1.6/Pkg/src/API.jl:1408
 [5] instantiate
   @ /home/conda/feedstock_root/build_artifacts/julia_1641439007442/work/usr/share/julia/stdlib/v1.6/Pkg/src/API.jl:1325 [inlined]
 [6] #instantiate#252
   @ /home/conda/feedstock_root/build_artifacts/julia_1641439007442/work/usr/share/julia/stdlib/v1.6/Pkg/src/API.jl:1321 [inlined]
 [7] instantiate()
   @ Pkg.API /home/conda/feedstock_root/build_artifacts/julia_1641439007442/work/usr/share/julia/stdlib/v1.6/Pkg/src/API.jl:1321
 [8] top-level scope
   @ none:1
make: *** [data/synthetic/test.fa] Error 1

size of docker image

The docker image has the size of ~1.5Gb and takes ~30 mins to compile. Half of this size is due to the fact that we use the ete3 python library, and we need to package a python distribution with conda in the docker image.
This only gets used in the export command, and in particular for the PanX-compatible export (file src/panX.jl).

A possible solution to reduce compilation time and docker image size is to use TreeTools instead of ete3. In particular we need functions to:

  • import a tree from a newick file, and export a tree to newick format
  • binarize a tree by removing politomies
  • rescale all branch lenghts by a constant factor
  • root the tree ad midpoint

check input fasta file for records with same name

If the input fasta file contains records with the same name, pangraph does not fail but the second duplicated records will not be present in the output pangraph.
A simple preliminary check could be added to make sure that no records with the same name exist, and make pangraph fail if they do.

Incorrect E. coli sequences being represented by PanGraph (large dataset)

Hi there,

We want to report an issue with a PanGraph that we generated on a dataset representing 1000 E. coli sequences. We believe that 64 of these sequences are not represented correctly by the PanGraph.

Thankfully, since we think the sequence lengths are also wrong, we manually verified the issue by simply computing the lengths of one of the mismatching sequences. We did this by adding up the lengths of the consensus sequences of the blocks on its path and adding the lengths of the insertions in the sequences and subtracting the lengths of the deletions on the path.

We find that the sequence length of the sequence ‘NZ_AP019856.1’ is computed by the PanGraph to be 4800017 bases. However, its true length is 4800098 bases.

We have uploaded the three relevant files to the following folder: https://drive.google.com/drive/folders/1JAliSaWokYX2i5KaUjQiOPnCdL_uyZqG?usp=sharing

We believe the mismatching sequences are: NZ_AP019856.1, NZ_CP054407.1, NZ_CP010219.1, NZ_CP036202.1, NZ_CP014583.1, NZ_CP027587.1, NZ_CP027325.1, NZ_CP013029.1, NZ_CP027459.1, NZ_CP050865.1, NZ_CP050862.1, NZ_CP027534.1, NZ_CP014316.1, NZ_CP015085.1, NZ_CP018970.1, NZ_CP023826.1, NZ_CP032201.1, NZ_CP023844.1, NZ_CP015138.1, NZ_CP018983.1, NZ_CP018991.1, NZ_CP049077.2, NZ_CP010876.1, NZ_CP036245.1, NZ_CP049085.2, NZ_CP035476.1, NZ_CP035477.1, NZ_CP014522.1, NZ_CP014495.1, NZ_CP024720.1, NZ_CP024717.1, NZ_CP021207.1, NZ_CP019008.1, NZ_CP019020.1, NZ_CP035498.1, NZ_CP053245.1, NZ_CP037449.1, NZ_CP048304.1, NZ_CP048920.1, NZ_CP040456.1, NZ_CP024886.1, NZ_CP051700.1, NZ_CP030111.1, NZ_AP022650.1, NZ_CP053251.2, NZ_CP051688.1, NZ_CP033762.1, NZ_CP019273.1, NZ_AP017610.1, NZ_CP033850.1, NZ_CP019029.1, NZ_CP015834.1, NZ_CP009859.1, NZ_CP040919.1, NZ_CP023366.1, NZ_CP041300.1, NZ_CP033605.1, NZ_CP041452.1, NZ_CP041448.1, NZ_CP028166.1, NZ_AP021896.1, NZ_CP031833.1

Thanks,
Harsh

GFA export edge case

I think in pangraph export there is an edge case for gfa export of circular genomes.

If a circular genome consists of two blocks, it will only get one edge between these blocks, when there should be the reverse edge added to make the genome circular (e.g. when viewed in bandage.)

Example attached but the issue is that for a genome

P	genome1	IRXQDAJTWV+,PLHWRNNGJC+	TP:Z:circular*

pangraph exports only one edge:

L	IRXQDAJTWV	+	PLHWRNNGJC	+	*	RC:i:2

when for bandage to plot properly it should also export

L	IRXQDAJTWV	-	PLHWRNNGJC	-	*	RC:i:2

fake_example.tar.gz

installation issue

I tried to install pangraph

after running the command `julia --project=. -e 'using Pkg; Pkg.build()'

I got :
Building Conda ───→/.julia/scratchspaces/44cfe95a-1eb2-52ea-b672-e2afdf69b78f/6cdc8832ba11c7695f494c9d9a1c31e90959ce0f/build.logBuilding PyCall ──→/.julia/scratchspaces/44cfe95a-1eb2-52ea-b672-e2afdf69b78f/4ba3651d33ef76e24fef6a598b63ffd1c5e1cd17/build.logBuilding PanGraph →~/Desktop/pangraph/deps/build.log

now entering the PERL through
julia --project=.
will jump to julia

_ _ _(_)_ | Documentation: https://docs.julialang.org (_) | (_) (_) | _ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help. | | | | | | |/ _ | |
| | || | | | (| | | Version 1.7.2 (2022-02-06)
/ |_'|||_'_| | Official https://julialang.org/ release
|__/ |

julia> `

now typing pangraph gives

ERROR: UndefVarError: pangraph not defined

Could you suggest a solution?

Thank you

Bypassing ordering step

I am building PanGraph for Sars-CoV2 sequences (around 50k). I observed that the ordering step is a bottleneck here. Is it possible to pass a tree as an argument and bypass the ordering step in PanGraph? Also, I have tried adding "-d mash" as an argument to check the speedup but it throws an error instead.

Any help would be appreciated.

Thanks

Non-boolean error in pangraph export

Hi,

I am getting the following error when I run the latest Pangraph:
julia --threads 15 --project=. src/PanGraph.jl export --no-duplications --export-panX --output-directory pangraph_export pangraph.json

Aligning blocks. Building trees... 74%|████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | ETA: 0:55:55
ERROR: LoadError: TypeError: non-boolean (Missing) used in boolean context
Stacktrace:
[1] find_midpoint(t::TreeTools.Tree{TreeTools.EmptyData}; topological::Bool)
@ TreeTools /mnt/storage/conda/cmeehan/pangraph/share/julia/packages/TreeTools/wxYIX/src/methods.jl:721
[2] root_midpoint!(t::TreeTools.Tree{TreeTools.EmptyData}; topological::Bool)
@ TreeTools /mnt/storage/conda/cmeehan/pangraph/share/julia/packages/TreeTools/wxYIX/src/methods.jl:625
[3] #root!#32
@ /mnt/storage/conda/cmeehan/pangraph/share/julia/packages/TreeTools/wxYIX/src/methods.jl:611 [inlined]
[4] (::Main.PanGraph.PanX.var"#13#14"{TreeTools.Tree{TreeTools.EmptyData}})()
@ Main.PanGraph.PanX /mnt/storage/tools/pangraph/src/panX.jl:122
[5] with_logstate(f::Function, logstate::Any)
@ Base.CoreLogging ./logging.jl:511
[6] with_logger
@ ./logging.jl:623 [inlined]
[7] produce_tree(alignment::String, scale::Int64)
@ Main.PanGraph.PanX /mnt/storage/tools/pangraph/src/panX.jl:121
[8] emitblock(block::Main.PanGraph.Graphs.Blocks.Block, root::String, prefix::String, identifier::Main.PanGraph.PanX.var"#11#12"{Dict{Main.PanGraph.Graphs.Nodes.Node, String}}; reduced::Bool)
@ Main.PanGraph.PanX /mnt/storage/tools/pangraph/src/panX.jl:160
[9] emit(G::Main.PanGraph.Graphs.Graph, root::String)
@ Main.PanGraph.PanX /mnt/storage/tools/pangraph/src/panX.jl:283
[10] (::Main.PanGraph.var"#43#48")(args::Vector{String})
@ Main.PanGraph /mnt/storage/tools/pangraph/src/export.jl:173
[11] run(cmd::Main.PanGraph.Commands.Command, args::Vector{String})
@ Main.PanGraph.Commands /mnt/storage/tools/pangraph/src/args.jl:182
[12] main(args::Vector{String})
@ Main.PanGraph /mnt/storage/tools/pangraph/src/PanGraph.jl:162
[13] top-level scope
@ /mnt/storage/tools/pangraph/src/PanGraph.jl:179
in expression starting at /mnt/storage/tools/pangraph/src/PanGraph.jl:1

I had run exactly the same dataset through pangraph export before (around 4 months ago) and it worked fine so seems to be a problem with the latest version. Any advice you have on how to fix would be great.

Thanks,
Conor

Can I use pangraph for MAGs?

I wonder if it is possible to use pangraph for metagenome assembled genomes which are usually highly fragmented.

Incorrect HIV Sequences being represented by PanGraph

Hi there,

Our lab has been working with PanGraphs for a while now and we've found them to be very useful. We believe that we've recently noticed some sequences being incorrectly represented by the PanGraph. The simplest dataset that we've found the issue on is a dataset of 2000 HIV sequences. We have attached a Google Drive link with our files. The HIV_2000.fa file stores the true sequences. The hiv_2000_pangraph.fa file consists of the sequences that we believe the PanGraph represents.

We've found seven sequences of the presumed PanGraph output to not match with the raw sequences. These are: B.RU.2004.04RU128005.AY682547, B.US.2000.14302_1.DQ853450, B.US.2000.14294_1.DQ853436, B.US.2000.14303_1.DQ853451, B.US.1998.15388_1.DQ853456, B.US.1998.15385_1.DQ853464 and B.US.1998.15386_1.DQ853460.

One thing to note is that most of these mismatches occur towards the ends of the sequences. The Google Drive link also contains the PanGraph that these sequences were derived from.

Soon, the data for 20000 HIV sequences will also be uploaded where 88/20000 sequences don't match. You can use those for testing.

Drive

Thanks,
Harsh Motwani
Turakhia Lab, UC San Diego

Add --version argument

Currently there does not seem to be a way to query a version of the CLI, so users are completely in the dark about what they are using:

$ docker run -it --rm neherlab/pangraph:latest pangraph --version        
pangraph --version: unknown command
Run 'pangraph help' for usage.


$ docker run -it --rm neherlab/pangraph:latest pangraph version
pangraph version: unknown command
Run 'pangraph help' for usage.

$  docker run -it --rm neherlab/pangraph:latest pangraph help   
usage: pangraph <command> [arguments]

    a tool for aligning large sets of closely related genomes in the presence of horizontal gene transfer

The available commands are:

  build       align genomes into a multiple sequence alignment graph
  generate    outputs a simulated sequence alignment graph
  help        prints usage information for a given command
  polish      realigns pancontigs of multiple sequence alignment graph
  marginalize computes all pairwise marginalizations of a multiple sequence alignment graph
  export      exports a pangraph to a chosen file format(s)

The expected arguments are:

  passed directly to the chosen command

Would be nice to add a --version flag or a version command.

Extra points for synchronizing the version reported with the version in the manifest.

Is there any plan to add annotations to the block/horizontal alignments?

Hello,

I am trying to look at the upstream and downstream features around an AMR gene. I managed to get the block alignments but is there a way to get annotations from prokka into the plots? If you don't have bandwidth right now - please point to a starting point to how I can visualise the annotations if you can. That would be really helpful.

Thank you.

GFAv1 output

Is it possible to obtain the paths as directed walks through the graph in GFAv1 format?

PanX export fails

I got an error while exporting a graph in PanX-compatible format with the following command:

pangraph export \
    --export-panX \
    --no-duplications \
    --output-directory coli_export \
    graph.json

The error seems to happen when midpoint-rooting with TreeTools.

LoadError: AssertionError: Issue with time on the branch above midpoint
Stacktrace:
  [1] root_midpoint!(t::TreeTools.Tree{TreeTools.EmptyData}; topological::Bool)
    @ TreeTools ~/.julia/packages/TreeTools/B7XJF/src/methods.jl:625
  [2] #root!#31
    @ ~/.julia/packages/TreeTools/B7XJF/src/methods.jl:591 [inlined]
  [3] (::Main.PanGraph.PanX.var"#13#14"{TreeTools.Tree{TreeTools.EmptyData}})()
    @ Main.PanGraph.PanX ~/Downloads/pangraph-test/pangraph/src/panX.jl:124
  [4] with_logstate(f::Function, logstate::Any)
    @ Base.CoreLogging ./logging.jl:511
  [5] with_logger
    @ ./logging.jl:623 [inlined]
  [6] produce_tree(alignment::String, scale::Int64)
    @ Main.PanGraph.PanX ~/Downloads/pangraph-test/pangraph/src/panX.jl:123
  [7] emitblock(block::Main.PanGraph.Graphs.Blocks.Block, root::String, prefix::String, identifier::Main.PanGraph.PanX.var"#11#12"{Dict{Main.PanGraph.Graphs.Nodes.Node, String}}; reduced::Bool)
    @ Main.PanGraph.PanX ~/Downloads/pangraph-test/pangraph/src/panX.jl:159
  [8] emit(G::Main.PanGraph.Graphs.Graph, root::String)
    @ Main.PanGraph.PanX ~/Downloads/pangraph-test/pangraph/src/panX.jl:278
  [9] (::Main.PanGraph.var"#39#44")(args::Vector{String})
    @ Main.PanGraph ~/Downloads/pangraph-test/pangraph/src/export.jl:173
 [10] run(cmd::Main.PanGraph.Commands.Command, args::Vector{String})
    @ Main.PanGraph.Commands ~/Downloads/pangraph-test/pangraph/src/args.jl:182
 [11] main(args::Vector{String})
    @ Main.PanGraph ~/Downloads/pangraph-test/pangraph/src/PanGraph.jl:162
 [12] top-level scope
    @ ~/Downloads/pangraph-test/pangraph/src/PanGraph.jl:177
in expression starting at /home/marco/Downloads/pangraph-test/pangraph/src/PanGraph.jl:1

The tree in question seems to have a large clade:

                               , NZ_CP013034.1#1
  _____________________________|
 |                             | NZ_CP013036.1#1
 |
 |                                                          , NZ_CP017865.1#1
 |                                                          |
 |                                                          | NZ_CP007183.1#1
 |                                                          |
 |                                                          | NZ_CP027638.1#1
 |                                                          |
 |                                                          | NZ_CP023545.1#1
 |                                                          |
 |                                                          | NZ_CP017868.1#1
 |                                                          |
 |                                                          | NZ_CP017878.1#1
 |                                                          |
 |                                                          | NZ_CP017873.1#1
 |                                                          |
 |                                                          | NZ_CP011015.1#1
 |                                                          |
 |                              ____________________________| CP011777.1#1
_|                             |                            |
 |                             |                            | NZ_CP015528.1#1
 |                             |                            |
 |                             |                            | NC_022347.1#1
 |                             |                            |
 |                             |                            | NZ_CP017871.1#1
 |                             |                            |
 |                             |                            | NZ_CP013733.1#1
 |_____________________________|                            |
 |                             |                            | NZ_CP017025.1#1
 |                             |                            |
 |                             |                            | NC_022660.2#1
 |                             |                            |
 |                             |                            | NZ_CP027634.1#1
 |                             |                            |
 |                             |                            | NZ_CP018900.1#1
 |                             |
 |                             |_____________________________ NZ_CP013032.1#1
 |
 |                              _____________________________ NZ_CP027639.1#1
 |_____________________________|
                               |_____________________________ NZ_CP007179.1#1

The newick file in question reads:

((NZ_CP013034.1#1:0.0,NZ_CP013036.1#1:0.0):0.000000005,((NZ_CP017865.1#1:0.0,NZ_CP007183.1#1:0.0,NZ_CP027638.1#1:0.0,NZ_CP023545.1#1:0.0,NZ_CP017868.1#1:0.0,NZ_CP017878.1#1:0.0,NZ_CP017873.1#1:0.0,NZ_CP011015.1#1:0.0,CP011777.1#1:0.0,NZ_CP015528.1#1:0.0,NC_022347.1#1:0.0,NZ_CP017871.1#1:0.0,NZ_CP013733.1#1:0.0,NZ_CP017025.1#1:0.0,NC_022660.2#1:0.0,NZ_CP027634.1#1:0.0,NZ_CP018900.1#1:0.0):0.000000005,NZ_CP013032.1#1:0.000000005)0.745:0.000000005,(NZ_CP027639.1#1:0.000000005,NZ_CP007179.1#1:0.000000005)0.000:0.000000005);

And the error is reproducible if I load this tree and run:

line = open("tree.nwk") do f
    readline(f)
end
tree = parse_newick_string(line)
TreeTools.binarize!(tree)
TreeTools.root!(tree; method=:midpoint)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.