GithubHelp home page GithubHelp logo

biojulia / bioalignments.jl Goto Github PK

View Code? Open in Web Editor NEW
59.0 17.0 24.0 1.14 MB

Sequence alignment tools

License: MIT License

Julia 98.33% Roff 0.99% Perl 0.68%
bam-files sam-files sequence-alignment sequence-analysis dna-sequences biology bioinformatics smith-waterman-alignment high-throughput-sequencing biojulia

bioalignments.jl's Introduction

Bio

Latest release MIT license Pkg Status

This package has been depreceated. Full details are available here. You might still download and use this package, as we don't want old scripts to break. However going forward, know that this repository is archived and read only, no further updates or fixes will be committed. You should use the packages that replace Bio.jl - a list is available here.

bioalignments.jl's People

Contributors

alyst avatar bicycle1885 avatar ciaranomara avatar dcjones avatar femtocleaner[bot] avatar github-actions[bot] avatar jakobnissen avatar kaparanewbie avatar kescobo avatar millironx avatar mortenpi avatar tanhevg avatar timholy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bioalignments.jl's Issues

Alignment of 2 LongAminoAcidSeq with AffineGapScoreModel from docs goes out of bounds

Affine alignment is not working, but simple edit distance works.

strucseq = "KVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNTQATNRNTDGSTDYGILQINSRWWCNDGRTPGSRNLCNIPCSALLSSDITASVNCAKKIVSDGNGMNAWVAWRNRCKGTDVQAWIRGCRL"

msaseq = "KIFERCEFARTLKRNGMGGYHGIRLADWVCLARWESSYNTKATNYNSKSTDYGIFQINSRYWCNDGKTPGAVNACGISCNVLLQDDITQAIACAKRVVDPQGIRAWVAWKKHCEQDLTQYQGC"

Expected Behavior

Something like

costmodel = CostModel(match=0, mismatch=1, insertion=1, deletion=1)
alignment = pairalign(EditDistance(), strucseq, msaseq, costmodel)

provides

PairwiseAlignmentResult{Int64, LongAminoAcidSeq, LongAminoAcidSeq}:
  distance: 62
  seq:   1 KVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNTQATNRNTDGSTDYGILQINS  60
           | | ||| |   || |   | |  |  ||| |  ||  || ||| |   |||||| ||||
  ref:   1 KIFERCEFARTLKRNGMGGYHGIRLADWVCLARWESSYNTKATNYN-SKSTDYGIFQINS  59

  seq:  61 RWWCNDGRTPGSRNLCNIPCSALLSSDITASVNCAKKIVSDGNGMNAWVAWRNRCKGTDV 120
           | ||||| |||  | | | |  ||  |||    || | | |  |  |||||   |   |
  ref:  60 RYWCNDGKTPGAVNACGISCNVLLQDDITQAIACA-KRVVDPQGIRAWVAWKKHC-EQD- 116

  seq: 121 QAWIRGCRL 129
                ||
  ref: 117 LTQYQGC-- 123

Current Behavior

costmodel = AffineGapScoreModel(BLOSUM62, gap_open=-10, gap_extend=-1)
alignment = pairalign(GlobalAlignment(), strucseq, msaseq, costmodel)

gives

ERROR: BoundsError: attempt to access 27×27 Matrix{Int64} at index [12, 28]
Stacktrace:
 [1] getindex
   @ .\array.jl:862 [inlined]
 [2] getindex
   @ C:\Users\kool7\.julia\packages\BioAlignments\t4D8A\src\submat.jl:82 [inlined]
 [3] run!(nw::BioAlignments.NeedlemanWunsch{Int64}, a::LongAminoAcidSeq, b::LongAminoAcidSeq, submat::SubstitutionMatrix{AminoAcid, Int64}, start_gap_open_a::Int64, start_gap_extend_a::Int64, middle_gap_open_a::Int64, middle_gap_extend_a::Int64, end_gap_open_a::Int64, end_gap_extend_a::Int64, start_gap_open_b::Int64, start_gap_extend_b::Int64, middle_gap_open_b::Int64, middle_gap_extend_b::Int64, end_gap_open_b::Int64, end_gap_extend_b::Int64)
   @ BioAlignments C:\Users\kool7\.julia\packages\BioAlignments\t4D8A\src\pairwise\algorithms\needleman_wunsch.jl:41
 [4] pairalign(::OverlapAlignment, a::LongAminoAcidSeq, b::LongAminoAcidSeq, score::AffineGapScoreModel{Int64}; score_only::Bool)
   @ BioAlignments C:\Users\kool7\.julia\packages\BioAlignments\t4D8A\src\pairwise\pairalign.jl:137
 [5] pairalign(::OverlapAlignment, a::LongAminoAcidSeq, b::LongAminoAcidSeq, score::AffineGapScoreModel{Int64})
   @ BioAlignments C:\Users\kool7\.julia\packages\BioAlignments\t4D8A\src\pairwise\pairalign.jl:134
 [6] top-level scope
   @ c:\Users\kool7\Google Drive\BioMakie.jl\_research\mitos2.jl:66

`AlignmentAnchor` method signature

The following was caught as an unrelated error in a BGZFStreams.jl downstream test.

MethodError: no method matching AlignmentAnchor(::Int64, ::Int64, ::BioAlignments.Operation)
[243](https://github.com/BioJulia/BGZFStreams.jl/runs/5163487858?check_suite_focus=true#step:6:243)
  Closest candidates are:
[244](https://github.com/BioJulia/BGZFStreams.jl/runs/5163487858?check_suite_focus=true#step:6:244)
    AlignmentAnchor(::Any, ::Any, ::Any, ::Any) at ~/.julia/packages/BioAlignments/EyHqv/src/anchors.jl:13
[245](https://github.com/BioJulia/BGZFStreams.jl/runs/5163487858?check_suite_focus=true#step:6:245)
    AlignmentAnchor(::Int64, ::Int64, ::Int64, ::BioAlignments.Operation) at ~/.julia/packages/BioAlignments/EyHqv/src/anchors.jl:13

Let us yank v2.0.1 and v2.1.0 of BioAlignments at JuliaRegistries/General.

Then I think the easiest fix is to go straight to v3, or we would need to resupply the v2 method signature AlignmentAnchor(::Int64, ::Int64, ::BioAlignments.Operation) and tag as v2.1.1.


Edited for clarity.

Bug in `OverlapAlignment`

Align these two sequences

a = AGTAAAGATGAATCCAAATCAAAAGATAATAACGATTGGCTCTGTTTCTCTCACCATTTCCACAATATGCTTCTTCATGCAAATTGCCATCCTGATAACTACTGTAACATTGCATTTCAAGCAATATGAATTCAACTCCCCCCCAAACAACCAAGTGATGCTGTGTGAACCAACAATAATAGAAAGAAACATAACAGAGATAGTGTATTTGACCAACACCACCATAGAGAGGGAAATATGCCCCAAACCAGCAGAATACAGAAATTGGTCAAAACCGCAATGTGGCATTACAGGATTTGCACCTTTCTCTAAGGACAATTCGATTAGGCTTTCCGCTGGTGGGGACATCTGGGTGACAAGAGAACCTTATGTGTCATGCGATCCTGACAAGTGTTATCAATTTGCCCTTGGACAGGGAACAACAATAAACAACGTGCATTCAAATAACACAGCACGTGATAGGACCCCTCATCGGACTCTATTGATGAATGAGTTGGGTGTTCCTTTCCATCTGGGGACCAAGCAAGTGTGCATAGCATGGTCCAGCTCAAGTTGTCACGATGGAAAAGCATGGCTGCATGTTTGTATAACGGGGGATGATAAAAATGCAACTGCTAGTTTCATTTACAATGGGAGGCTTGTAGATAGTGTTGTTTCATGGTCCAAAGATATTCTCAGGACCCAGGAGTCAGAATGCGTTTGTATCAATGGAACTTGTACAGTAGTAATGACTGATGGAAATGCTACAGGAAAAGCTGATACTAAAATATTATTCATTGAGGAGGGGAAAATCGTTCATACTAGCAAATTGTCAGGAAGTGCTCAGCATGTCGAAGAGTGCTCTTGCTATCCTCGATACCCTGGTGTCAGATGTGTCTGCAGAGACAACTGGAAAGGATCCAACCGGCCCATCGTAGATATAAACATAAAGGATCATAGCATTGTTTCCAGTTATGTGTGTTCAGGACTTGTTGGAGACACACCCAGAAAAACCGACAGCTCCAGCAGCAGCCATTGCTTGAATCCTAACAATGAAAAAGGTGGTCATGGAGTGAAAGGCTGGGCCTTTGATGATGGAAATGACGTGTGGATGGGGAGAACAATCAACGAGACGTCACGCTTAGGGTATGAAACCTTCAAAGTCGTTGAAGGCTGGTCCAACCCTAAGTCCAAATTGCAGATAAATAGGCAAGTCATAGTTGACAGAGGTGATAGGTCCGGTTATTCTGGTATTTTCTCTGTTGAAGGCAAAAGCTGCATCAATCGGTGCTTTTATGTGGAGTTGATTAGGGGAAGAAAAGAGGAAACTGAAGTCTTGTGGACCTCAAACAGTATTGTTGTGTTTTGTGGCACCTCAGGTACATATGGAACAGGCTCATGGCCTGATGGGGCGGACCTCAATCTCATGCATATATAAGCTTTCGCAATTTTAGAAAAAACT

b = ACTGAGGCAAATAGGCCAAAAATGAACAATGCTACCTTCAACTATACAAACGTTAACCCTATTTCTCACATCAGGGGGAGTGTTATTATCACTATATGTGTCAGCTTCATTGTCATACTTACTATATTCGGATATATTGCTAAAATTTTCACAAACAGAAATAACTGCACCAATAATGCCATTGGATTGTGCAAACGCATCAAATGTTCAGGCTGTGAACCGTTCTGCAACAAAAGGGGTGACACTTCTTCTCCCAGAACCGGAGTGGACGTATCCTCGTTTATCTTGCCCGGGCTCAACCTTTCAGAAAGCACTCCTAATTAGCCCCCATAGATTCGGAGAAACCAAAGGAAACTCAGCTCCCTTGATAATAAGGGAACCTTTTATTGCTTGTGGACCAAAGGAATGCAAACACTTTGCTCTAACCCATTATGCAGCTCAGCCAGGGGGATACTACAATGGAACAAGAGAAGACAGAAACAAGCTGAGGCATCTAATTTCAGTCAAATTGGGCAAAATCCCAACAGTAGAAAACTCTATTTTCCACATGGCAGCTTGGAGCGGGTCCGCATGCCATGATGGTAGAGAATGGACATACATCGGAGTTGATGGCCCCGACAGTAATGCATTGCTCAAAATAAAATATGGAGAAGCATATACTGACACATACCATTCCTATGCAAAAAACATCCTAAGGACACAAGAAAGTGCCTGCAATTGCATCGGGGGAGATTGTTATCTTATGATAACTGATGGCCCAGCTTCAGGGATTAGTGAATGCAGATTCCTTAAGATTCGAGAGGGCCGAATAATAAAAGAAATATTTCCAACAGGAAGAGTAAAACATACTGAGGAATGCACATGCGGATTTGCCAGCAACAAAACCATAGAATGTGCCTGTAGAGATAACAATTACACAGCAAAAAGACCCTTTGTCAAATTAAATGTGGAGACTGATACAGCGGAAATAAGATTGATGTGCACAGAGACTTATTTGGACACCCCCAGACCAAATGATGGAAGCATAACAGGGCCTTGCGAATCTAATGGGGACAAAGGGAGTGGAGGCATCAAAGGAGGATTTGTTCATCAAAGAATGGCATCCAAGATTGGAAGGTGGTACTCTCGAACGATGTCTAAAACTAAAAGAATGGGGATGGGACTGTATGTAAAGTATGATGGAGACCCATGGACTGATAGTGAAGCCCTTGCTCTTAGTGGAGTAATGGTTTCAATGGAAGAACCTGGTTGGTATTCCTTTGGCTTCGAAATAAAAGATAAGAAATGTGATGTCCCCTGTATTGGGATAGAAATGGTACATGATGGTGGGAAAACGACTTGGCACTCAGCAGCAACAGCCATTTACTGTTTAATGGGCTCAGGACAACTGCTGTGGGACACTGTCACAGGTGTTGATATGGCTCTGTAATGGAGGAATGGTTGAGTCTGTTCTAAACCCTTTGTTCCTATTTTGTTTGAACAATTGTCCTTACTGAACTTAATTGTTTCTGAA

With the following code

pairalign(OverlapAlignment(), a, b, AffineGapScoreModel(EDNAFULL, gap_open=-25, gap_extend=-2))

The resulting alignment only aligns part of a, and is therefore not a global-global alignment.

Lack of an article to cite BioAlignments.jl

Hi everyone! I recently built a secondary structure prediction workflow (check it out!) and I'm submitting a manuscript about it. I like to give due credit to every piece of open-source software I used during elaboration and BioAlignments.jl was one of them. I reached out to @kescobo and he suggested I open an issue.

Perhaps the maintainers/authors can work something out, maybe initially a Zenodo release so that at least there's a title, authors and a DOI, and in the future a short communication.

Thanks for your attention!

[Feature]: Add method to remove sequence match info from cigar

Expected Behavior

I would like a way to tell if the two alignments are the same regardless of sequence, such that

Alignment("1=1X") == Alignment("2M")

Current Behavior

Alignment("1=1X") == Alignment("2M")

# returns false

☝️ This behavior is correct!

Possible Solution / Implementation

The current behavior is entirely correct, but it doesn't let me compare alignments that have matching against those that don't. I propose a new function

remove_match_ops(::T) where {T<:Union{String,Alignment,AlignedSequence,PairwiseAlignment,PairwiseAlignmentResult}}

that would remove the = (sequence match) and X (sequence mismatch) operations from the CIGAR of the alignment and return a T where those operations would be replaced by M (match) operations and adjacent match operations merged.

LocalAlignment find lower score than biopython/scikit-bio implementations

While I was benchmarking Julia's pairalign (318 μs) against BioPython (7310 µs) and scikit-bio (925 µs) implementation, I found that Julia's local alignment returns a solution with score 63 while python solutions have a score of 65. That makes me think that we have some hidden bug there, and we are not hitting the right solution.

Python 3.5.2

from Bio import pairwise2                                                                                                                                                                                                            
from Bio.SubsMat import MatrixInfo as matlist                                                                                                                                                                                        
matrix = matlist.blosum50                                                                                                                                                                                                            
pairwise2.align.localds("ETPRAHGALTSDNSGTTLFGKPEPMSSAEATPTASEIRNPVFSGKMDGNSLKQADSTSTR", "KEEAGSLRNEESMLKGKAEPMIYGKGEPGTVGRVDCTASGAENSGSLGKVDMPCSSKVDI", matrix, -10, -1)                                                       
[('ETPRAHGALTSDNSGTTLFGKPEPMSSAEATP--------TASEIRNPVFSGKMDGNSLKQADSTSTR',
  '--KEEAGSLRNEES--MLKGKAEPMIYGKGEPGTVGRVDCTASGAENSGSLGKVDMPCSSKVDI----',
  65.0,
  6,
  63),
 ('ETPRAHGALTSDNSGTTLFGKPEPMSSAEATP-TASEIRNPVFSGKMDGNSLKQADSTSTR----',
  '--KEEAGSLRNEES--MLKGKAEPMIYGKGEPGTVGRV-DCTASGAENSGSLGKVDMPCSSKVDI',
  65.0,
  6,
  56),
 ('ETPRAHGALTSDNSGTTLFGKPEPMSSAEATP-TASEIRNPVFSGKMDGNSLKQADSTSTR----',
  '--KEEAGSLRNEES--MLKGKAEPMIYGKGEPGTVGRVDCTA-SGAENSGSLGKVDMPCSSKVDI',
  65.0,
  6,
  56)]
from skbio.alignment import local_pairwise_align_ssw                                                                                                                                                                                 
from skbio.alignment._pairwise import blosum50                                                                                                                                                                                       
from skbio import Protein                                                                                                                                                                                                            
local_pairwise_align_ssw(Protein("ETPRAHGALTSDNSGTTLFGKPEPMSSAEATPTASEIRNPVFSGKMDGNSLKQADSTSTR"), Protein("KEEAGSLRNEESMLKGKAEPMIYGKGEPGTVGRVDCTASGAENSGSLGKVDMPCSSKVDI"), gap_open_penalty=10, gap_extend_penalty=1, substitution_matrix=blosum50)                                                                                                                                                                                                                  
(TabularMSA[Protein]
 --------------------------------------------------
 Stats:
     sequence count: 2
     position count: 50
 --------------------------------------------------
 GALTSDNSGTTLFGKPEPMSSAEA-TPTASEIRNPVFSGKMDGNSLKQAD
 GSLRNEES--MLKGKAEPMIYGKGEPGTVGR-VDCTASGAENSGSLGKVD, 65, [(6, 54), (4, 50)])
using BioAlignments
pairalign(LocalAlignment(), "ETPRAHGALTSDNSGTTLFGKPEPMSSAEATPTASEIRNPVFSGKMDGNSLKQADSTSTR", "KEEAGSLRNEESMLKGKAEPMIYGKGEPGTVGRVDCTASGAENSGSLGKVDMPCSSKVDI", AffineGapScoreModel(BLOSUM50, gap_open=-10, gap_extend=-1))
PairwiseAlignmentResult{Int64,String,String}:
  score: 63
  seq:  7 GALTSDNSGTTLFGKPEPMSSAEATP--------TASEIRNPVFSGKMDGNSLKQAD 55
          | |    |   | || |||      |        |||   |    || |       |
  ref:  5 GSLRNEES--MLKGKAEPMIYGKGEPGTVGRVDCTASGAENSGSLGKVDMPCSSKVD 59

In this case, BioJulia's solution is part of BioPython's first solution.

julia> versioninfo()
Julia Version 1.0.2
Commit d789231e99 (2018-11-08 20:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2609 v3 @ 1.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, haswell)

(v1.0) pkg> st
    Status `~/.julia/environments/v1.0/Project.toml`
  [c7e460c6] ArgParse v0.6.1
  [c52e3926] Atom v0.7.12
  [6e4b80f9] BenchmarkTools v0.4.1
  [e334fcbd] Bio3DView v0.0.0 #master (https://github.com/jgreener64/Bio3DView.jl)
  [00701ae9] BioAlignments v1.0.0
  [7e6ae17a] BioSequences v1.1.0
  [3c28c6f8] BioSymbols v3.1.0
  [336ed68f] CSV v0.4.3
  [a93c6f00] DataFrames v0.16.0
  [864edb3b] DataStructures v0.15.0
  [ab62b9b5] DeepDiffs v1.1.0
  [e30172f5] Documenter v0.21.0
  [5789e2e9] FileIO v1.0.5
  [1fa38f19] Format v0.7.2 [`~/.julia/dev/Format`]
  [59287772] Formatting v0.3.5
  [28b8d3ca] GR v0.37.0
  [92cae0f8] GaussDCA v0.0.0 #master (https://github.com/carlobaldassi/GaussDCA.jl)
  [c27321d9] Glob v1.2.0
  [bd48cda9] GraphRecipes v0.4.0
  [a2bd30eb] Graphics v0.4.0
  [4c0ca9eb] Gtk v0.16.4
  [7073ff75] IJulia v1.15.2
  [18364772] IPython v0.4.0
  [6218d12a] ImageMagick v0.7.1
  [86fae568] ImageView v0.8.2
  [916415d5] Images v0.17.0
  [d0351b0e] InspectDR v0.3.3
  [c601a237] Interact v0.9.0
  [e5e0dc1b] Juno v0.5.3
  [b964fa9f] LaTeXStrings v1.0.3
  [51bafb47] MIToS v2.3.3+ [`~/.julia/dev/MIToS`]
  [86f7a689] NamedArrays v0.9.2
  [47be7bcc] ORCA v0.2.0
  [3b7a836e] PGFPlots v3.0.1
  [f9da4da7] PairwiseListMatrices v0.7.0+ #master (https://github.com/diegozea/PairwiseListMatrices.jl.git)
  [0e44f1d2] PlotRecipes v0.3.0+ [`~/.julia/dev/PlotRecipes`]
  [f0f68f2c] PlotlyJS v0.12.2
  [91a5bcdd] Plots v0.22.5
  [c46f51b8] ProfileView v0.4.0
  [438e738f] PyCall v1.18.5
  [d330b81b] PyPlot v2.7.0
  [f535d66d] ROCAnalysis v0.3.0
  [3cdcf5f2] RecipesBase v0.6.0
  [60ddc479] StatPlots v0.8.2
  [88034a9c] StringDistances v0.3.0
  [37b6cedf] Traceur v0.2.0
  [b8865327] UnicodePlots v1.0.1

Multiple alignment manipulation

Need capability to manipulate a multiple sequence alignment, seems like the right place to put it.

I started working on this but it may need more thought about how it will play nice with the pairwise alignment-oriented code.

Version trouble with IntervalTrees. Only able to install "BioAlignments" after removing package "Bio"

I have a problem when installing the package. I realize the problem is constraining version for the IntervalTrees package (multi-package problem), but I don't know where better to post this issue.

If I understand correctly, BioAlignments requires version 0.5.0 of package IntervalTrees, however does it even exist? If I go to https://github.com/BioJulia/IntervalTrees.jl/releases, the latest is 0.4.1.

Also, package Bio has the following version requirements for package IntervalTrees: [0.0.1,0.1.1).

Code

Pkg.add("BioAlignments")

Then I get this error message:

julia> Pkg.add("BioAlignments")
ERROR: Unsatisfiable requirements detected for package IntervalTrees:
├─version range [0.0.0-,∞) set by an explicit requirement
├─version range [0.0.0-,∞) required by package Bio, whose allowed version range is [0.0.0-,∞):
│ └─version range [0.0.0-,∞) set by an explicit requirement
├─version range [0.0.0-,∞) required by package BioSequences, whose allowed version range is [0.5.0,∞):
│ ├─version range [0.0.0-,∞) set by an explicit requirement
│ └─version range [0.5.0,∞) required by package BioAlignments, whose allowed version range is [0.0.0-,∞):
  │ └─version range [0.0.0-,∞) set by an explicit requirement
├─version range [0.0.0-,∞) required by package BioAlignments, whose allowed version range is [0.0.0-,∞):
│ └─[see above for BioAlignments backtrace]
└─version range [0.2.0,∞) required by package GenomicFeatures, whose allowed version range is [0.1.0,∞):
  └─version range [0.1.0,∞) required by package BioAlignments, whose allowed version range is [0.0.0-,∞):
    └─[see above for BioAlignments backtrace]
The intersection of the requirements is empty.
filter_versions(::Dict{String,Base.Pkg.Types.VersionSet}, ::Dict{String,Dict{VersionNumber,Base.Pkg.Types.Available}}, ::Dict{AbstractString,Base.Pkg.Types.ResolveBacktraceItem}) at .\pkg\query.jl:299
prune_versions(::Dict{String,Base.Pkg.Types.VersionSet}, ::Dict{String,Dict{VersionNumber,Base.Pkg.Types.Available}}, ::Dict{AbstractString,Base.Pkg.Types.ResolveBacktraceItem}) at .\pkg\query.jl:328
prune_dependencies(::Dict{String,Base.Pkg.Types.VersionSet}, ::Dict{String,Dict{VersionNumber,Base.Pkg.Types.Available}}, ::Dict{AbstractString,Base.Pkg.Types.ResolveBacktraceItem}) at .\pkg\query.jl:546
resolve(::Dict{String,Base.Pkg.Types.VersionSet}, ::Dict{String,Dict{VersionNumber,Base.Pkg.Types.Available}}, ::Dict{String,Tuple{VersionNumber,Bool}}, ::Dict{String,Base.Pkg.Types.Fixed}, ::Dict{String,VersionNumber}, ::Set{String}) at .\pkg\entry.jl:498
resolve(::Dict{String,Base.Pkg.Types.VersionSet}, ::Dict{String,Dict{VersionNumber,Base.Pkg.Types.Available}}, ::Dict{String,Tuple{VersionNumber,Bool}}, ::Dict{String,Base.Pkg.Types.Fixed}) at .\pkg\entry.jl:479
edit(::Function, ::String, ::Base.Pkg.Types.VersionSet, ::Vararg{Base.Pkg.Types.VersionSet,N} where N) at .\pkg\entry.jl:30
(::Base.Pkg.Entry.##1#3{String,Base.Pkg.Types.VersionSet})() at .\task.jl:335
Stacktrace:
 [1] sync_end() at .\task.jl:287
 [2] macro expansion at .\task.jl:303 [inlined]
 [3] add(::String, ::Base.Pkg.Types.VersionSet) at .\pkg\entry.jl:51
 [4] (::Base.Pkg.Dir.##4#7{Array{Any,1},Base.Pkg.Entry.#add,Tuple{String}})() at .\pkg\dir.jl:36
 [5] cd(::Base.Pkg.Dir.##4#7{Array{Any,1},Base.Pkg.Entry.#add,Tuple{String}}, ::String) at .\file.jl:59
 [6] #cd#1(::Array{Any,1}, ::Function, ::Function, ::String, ::Vararg{String,N} where N) at .\pkg\dir.jl:36
 [7] add(::String) at .\pkg\pkg.jl:117

After removing package Bio, the installation works fine.

julia> Pkg.rm("Bio")
INFO: Upgrading IntervalTrees: v0.1.0 => v0.4.1
INFO: Removing BGZFStreams v0.2.0
INFO: Removing Bio v0.4.7
INFO: Removing Calculus v0.4.0
INFO: Removing CodecZlib v0.4.3
INFO: Removing ColorTypes v0.6.7
INFO: Removing Colors v0.8.2
INFO: Removing CommonSubexpressions v0.1.0
INFO: Removing DiffResults v0.0.3
INFO: Removing DiffRules v0.0.5
INFO: Removing Distributions v0.15.0
INFO: Removing FixedPointNumbers v0.4.6
INFO: Removing ForwardDiff v0.7.5
INFO: Removing Iterators v0.3.1
INFO: Removing LibExpat v0.4.2
INFO: Removing Libz v0.2.4
INFO: Removing LightGraphs v0.12.0
INFO: Removing LightXML v0.6.0
INFO: Removing MacroTools v0.4.2
INFO: Removing Missings v0.2.10
INFO: Removing NaNMath v0.3.1
INFO: Removing PDMats v0.8.0
INFO: Removing QuadGK v0.2.1
INFO: Removing Reexport v0.1.0
INFO: Removing Rmath v0.4.0
INFO: Removing Roots v0.6.0
INFO: Removing SimpleTraits v0.6.0
INFO: Removing SortingAlgorithms v0.2.1
INFO: Removing SpecialFunctions v0.6.0
INFO: Removing StaticArrays v0.7.2
INFO: Removing StatsBase v0.23.1
INFO: Removing StatsFuns v0.6.0
INFO: Removing WinRPM v0.3.2
WARNING: The following packages have been updated but were already imported:
- IntervalTrees
Restart Julia to use the updated versions.
INFO: Package database updated

julia> Pkg.add("BioAlignments")
INFO: Cloning cache of BioAlignments from https://github.com/BioJulia/BioAlignments.jl.git
INFO: Cloning cache of GenomicFeatures from https://github.com/BioJulia/GenomicFeatures.jl.git
INFO: Installing BGZFStreams v0.2.0
INFO: Installing BioAlignments v0.3.0
INFO: Installing ColorTypes v0.6.7
INFO: Installing FixedPointNumbers v0.4.6
INFO: Installing GenomicFeatures v0.2.1
INFO: Installing Libz v0.2.4
INFO: Package database updated

Your Environment

  • 0.6.0 (same error on 0.6.3)
  • Windows 10

Installed packages:

show(STDOUT, "text/plain", sort(collect(Pkg.installed())))
57-element Array{Pair{String,VersionNumber},1}:
 "Automa"=>v"0.6.1"
 "BGZFStreams"=>v"0.2.0"
 "BinDeps"=>v"0.8.8"
 "BinaryProvider"=>v"0.3.2"
 "Bio"=>v"0.4.7"
 "BioCore"=>v"1.4.0"
 "BioSequences"=>v"0.8.3"
 "BioSymbols"=>v"2.0.0"
 "BufferedStreams"=>v"0.4.0"
 "Calculus"=>v"0.4.0"
 "CodecZlib"=>v"0.4.3"
 "ColorTypes"=>v"0.6.7"
 "Colors"=>v"0.8.2"
 "Combinatorics"=>v"0.6.0"
 "CommonSubexpressions"=>v"0.1.0"
 "Compat"=>v"0.69.0"
 "Conda"=>v"0.8.1"
 "DataStructures"=>v"0.8.3"
 "DiffResults"=>v"0.0.3"
 "DiffRules"=>v"0.0.5"
 "Distributions"=>v"0.15.0"
 "FixedPointNumbers"=>v"0.4.6"
 "ForwardDiff"=>v"0.7.5"
 "IJulia"=>v"1.8.0"
 "IndexableBitVectors"=>v"0.1.2"
 "IntervalTrees"=>v"0.1.0"
 "IterTools"=>v"0.2.1"
 "Iterators"=>v"0.3.1"
 "JSON"=>v"0.17.2"
 "LibExpat"=>v"0.4.2"
 "Libz"=>v"0.2.4"
 "LightGraphs"=>v"0.12.0"
 "LightXML"=>v"0.6.0"
 "MacroTools"=>v"0.4.2"
 "MbedTLS"=>v"0.5.11"
 "Missings"=>v"0.2.10"
 "NaNMath"=>v"0.3.1"
 "Nullables"=>v"0.0.5"
 "PDMats"=>v"0.8.0"
 "Polynomials"=>v"0.3.2"
 "QuadGK"=>v"0.2.1"
 "Reexport"=>v"0.1.0"
 "Rmath"=>v"0.4.0"
 "Roots"=>v"0.6.0"
 "SHA"=>v"0.5.7"
 "SimpleTraits"=>v"0.6.0"
 "SortingAlgorithms"=>v"0.2.1"
 "SpecialFunctions"=>v"0.6.0"
 "StaticArrays"=>v"0.7.2"
 "StatsBase"=>v"0.23.1"
 "StatsFuns"=>v"0.6.0"
 "TranscodingStreams"=>v"0.5.2"
 "Twiddle"=>v"0.4.0"
 "URIParser"=>v"0.3.1"
 "VersionParsing"=>v"1.1.1"
 "WinRPM"=>v"0.3.2"
 "ZMQ"=>v"0.6.2"

Inconsistent when aligning distinct sequence types

So I was surprised to find you can align Strings to each other:

julia> x= pairalign(LocalAlignment(), "ACA", "AAA", model)
PairwiseAlignmentResult{Int64, String, String}:
  score: 6
  seq: 1 ACA 3
         | |
  ref: 1 AAA 3

It's kind of cool that all it needs is a sequence of elements that implements convert(T, x) to the right type. But when displaying the sequence, it does not recognize that DNA_A == 'A'.

julia> x= pairalign(LocalAlignment(), "ACA", dna"AAA", model)
PairwiseAlignmentResult{Int64, String, LongDNASeq}:
  score: 6
  seq: 1 ACA 3

  ref: 1 AAA 3

We should also think about how to handle alignments of distinct sequence types. For example, how do you align to RNA sequences to each other? There is no substitution model, though obviously the DNA models could work. But since Strings are allowed to be used, we have an inconsistency: pairalign(LocalAlignment(), "AUA", dna"AAA", model) errors, but pairalign(LocalAlignment(), rna"AUA", dna"AAA", model) works, simply because convert(DNA, 'U') is an error, whereas convert(DNA, RNA_U) isn't.

Outdated documentation for SAM/BAM in BioAlignments is first hit in search

I've just tried following the example at https://biojulia.net/BioAlignments.jl/latest/hts-files.html and got the error that BAM is not defined in BioAlignments. After some searching, I found that I have to use the package XAM instead, i.e., in the example code given, using BioAlignments should be replaced with using XAM.

How comes that the BioAlignments documentation lists the API for SAM and BAM and gives examples even though these are not part of BioAlignments? Could it be that they used to be, but you recently moved them out into their own package, XAM, but forgot to update the documentation?

If so, please take this Issue as a gentle reminder. :-)

(My problem is solved, but being new to BioJulia, I was just thoroughly confused for a bit.)

Bug in extracting auxiliary data fields from BAM file

For some BAM records, attempting to find a nonexisting BAM auxiliary data field raises an unexpected error. This happens because getindex(record) calls getauxvalue, which calls findauxtag, which calls next_tag_position, which does not check for the data_size(record), and runs "off the edge". Thus the data after the proper end of the record is attempted to get parsed, which results in an error.

Expected Behavior

julia> haskey(record, "QQ")
false

Current Behavior

julia> haskey(record, "QQ")
ERROR: invalid type tag: ''
Stacktrace:
 [1] next_tag_position(::Array{UInt8,1}, ::Int64) at /home/jakni/.julia/dev/BioAlignments/src/bam/auxdata.jl:150
 [2] haskey(::BioAlignments.BAM.Record, ::String) at /home/jakni/.julia/dev/BioAlignments/src/bam/auxdata.jl:109
 [3] top-level scope at none:0

Possible Solution / Implementation

I propose one of two changes:

  1. Changing AuxData so that it contains a view of the data rather than a copy. This makes it very cheap to instantiate, and so the AuxData can be a field of the BAM record, instantiated when the BAM file is instantiated. All operations on the aux fields then simply operates on the already-existing lightweight AuxData object.
  2. Removing the AuxData object and rewrite all the aux operation to work directly on the record.

I can pitch a PR with either of those changes. I'd also like to add a get method to BAM records, similar to that of Dicts, to make it less awkward to tell Julia to "get this aux field if it exists, but don't raise an error".

Steps to Reproduce (for bugs)

  1. Open any BAM file (you can use this small BAM file)
  2. Run the following function to obtain a bad record:
function getbad(reader)
    record = BAM.Record()
    while !eof(reader)
        read!(reader, record)
        try
            x = haskey(record, "PW")
        catch
            close(reader)
            return record
        end
    end
end
  1. You can then provoke the error easily:
    haskey(record, "PW")

Friendly request to `@deprecate` API changes via semver

Hi BioAlignments,

Just wanted to put in a friendly request for API changes to follow Julia's @deprecate pattern for API changes please, so for example in package v0.1.x:

# src/AwesomeCode.jl
oldsignature(x::Int) = # does thing

Then the API change goes into v0.2.0 along with the deprecation

# src/AwesomeCode.jl
newsignature(x::Float64) = # does similar thing

# src/Deprecated.jl
@deprecate oldsignature(x::Int) newsignature(float(x))

This will give users a chance to update their scripts with a useful auto-generated warning, which tells them how to replace old calls with new ones.

Then remove the deprecation in v0.3

# src/AwesomeCode.jl
newsignature(x::Float64) = # does similar thing

Also consider listing breaking changes in a NEWS.md. Based on a quick search it looks like the @deprecate feature is not used here much.

Thanks for the great package!

How to get the score of two alignments?

Is there a way to only get the score of two alignments? Let's say that I have two alignments; "MQD--RV--KRP" and "MKKL-K-K-H-P" and I want to get there score. Currently, I'm trying to do:

scoremodel = AffineGapScoreModel(BLOSUM62, gap_open=-3, gap_extend=-1)
res = pairalign(GlobalAlignment(), S1, S2, scoremodel, score_only=true)

But this is giving me an error because of - present in the S1 and S2 sequences. Is there a way to only compute the score of different alignments, without actually aligning them? I have a list of alignments already computed and I want to compute their scores only. Any help/guidance would be much appreciated.

Registering BioAlignments.jl to BioRegistry?

Hi there!

Today I tried to use BioAlignments.jl along with BioSequences.jl 2.0 and other packages. However, this package is not registered in BioJuliaRegistry and thus it failed to be installed.
I've been away for a while so I'm not pretty sure the current release cycle of BioJulia.
@benjward, could you tell me what I should do here?

BTW, I found BioSequences.jl seems to have the same UUID in both General and BioJuliaRegistry. Is it okay?

(v1.2) pkg> add BioSequences FASTX
 Resolving package versions...
 Installed BioSymbols ─── v4.0.1
 Installed FASTX ──────── v1.1.0
 Installed BioSequences ─ v2.0.0
  Updating `~/.julia/environments/v1.2/Project.toml`
  [7e6ae17a] + BioSequences v2.0.0
  [c2308a5c] + FASTX v1.1.0
  Updating `~/.julia/environments/v1.2/Manifest.toml`
  [67c07d97] + Automa v0.7.0
  [47718e42] + BioGenerics v0.1.0
  [7e6ae17a] + BioSequences v2.0.0
  [3c28c6f8] + BioSymbols v4.0.1
  [861a8166] + Combinatorics v0.7.0
  [864edb3b] + DataStructures v0.17.0
  [c2308a5c] + FASTX v1.1.0
  [1cb3b9ac] + IndexableBitVectors v1.0.0
  [bac558e1] + OrderedCollections v1.1.0
  [f27b6e38] + Polynomials v0.5.2
  [3bb67fe8] + TranscodingStreams v0.9.5
  [7200193e] + Twiddle v1.1.1
  [2a0f44e3] + Base64
  [8ba89e20] + Distributed
  [b77e0a4c] + InteractiveUtils
  [8f399da3] + Libdl
  [37e2e46d] + LinearAlgebra
  [56ddb016] + Logging
  [d6f4376e] + Markdown
  [de0858da] + Printf
  [9a3f8284] + Random
  [9e88b42a] + Serialization
  [6462fe0b] + Sockets
  [2f01184e] + SparseArrays
  [8dfed614] + Test
  [4ec0a83e] + Unicode

(v1.2) pkg> add BioAlignments
 Resolving package versions...
ERROR: Unsatisfiable requirements detected for package BioAlignments [00701ae9]:
 BioAlignments [00701ae9] log:
 ├─possible versions are: [0.1.0, 0.2.0, 0.3.0, 1.0.0] or uninstalled
 ├─restricted to versions * by an explicit requirement, leaving only versions [0.1.0, 0.2.0, 0.3.0, 1.0.0]
 ├─restricted by julia compatibility requirements to versions: 1.0.0 or uninstalled, leaving only versions: 1.0.0
 └─restricted by compatibility requirements with BioSequences [7e6ae17a] to versions: uninstalled — no versions left
   └─BioSequences [7e6ae17a] log:
     ├─possible versions are: [0.5.0, 0.6.0-0.6.3, 0.7.0, 0.8.0-0.8.3, 1.0.0, 1.1.0, 2.0.0] or uninstalled
     └─restricted to versions 2.0.0 by an explicit requirement, leaving only versions 2.0.0

The API of accessing fields of BioAlignments

For context, please read https://discourse.julialang.org/t/accessing-type-internal-fields-in-package-interfaces/70263

In brief, I was teaching some students about BioAlignments recently. My fellow teachers, who did not know about the package, were surprised at the workflow a user is faced with. Let me give an example:

The pairalign function returns a PairwiseAlignmentResult, which contains a PairwiseAlignment, which again contains an Alignment. A user may, in the same breadth, be interested in extracting information from all these three types, for example the score from the first, the number of matches of the second, and the CIGAR string from the third.

However, it is quite unclear, and undocumented how the user is supposed to get these internal types from their initial PairwiseAlignmentResult. Both me and my fellow teachers thought accessing a type's undocumented fields is messing with internal behaviour, but then how are you supposed to get e.g. the Alignment object? Further, how is a user even supposed to know they are to use dump to inspect the fields of the returned object? Weirdly, the alignment function, despite its name, does not actually return an Alignment object.

Here is what I propose:

  • We should improve the docstrings and docs for each type. What exactly is the difference between PairwiseAlignmentResult, PairwiseAlignment and Alignment?
  • We should add documented, exported functions the extract the relevant types, and have them displayed at the top of the docs for each type. Fields that are NOT directly relevant to the user should not have these getter functions
  • When adding these features, we should have some simple test where we exercise basic functionality and make sure the user does not need to reach into internal struct layouts to perform simple tasks

New BAM files can support more than 65535 cigar opcodes

New BAM files can support more than 65535 cigar opcodes
samtools/hts-specs#40

There was discussion on the samtools-dev mailing list about this last year: http://sourceforge.net/p/samtools/mailman/message/30672431/ The main crux of the discussion there was to reuse the 16bit bin field to act as the top 16 bits of ncigar, possibly using the top flag bit as an indicator. There are some other discussions (internal) regarding this, including possibly removing bin completely given that it has no meaning with CSI indices, so this issue is largely just a note to track the issue and collate ideas/fixes. Note that the problem is definitely a real one. I have hit this with real projects, caused when a user merged consensus sequences from an assembly into a BAM file, but it is also not too far away with actual sequence reads too.

Newer technologies (PacBio, ONT, maybe more) offer substantially longer read lengths but also with higher indel rates leading to substantially more cigar opcodes. A 320Kb with with a 10% indel rate would lead to 2 changes (D/I to M) every 10 bases, giving 64k cigar ops. (Those aren't figures for real technologies, but it's not inconceivable.)

htslib spec:
samtools/hts-specs#227
samtools/hts-specs@ff8b54d

htslib patch:
samtools/htslib@aea349a

bamtools patch:

Support BAMs with >65535 CIGAR operations

Due to a design flaw, the original BAM format is unable to store an alignment
with >65535 CIGAR operations. The SAM/BAM specification maintainers have
decided to move the actual CIGAR to a CG optional tag and write a fake CIGAR
<readLen>S<refLen>N
at the original CIGAR place.

This PR recognizes the CG tag and seamlessly moves the real CIGAR back to its
right place and update the bin field accordingly. Library users need not take
any actions.

The convert and sort commands of command-line bamtools have been tested on BAMs
containing the CG tag.

pezmaster31/bamtools@cbfd3e8

The CIGAR length problem is specific for the BAM format, not for SAM format (can handle arbitrary long CIGAR strings).

If you add support for long CIGAR strings, you probably should use @hd VN:1.6 in the header too:
jmarshall/hts-specs@3f5a63e

Bump version to VN:1.6 due to >65535 CIGAR strings

The new CG tag field BAM representation for long CIGAR strings
(PR #227, merged as dab57f4)
will be unnoticed by older code. Such code will see the placeholder
CIGAR string, so it needs to be possible to signal the presence of
CG tags via the @HD-VN version number header field.

https://github.com/BioJulia/BioAlignments.jl/blob/master/src/bam/record.jl contains a typo:

fixed-length fields (see BMA specs for the details)

Do these methods work and are they tested?

Do the following methods work?

"""
seq2ref(aln::Union{Alignment, AlignedSequence, PairwiseAlignment}, i::Integer)::Tuple{Int,Operation}
Map a position `i` from sequence to reference.
"""
seq2ref(aln::Alignment, i::Integer) = pos2pos(aln, i, seqpos, refpos)
"""
ref2seq(aln::Union{Alignment, AlignedSequence, PairwiseAlignment}, i::Integer)::Tuple{Int,Operation}
Map a position `i` from reference to sequence.
"""
ref2seq(aln::Alignment, i::Integer) = pos2pos(aln, i, refpos, seqpos)
"""
seq2aln(aln::Union{Alignment, AlignedSequence, PairwiseAlignment}, i::Integer)::Tuple{Int,Operation}
Map a position `i` from the input sequence to the alignment sequence.
"""
seq2aln(aln::Alignment, i::Integer) = pos2pos(aln, i, seqpos, alnpos)
"""
ref2aln(aln::Union{Alignment, AlignedSequence, PairwiseAlignment}, i::Integer)::Tuple{Int,Operation}
Map a position `i` from the reference sequence to the alignment sequence.
"""
ref2aln(aln::Alignment, i::Integer) = pos2pos(aln, i, refpos, alnpos)
"""
aln2seq(aln::Union{Alignment, AlignedSequence, PairwiseAlignment}, i::Integer)::Tuple{Int,Operation}
Map a position `i` from the alignment sequence to the input sequence.
"""
aln2seq(aln::Alignment, i::Integer) = pos2pos(aln, i, alnpos, seqpos)
"""
aln2ref(aln::Union{Alignment, AlignedSequence, PairwiseAlignment}, i::Integer)::Tuple{Int,Operation}
Map a position `i` from the alignment sequence to the reference sequence.
"""
aln2ref(aln::Alignment, i::Integer) = pos2pos(aln, i, alnpos, refpos)

I don't see what provides variables refpos, seqpos, and alnpos.

`ref2seq` returns a position outside of the sequence length

Expected Behavior

The ref2seq function should give a valid index that can be used to find the corresponding base in a BAM.sequence.

Current Behavior

ref2seq returns an invalid index position that is greater than the total length of the sample sequence.

Possible Solution / Implementation

My hunch is that ref2seq isn't smart enough to deal with the 'Hard Clip' (H) operation when counting anchors, as I've only encountered this bug on BAM alignments that start with H. I've looked at the source, but can't determine without attaching a debugger if that is indeed the case.

Steps to Reproduce (for bugs)

  1. Download the offending reference.fasta and sample.bam
  2. Run the following
using BioAlignments
using FASTX
using XAM

# Get the reference sequence
reference_reader   = open(FASTA.Reader, "reference.fasta")
reference_record   = read(reference_reader)
reference_sequence = sequence(reference_record)
close(reference_reader)

# Read the problematic alignment
bam_reader = open(BAM.Reader, "sample.bam")
bam_record = read(bam_reader)
aligned_sequence = AlignedSequence(reference_sequence, BAM.alignment(bam_record))

# Let's look at postion 50 in the reference
reference_sequence[50]

# Is reference position 50 within the BAM sequence?
(50 in BAM.position(bam_record):BAM.rightposition(bam_record)) && println("it is")

# Find where reference position 50 is relative to the sequence
pos = ref2seq(aligned_sequence, 50)[1]

# Look at position 50 within the sequence
BAM.sequence(bam_record)[pos]

close(bam_reader)

Output

DNA_G
it is
ERROR: BoundsError: attempt to access BioSequences.LongDNASeq at index [284]
Stacktrace:
 [1] checkbounds
   @ ~/.julia/packages/BioSequences/k4j4J/src/biosequence/indexing.jl:117 [inlined]
 [2] getindex(seq::BioSequences.LongDNASeq, i::Int64)
   @ BioSequences ~/.julia/packages/BioSequences/k4j4J/src/biosequence/indexing.jl:142
 [3] top-level scope
   @ REPL[37]:1

Context

I'm trying to identify and compare variants at specific locations along the reference genome. Basically, I'd like to call a function with the reference genome position and the BAM alignment, and get the base that was called at that position. The ultimate goal is using this data in variant calling-type applications.

Your Environment

  • Package Version used: 2.0.0
  • Julia Version used: 1.6.1
  • Operating System and version (desktop or mobile): Fedora 34 (x86_64)
  • Link to your project: N/A
  • Installed packages
    • BioAlignments
    • BioSequences
    • BioSymbols
    • CSV
    • Combinatorics
    • DataFrames
    • FASTX
    • HypothesisTests
    • XAM

"master" vs "develop" branches

What is the status of these branches?
Right now PRs are merged into "master", but e.g. docs/make.jl are still configured to treat "develop" as development branch, and the docs/readme mention "develop" as the development branch.

I think it should be configured to be consistent with the rest of BioJulia, the ambiguous branch should be removed, and docs updated.

Base.convert(BAM.Record, x::Vector{UInt8}) not updated to Julia v 1.0

Cannot convert Vector{UInt8} to BAM.Record

Summary
The Base.convert(::Type{Record}, data::Vector{UInt8}) in src/bam/record.jl relies on the now-removed function unsafe_copy!. This means that it's no longer possible to instantiate a BAM.Record from a Vector{UInt8}:

julia> x = BAM.Record(rand(UInt8, 1000))
ERROR: UndefVarError: unsafe_copy! not defined
Stacktrace:
 [1] convert(::Type{BioAlignments.BAM.Record}, ::Array{UInt8,1}) at /home/jakni/.julia/dev/BioAlignments/src/bam/record.jl:38
 [2] BioAlignments.BAM.Record(::Array{UInt8,1}) at /home/jakni/.julia/dev/BioAlignments/src/bam/record.jl:33
 [3] top-level scope at none:0

Solution
Replace the current function at src/bam/record.jl at line 36 with this:

function Base.convert(::Type{Record}, data::Vector{UInt8})
    record = Record()
    dst_pointer = Ptr{UInt8}(pointer_from_objref(record))
    unsafe_copyto!(dst_pointer, pointer(data), FIXED_FIELDS_BYTES)
    dsize = data_size(record)
    resize!(record.data, dsize)
    unsafe_copyto!(record.data, 1, data, FIXED_FIELDS_BYTES + 1, dsize)
    return record
end

Although probably, it'd be good to have some boundschecks on this function:

function Base.convert(::Type{Record}, data::Vector{UInt8})
    length(data) < FIXED_FIELDS_BYTES && throw(ArgumentError("data too short"))
    record = Record()
    dst_pointer = Ptr{UInt8}(pointer_from_objref(record))
    unsafe_copyto!(dst_pointer, pointer(data), FIXED_FIELDS_BYTES)
    dsize = data_size(record)
    resize!(record.data, dsize)
    length(data) < dsize + FIXED_FIELDS_BYTES && throw(ArgumentError("data too short"))
    unsafe_copyto!(record.data, 1, data, FIXED_FIELDS_BYTES + 1, dsize)
    return record
end

OP_BACK has no use?

@MillironX discovered that OP_BACK is not used in BioAlignments.jl, but listed as being present for compatibility with the SAM specs. However, as he also discovered, the operation is not listed in the specs either.

What's the deal with it? If no-one remembers, let's just remove it.

Obscure error when aligning sequences with gaps

MWE:

model = AffineGapScoreModel(EDNAFULL, gap_open=-12, gap_extend=-2)
pairalign(LocalAlignment(), dna"A-A", dna"AAA", model)

Errors with:
ERROR: LoadError: BoundsError: attempt to access 15×15 Matrix{Int64} at index [0, 1]

It seems clear to me that the problem is that the alignment do not have an entry in the score table for matching to a deletion (a deletion being decoded to 0), and so it goes out of bounds.
The solution here should be to throw an error if the sequence contains any gaps

Weird error with `AffineGapScoreModel()`

julia> AffineGapScoreModel()
ERROR: KeyError: key :match not found
Stacktrace:
 [1] getindex(h::Dict{Symbol, Union{}}, key::Symbol)
   @ Base ./dict.jl:498
 [2] AffineGapScoreModel(; scores::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ BioAlignments ~/code/BioAlignments.jl/src/models.jl:97
 [3] AffineGapScoreModel()
   @ BioAlignments ~/code/BioAlignments.jl/src/models.jl:95
 [4] top-level scope
   @ REPL[7]:1

find() instead of findall() in hts-files.md

There seems to be a typo in the file BioAlignments.jl/docs/src/hts-files.md

Expected Behavior

I expected this code to work in Julia 1.1.0:

find(header(reader), "SQ")

I suggest to use findall() instead of find(), like this:

findall(header(reader), "SQ")

Current Behavior

The current code gives this error:

UndefVarError: find not defined

The suggested change instead results in this:

2-element Array{BioAlignments.SAM.MetaInfo,1}:
 BioAlignments.SAM.MetaInfo:
    tag: SQ
  value: SN=chr1 LN=1575
 BioAlignments.SAM.MetaInfo:
    tag: SQ
  value: SN=chr2 LN=1584

Possible Solution / Implementation

Replace find() with findall() for Julia 1.x.

Steps to Reproduce (for bugs)

Download the file ex1_header.sam, then run this code in Atom

using BioAlignments
reader = open(SAM.Reader, "ex1_header.sam");
find(header(reader), "SQ")

Context

Obviously, I am a Julia beginner. I am trying to learn how to read a SAM file and a BAM file using BioAlignments. The documentation refers to "data.sam" and "data.bam" which are only generic hypothetical files. I would like to have a fully reproducible example with a real file instead such as ex1_header.sam from https://github.com/BioJulia/BioFmtSpecimens/tree/master/SAM

I tried to add Bio but received this error message:

(v1.1) pkg> add Bio
  Updating registry at `C:\Users\aalexandersson\.julia\registries\General`
  Updating git-repo `https://github.com/JuliaRegistries/General.git`
 Resolving package versions...
ERROR: Unsatisfiable requirements detected for package Bio [3637df68]:
 Bio [3637df68] log:
 ├─possible versions are: [0.1.0, 0.2.0-0.2.3, 0.3.0, 0.4.0-0.4.7] or uninstalled
 ├─restricted to versions * by an explicit requirement, leaving only versions [0.1.0, 0.2.0-0.2.3, 0.3.0, 0.4.0-0.4.7]
 └─restricted by julia compatibility requirements to versions: uninstalled - no versions left

Then I successfully installed BioAlignments individually.

Your Environment

julia> versioninfo()
Julia Version 1.1.0
Commit 80516ca202 (2019-01-21 21:24 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7 CPU         870  @ 2.93GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, nehalem)
Environment:
  JULIA_EDITOR = "C:\Users\aalexandersson\AppData\Local\atom\app-1.29.0\atom.exe" -a
  JULIA_NUM_THREADS = 4
(v1.1) pkg> status
    Status `C:\Users\aalexandersson\.julia\environments\v1.1\Project.toml`
  [c52e3926] Atom v0.7.14
  [9e28174c] BinDeps v0.8.10
  [00701ae9] BioAlignments v1.0.0
  [37cfa864] BioCore v2.0.5
  [7e6ae17a] BioSequences v1.1.0
  [159f3aea] Cairo v0.5.6
  [34da2185] Compat v1.5.1
  [a93c6f00] DataFrames v0.17.0
  [186bb1d3] Fontconfig v0.2.0
  [c91e804a] Gadfly v1.0.1
  [c43c736e] Genie v0.1.0 #master (https://github.com/essenciary/Genie.jl)
  [e5e0dc1b] Juno v0.5.4
  [ce6b1742] RDatasets v0.6.1
  [b8865327] UnicodePlots v1.0.1

Can't compiled by PackageCompile

This template is rather extensive. Fill out all that you can, if are a new contributor or you're unsure about any section, leave it unchanged and a reviewer will help you 😄. This template is simply a tool to help everyone remember the BioJulia guidelines, if you feel anything in this template is not relevant, simply delete it.

Expected Behavior

Being Compiled to system image

Current Behavior

Compile failed

Possible Solution / Implementation

From the error report, it seems like something wrong in the file .julia/packages/BioSequences/7i86L/test/kmers/shuffle.jl

Steps to Reproduce (for bugs)

using PackageCompile
compile_package("BioAlignments")

Context

I try to accelerate the julia startup by compiling the used packages into system image, all other packages went well except then BioAlignment, the error report is listed below:

  Updating registry at `~/.julia/registries/General`
  Updating git-repo `https://github.com/JuliaRegistries/General.git`
�[?25l�[2K�[?25h Resolving package versions...
  Updating `~/.julia/dev/PackageCompiler/packages/BioSequences/Project.toml`
  [67c07d97] + Automa v0.8.0
  [37cfa864] + BioCore v2.0.5
  [3c28c6f8] + BioSymbols v3.1.0
  [e1450e63] + BufferedStreams v1.0.0
  [861a8166] + Combinatorics v0.7.0
  [1cb3b9ac] + IndexableBitVectors v1.0.0
  [524e6230] + IntervalTrees v1.0.0
  [7200193e] + Twiddle v1.1.0
  [37e2e46d] + LinearAlgebra 
  [de0858da] + Printf 
  [9a3f8284] + Random 
  Updating `~/.julia/dev/PackageCompiler/packages/BioSequences/Manifest.toml`
  [67c07d97] + Automa v0.8.0
  [37cfa864] + BioCore v2.0.5
  [3c28c6f8] + BioSymbols v3.1.0
  [e1450e63] + BufferedStreams v1.0.0
  [19ecbf4d] + Codecs v0.5.0
  [861a8166] + Combinatorics v0.7.0
  [34da2185] + Compat v2.1.0
  [864edb3b] + DataStructures v0.15.0
  [1cb3b9ac] + IndexableBitVectors v1.0.0
  [524e6230] + IntervalTrees v1.0.0
  [bac558e1] + OrderedCollections v1.1.0
  [f27b6e38] + Polynomials v0.5.2
  [3bb67fe8] + TranscodingStreams v0.9.4
  [7200193e] + Twiddle v1.1.0
  [ddb6d928] + YAML v0.3.2
  [2a0f44e3] + Base64 
  [ade2ca70] + Dates 
  [8bb1440f] + DelimitedFiles 
  [8ba89e20] + Distributed 
  [b77e0a4c] + InteractiveUtils 
  [76f85450] + LibGit2 
  [8f399da3] + Libdl 
  [37e2e46d] + LinearAlgebra 
  [56ddb016] + Logging 
  [d6f4376e] + Markdown 
  [a63ad114] + Mmap 
  [44cfe95a] + Pkg 
  [de0858da] + Printf 
  [9abbd945] + Profile 
  [3fa0cd96] + REPL 
  [9a3f8284] + Random 
  [ea8e919c] + SHA 
  [9e88b42a] + Serialization 
  [1a1011a3] + SharedArrays 
  [6462fe0b] + Sockets 
  [2f01184e] + SparseArrays 
  [10745b16] + Statistics 
  [8dfed614] + Test 
  [cf7118a7] + UUIDs 
  [4ec0a83e] + Unicode 
 Resolving package versions...
  Updating `~/.julia/dev/PackageCompiler/packages/BioSequences/Project.toml`
  [9b87118b] + PackageCompiler v0.6.3
  [44cfe95a] + Pkg 
  Updating `~/.julia/dev/PackageCompiler/packages/BioSequences/Manifest.toml`
  [c7e460c6] + ArgParse v0.6.2
  [9e28174c] + BinDeps v0.8.10
  [b99e7846] + BinaryProvider v0.5.4
  [0862f596] + HTTPClient v0.2.1
  [b27032c2] + LibCURL v0.5.0
  [522f3ed2] + LibExpat v0.5.0
  [2ec943e9] + Libz v1.0.0
  [9b87118b] + PackageCompiler v0.6.3
  [b718987f] + TextWrap v0.3.0
  [30578b45] + URIParser v0.4.0
  [c17dfb99] + WinRPM v0.4.2

signal (11): Segmentation fault
in expression starting at /home/chensy/.julia/packages/BioSequences/7i86L/test/kmers/shuffle.jl:1
jl_static_show_x_ at /buildworker/worker/package_linux64/build/src/rtutils.c:657
jl_static_show_x at /buildworker/worker/package_linux64/build/src/rtutils.c:1030 [inlined]
jl_static_show_x_ at /buildworker/worker/package_linux64/build/src/rtutils.c:702
jl_static_show_x at /buildworker/worker/package_linux64/build/src/rtutils.c:1030 [inlined]
jl_static_show at /buildworker/worker/package_linux64/build/src/rtutils.c:1035
jl_mt_assoc_by_type at /buildworker/worker/package_linux64/build/src/gf.c:1130
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2187
do_call at /buildworker/worker/package_linux64/build/src/interpreter.c:323
eval_value at /buildworker/worker/package_linux64/build/src/interpreter.c:411
eval_stmt_value at /buildworker/worker/package_linux64/build/src/interpreter.c:362 [inlined]
eval_body at /buildworker/worker/package_linux64/build/src/interpreter.c:773
eval_body at /buildworker/worker/package_linux64/build/src/interpreter.c:689
eval_body at /buildworker/worker/package_linux64/build/src/interpreter.c:689
eval_body at /buildworker/worker/package_linux64/build/src/interpreter.c:689
jl_interpret_toplevel_thunk_callback at /buildworker/worker/package_linux64/build/src/interpreter.c:885
unknown function (ip: 0xfffffffffffffffe)
unknown function (ip: 0x7ff0a474e2bf)
unknown function (ip: 0x66)
jl_interpret_toplevel_thunk at /buildworker/worker/package_linux64/build/src/interpreter.c:894
jl_toplevel_eval_flex at /buildworker/worker/package_linux64/build/src/toplevel.c:764
jl_parse_eval_all at /buildworker/worker/package_linux64/build/src/ast.c:883
jl_load at /buildworker/worker/package_linux64/build/src/toplevel.c:826
include at ./boot.jl:326 [inlined]
include_relative at ./loading.jl:1038
include at ./sysimg.jl:29 [inlined]
include at /home/chensy/.julia/packages/BioSequences/7i86L/test/runtests.jl:1
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2219
do_call at /buildworker/worker/package_linux64/build/src/interpreter.c:323
eval_value at /buildworker/worker/package_linux64/build/src/interpreter.c:411
eval_stmt_value at /buildworker/worker/package_linux64/build/src/interpreter.c:362 [inlined]
eval_body at /buildworker/worker/package_linux64/build/src/interpreter.c:773
eval_body at /buildworker/worker/package_linux64/build/src/interpreter.c:689
eval_body at /buildworker/worker/package_linux64/build/src/interpreter.c:689
jl_interpret_toplevel_thunk_callback at /buildworker/worker/package_linux64/build/src/interpreter.c:885
unknown function (ip: 0xfffffffffffffffe)
unknown function (ip: 0x7ff0cbab617f)
unknown function (ip: 0x33)
jl_interpret_toplevel_thunk at /buildworker/worker/package_linux64/build/src/interpreter.c:894
jl_toplevel_eval_flex at /buildworker/worker/package_linux64/build/src/toplevel.c:764
jl_eval_module_expr at /buildworker/worker/package_linux64/build/src/toplevel.c:179
jl_toplevel_eval_flex at /buildworker/worker/package_linux64/build/src/toplevel.c:614
jl_parse_eval_all at /buildworker/worker/package_linux64/build/src/ast.c:883
jl_load at /buildworker/worker/package_linux64/build/src/toplevel.c:826
include at ./boot.jl:326 [inlined]
include_relative at ./loading.jl:1038
include at ./sysimg.jl:29
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2219
include at ./client.jl:403
jl_fptr_trampoline at /buildworker/worker/package_linux64/build/src/gf.c:1864
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2219
do_call at /buildworker/worker/package_linux64/build/src/interpreter.c:323
eval_value at /buildworker/worker/package_linux64/build/src/interpreter.c:411
eval_stmt_value at /buildworker/worker/package_linux64/build/src/interpreter.c:362 [inlined]
eval_body at /buildworker/worker/package_linux64/build/src/interpreter.c:773
eval_body at /buildworker/worker/package_linux64/build/src/interpreter.c:689
jl_interpret_toplevel_thunk_callback at /buildworker/worker/package_linux64/build/src/interpreter.c:885
unknown function (ip: 0xfffffffffffffffe)
unknown function (ip: 0x7ff0cb28a49f)
unknown function (ip: (nil))
jl_interpret_toplevel_thunk at /buildworker/worker/package_linux64/build/src/interpreter.c:894
jl_toplevel_eval_flex at /buildworker/worker/package_linux64/build/src/toplevel.c:764
jl_parse_eval_all at /buildworker/worker/package_linux64/build/src/ast.c:883
jl_load at /buildworker/worker/package_linux64/build/src/toplevel.c:826
include at ./boot.jl:326 [inlined]
include_relative at ./loading.jl:1038
include at ./sysimg.jl:29
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2219
exec_options at ./client.jl:267
_start at ./client.jl:436
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2219
jl_apply at /buildworker/worker/package_linux64/build/ui/../src/julia.h:1571 [inlined]
true_main at /buildworker/worker/package_linux64/build/ui/repl.c:96
main at /buildworker/worker/package_linux64/build/ui/repl.c:217
__libc_start_main at /build/eglibc-ripdx6/eglibc-2.19/csu/libc-start.c:287
_start at /home/chensy/julia-1.1.0/bin/julia (unknown line)
Allocations: 140966623 (Pool: 140931781; Big: 34842); GC: 278
[ Info: Registered package BioSequences, using already given UUID: 7e6ae17a-c86d-528c-b3b9-7f778a29fe59
ERROR: LoadError: failed process: Process(`/home/chensy/julia-1.1.0/bin/julia --compile=all --optimize=0 -g1 --trace-compile=/home/chensy/.julia/dev/PackageCompiler/packages/precompile_tmp.jl --history-file=yes --code-coverage=none --inline=yes --math-mode=ieee --handle-signals=yes --warn-overwrite=no --compile=yes --depwarn=yes --cpu-target=native --track-allocation=none --sysimage-native-code=yes --sysimage=/home/chensy/julia-1.1.0/lib/julia/sys.so -g1 --compiled-modules=yes --optimize=2 /home/chensy/.julia/dev/PackageCompiler/sysimg/run_julia_code.jl`, ProcessSignaled(11)) [0]
Stacktrace:
 [1] error(::String, ::Base.Process, ::String, ::Int64, ::String) at ./error.jl:42
 [2] pipeline_error at ./process.jl:785 [inlined]
 [3] #run#515(::Bool, ::Function, ::Cmd) at ./process.jl:726
 [4] #run_julia#1 at ./process.jl:724 [inlined]
 [5] (::getfield(PackageCompiler, Symbol("#kw##run_julia")))(::NamedTuple{(:compile, :O, :g, :trace_compile),Tuple{String,Int64,Int64,String}}, ::typeof(PackageCompiler.run_julia), ::String) at ./none:0
 [6] snoop(::Symbol, ::String, ::String, ::String, ::Bool, ::Array{Any,1}) at /home/chensy/.julia/dev/PackageCompiler/src/snooping.jl:34
 [7] (::getfield(PackageCompiler, Symbol("##35#37")){Array{Any,1},Tuple{Symbol},Dict{Any,Any},String,Dict{String,Array{Dict{String,Any},1}}})(::IOStream) at /home/chensy/.julia/dev/PackageCompiler/src/snooping.jl:124
 [8] #open#310(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::getfield(PackageCompiler, Symbol("##35#37")){Array{Any,1},Tuple{Symbol},Dict{Any,Any},String,Dict{String,Array{Dict{String,Any},1}}}, ::String, ::Vararg{String,N} where N) at ./iostream.jl:369
 [9] open at ./iostream.jl:367 [inlined]
 [10] #snoop_packages#34 at /home/chensy/.julia/dev/PackageCompiler/src/snooping.jl:110 [inlined]
 [11] #snoop_packages at ./none:0 [inlined]
 [12] #compile_package#67(::Bool, ::Bool, ::Bool, ::Nothing, ::Bool, ::Function, ::Tuple{String,String}) at /home/chensy/.julia/dev/PackageCompiler/src/PackageCompiler.jl:122
 [13] #compile_package#64 at /home/chensy/.julia/dev/PackageCompiler/src/PackageCompiler.jl:116 [inlined]
 [14] compile_package(::String) at /home/chensy/.julia/dev/PackageCompiler/src/PackageCompiler.jl:88
 [15] top-level scope at none:0
 [16] include at ./boot.jl:326 [inlined]
 [17] include_relative(::Module, ::String) at ./loading.jl:1038
 [18] include(::Module, ::String) at ./sysimg.jl:29
 [19] exec_options(::Base.JLOptions) at ./client.jl:267
 [20] _start() at ./client.jl:436
in expression starting at /home/chensy/jatt/generate_log.jl:2

Your Environment

  • Package Version used: BioAlignments 1.0.0
  • Julia Version used: 1.1.0
  • Operating System and version (desktop or mobile): Ubuntu 16.04
  • Link to your project:

TagBot not triggering.

On release of v2.0.1 for #66 , no release was created. I tried running the workflow manually to no avail.

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Extract start and end positions from a PairwiseAlignment

Hello, I have a feature request, that is related to some other open issues related to the API of this package.

Currently, there is no API to extract the start and end positions from a PairwiseAlignment. It would be great if the client can easily extract the location of matches (start, end) between the reference and the sequence of the PairwiseAlignment object.

Here is the hack that I wrote to solve this problem for myself, but it would be great if the API could expose an abstraction for locations of matches, or some analogous methods/functions so that the client doesn't have to mess with internal "private" fields.

function GetPairwiseAlignmentSeqStart(aln::PairwiseAlignment)
    return aln.a.aln.anchors[1].seqpos + 1
end

function GetPairwiseAlignmentRefStart(aln::PairwiseAlignment)
    return aln.a.aln.anchors[1].refpos + 1
end

function GetPairwiseAlignmentSeqEnd(aln::PairwiseAlignment)
    return aln.a.aln.anchors[end].seqpos
end

function GetPairwiseAlignmentRefEnd(aln::PairwiseAlignment)
    return aln.a.aln.anchors[end].refpos
end

Thanks for your consideration, hope this suggestion is helpful for both users as well as the designers and implementors of this package.

Support the mpileup format

Currently we don't have any support for mpileup files.

The format is simple enough that a Automa machine should satisfy this format easily.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.