pacificbiosciences / pbcore Goto Github PK

A Python library for reading and writing PacBio® data files

License: Other

Makefile 0.24% Python 99.42% Shell 0.32% M4 0.02%

pbcore's Introduction

pbcore

The pbcore package provides Python APIs for interacting with PacBio data files and writing bioinformatics applications.

Availability

Latest version can be installed via bioconda package pbcore.

Please refer to our official pbbioconda page for information on Installation, Support, License, Copyright, and Disclaimer.

Documentation:

http://pacificbiosciences.github.io/pbcore/

DISCLAIMER

THIS WEBSITE AND CONTENT AND ALL SITE-RELATED SERVICES, INCLUDING ANY DATA, ARE PROVIDED "AS IS," WITH ALL FAULTS, WITH NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE. YOU ASSUME TOTAL RESPONSIBILITY AND RISK FOR YOUR USE OF THIS SITE, ALL SITE-RELATED SERVICES, AND ANY THIRD PARTY WEBSITES OR APPLICATIONS. NO ORAL OR WRITTEN INFORMATION OR ADVICE SHALL CREATE A WARRANTY OF ANY KIND. ANY REFERENCES TO SPECIFIC PRODUCTS OR SERVICES ON THE WEBSITES DO NOT CONSTITUTE OR IMPLY A RECOMMENDATION OR ENDORSEMENT BY PACIFIC BIOSCIENCES.

pbcore's People

Contributors

Stargazers

Watchers

pbcore's Issues

Installing pbcore

I have no error in installing pbcore. However, upon running the Falcon scripts which required pbcore, pythons throw error that there were no pbcore.io to import FastaReader.

python /Desktop/program/FALCON-master/src/py_scripts/falcon_overlap.py
Traceback (most recent call last):
File "/Desktop/program/FALCON-master/src/py_scripts/falcon_overlap.py", line 41, in
from pbcore.io import FastaReader
ImportError: No module named io
Thanks

1.3.0 planning

I want 1.3.0 to be about

major cleanups, including at least one that could break compatibility to a small degree.
removal of deprecated functions (like movieTable accessors, etc.)

BAM support originally incubated in the context of /aligned/ BAM files---the support predated our move to using BAM for raw basecalls.

We see this today in the names of some of the classes: BamAlignment, which really ought to be BamRecord now, for example. I want to fix all of these. Compatibility impact should be low because clients don't construct these objects directly, they get them from slicing on a reader object.

There are also a couple API calls that are really awkward for use on unaligned BAM files. In particular, access to raw read data is very awkward. To get the basecalls in an unaligned BAM record, you'd really like to be able to do:

    >>> record.read()

but that fails because the default args are aligned=True, orientation="native"---andaligned=True` exceptions out on unmapped reads. You need to call

   >>> record.read(aligned=False, orientation="native")

I would like these to be the new defaults, because it works well for aligned or unaligned records.

I could add an alignedRead accessor as well, to make it a little more convenient.

inputs to genomicconsensus with arrow algorithm

I have assembled a de novo genome (1.98 Gb) with canu v1.6 using pacbio reads. I am in the polish stage and have aligned raw subreads.bam to the assembly with blasr in batches, as the process was running out of allocated walltime when all reads submitted. I therefore have 6 large alignment.bam files 342G, 284G, 240G, 117G, 154G, 78G.

I was trying to merge the bam files using pbmerge, however this too was a very long process and at one stage failed. Is it possible to run these individual alignment.bam files as inputs with the arrow algorithm and get 6 fasta outputs? My thinking is that these are going to be much smaller files to merge but I am not sure if this is valid.

pyxb no longer supported upstream

Hello,
pabigot/pyxb#100 seeks a new maintainer. Linux distributions have since started to remove it (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=946602). pbcore uses it (

pbcore/setup.py

Line 13 in 1fc62fc

'pyxb == 1.2.6',

) for validations if I get this right.
What should happen now? The best possible answer would possibly be that Pacific Biosciences starts maintaining that package or fund someone else maintaining it. What should us users do in the meantime?
Cheers,
Steffen

Systematic tagging of new releases and upload to PyPI

Please consider systematically tagging and releasing tarballs to PyPI in the future. Neither 1.2.11, 1.2.12, 1.3.0 and now 1.4.0 were tagged nor pushed to PyPI which makes tracking of new releases downstream more difficult than it should be. Right now, anything requiring pbcore would still download the outdated 1.2.10 version from PyPI.

Thanks,

"only PacBio BAM files version >= 3.0b3 are supported"

Hello, I've tried to use the pbcore to read an aligned reads BAM file from smrtportal after an HGAP assembly, but receive the error: pbcore.io.align._BamSupport.IncompatibleFile: This BAM file is incompatible with this API (only PacBio BAM files version >= 3.0b3 are supported

It seems to me that this should Just Work, so I'm wondering if I'm missing something.

smrtanalysis_2.3.0.140936.p2.144836
HGAP assembly finishes
download the Aligned Reads *.bam and *.bai files
try to do a simple open with the following script
get error message

import pbcore.io

with pbcore.io.BamReader("016496_data_aligned_reads.bam"):
    pass

Traceback (most recent call last):
  File "read-bam.py", line 3, in <module>
    with pbcore.io.BamReader("016496_data_aligned_reads.bam"):
  File "/Users/agnor/.virtualenvs/tmp-5cfd6e743e152dc4/lib/python2.7/site-packages/pbcore-1.0.0-py2.7.egg/pbcore/io/align/BamIO.py", line 308, in __init__
    super(BamReader, self).__init__(fname, referenceFastaFname)
  File "/Users/agnor/.virtualenvs/tmp-5cfd6e743e152dc4/lib/python2.7/site-packages/pbcore-1.0.0-py2.7.egg/pbcore/io/align/BamIO.py", line 166, in __init__
    self._checkFileCompatibility()
  File "/Users/agnor/.virtualenvs/tmp-5cfd6e743e152dc4/lib/python2.7/site-packages/pbcore-1.0.0-py2.7.egg/pbcore/io/align/BamIO.py", line 161, in _checkFileCompatibility
    "(only PacBio BAM files version >= 3.0b3 are supported)")
pbcore.io.align._BamSupport.IncompatibleFile: This BAM file is incompatible with this API (only PacBio BAM files version >= 3.0b3 are supported)

$ pip freeze
Cython==0.22
h5py==2.5.0
numpy==1.9.2
pbcore==1.0.0
pysam==0.8.1
six==1.9.0
wsgiref==0.1.2

how to generate bam.pbi (PacBio BAM index)?

in the pbcore docs, I've seen references to readers that require a "bam.pbi (PacBio BAM index)".

But how do I create this? I see nothing in the docs and I'm assuming this is different than the usual bam.bai index file.

Thank you.

Trouble building this with anaconda

On trying to build this with anaconda, I seem to get an error message saying that need pysam >=0.8.4. I do have the pysam 0.8.4 on my machine. I'd appreciate any help with this.

After running

conda skeleton pypi pbcore

I get the following error while running

conda build pbcore

I do have pysam 0.8.4 in my conda list

Thanks,
Govinda.

[Issue] Distribution of encoded IPDs does not match the technical specification

Hello everyone,

I am currently working on the IPDs of PacBio experiments.

I have read much about the lossy 8 bits encoding of the IPDs so I thought it would be easy for me to decode/re-encode if needed (and I really need it actually).

Before doing so, I had the idea of plotting the coded IPD's distribution for both Total and every base separated as follows

It looks like a bug to me,because if I understand the codec well, then on my distribution plot I should have four artefacts from each change of the codec at 64, 128, 192, 255

But the expected artefact at 128 is obviously missing

I obtained the figure above with a .csv like this one, which I computed by my own (and checked carefully) (the following one is from thymine) -->

As you can see on this .csv , after IPD 64, IPDs go up 2 by 2, so that odd values are absent
Also, I don't show it here but after 196 they go 4 by 4 until it reaches 255

These data come from an analysis of 6GB of raw-encoded-data directly out of the sequencer.
I double-checked it, and it seems like I did no mistake in my own code to get the IPDs, neither to plot them, or to build the .csv.

Such results don't correspond at all with the encoding as it is presented in the technical doc here ( http://pacbiofileformats.readthedocs.io/en/3.0/BAM.html ), so I checked the code

Do you think that the following python snipplet from pbcore might be involved ?

#
# Kinetics: decode the scheme we are using to encode approximate frame
# counts in 8-bits.
#
def _makeFramepoints():
    B = 2
    t = 6
    T = 2**t

    framepoints = []
    next = 0
    for i in range(256/T):
        grain = B**i
        nextOnes = next + grain * np.arange(0, T)
        next = nextOnes[-1] + grain
        framepoints = framepoints + list(nextOnes)
    return np.array(framepoints, dtype=np.uint16)

def _makeLookup(framepoints):
    # (frame -> code) involves some kind of rounding
    # basic round-to-nearest
    frameToCode = np.empty(shape=max(framepoints)+1, dtype=int)
    for i, (fl, fu) in enumerate(zip(framepoints, framepoints[1:])):
        if (fu > fl + 1):
            m = (fl + fu)/2
            for f in xrange(fl, m):
                frameToCode[f] = i
            for f in xrange(m, fu):
                frameToCode[f] = i + 1
        else:
            frameToCode[fl] = i
    # Extra entry for last:
    frameToCode[fu] = i + 1
return frameToCode, fu

I would be very glad to have your opinion on it :)

Have a nice day !
Guillaume

hdf5.h missing h5py fails to install

Following the directions for install:
In file included from /usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/ndarraytypes.h:1760:0,

             from /usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/ndarrayobject.h:17,

             from /usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/arrayobject.h:4,

             from h5py/api_compat.h:21,

             from h5py/defs.c:259:

/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:15:2: warning: #warning "Using deprecated NumPy API, disable it by " "#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]

In file included from h5py/defs.c:259:0:

h5py/api_compat.h:22:18: fatal error: hdf5.h: No such file or directory

compilation terminated.

error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

Can't roll back h5py; was not uninstalled
Command /mnt/myenv2/bin/python -c "import setuptools;file='/mnt/myenv2/build/h5py/setup.py';exec(compile(open(file).read().replace('\r\n', '\n'), file, 'exec'))" install --record /tmp/pip-pf_SBW-record/install-record.txt --single-version-externally-managed --install-headers /mnt/myenv2/include/site/python2.7 failed with error code 1 in /mnt/myenv2/build/h5py
Storing complete log in /home/ubuntu/.pip/pip.log

Can the pysam dependency be changed?

'pysam == 0.8.1'

'pysam >= 0.8.1'

FASTA sequence normalization

Feature has been wanted for a long time

FASTA readers should have ability to normalize returned sequence according to a desired alphabet (Dna, Dna5, IUPAC, etc, in Seqan's terminology). Dna5 (perhaps DnaN?) would be the most sensible default.

As it stands, we reinvent this functionality in various ways in all our downstream applications, sometimes with difficulty (as in GenomicConsenus)

Include filename in exception message?

https://github.com/PacificBiosciences/pbcore/blob/master/pbcore/io/align/BamIO.py#L372

I need to know the filename which is missing an index. This could be provided by the caller of pbcore.io.SubreadSet, but it could also be provided in BamIO.py. Which would make more sense?

  File "/mnt/secondary/builds/full/3.0.4/prod/smrtanalysis_3.0.4.170728/private/pacbio/pythonpkgs/pbsmrtpipe/lib/python2.7/site-packages/pbsmrtpipe/tools/chunk_utils.py", line 318, in _to_chunked_dataset_files
    dset = dataset_type(dataset_path, strict=True)
  File "/mnt/secondary/builds/full/3.0.4/prod/smrtanalysis_3.0.4.170728/private/pacbio/pythonpkgs/pbcore/lib/python2.7/site-packages/pbcore/io/dataset/DataSetIO.py", line 2221, in __init__
    super(SubreadSet, self).__init__(*files, **kwargs)
  File "/mnt/secondary/builds/full/3.0.4/prod/smrtanalysis_3.0.4.170728/private/pacbio/pythonpkgs/pbcore/lib/python2.7/site-packages/pbcore/io/dataset/DataSetIO.py", line 1705, in __init__
    super(ReadSet, self).__init__(*files, **kwargs)
  File "/mnt/secondary/builds/full/3.0.4/prod/smrtanalysis_3.0.4.170728/private/pacbio/pythonpkgs/pbcore/lib/python2.7/site-packages/pbcore/io/dataset/DataSetIO.py", line 389, in __init__
    self.updateCounts()
  File "/mnt/secondary/builds/full/3.0.4/prod/smrtanalysis_3.0.4.170728/private/pacbio/pythonpkgs/pbcore/lib/python2.7/site-packages/pbcore/io/dataset/DataSetIO.py", line 2090, in updateCounts
    self.assertIndexed()
  File "/mnt/secondary/builds/full/3.0.4/prod/smrtanalysis_3.0.4.170728/private/pacbio/pythonpkgs/pbcore/lib/python2.7/site-packages/pbcore/io/dataset/DataSetIO.py", line 1939, in assertIndexed
    self._assertIndexed((IndexedBamReader, CmpH5Reader))
  File "/mnt/secondary/builds/full/3.0.4/prod/smrtanalysis_3.0.4.170728/private/pacbio/pythonpkgs/pbcore/lib/python2.7/site-packages/pbcore/io/dataset/DataSetIO.py", line 1644, in _assertIndexed
    self._openFiles()
  File "/mnt/secondary/builds/full/3.0.4/prod/smrtanalysis_3.0.4.170728/private/pacbio/pythonpkgs/pbcore/lib/python2.7/site-packages/pbcore/io/dataset/DataSetIO.py", line 1758, in _openFiles
    referenceFastaFname=refFile)
  File "/mnt/secondary/builds/full/3.0.4/prod/smrtanalysis_3.0.4.170728/private/pacbio/pythonpkgs/pbcore/lib/python2.7/site-packages/pbcore/io/align/BamIO.py", line 372, in __init__
    raise IOError, "IndexedBamReader requires bam.pbi index file"
IOError: IndexedBamReader requires bam.pbi index file

I'm happy to provide a pull-request, if you prefer me to propose the change.

I am trying to run Quiver I get this error

I am trying to run Quiver :
However I get the following error, my input is a bam file and reference.fasta file.

Running job quiver-lv (7370641) on
'HD'
Traceback (most recent call last):
File "/home/l/python2.7/lib/python2.7/site-packages/pbcommand-0.6.1-py2.7.egg/pbcommand/cli/core.py", line 136, in _pacbio_main_runner
return_code = exe_main_func(*args, **kwargs)
File "/home/l/python2.7/lib/python2.7/site-packages/GenomicConsensus-2.2.0-py2.7.egg/GenomicConsensus/main.py", line 381, in args_runner
return tr.main()
File "/home/l/python2.7/lib/python2.7/site-packages/GenomicConsensus-2.2.0-py2.7.egg/GenomicConsensus/main.py", line 297, in main
with AlignmentSet(options.inputFilename) as peekFile:
File "/home/l/python2.7/lib/python2.7/site-packages/pbcore-1.4.0-py2.7.egg/pbcore/io/dataset/DataSetIO.py", line 2649, in init
super(AlignmentSet, self).init(*files, **kwargs)
File "/home/l/python2.7/lib/python2.7/site-packages/pbcore-1.4.0-py2.7.egg/pbcore/io/dataset/DataSetIO.py", line 1936, in init
super(ReadSet, self).init(*files, **kwargs)
File "/home/l/python2.7/lib/python2.7/site-packages/pbcore-1.4.0-py2.7.egg/pbcore/io/dataset/DataSetIO.py", line 477, in init
self.updateCounts()
File "/home/l/python2.7/lib/python2.7/site-packages/pbcore-1.4.0-py2.7.egg/pbcore/io/dataset/DataSetIO.py", line 2490, in updateCounts
self.assertIndexed()
File "/home/l/python2.7/lib/python2.7/site-packages/pbcore-1.4.0-py2.7.egg/pbcore/io/dataset/DataSetIO.py", line 2320, in assertIndexed
self._assertIndexed((IndexedBamReader, CmpH5Reader))
File "/home/l/python2.7/lib/python2.7/site-packages/pbcore-1.4.0-py2.7.egg/pbcore/io/dataset/DataSetIO.py", line 1893, in _assertIndexed
self._openFiles()
File "/home/l/python2.7/lib/python2.7/site-packages/pbcore-1.4.0-py2.7.egg/pbcore/io/dataset/DataSetIO.py", line 2017, in _openFiles
resource = IndexedBamReader(location)
File "/home/l/python2.7/lib/python2.7/site-packages/pbcore-1.4.0-py2.7.egg/pbcore/io/align/BamIO.py", line 385, in init
super(IndexedBamReader, self).init(fname, referenceFastaFname)
File "/home/l/python2.7/lib/python2.7/site-packages/pbcore-1.4.0-py2.7.egg/pbcore/io/align/BamIO.py", line 196, in init
self._checkFileCompatibility()
File "/home/l/python2.7/lib/python2.7/site-packages/pbcore-1.4.0-py2.7.egg/pbcore/io/align/BamIO.py", line 185, in _checkFileCompatibility
checkedVersion = self.version
File "/home/l/python2.7/lib/python2.7/site-packages/pbcore-1.4.0-py2.7.egg/pbcore/io/align/BamIO.py", line 271, in version
return self.peer.header["HD"]["pb"]
KeyError: 'HD'
[ERROR] 'HD'
Traceback (most recent call last):
File "/home/l/python2.7/lib/python2.7/site-packages/pbcommand-0.6.1-py2.7.egg/pbcommand/cli/core.py", line 136, in _pacbio_main_runner
return_code = exe_main_func(*args, **kwargs)
File "/home/l/python2.7/lib/python2.7/site-packages/GenomicConsensus-2.2.0-py2.7.egg/GenomicConsensus/main.py", line 381, in args_runner
return tr.main()
File "/home/l/python2.7/lib/python2.7/site-packages/GenomicConsensus-2.2.0-py2.7.egg/GenomicConsensus/main.py", line 297, in main
with AlignmentSet(options.inputFilename) as peekFile:
File "/home/l/python2.7/lib/python2.7/site-packages/pbcore-1.4.0-py2.7.egg/pbcore/io/dataset/DataSetIO.py", line 2649, in init
super(AlignmentSet, self).init(*files, **kwargs)
File "/home/l/python2.7/lib/python2.7/site-packages/pbcore-1.4.0-py2.7.egg/pbcore/io/dataset/DataSetIO.py", line 1936, in init
super(ReadSet, self).init(*files, **kwargs)
File "/home/l/python2.7/lib/python2.7/site-packages/pbcore-1.4.0-py2.7.egg/pbcore/io/dataset/DataSetIO.py", line 477, in init
self.updateCounts()
File "/home/l/python2.7/lib/python2.7/site-packages/pbcore-1.4.0-py2.7.egg/pbcore/io/dataset/DataSetIO.py", line 2490, in updateCounts
self.assertIndexed()
File "/home/l/python2.7/lib/python2.7/site-packages/pbcore-1.4.0-py2.7.egg/pbcore/io/dataset/DataSetIO.py", line 2320, in assertIndexed
self._assertIndexed((IndexedBamReader, CmpH5Reader))
File "/home/l/python2.7/lib/python2.7/site-packages/pbcore-1.4.0-py2.7.egg/pbcore/io/dataset/DataSetIO.py", line 1893, in _assertIndexed
self._openFiles()
File "/home/l/python2.7/lib/python2.7/site-packages/pbcore-1.4.0-py2.7.egg/pbcore/io/dataset/DataSetIO.py", line 2017, in _openFiles
resource = IndexedBamReader(location)
File "/home/l/python2.7/lib/python2.7/site-packages/pbcore-1.4.0-py2.7.egg/pbcore/io/align/BamIO.py", line 385, in init
super(IndexedBamReader, self).init(fname, referenceFastaFname)
File "/home/l/python2.7/lib/python2.7/site-packages/pbcore-1.4.0-py2.7.egg/pbcore/io/align/BamIO.py", line 196, in init
self._checkFileCompatibility()
File "/home/l/python2.7/lib/python2.7/site-packages/pbcore-1.4.0-py2.7.egg/pbcore/io/align/BamIO.py", line 185, in _checkFileCompatibility
checkedVersion = self.version
File "/home/l/python2.7/lib/python2.7/site-packages/pbcore-1.4.0-py2.7.egg/pbcore/io/align/BamIO.py", line 271, in version
return self.peer.header["HD"]["pb"]
KeyError: 'HD'

Python 3 support

Hi,
Are there any plans for supporting Python 3 in this and the other PacBio python tools?

Consistent Model of empty Readers and DataSets

I've talked with @dalexander and he suggested opening up an issue to define a general model for handling empty files, both at the file level (i.e., Reader) and DataSet level.

Misc comments:

RS-era: Empty cmp.h5 files were handled with special logic, specifically in the gather task.
RS-era: Similar cmp.h5 "is empty" logic was in different tools/libs.
RS-era: Filtering was a core task and as such, the downstream reports had special logic to generate empty reports
the current mapping_stats works fine with an empty AlignmentSet (<ExternalResources /> <DataSets />. It might work with an empty bam and pbi file.
Some empty files make don't make sense, for example, the ReferenceSet must have a non-empty fasta file. This is constrained via fasta-to-reference but not fundamentally at the DataSet XSD level.
Clarify model for Importing empty DataSet(s) at the service level from "successful" jobs
Empty AlignmentSets in the context of Multi-job analysis might create headaches. I would have to look into the R code to see how empty files are handled in the RS-era.

Bug Report: FastaReader fails on sequences with ">" in sequence description

Suggestion: Split sequence files on "\n>" instead of ">"

Numpy 1.11 throws VisibleDeprecationWarnings

Our GenomicConsensus circleci tests now throw these:

/home/ubuntu/virtualenvs/venv-system/local/lib/python2.7/site-packages/pbcore/io/align/_bgzf.py:641: VisibleDeprecationWarning: converting an array with ndim > 0 to an index will result in an error in the future
+    data = self._buffer[self._within_block_offset:self._within_block_offset + size]

standardize hqRegionSnr channel order

In the BasH5Reader, hqRegionSnr is in channel order; in BamReader it's standardized order as ACGT. Need to switch BasH5Reader to use the latter convention.

Install Quiver

We're told to install quiver from github in order to use .fofn for multiple bam files.

I found this online document that mentions about some environment variables need to be different from the rest of the package. It also has extra instructions that are probably going to corrupt the present installation if we followed them. Could you let us know what're the some environment variables need to be different from the rest of the package? We can use the module control to manage different environments.

/tools/bioinfo/app/smrtanalysis-2.3.0.140936/install/smrtanalysis_2.3.0.140936/analysis/lib/python2.7/GenomicConsensus/quiver

Thanks,
Jean

move test data out into separate library

We have discussed a medium-term goal of moving the test data in pbcore and pbbam into a unified "sample PacBio data" package.

Goals:

Tiny files
Represent a few realistic use cases, maybe some corner cases
Data files locations accessible by Python and C++ API
Data files locations accessible by command line tools, e.g.

  $ pacbio-sample-file tiny-aligned-bam

Feature Request: labeledZmwsFromBarcodeLabel for BarcodeH5Fofn

That way single and multi-file barcode lists can be interrogated with a shared interface

KeyError BASECALLERVERSION

Steps to reproduce:

Starting with unaligned PacBio BAM files from sequencing service
Performed de novo assembly with CANU, proceed to polish using arrow...
Convert unaligned BAM file to FASTA using bamtools convert version 2.4.1
Align FASTA reads to de novo assembly using pbalign v0.3.1 (which used BLASR v5.3 internally)
Give aligned BAM file from pbalign and assembly FASTA from CANU to arrow v2.2.2.

$ arrow --numWorkers 32 -r canu_de_novo.fasta -o canu_de_novo_pbalign_arrow.fasta canu_de_novo_pbalign_mapped.bam
'BASECALLERVERSION'
Traceback (most recent call last):
  File "/opt/pacbio/smrtlink/install/smrtlink-release_5.1.0.26412/bundles/smrttools/install/smrttools-release_5.1.0.26366/private/thirdparty/
python/python_2.7.9/site-packages/pbcommand/cli/core.py", line 137, in _pacbio_main_runner
    return_code = exe_main_func(*args, **kwargs)
  File "/opt/pacbio/smrtlink/install/smrtlink-release_5.1.0.26412/bundles/smrttools/install/smrttools-release_5.1.0.26366/private/thirdparty/
python/python_2.7.9/site-packages/GenomicConsensus/main.py", line 383, in args_runner
    return tr.main()
  File "/opt/pacbio/smrtlink/install/smrtlink-release_5.1.0.26412/bundles/smrttools/install/smrttools-release_5.1.0.26366/private/thirdparty/
python/python_2.7.9/site-packages/GenomicConsensus/main.py", line 297, in main
    with AlignmentSet(options.inputFilename) as peekFile:
  File "/opt/pacbio/smrtlink/install/smrtlink-release_5.1.0.26412/bundles/smrttools/install/smrttools-release_5.1.0.26366/private/thirdparty/
python/python_2.7.9/site-packages/pbcore/io/dataset/DataSetIO.py", line 2723, in __init__
    super(AlignmentSet, self).__init__(*files, **kwargs)
  File "/opt/pacbio/smrtlink/install/smrtlink-release_5.1.0.26412/bundles/smrttools/install/smrttools-release_5.1.0.26366/private/thirdparty/
python/python_2.7.9/site-packages/pbcore/io/dataset/DataSetIO.py", line 1987, in __init__
    super(ReadSet, self).__init__(*files, **kwargs)
  File "/opt/pacbio/smrtlink/install/smrtlink-release_5.1.0.26412/bundles/smrttools/install/smrttools-release_5.1.0.26366/private/thirdparty/
python/python_2.7.9/site-packages/pbcore/io/dataset/DataSetIO.py", line 477, in __init__
    self.updateCounts()
  File "/opt/pacbio/smrtlink/install/smrtlink-release_5.1.0.26412/bundles/smrttools/install/smrttools-release_5.1.0.26366/private/thirdparty/
python/python_2.7.9/site-packages/pbcore/io/dataset/DataSetIO.py", line 2541, in updateCounts
    self.assertIndexed()
  File "/opt/pacbio/smrtlink/install/smrtlink-release_5.1.0.26412/bundles/smrttools/install/smrttools-release_5.1.0.26366/private/thirdparty/
python/python_2.7.9/site-packages/pbcore/io/dataset/DataSetIO.py", line 2371, in assertIndexed
    self._assertIndexed((IndexedBamReader, CmpH5Reader))
  File "/opt/pacbio/smrtlink/install/smrtlink-release_5.1.0.26412/bundles/smrttools/install/smrttools-release_5.1.0.26366/private/thirdparty/
python/python_2.7.9/site-packages/pbcore/io/dataset/DataSetIO.py", line 1944, in _assertIndexed
    self._openFiles()
  File "/opt/pacbio/smrtlink/install/smrtlink-release_5.1.0.26412/bundles/smrttools/install/smrttools-release_5.1.0.26366/private/thirdparty/
python/python_2.7.9/site-packages/pbcore/io/dataset/DataSetIO.py", line 2068, in _openFiles
    resource = IndexedBamReader(location)
  File "/opt/pacbio/smrtlink/install/smrtlink-release_5.1.0.26412/bundles/smrttools/install/smrttools-release_5.1.0.26366/private/thirdparty/
python/python_2.7.9/site-packages/pbcore/io/align/BamIO.py", line 388, in __init__
    super(IndexedBamReader, self).__init__(fname, referenceFastaFname)
  File "/opt/pacbio/smrtlink/install/smrtlink-release_5.1.0.26412/bundles/smrttools/install/smrttools-release_5.1.0.26366/private/thirdparty/
python/python_2.7.9/site-packages/pbcore/io/align/BamIO.py", line 202, in __init__
    self._loadReadGroupInfo()
  File "/opt/pacbio/smrtlink/install/smrtlink-release_5.1.0.26412/bundles/smrttools/install/smrttools-release_5.1.0.26366/private/thirdparty/
python/python_2.7.9/site-packages/pbcore/io/align/BamIO.py", line 115, in _loadReadGroupInfo
    basecallerVersion = ".".join(ds["BASECALLERVERSION"].split(".")[0:2])
KeyError: 'BASECALLERVERSION'
[ERROR] 'BASECALLERVERSION'
Traceback (most recent call last):
  File "/opt/pacbio/smrtlink/install/smrtlink-release_5.1.0.26412/bundles/smrttools/install/smrttools-release_5.1.0.26366/private/thirdparty/
python/python_2.7.9/site-packages/pbcommand/cli/core.py", line 137, in _pacbio_main_runner
    return_code = exe_main_func(*args, **kwargs)
  File "/opt/pacbio/smrtlink/install/smrtlink-release_5.1.0.26412/bundles/smrttools/install/smrttools-release_5.1.0.26366/private/thirdparty/
python/python_2.7.9/site-packages/GenomicConsensus/main.py", line 383, in args_runner
    return tr.main()
  File "/opt/pacbio/smrtlink/install/smrtlink-release_5.1.0.26412/bundles/smrttools/install/smrttools-release_5.1.0.26366/private/thirdparty/
python/python_2.7.9/site-packages/GenomicConsensus/main.py", line 297, in main
    with AlignmentSet(options.inputFilename) as peekFile:
  File "/opt/pacbio/smrtlink/install/smrtlink-release_5.1.0.26412/bundles/smrttools/install/smrttools-release_5.1.0.26366/private/thirdparty/
python/python_2.7.9/site-packages/pbcore/io/dataset/DataSetIO.py", line 2723, in __init__
    super(AlignmentSet, self).__init__(*files, **kwargs)
  File "/opt/pacbio/smrtlink/install/smrtlink-release_5.1.0.26412/bundles/smrttools/install/smrttools-release_5.1.0.26366/private/thirdparty/
python/python_2.7.9/site-packages/pbcore/io/dataset/DataSetIO.py", line 1987, in __init__
    super(ReadSet, self).__init__(*files, **kwargs)
  File "/opt/pacbio/smrtlink/install/smrtlink-release_5.1.0.26412/bundles/smrttools/install/smrttools-release_5.1.0.26366/private/thirdparty/
python/python_2.7.9/site-packages/pbcore/io/dataset/DataSetIO.py", line 477, in __init__
    self.updateCounts()
  File "/opt/pacbio/smrtlink/install/smrtlink-release_5.1.0.26412/bundles/smrttools/install/smrttools-release_5.1.0.26366/private/thirdparty/
python/python_2.7.9/site-packages/pbcore/io/dataset/DataSetIO.py", line 2541, in updateCounts
    self.assertIndexed()
  File "/opt/pacbio/smrtlink/install/smrtlink-release_5.1.0.26412/bundles/smrttools/install/smrttools-release_5.1.0.26366/private/thirdparty/
python/python_2.7.9/site-packages/pbcore/io/dataset/DataSetIO.py", line 2371, in assertIndexed
    self._assertIndexed((IndexedBamReader, CmpH5Reader))
  File "/opt/pacbio/smrtlink/install/smrtlink-release_5.1.0.26412/bundles/smrttools/install/smrttools-release_5.1.0.26366/private/thirdparty/
python/python_2.7.9/site-packages/pbcore/io/dataset/DataSetIO.py", line 1944, in _assertIndexed
    self._openFiles()
  File "/opt/pacbio/smrtlink/install/smrtlink-release_5.1.0.26412/bundles/smrttools/install/smrttools-release_5.1.0.26366/private/thirdparty/
python/python_2.7.9/site-packages/pbcore/io/dataset/DataSetIO.py", line 2068, in _openFiles
    resource = IndexedBamReader(location)
  File "/opt/pacbio/smrtlink/install/smrtlink-release_5.1.0.26412/bundles/smrttools/install/smrttools-release_5.1.0.26366/private/thirdparty/
python/python_2.7.9/site-packages/pbcore/io/align/BamIO.py", line 388, in __init__
    super(IndexedBamReader, self).__init__(fname, referenceFastaFname)
  File "/opt/pacbio/smrtlink/install/smrtlink-release_5.1.0.26412/bundles/smrttools/install/smrttools-release_5.1.0.26366/private/thirdparty/
python/python_2.7.9/site-packages/pbcore/io/align/BamIO.py", line 202, in __init__
    self._loadReadGroupInfo()
  File "/opt/pacbio/smrtlink/install/smrtlink-release_5.1.0.26412/bundles/smrttools/install/smrttools-release_5.1.0.26366/private/thirdparty/
python/python_2.7.9/site-packages/pbcore/io/align/BamIO.py", line 115, in _loadReadGroupInfo
    basecallerVersion = ".".join(ds["BASECALLERVERSION"].split(".")[0:2])
KeyError: 'BASECALLERVERSION'

Actual Results

Cyrptic exception from pbcore/io/align/BamIO.py which assumes certain BAM metadata will be present.

Expected Result

Polished assembly, or at least a user friendly error message explaining what BAM metadata is missing and ideally how to deal with it.

Notes

From reading https://github.com/PacificBiosciences/pitchfork/issues/316 (found via Google), is seems that the KeyError 'BASECALLERVERSION' problem is to do with missing PacBio headers in the aligned BAM file?

In my case the unmapped BAM files have an @RG (read group tag) which include BASECALLERVERSION=5.0.0 as part of the DS key's value. My mapped BAM file from pbalign is missing the header. I could check this with the following, actual library name obscured:

Raw BAM header:

$ samtools view -h m12345_123456_123456.subreads.bam | head -n 100 | grep "^@"
@HD	VN:1.5	SO:unknown	pb:3.0.5
@RG	ID: f5b4ffb6	PL:PACBIO	DS:READTYPE=SUBREAD;Ipd:CodecV1=ip;PulseWidth:CodecV1=pw;BINDINGKIT=101-365-900;SEQUENCINGKIT=101-309-500;BASECALLERVERSION=5.0.0;FRAMERATEHZ=80.000000	PU:m12345_123456_123456	PM:SEQUEL
@PG	ID:baz2bam	PN:baz2bam	VN:5.1.0.26367	CL:/opt/pacbio/ppa-5.1.0/bin/baz2bam /data/pa/m12345_123456_123456.baz -o /data/pa/m12345_123456_123456 --metadata /data/pa/.m12345_123456_123456.metadata.xml -j 12 -b 12 --progress --silent --minSubLength 50 --minSnr 3.750000 --adapters /data/pa/m12345_123456_123456.adapters.fasta --controlAdapters /data/pa/m12345_123456_123456.controls.adapters.fasta --controls /data/pa/m54047_180428_220428.controls.fasta 
@PG	ID:bazFormat	PN:bazformat	VN:1.3.0
@PG	ID:bazwriter	PN:bazwriter	VN:5.1.0.26367

Map BAM file from pbalign has now SAM header:

$ samtools view -h canu_de_novo_pbalign_mapped.bam | head -n 100 | grep "^@"

I might be able to combine the header-less mapped BAM files with header information from the unmapped BAM to restore the metadata...?

Frustratingly blastr v5.3 says it takes FASTA or bax.h5 files, recommends not using FASTA as no quality, yet says bax.h5 support is deprecated.

Do I may need to go back to our sequencing provider to get the bax.h5 files as well as the BAM files?

Add createdAt accessor to DataSet

SMRT Analysis 1.4 - qv is not "-33" value.

This was reported by a customer:

bash5tools produce qv report in csv file.

This QV is wrong. There are some qv over 100.
All QVs should be substracted by 33.

Function “baseCallsDG.getQVForZMW()” is going to add 33 and make ASCII code (33-126) at BasH5IO,

https://github.com/PacificBiosciences/pbcore/blob/master/src/pbcore/io/BasH5IO.py#L91
Therefore you need to subtract 33 from “aseCallsDG.getQWforZMW()” when

I. Filtering with –minMeanQV : https://github.com/PacificBiosciences/pbh5tools/blob/master/src/python/pbtools/pbh5tools/BasH5ToData.py#L44
II. Output csv Raw : https://github.com/PacificBiosciences/pbh5tools/blob/master/src/python/pbtools/pbh5tools/BasH5ToData.py#L126
III. Output csv CCS : https://github.com/PacificBiosciences/pbh5tools/blob/master/src/python/pbtools/pbh5tools/BasH5ToData.py#L137

Test failures with NumPy 1.14

Hi,
the Debian packaged version of pbcore fails to build since the latest version of NumPy was uploaded.
The log says:

======================================================================
FAIL: test_referenceInfo (test_pbdataset.TestDataSet)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/autopkgtest-lxc.agmyzyvx/downtmp/build.owB/src/tests/test_pbdataset.py", line 1407, in test_referenceInfo
    "(27, 27, 'E.faecalis.1', 'E.faecalis.1', 1482, "
AssertionError: "(27, 27, 'E.faecalis.1', 'E.faecalis.1', 1482, 'a1a59c267ac1341e5a12bce7a7d37bcb', 0, 0)" != "(27, 27, 'E.faecalis.1', 'E.faecalis.1', 1482, 'a1a59c267ac1341e5a12bce7a7d37bcb', 0L, 0L)"
-------------------- >> begin captured logging << --------------------
pbcore.io.dataset.DataSetIO: DEBUG: Opening ReadSet resources
pbcore.io.dataset.DataSetIO: DEBUG: Done opening resources
pbcore.io.dataset.DataSetIO: DEBUG: Updating counts
pbcore.io.dataset.DataSetIO: DEBUG: Populating index
pbcore.io.dataset.DataSetIO: DEBUG: Processing resource indices
pbcore.io.dataset.DataSetIO: DEBUG: Filtering reference entries
pbcore.io.dataset.DataSetIO: DEBUG: Done populating index
--------------------- >> end captured logging << ---------------------

Please let me know if you need further information about this issue.
Kind regards, Andreas.

add ".open" support for datasets (XML)

@mdsmith, I need this "yesterday"....

Deprecating the h5py dependency

As a first step towards moving away from this altogether, would it be possible to make the dependency optional so that pbcore can be installed without hdf5 support? Removing the automatic h5py installation would not prevent us from continuing to support it in pbcore (or smrtanalysis) indefinitely, but it is unnecessary for many of our standalone internal use cases (e.g. pbcommand testing) and the extra overhead makes them considerably more complicated to set up.

Pin pysam to specific version to 0.8.1

https://github.com/pysam-developers/pysam/releases

This library is floating around a bit too much. The changes in 0.8.4 break compatibility with pbcore.

https://circleci.com/gh/PacificBiosciences/pbcore/98

test failure with numpy 1.16: dtype of to_begin must be compatible with input ary

pbcore tests fail under numpy 1.16 (currently 1.16.0.rc2), with message "arraysetops.py TypeError: dtype of to_begin must be compatible with input ary". It involves BasH5IO.py, invoking numpy arraysetops.

Failing test logs can be seen at https://ci.debian.net/packages/p/python-pbcore/unstable/amd64/

The same failure is seen in build logs at https://tests.reproducible-builds.org/debian/rbuild/unstable/amd64/python-pbcore_1.5.0+git20180606.43fcd9d+dfsg-2.rbuild.log.gz

Those logs are for pbcore 1.5.0(git20180606), but the error still occurs with 1.6.5.

A sample from a 1.6.5 build is:

ERROR: test_pbcore_io_AlnFileReaders.TestBasicBam.testBaxAttaching

Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
self.test(*self.arg)
File "/home/projects/misc/build/python-pbcore/tests/test_pbcore_io_AlnFileReaders.py", line 293, in testBaxAttaching
self.f.attach(self.BAX_FILE)
File "/home/projects/misc/build/python-pbcore/pbcore/io/align/_AlignmentMixin.py", line 22, in attach
self.basH5Collection = BasH5Collection(fofnFilename)
File "/home/projects/misc/build/python-pbcore/pbcore/io/BasH5IO.py", line 818, in init
for k, v in groupby(movieNamesAndFiles, lambda t: t[0]) ])
File "/home/projects/misc/build/python-pbcore/pbcore/io/BasH5IO.py", line 662, in init
self._parts = [ BaxH5Reader(self.filename) ]
File "/home/projects/misc/build/python-pbcore/pbcore/io/BasH5IO.py", line 328, in init
self._loadRegions(self.file)
File "/home/projects/misc/build/python-pbcore/pbcore/io/BasH5IO.py", line 355, in _loadRegions
self._regionTableIndex = _makeRegionTableIndex(self.regionTable.holeNumber)
File "/home/projects/misc/build/python-pbcore/pbcore/io/BasH5IO.py", line 284, in _makeRegionTableIndex
to_begin=[1], to_end=[1])
File "/usr/lib/python2.7/dist-packages/numpy/lib/arraysetops.py", line 111, in ediff1d
raise TypeError("dtype of to_begin must be compatible "
TypeError: dtype of to_begin must be compatible with input ary

Wrong info in mapping.xml

Hi,
There's wrong info in mapping.xml:

102-186-000 is Sequencing kit 2.0 not 3.0.

Please upload pbcore to PyPI

❯❯❯ pip install pbcore
Collecting pbcore
  Could not find a version that satisfies the requirement pbcore (from versions: )
No matching distribution found for pbcore

Please port to Python3

Hello,
the Debian Med team is maintaining pbcore for official Debian. The recently released Debian 10 was the last Debian release featuring Python2 since this programming language is EOL. If you are interested that we continue to maintain pbcore in official Debian (and that users of other modern distributions will have no problems to install pbcore on their systems) I'd recommend you port your code to Python3. The 2to3 tool might be of great help here.
Kind regards, Andreas.

Can I use CCS module in pbcore?

I can't find any module including that function?

InvalidDataSetIOError

when I run arrow, I got a arrow
InvalidDataSetIOError: No reference tables found, are these input files aligned?
my command is arrow -j8 --referenceFilename contig.fasta -o arrow.fasta subreads.bam
what's wrong?

Read group string with barcode attached cannot be converted to int.

I my latest isoseq data, the read group string got the bar code attached (such as in "fca17910/0--1") resulting in an ValueError: invalid literal for int() with base 16: 'fca17910/0--1'

pbcore/pbcore/io/align/_BamSupport.py

Line 65 in 63c854b

def rgAsInt(rgIdString):

Test failures on Debian

I'm updating the pbcore Debian package to your latest release, but two of the build-time tests are failing:

FAIL: test_create_cli (test_pbdataset.TestDataSet)

Traceback (most recent call last):
File "/«BUILDDIR»/python-pbcore-1.2.3+dfsg/tests/test_pbdataset.py", line 103, in test_create_cli
self.assertEqual(r, 0)
AssertionError: 127 != 0
-------------------- >> begin captured logging << --------------------
test_pbdataset: DEBUG: Absolute
test_pbdataset: DEBUG: dataset.py create --type AlignmentSet /tmp/tmpTG7Mv1dataset-unittest/pbalchemysim.alignmentset.xml /«BUILDDIR»/python-pbcore-1.2.3+dfsg/pbcore/data/datasets/pbalchemysim0.alignmentset.xml /«BUILDDIR»/python-pbcore-1.2.3+dfsg/pbcore/data/datasets/pbalchemysim1.alignmentset.xml
--------------------- >> end captured logging << ---------------------

FAIL: test_split_cli (test_pbdataset.TestDataSet)

Traceback (most recent call last):
File "/«BUILDDIR»/python-pbcore-1.2.3+dfsg/tests/test_pbdataset.py", line 89, in test_split_cli
self.assertEqual(r, 0)
AssertionError: 127 != 0
-------------------- >> begin captured logging << --------------------
test_pbdataset: DEBUG: dataset.py split --outdir /tmp/tmpePANm1dataset-unittest --contigs --chunks 2 /«BUILDDIR»/python-pbcore-1.2.3+dfsg/pbcore/data/datasets/pbalchemysim0.alignmentset.xml
--------------------- >> end captured logging << ---------------------

Ran 241 tests in 32.003s

FAILED (failures=2, skipped=8)

The versions of dependencies in use are:
pysam 0.8.3
h5py 2.5.0
numpy 1.9.2

Do you know what could be going on here?

Github doesn't seem to have a way of attaching log files, but I can provide the full build log if you tell me where to send it.

pbcore.io.merge_gffs_sorted should clean up after itself

but it doesn't. closing PacificBiosciences/pbcoretools#84.

Please use proper release tags

Hi,
the release tags are mentioning version 1.2.10 as latest release. When checking setup.py it specifies version 1.5.0. It would be really helpful if you could keep the release tags in sync with your release versions.
Thanks a lot, Andreas.

Push SA 3.1.0 compatible version to pypi

BamAlignment class feature transparency

Make available class-features easy to find.

The availability of some object-instance features are hidden from the user unless they dive into the source. Attached is an example of accessing IPDs from a BamAlignment instance through pbcore…you can’t see that IPDs are available via dir(), but you can when you look at the BamAlignment class in the source code.

One possibility, write a builtin dir-type class def that lists all features. Or figure out how to include all features within a call to python's builtin dir().

Toy .ipnb example located at Q:\Labs\Kristofor\analyses\pbcore_BAM_toy.html (pdf attached)
pbcore_BAM_toy.pdf

SMART Analysis - QUIVER - pbtranscript.py ", line 35, in <module> from pbcore.util.ToolRunner import PBMultiToolRunner

Hi,
I want to run ICE and Quiver with Command-line : pbtranscript.py cluster
I have installed smrt_analysis, the installation was successful.

But when I run:
pbtranscript.py cluster isoseq_flnc.fasta final.consensus.fa
--nfl_fa isoseq_nfl.fasta -d cluster --ccs_fofn /reads_of_insert.fofn
--bas_fofn input.fofn --cDNA_size under1k --quiver --use_sge
--max_sge_jobs 40 --unique_id 300 --blasr_nproc 24 --quiver_nproc 8

I have this error:
Traceback (most recent call last):
File "./pbtranscript.py", line 35, in
from pbcore.util.ToolRunner import PBMultiToolRunner

I stuck, I don't know what to do.
Cheers,
Virginie

quiver complaints bam files generated by blasr

I'm trying to use quiver to polish contigs assembled by falcon. I aligned *bax.h5 reads to the contigs to create sam file and use samtools to convert sam to bam and then use the *bam file as input for quiver. The run failed immediately with error: IOError: Invalid or nonexistent cmp.h5 file. I was told that old quiver requires cmp.h5 as its input file but new quiver should accept bam files as its input files. Is it correct? Our smrtanalysis is smrtanalysis-2.3.0.140936. Any suggestion would be greatly appreciated.

util/Process.py could be a little better

As it stands, it drops stderr on the floor if there was an error and merge_stderr=False, returning only stdout. Also, it could potentially block, particularly if the command called tried to read from stdin.

--- Process.py  2017-01-10 13:02:38.000000000 -0600
+++ Process.py  2017-03-30 14:13:08.835189623 -0500
@@ -47,22 +47,12 @@
                           stdout=subprocess.PIPE, stderr=_stderr,
                           close_fds=True )

-    out = [ l[:-1] for l in p.stdout.readlines() ]
-
-    p.stdout.close()
-    if not merge_stderr:
-        p.stderr.close()
-
-    # need to allow process to terminate
-    p.wait()
+    (out, errorMessage) = p.communicate(None)

     errCode = p.returncode and p.returncode or 0
     if p.returncode>0:
-        errorMessage = os.linesep.join(out)
-        output = []
-    else:
-        errorMessage = ''
-        output = out
-    else:
-        errorMessage = ''
-        output = out
+        errorMessage += out
+        out = ''

-    return output, errCode, errorMessage
+    return str.splitlines(out), errCode, errorMessage

help with reprocessMotifSites.py?

Hi, I'm trying to use reprocessingMotifSites.py and I've run into this error message. Any help is appreciated! Thanks.

usc-secure-wireless-034-135:Scripts FinkelLabAir$ python reprocessMotifSites.py --reference path/to/rev_compPACBIO.fasta --motifs /path/to/-home-smrtanalysis-smrtanalysis-current-common-jobs-016-016796-data-motifs.gff.gz --motif_summary /path/to/-home-smrtanalysis-smrtanalysis-current-common-jobs-016-016796-data-motif_summary.csv --undetected --gff ~/Desktop/new.gff ~/Downloads/-home-smrtanalysis-smrtanalysis-current-common-jobs-016-016796-data-aligned_reads.cmp.h5
Traceback (most recent call last):
File "reprocessMotifSites.py", line 552, in
kt.start()
File "/Library/Python/2.7/site-packages/pbcore-0.8.5-py2.7.egg/pbcore/util/ToolRunner.py", line 105, in start
return self.run()
File "reprocessMotifSites.py", line 280, in run
ret = self._mainLoop()
File "reprocessMotifSites.py", line 490, in _mainLoop
self.loadReferenceAndModel(self.args.reference, self.args.infile)
File "reprocessMotifSites.py", line 431, in loadReferenceAndModel
contigs = ReferenceUtils.loadReferenceContigs(referencePath, cmpH5Path)
File "/Library/Python/2.7/site-packages/kineticsTools-0.4.0-py2.7-macosx-10.9-intel.egg/kineticsTools/ReferenceUtils.py", line 60, in loadReferenceContigs
contigDict[x.MD5].id = x.ID
KeyError: 'ab1a06dc3709a6f554551cc5df814102'

API inconsistencies in BasH5IO

I just wanted to write a note about this here. This is a superficial issue, the question is whether it is annoying enough to people that we should fix it, or just carry on and preserve compatibility.

Naming inconsistency
Why ccsRead but subreads? This kind of makes sense to me because of how I parse those words, but might not make sense to other users.
Properties vs methods
The ccsRead, subreads, etc are properties, not methods, while read is a method. This confused Appslab back in 2.0. I wish I had just kept them all as methods.
sequencingZmws/allSequencingZmws
these return vectors of holenumbers, not iterators of Zmw objects ... like people might expect

Replace IndexedFastaReader with pyfaidx.Fasta

It looks like the interface to IndexedFastaReader is similar to pyfaidx and that one of your tools already uses pyfaidx for indexing. There are a few advantages to using the pyfaidx codebase:

It's pretty well tested, with good test coverage (which you can freely improve on)
I've implemented index creation and validation so you don't have to require users to have samtools
The API is nicer (though that's subjective)

Would there be some interest in this?