biocommons / biocommons.seqrepo Goto Github PK

non-redundant, compressed, journalled, file-based storage for biological sequences

License: Apache License 2.0

Makefile 4.68% Python 91.82% Perl 0.81% Shell 2.69%

bioinformatics genome-analysis genomics sequencing variant-analysis variation

biocommons.seqrepo's Introduction

biocommons.seqrepo

SeqRepo is a Python package for storing and reading a local collection of biological sequences. The repository is non-redundant, compressed, and journalled, making it efficient to store and transfer multiple snapshots.

Introduction

Specific, named biological sequences provide the reference and coordinate system for communicating variation and consequential phenotypic changes. Several databases of sequences exist, with significant overlap, all using distinct names. Furthermore, these systems are often difficult to install locally.

SeqRepo provides an efficient, non-redundant and indexed storage system for biological sequences. Clients refer to sequences and metadata using familiar identifiers, such as NM_000551.3 or GRCh38:1, or any of several hash-based identifiers. The interface supports fast slicing of arbitrary regions of large sequences.

A "fully-qualified" identifier includes a namespace to disambiguate accessions from different origins or sequence sets (e.g., "1" in GRCh37 and GRCh38). If the namespace is provided, seqrepo uses it as-is; if the namespace is not provided and the unqualified identifier refers to a unique sequence, it is returned; otherwise, the use of ambiguous identifiers raise an error.

SeqRepo favors namespaces from identifiers.org whenever available. Examples include refseq and ensembl.

seqrepo-rest-service provides a REST interface and docker image.

Released under the Apache License, 2.0.

| | | ChangeLog

Citation

Hart RK, Prlić A (2020). SeqRepo: A system for managing local collections of biological sequences. PLoS ONE 15(12): e0239883. https://doi.org/10.1371/journal.pone.0239883

Features

Timestamped, read-only snapshots.
Space-efficient storage of sequences within a single snapshot and across snapshots.
Bandwidth-efficient transfer incremental updates.
Fast fetching of sequence slices on chromosome-scale sequences.
Precomputed digests that may be used as sequence aliases.
Mappings of external aliases (i.e., accessions or identifiers like NM_013305.4) to sequences.

Deployments Scenarios

Local read-only archive, mirrored from public site, accessed via Python API (see Mirroring documentation)
Local read-write archive, maintained with command line utility and/or API (see Command Line Interface documentation).
Docker data-only container that may be linked to application container.
SeqRepo and refget REST API for local or remote access (see seqrepo-rest-service)

Technical Quick Peek

Within a single snapshot, sequences are stored non-redundantly and compressed in an add-only journalled filesystem structure. A truncated SHA-512 hash is used to assess uniquness and as an internal id. (The digest is truncated for space efficiency.)

Sequences are compressed using the Block GZipped Format (BGZF)), which enables pysam to provide fast random access to compressed sequences. (Variable compression typically makes random access impossible.)

Sequence files are immutable, thereby enabling the use of hardlinks across snapshots and eliminating redundant transfers (e.g., with rsync).

Each sequence id is associated with a namespaced alias in a sqlite database. Such as <seguid,rvvuhY0FxFLNwf10FXFIrSQ7AvQ>, <NCBI,NP_004009.1>, <gi,5032303>, <ensembl-75ENSP00000354464>, <ensembl-85,ENSP00000354464.4>. The sqlite database is mutable across releases.

For calibration, recent releases that include 3 human genome assemblies (including patches), and full RefSeq sets (NM, NR, NP, NT, XM, and XP) consumes approximately 8GB. The minimum marginal size for additional snapshots is approximately 2GB (for the sqlite database, which is not hardlinked).

For more information, see docs/design.rst.

Requirements

Reading a sequence repository requires several Python packages, all of which are available from pypi. Installation should be as simple as pip install biocommons.seqrepo.

Writing sequence files also requires bgzip, which provided in the htslib repo. Ubuntu users should install the tabix package with sudo apt install tabix.

Development and deployments are on Ubuntu. Other systems may work but are not tested. Patches to get other systems working would be welcomed.

Quick Start

OS X

$ brew install python libpq

Ubuntu

$ sudo apt install -y python3-dev gcc zlib1g-dev tabix

All platforms

$ python -m venv venv
$ source venv/bin/activate
$ pip install seqrepo
$ sudo mkdir -p /usr/local/share/seqrepo
$ sudo chown $USER /usr/local/share/seqrepo
$ seqrepo pull -i 2018-11-26
$ seqrepo show-status -i 2018-11-26
seqrepo 0.2.3.post3.dev8+nb8298bd62283
root directory: /usr/local/share/seqrepo/2018-11-26, 7.9 GB
backends: fastadir (schema 1), seqaliasdb (schema 1)
sequences: 773587 sequences, 93051609959 residues, 192 files
aliases: 5579572 aliases, 5480085 current, 26 namespaces, 773587 sequences

# Simple Pythonic interface to sequences
>> from biocommons.seqrepo import SeqRepo
>> sr = SeqRepo("/usr/local/share/seqrepo/latest")
>> sr["NC_000001.11"][780000:780020]
'TGGTGGCACGCGCTTGTAGT'

# Or, use the seqrepo shell for even easier access
$ seqrepo start-shell -i 2018-11-26
In [1]: sr["NC_000001.11"][780000:780020]
Out[1]: 'TGGTGGCACGCGCTTGTAGT'

# N.B. The following output is edited for simplicity
$ seqrepo export -i 2018-11-26 | head -n100
>SHA1:9a2acba3dd7603f... SEGUID:mirLo912A/MppLuS1cUyFMduLUQ Ensembl-85:GENSCAN00000003538 ...
MDSPLREDDSQTCARLWEAEVKRHSLEGLTVFGTAVQIHNVQRRAIRAKGTQEAQAELLCRGPRLLDRFLEDACILKEGRGTDTGQHCRGDARISSHLEA
SGTHIQLLALFLVSSSDTPPSLLRFCHALEHDIRYNSSFDSYYPLSPHSRHNDDLQTPSSHLGYIITVPDPTLPLTFASLYLGMAPCTSMGSSSMGIFQS
QRIHAFMKGKNKWDEYEGRKESWKIRSNSQTGEPTF
>SHA1:ca996b263102b1... SEGUID:yplrJjECsVqQufeYy0HkDD16z58 NCBI:XR_001733142.1 gi:1034683989
TTTACGTCTTTCTGGGAATTTATACTGGAAGTATACTTACCTCTGTGCAAAATTGCAAATATATAAGGTAATTCATTCCAGCATTGCTTATATTAGGTTG
AACTATGTAACATTGACATTGATGTGAATCAAAAATGGTTGAAGGCTGGCAGTTTCATATGATTCAGCCTATAATAGCAAAAGATTGAAAAAATCCATTA
ATACAGTGTGGTTCAAAAAAATTTGTTGTATCAAGGTAAAATAATAGCCTGAATATAATTAAGATAGTCTGTGTATACATCGATGAAAACATTGCCAATA

See Installation and Mirroring for more information.

Environment Variables

SEQREPO_LRU_CACHE_MAXSIZE sets the lru_cache maxsize for the sqlite query response caching. It defaults to 1 million but can also be set to "none" to be unlimited.

SEQREPO_FD_CACHE_MAXSIZE sets the lru_cache size for file handler caching during FASTA sequence retrievals. It defaults to 0 to disable any caching, but can be set to a specific value or "none" to be unlimited. Using a moderate value (>10) will greatly increase performance of sequence retrieval.

Developing

Developing on OS X

brew install python libpq bash

If you get "xcrun: error: invalid active developer path", you need to install XCode. See this StackOverflow answer.

Developing on Ubuntu

sudo apt install -y python3-dev gcc zlib1g-dev tabix

Here's how to get started developing:

make devready
source venv/bin/activate
seqrepo --version

Code reformatting:

make reformat

Install pre-commit hook:

# included in `make devready`, not necessary for new installations
pre-commit install

Building a docker image

Docker images are available at https://hub.docker.com/r/biocommons/seqrepo. Tags correspond to the version of data, not the version of seqrepo, because the intent is to make it easy to depend on a local version of seqrepo files. Each docker image is an installation of seqrepo that downloads the corresponding version of seqrepo data. When used in conjunction with docker volumes for persistence, this provides an easy way to incorporate seqrepo data into a docker stack.

Building

cd misc/docker
make 2021-01-29.log  # builds and pushes to hub.docker.com (i.e., you need creds)

biocommons.seqrepo's People

Contributors

Stargazers

Watchers

biocommons.seqrepo's Issues

Update seqrepo to support new NCBI Fasta header format

NCBI changed the FASTA headers for /refseq/H_sapiens/mRNA_Prot/human.*.rna.fna.gz files.

What used to be
>gi|295424141|ref|NM_000439.4| Homo sapiens proprotein convertase subtilisin/kexin type 1 (PCSK1), transcript variant 1, mRNA

now looks like
>NM_000439.4 Homo sapiens proprotein convertase subtilisin/kexin type 1 (PCSK1), transcript variant 1, mRNA

I can provide a PR for this.

Q: Are there any implications for any other downstream systems, if the gi alias is missing for newer transcripts?

Seqrepo docker image install problem

I am trying to spin up a container from the seqrepo image from docker hub and getting the error below.
The image pulls fine. I get the same error with the other image tag as well (20161004).
Thanks.

docker run --name seqrepo_test1 biocommons/seqrepo:20161213

Traceback (most recent call last):
File "/usr/local/bin/seqrepo", line 11, in
sys.exit(main())
File "/usr/local/lib/python3.5/dist-packages/biocommons/seqrepo/cli.py", line 436, in main
opts.func(opts)
File "/usr/local/lib/python3.5/dist-packages/biocommons/seqrepo/cli.py", line 301, in pull
raise KeyError("{}: not in list of remote instance names".format(instance_name))
KeyError: '20161213: not in list of remote instance names'

Keep a `latest` symlink that points to most recent instance

It's handy to have a stable name to refer to the most recent instance. Implement support for a latest symlink that points to the most recent instance.

Permission issue when run using docker

Hi Reece,

I tried to get a local version of seqrepo setup using docker (under OSX). When trying to share my local seqrepo folder using
docker run -v /usr/local/share/seqrepo:/usr/local/share/seqrepo biocommons/seqrepo
rsync is running into permission problems:
rsync: recv_generator: mkdir "/usr/local/share/seqrepo/2018-08-21.et6u24wj/sequences" failed: Permission denied (13)

After some googling I managed to work around this by using:
docker run -v /usr/local/share/seqrepo:/usr/local/share/seqrepo --user 1000:50 biocommons/seqrepo

With this, seqrepo starts to download files, However after the downloads are complete, the renaming of the folder fails with
PermissionError: [Errno 13] Permission denied: '/usr/local/share/seqrepo/2018-08-21.a6_3p6gn' -> '/usr/local/share/seqrepo/2018-08-21’

Note: I made sure that /usr/local/share/seqrepo/ is configured in docker as a shared folder and the seqrepo folder has global write permissions.

Any thoughts?

Thanks,
Andreas

simplify pulling new releases

Mirroring seqrepo requires something like

rsync -HRavP rsync.biocommons.org::seqrepo/2016082[78] /tmp/seqrepo/

rsync -HavP --link-dest=/tmp/seqrepo/20160827/ rsync.biocommons.org::seqrepo/20160828/ /tmp/seqrepo/20160828/

Use seqrepo cli to simplify this process. This will likely force a rethink about the distinction between seqrepo root (e.g., /usr/local/share/seqrepo) and seqrepo dir (/usr/local/share/seqrepo/master).

eliminate mirrors requirement and "tighten" update cycle

The seqrepo loading process currently expects from mirrors-ncbi. That process is cumbersome and requires lots of disk space.

To improve currency of seqrepo data, implement a new loading process that downloads data, loads, and then discards the download. The goal is to eliminate long-term mirroring.

Ideas to consider:

streaming loading
how to support multiple sources (e.g., NCBI, Ensembl-xx, LRG)
builder as docker image
exact inputs and outputs to process. e.g., does this operate on master, or on a snapshot? Does it generate a new snapshot?
where to build?

write script to fetch and load sequences from remote sources

Depends on seqfetcher moving to bioutils

Standardize on unicode internally + ascii encoding

That is:

store(seq: unicode)
fetch(...): unicode

Translate VMC identifiers to sha512t24u

The ga4gh sequence digest is based on the sha512t24us used in seqrepo.

SeqRepo currently has alias entries for VMC. Translate these (as with identifiers in #31) to sha512t24us.

RFC: Origin and accession versioning for Ensembl

Two related questions:

Should SeqRepo support an unversioned "ensembl" namespace that refers to the most recent ensembl release? This effectively creates an alias for the most recent ensembl release.
Should SeqRepo support for unversioned ensembl accessions? An unversioned ensembl accession is not globally unique: it must be qualified by the ensembl release. For example, accessions like ENST1234 are NOT unique; ENST1234.5, ensembl:ENST1234.5, and ensembl-99:ENST1234 uniquely refer to the same sequence. (Note that the unversioned accession is qualified by the ensembl release.)

"fix" namespace casing

I chose to use (mostly) lower case namespaces in seqrepo. That's unfortunate because most people will expect upper for abbreviations like NCBI.

This issue will update code and database to reflect casing like:

GRCh37
Ensembl-00
NCBI
LRG
SHA512

requires snapshot names to match instance_name_re

and -f to force

rethink upcase flag

seqrepo should not manipulate sequences on loading.

This issue should remove the flag and action. Then:

nothing else
raise exception if sequences don't smell like sequences
as above, but flag to disable

add command to list local and remote instances

Use namespaces from identifiers.org

The goal of this issue is to migrate namespaces to those used in identifiers.org. Those namespaces will become the standard in seqrepo. The namespaces are:

NCBI → refseq (for most accessions)
Ensembl → ensembl
LRG → lrg

Implementation options:

Abrupt db migration. Migrate namespaces in database in-place without a transition and release. Clients using old namespaces would break.
Use optional API translation to transition. Use three releases in which the new namespaces are is 1) optional w/default off (no behavior change), 2) optional w/default on, 3) updated in db and API translation is removed.
Automatic translation. In the first release (e.g., 0.5.7), add obligatory translation from new to old namespaces. Storing aliases would continue to use the old namespace. Finding would use translate new namespaces to old. Fetching alias records would add synthetic record for the new namespace. In a subsequent release (e.g., 0.6) invert the previous translation.

Option 3 seem pretty clearly the superior choice: it's easier to understand, requires less communication, and obviates client switches.

Translate from NCBI to refseq
Translate Ensembl to ensembl
Translate LRG to lrg

create new "copy" cli subcommand

The snapshot command hardlinks sequences and copies databases, then makes the snapshot not writable.
When I screw up master, it's nice to have the reverse.

This issue will refactor snapshot into a copy command that accepts an optional mode (--writeable and --readonly?). Then snapshot will use copy and make readonly, and a new command (unsnapshot?) will copy and make writeable (as master, presumably).

An variation on this theme is to introduce two new commands, lock and unlock, to make read-only and writeable respectively.

Reenable seqaliasdb testing on travis-ci

seqaliasdb requires sqlite3 >= 3.8.0 (for conditional indexes). travis-ci is currently at 3.7.15 (travis-ci/apt-package-safelist#3295), which causes these tests to fail.

The regrettable solution is to disable testing in tests/test_seqaliasdb.py (on travis-ci only), which increases the onus on developers to test locally.

This issue is a reminder to work out a better solution, hopefully by upgrading packages on travis-ci.

Permission denied on pull

Hello,

I am trying to download seqrepo file with the command : seqrepo -r seqrepo/ pull
The rsync is failing with error message :

rsync: change_dir "/2018-10-03" (in seqrepo) failed: Permission denied (13)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1668) [Receiver=3.1.2]

Apparently there is a right problem on the server.

Best regards,

Vivien

Release last 2.7 version with pinned dependencies

implement identifier translator

Given namespace, accession, return all related namespace, accessions.
For example, this could be used like so:

vmc_identifiers = seqrepo.translate_identifier("RefSeq:NM_000551.3", "VMC")

to get the VMC identifier for a RefSeq sequence. (The function will generally return a list.)

NameError: global name 'FileNotFoundError' is not defined, line 521, in update_latest

Trying a little self-help here:

(pipecycle) [jfreidin@USAE1CBIOINTP04 jfreidin]$ python -m pdb `which seqrepo` -r seqrepo show-status
> /compbio/development/sandbox/jfreidin/miniconda2/envs/pipecycle/bin/seqrepo(4)<module>()
-> import re
(Pdb) c
/compbio/development/sandbox/jfreidin/miniconda2/envs/pipecycle/lib/python2.7/site-packages/bioutils/_versionwarning.py:12: UserWarning: Support for Python < 3.6 is now deprecated and will be dropped on 2019-03-31. See https://github.com/biocommons/org/wiki/Migrating-to-Python-3.6
  "Support for Python < 3.6 is now deprecated and"
Traceback (most recent call last):
  File "/compbio/development/sandbox/jfreidin/miniconda2/envs/pipecycle/lib/python2.7/pdb.py", line 1314, in main
    pdb._runscript(mainpyfile)
  File "/compbio/development/sandbox/jfreidin/miniconda2/envs/pipecycle/lib/python2.7/pdb.py", line 1233, in _runscript
    self.run(statement)
  File "/compbio/development/sandbox/jfreidin/miniconda2/envs/pipecycle/lib/python2.7/bdb.py", line 400, in run
    exec cmd in globals, locals
  File "<string>", line 1, in <module>
  File "/compbio/development/sandbox/jfreidin/miniconda2/envs/pipecycle/bin/seqrepo", line 4, in <module>
    import re
  File "/compbio/development/sandbox/jfreidin/miniconda2/envs/pipecycle/lib/python2.7/site-packages/biocommons/seqrepo/cli.py", line 534, in main
    opts.func(opts)
  File "/compbio/development/sandbox/jfreidin/miniconda2/envs/pipecycle/lib/python2.7/site-packages/biocommons/seqrepo/cli.py", line 409, in show_status
    sr = SeqRepo(seqrepo_dir)
  File "/compbio/development/sandbox/jfreidin/miniconda2/envs/pipecycle/lib/python2.7/site-packages/biocommons/seqrepo/seqrepo.py", line 55, in __init__
    self.sequences = FastaDir(self._seq_path, writeable=self._writeable)
  File "/compbio/development/sandbox/jfreidin/miniconda2/envs/pipecycle/lib/python2.7/site-packages/biocommons/seqrepo/fastadir/fastadir.py", line 67, in __init__
    self._db = sqlite3.connect(self._db_path)
OperationalError: unable to open database file
Uncaught exception. Entering post mortem debugging
Running 'cont' or 'step' will restart the program
> /compbio/development/sandbox/jfreidin/miniconda2/envs/pipecycle/lib/python2.7/site-packages/biocommons/seqrepo/fastadir/fastadir.py(67)__init__()
-> self._db = sqlite3.connect(self._db_path)
(Pdb) p self._db_path
'seqrepo/latest/sequences/db.sqlite3'
(Pdb) q
Post mortem debugger finished. The /compbio/development/sandbox/jfreidin/miniconda2/envs/pipecycle/bin/seqrepo will be restarted
> /compbio/development/sandbox/jfreidin/miniconda2/envs/pipecycle/bin/seqrepo(4)<module>()
-> import re
(Pdb)

Which led me to the following bug?

(pipecycle) [jfreidin@USAE1CBIOINTP04 jfreidin]$ python -m pdb `which seqrepo` -v -r seqrepo update-latest
> /compbio/development/sandbox/jfreidin/miniconda2/envs/pipecycle/bin/seqrepo(4)<module>()
-> import re
(Pdb) c
/compbio/development/sandbox/jfreidin/miniconda2/envs/pipecycle/lib/python2.7/site-packages/bioutils/_versionwarning.py:12: UserWarning: Support for Python < 3.6 is now deprecated and will be dropped on 2019-03-31. See https://github.com/biocommons/org/wiki/Migrating-to-Python-3.6
  "Support for Python < 3.6 is now deprecated and"
Traceback (most recent call last):
  File "/compbio/development/sandbox/jfreidin/miniconda2/envs/pipecycle/lib/python2.7/pdb.py", line 1314, in main
    pdb._runscript(mainpyfile)
  File "/compbio/development/sandbox/jfreidin/miniconda2/envs/pipecycle/lib/python2.7/pdb.py", line 1233, in _runscript
    self.run(statement)
  File "/compbio/development/sandbox/jfreidin/miniconda2/envs/pipecycle/lib/python2.7/bdb.py", line 400, in run
    exec cmd in globals, locals
  File "<string>", line 1, in <module>
  File "/compbio/development/sandbox/jfreidin/miniconda2/envs/pipecycle/bin/seqrepo", line 4, in <module>
    import re
  File "/compbio/development/sandbox/jfreidin/miniconda2/envs/pipecycle/lib/python2.7/site-packages/biocommons/seqrepo/cli.py", line 534, in main
    opts.func(opts)
  File "/compbio/development/sandbox/jfreidin/miniconda2/envs/pipecycle/lib/python2.7/site-packages/biocommons/seqrepo/cli.py", line 521, in update_latest
    except (OSError, FileNotFoundError):  # OSError on Py2, FNF on Py3
NameError: global name 'FileNotFoundError' is not defined
Uncaught exception. Entering post mortem debugging
Running 'cont' or 'step' will restart the program

add assembly names to seqrepo

When sequences exist, add assembly name aliases.

Probably use bioutils for now to attach aliases like GRCh38:1 to ncbi:NC_000001.11.
Might as well do patch levels too.

Test for bgzip only when opened for writing (not needed for reading)

Generate "RefSeq" alias records in find_aliases()

#38 permits searching for aliases with a "RefSeq" namespace. This issue is to implement the return of alias records with "NCBI" replaced with "RefSeq". In the 0.4 series, the default will be to preserve current ("NCBI") behavior and enable opt-in translation; in 0.5, seqrepo will return RefSeq namespaces and drop this flag entirely. In other words, this issue enables developers to adopt future behavior now.

move seqrepo-push code into cli

sbin/seqrepo-push pushes a seqrepo instance to dl.biocommons.org. This is the converse of the cli pull command.

This issue will move the push into cli as well (and remove the script).

Provide range checking and negative coordinates a la Python

Seqrepo currently chokes on negative coordinates and doesn't provide range checking.
Provide both.

improve test coverage

Test coverage is a measly 63%. Write tests to get to at least 90%.

Current coverage on master branch (live update):

separate build and deploy steps on travis

See the hgvs .travis.yml for how-to

write documentation, deploy to RTD

Document modules and classes, and write sphinx docs.
When that's done, arrange for RTD builds.

add JRGv1 & v2 sequences

Add Japanese reference genome (JRG) sequences
https://jmorp.megabank.tohoku.ac.jp/201902/downloads/

Implement GA4GH RefGet digest and REST interface

Implement https://samtools.github.io/hts-specs/refget.html digests and REST API.

add seqrepo rest interface

Add a simple read-only REST interface, initiated by seqrepo cli

This will still require that users have Python to run the REST interface, but will allow non-python programs to access.

e.g., seqrepo rest -i instance_name -p port -listen '*'

REST endpoints ideas:

/0/
  status
  sequence/:seq_id[/start_i,end_i]
  sequence/:nsa[/start_i,end_i]
  sequences?namespace=,accession=,nsa=
  sequence-aliases/:seq_id
  sequence-aliases?namespace=,accession=

or
/1/
  status
  sequences/:id (get)
  sequences?identifier=RefSeq:NM_01234.5
      ?alias=NM_01234.5
      ?namespace=RefSeq
      ?instance=20180101
  info?instance=
  instances/

Instructions should including this as a docker container.

Can't install on Windows due to pysam dependency

The pysam library currently does not support Windows, which creates issues when attempting to install seqrepo and libraries that use it. (Such as hgvs - biocommons/hgvs#522)
Possible workaround may be to use pysam-win.

fastadir and seqaliasdb can't open ro files because they always try to upgrade

... which means that sqlite3 files must be left writable during cloning. Fix this.

Upgrade SQLite idioms to be DB-API compliant

The code uses expressions that work for SQLite but (per @teemuvesala) are not DB-API compliant. This prevents porting to other DB-API compliant backends.

Add sequence aliases from NCBI Assembly records

seqrepo add-assembly-aliases adds assembly names like "GRCh38:19" to the same sequence referred to as "NC_000019.10". This information comes straight from GRCh assembly info, via bioutils like this:

[{'aliases': ['chr19'],
  'assembly_unit': 'Primary Assembly',
  'genbank_ac': 'CM000681.2',
  'length': 58617616,
  'name': '19',
  'refseq_ac': 'NC_000019.10',
  'relationship': '=',
  'sequence_role': 'assembled-molecule'}]

Consider whether/how to add aliases to these records. One issue is that there may not be any guarantee that alias entries are unique within the namespace.

Support RefSeq namespaces for searching

SeqRepo currently uses NCBI as the namespace for what should be called RefSeq. We currently have NCBI in use in other biocommons packages, but that should not be further entrenched. To enable developers to use the proper RefSeq namespace, manually translate "RefSeq" to "NCBI" when finding accessions (i.e., SeqAliasDB::find_aliases()).

This action is a stop-gap until we address #31.

rework fetch_aliases method for consistency

SeqAliasDB.find_aliases() currently returns an iterator over rows whereas other query methods return dicts or lists of dicts. Make find_aliases consistent with other functionality by:

Renaming find_aliases to find_aliases_iter
Add a new find_aliases that returns a list of dicts from the iter version.

Create abstract interface

seqrepo currently has distinct backends for aliases and sequences. Both are essentially key-value stores.

In order to open up other possibilities (redis, elasticache, federated repo proxy, local cache, etc.), implement abstract classes and subclass for each backend type.

Wishlist

Completely hide internal keys. They are an implementation detail that should be hidden.
Reimplement sequence proxy
Abstract classes for backends: sqlite, postgresql, dynamo/redis
abstract interfaces for RW backends + RO clients, incl REST
Standardize alias/identifier language
more tests
Rename VMC prefix
iterate over prefix
find_ not found → [] okay (ie none); fetch_ not found → KeyError
hosting
automated loading
stream-based loading

Investigate/Consider:

binary internal key → faster joins (?), smaller tables
unify alias + fasta database files (using existing db connection)

Functions:

alias → k
k → aliases
k → sequence
k → sequence info

seqrepo pull exits without creating the database.

I've been unable to install seqrepo on CentOS Linux release 7.5.1804 (Core).
It seems to download from the mirror and then exits without any indication of what went wrong.
I don't have root access, but can write files locally.

All files and directories are created ug-w btw.
I don't know if that's related to the problem.

[jfreidin@USAE1CBIOINTP04 jfreidin]$ seqrepo -r seqrepo pull
/compbio/development/sandbox/jfreidin/miniconda2/envs/pipecycle/lib/python2.7/site-packages/bioutils/_versionwarning.py:12: UserWarning: Support for Python < 3.6 is now deprecated and will be dropped on 2019-03-31. See https://github.com/biocommons/org/wiki/Migrating-to-Python-3.6
"Support for Python < 3.6 is now deprecated and"
receiving incremental file list
created directory seqrepo/2018-11-26.YscIqz
./
aliases.sqlite3
2,391,415,808 100% 25.68MB/s 0:01:28 (xfr#1, ir-chk=1003/1005)
sequences/
sequences/db.sqlite3
182,138,880 100% 26.94MB/s 0:00:06 (xfr#2, ir-chk=1004/1008)
sequences/2016/
sequences/2016/0824/
sequences/2016/0824/045923/
sequences/2016/0824/045923/1472014763.7728612.fa.bgz
1,356,587 100% 323.44MB/s 0:00:00 (xfr#3, to-chk=881/1080)
sequences/2016/0824/045923/1472014763.7728612.fa.bgz.fai
412,267 100% 2.41MB/s 0:00:00 (xfr#4, to-chk=880/1080)
...
sequences/2018/1126/0628/1543213728.5207198.fa.bgz.fai
5,045 100% 4.81MB/s 0:00:00 (xfr#805, to-chk=1/1080)
sequences/2018/1126/0628/1543213728.5207198.fa.bgz.gzi
214,408 100% 2.25MB/s 0:00:00 (xfr#806, to-chk=0/1080)
(pipecycle) [jfreidin@USAE1CBIOINTP04 jfreidin]$ du -s seqrepo/*
12509709 seqrepo/2018-11-26
(pipecycle) [jfreidin@USAE1CBIOINTP04 jfreidin]$ seqrepo -r seqrepo/2018-11-26 show-status
/compbio/development/sandbox/jfreidin/miniconda2/envs/pipecycle/lib/python2.7/site-packages/bioutils/_versionwarning.py:12: UserWarning: Support for Python < 3.6 is now deprecated and will be dropped on 2019-03-31. See https://github.com/biocommons/org/wiki/Migrating-to-Python-3.6
"Support for Python < 3.6 is now deprecated and"
Traceback (most recent call last):
File "/compbio/development/sandbox/jfreidin/miniconda2/envs/pipecycle/bin/seqrepo", line 11, in
sys.exit(main())
File "/compbio/development/sandbox/jfreidin/miniconda2/envs/pipecycle/lib/python2.7/site-packages/biocommons/seqrepo/cli.py", line 534, in main
opts.func(opts)
File "/compbio/development/sandbox/jfreidin/miniconda2/envs/pipecycle/lib/python2.7/site-packages/biocommons/seqrepo/cli.py", line 409, in show_status
sr = SeqRepo(seqrepo_dir)
File "/compbio/development/sandbox/jfreidin/miniconda2/envs/pipecycle/lib/python2.7/site-packages/biocommons/seqrepo/seqrepo.py", line 55, in init
self.sequences = FastaDir(self._seq_path, writeable=self._writeable)
File "/compbio/development/sandbox/jfreidin/miniconda2/envs/pipecycle/lib/python2.7/site-packages/biocommons/seqrepo/fastadir/fastadir.py", line 67, in init
self._db = sqlite3.connect(self._db_path)
sqlite3.OperationalError: unable to open database file

Enable translation of unqualified identifiers

First implementation of identifier translation required a fully-qualified (namespaced) identifier. Relax that to support bare accessions as long as it refers to only one seq_id across all namespaces.

SeqRepo Query Produces KeyError (namespace not unique)

Hello,

I was playing around with extracting sequences from SeqRepo for ensembl transcripts and noticed an issue that is thrown when there are multiple sequences returned throwing a KeyError.

Successful example with BRAF ensembl transcript:

In [18]: sr['ENST00000288602'] INFO:biocommons.seqrepo.fastadir.fastadir:Opening for reading: /usr/local/share/seqrepo/2017-07-04/sequences/2016/0824/050656/1472015216.5390692.fa.bgz Out[18]: 'CGCCTCCCTTCCCCCTCCCCGCCCGACAGCGGCCGCTCGGGCCCCGGCTCTCGGTTATAAGATGGCGGCGCTGAGCGGTGGCGGTGGTGGCGGCGCGGAGCCGGGCCAGGCTCTGTTCAACGGGGACATGGAGCC .... (shortened for readibility)'

Unsuccessful example with IGHM transcript

`In [19]: sr['ENST00000390559']`

KeyError                                  Traceback (most recent call last)
/Users/admin/.pyenv/versions/3.6.5/lib/python3.6/site-packages/hgvs/shell.py in <module>()
----> 1 sr['ENST00000390559']

/Users/admin/.pyenv/versions/3.6.5/lib/python3.6/site-packages/biocommons/seqrepo/seqrepo.py in __getitem__(self, nsa)
     65         # lookup aliases, optionally namespaced, like NM_01234.5 or NCBI:NM_01234.5
     66         ns, a = nsa.split(nsa_sep) if nsa_sep in nsa else (None, nsa)
---> 67         return self.fetch(alias=a, namespace=ns)
     68
     69     def __iter__(self):

/Users/admin/.pyenv/versions/3.6.5/lib/python3.6/site-packages/biocommons/seqrepo/seqrepo.py in fetch(self, alias, start, end, namespace)
     87         if len(seq_ids) > 1:
     88             # This should only happen when namespace is None
---> 89             raise KeyError("Alias {} (namespace: {}): not unique".format(alias, namespace))
     90
     91         return self.sequences.fetch(seq_ids.pop(), start, end)

KeyError: 'Alias ENST00000390559 (namespace: None): not unique'

It appears to not actually be a KeyError but rather that the same alias (e.g. ENST00000390559) exists for namespaces (i.e. ensemble versions). This is important because when trying to annotate using the HGVS package, the namespace is not specified. Is there a specific reason for this behavior?

Remove support for Python 2.7

using decode() rather than six.u()

We're using six.u() rather decode() to convert byte strings to unicode strings (i.e., str→unicode in py 2.7, and bytes→str in py 3.x).

This is a bug because six.u() does more than just decode -- it also interprets escapes. Although this probably doesn't hurt us in seqrepo, it's not the right way to do this.

snafu$ python2.7 -c 'import six; t="line\\nseparated?"; print(t); print(t.decode()); print(six.u(t))'
line\nseparated?
line\nseparated?
line
separated?

Note that t.decode() and six.u(t) are not doing the same thing above.

add "VMC" namespace, drop "sh" namespace

Add the VMC namespace (e.g., VMC:GS_...) and accessions for all existing sequences
Add code to create VMC accession
Drop sh namespace from db and loader.

See https://github.com/ga4gh/vmc

Support circular sequences

Seqrepo does not currently provide special support for circular sequences. It should.

The easier implementation is with a boolean that assumes the origin at interbase position 0.
An alternative is to support arbitrary origin locations by storing the origin position. For linear sequences, the value is unset. All coordinate computations would need to be modified to circularly permute based on this origin.

Allow default seqrepo location via SEQREPO_ROOT_DIR environment variable

multi-thread read support

When the HGVS function(s) are called in a multithreading programs (eg., webapp based on Django), the calls are failed with exception:

ProgrammingError('SQLite objects created in a thread can only be used in that same thread.The object was created in thread id {foo} and this is thread id {bar}',)

seealso: https://groups.google.com/forum/#!topic/hgvs-discuss/MeHe7jtRIPo

Allow bgzip to found in PATH

fabgz.py hardcodes the path for bgzip. Enable it to be found in PATH or perhaps via configuration.

Also, bgzip version header has changed. Will need to adapt regexp appropriately.