GithubHelp home page GithubHelp logo

althonos / pyhmmer Goto Github PK

View Code? Open in Web Editor NEW
112.0 7.0 13.0 10.99 MB

Cython bindings and Python interface to HMMER3.

Home Page: https://pyhmmer.readthedocs.io

License: MIT License

Python 33.96% Shell 0.17% C 1.20% Cython 63.91% Clarion 0.77%
hmmer hmmer3 bioinformatics cython-library python-bindings python-library hidden-markov-model sequence-analysis

pyhmmer's Introduction

πŸπŸŸ‘β™¦οΈπŸŸ¦ PyHMMER Stars

Cython bindings and Python interface to HMMER3.

Actions Coverage PyPI Bioconda AUR Wheel Python Versions Python Implementations License Source Mirror GitHub issues Docs Changelog Downloads Paper Citations

πŸ—ΊοΈ Overview

HMMER is a biological sequence analysis tool that uses profile hidden Markov models to search for sequence homologs. HMMER3 is developed and maintained by the Eddy/Rivas Laboratory at Harvard University.

pyhmmer is a Python package, implemented using the Cython language, that provides bindings to HMMER3. It directly interacts with the HMMER internals, which has the following advantages over CLI wrappers (like hmmer-py):

  • single dependency: If your software or your analysis pipeline is distributed as a Python package, you can add pyhmmer as a dependency to your project, and stop worrying about the HMMER binaries being properly setup on the end-user machine.
  • no intermediate files: Everything happens in memory, in Python objects you have control on, making it easier to pass your inputs to HMMER without needing to write them to a temporary file. Output retrieval is also done in memory, via instances of the pyhmmer.plan7.TopHits class.
  • no input formatting: The Easel object model is exposed in the pyhmmer.easel module, and you have the possibility to build a DigitalSequence object yourself to pass to the HMMER pipeline. This is useful if your sequences are already loaded in memory, for instance because you obtained them from another Python library (such as Pyrodigal or Biopython).
  • no output formatting: HMMER3 is notorious for its numerous output files and its fixed-width tabular output, which is hard to parse (even Bio.SearchIO.HmmerIO is struggling on some sequences).
  • efficient: Using pyhmmer to launch hmmsearch on sequences and HMMs in disk storage is typically as fast as directly using the hmmsearch binary (see the Benchmarks section). pyhmmer.hmmer.hmmsearch uses a different parallelisation strategy compared to the hmmsearch binary from HMMER, which can help getting the most of multiple CPUs when annotating smaller sequence databases.

This library is still a work-in-progress, and in an experimental stage, but it should already pack enough features to run biological analyses or workflows involving hmmsearch, hmmscan, nhmmer, phmmer, hmmbuild and hmmalign.

πŸ”§ Installing

pyhmmer can be installed from PyPI, which hosts some pre-built CPython wheels for Linux and MacOS on x86-64 and Arm64, as well as the code required to compile from source with Cython:

$ pip install pyhmmer

Compilation for UNIX PowerPC is not tested in CI, but should work out of the box. Note than non-UNIX operating systems (such as Windows) are not supported by HMMER.

A Bioconda package is also available:

$ conda install -c bioconda pyhmmer

πŸ”– Citation

PyHMMER is scientific software, with a published paper in the Bioinformatics. Please cite both PyHMMER and HMMER if you are using it in an academic work, for instance as:

PyHMMER (Larralde et al., 2023), a Python library binding to HMMER (Eddy, 2011).

Detailed references are available on the Publications page of the online documentation.

πŸ“– Documentation

A complete API reference can be found in the online documentation, or directly from the command line using pydoc:

$ pydoc pyhmmer.easel
$ pydoc pyhmmer.plan7

πŸ’‘ Example

Use pyhmmer to run hmmsearch to search for Type 2 PKS domains (t2pks.hmm) inside proteins extracted from the genome of Anaerococcus provencensis (938293.PRJEB85.HG003687.faa). This will produce an iterable over TopHits that can be used for further sorting/querying in Python. Processing happens in parallel using Python threads, and a TopHits object is yielded for every HMM passed in the input iterable.

import pyhmmer

with pyhmmer.easel.SequenceFile("pyhmmer/tests/data/seqs/938293.PRJEB85.HG003687.faa", digital=True) as seq_file:
    sequences = list(seq_file)

with pyhmmer.plan7.HMMFile("pyhmmer/tests/data/hmms/txt/t2pks.hmm") as hmm_file:
    for hits in pyhmmer.hmmsearch(hmm_file, sequences, cpus=4):
      print(f"HMM {hits.query_name.decode()} found {len(hits)} hits in the target sequences")

Have a look at more in-depth examples such as building a HMM from an alignment, analysing the active site of a hit, or fetching marker genes from a genome in the Examples page of the online documentation.

πŸ’­ Feedback

⚠️ Issue Tracker

Found a bug ? Have an enhancement request ? Head over to the GitHub issue tracker if you need to report or ask something. If you are filing in on a bug, please include as much information as you can about the issue, and try to recreate the same bug in a simple, easily reproducible situation.

πŸ—οΈ Contributing

Contributions are more than welcome! See CONTRIBUTING.md for more details.

⏱️ Benchmarks

Benchmarks were run on a i7-10710U CPU running @1.10GHz with 6 physical / 12 logical cores, using a FASTA file containing 4,489 protein sequences extracted from the genome of Escherichia coli (562.PRJEB4685) and the version 33.1 of the Pfam HMM library containing 18,259 domains. Commands were run 3 times on a warm SSD. Plain lines show the times for pressed HMMs, and dashed-lines the times for HMMs in text format.

Benchmarks

Raw numbers can be found in the benches folder. They suggest that phmmer should be run with the number of logical cores, while hmmsearch should be run with the number of physical cores (or less). A possible explanation for this observation would be that HMMER platform-specific code requires too many SIMD registers per thread to benefit from simultaneous multi-threading.

To read more about how PyHMMER achieves better parallelism than HMMER for many-to-many searches, have a look at the Performance page of the documentation.

πŸ” See Also

Building a HMM from scratch? Then you may be interested in the pyfamsa package, providing bindings to FAMSA, a very fast multiple sequence aligner. In addition, you may want to trim alignments: in that case, consider pytrimal, which wraps trimAl 2.0.

If despite of all the advantages listed earlier, you would rather use HMMER through its CLI, this package will not be of great help. You can instead check the hmmer-py package developed by Danilo Horta at the EMBL-EBI.

βš–οΈ License

This library is provided under the MIT License. The HMMER3 and Easel code is available under the BSD 3-clause license. See vendor/hmmer/LICENSE and vendor/easel/LICENSE for more information.

This project is in no way affiliated, sponsored, or otherwise endorsed by the original HMMER authors. It was developed by Martin Larralde during his PhD project at the European Molecular Biology Laboratory in the Zeller team.

pyhmmer's People

Contributors

althonos avatar arajkovic avatar halanzi avatar rtviii avatar tmsincomb avatar valentynbez avatar zdk123 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

pyhmmer's Issues

Recommended settings for searching a small number of profiles against a large database

I'm using pyhmmer to search a small number of profiles (for example custom profiles, not from Pfam) against large databases, such as UniProt. Here are a few things I have questions about:

  1. Because the databases are too large to fit into memory, I am chunking the sequences into groups of 10,000 proteins at a time, and making multiple calls to hmmsearch. In order to get the e-value calculations right, I have to parse the protein file multiple times. First to count the number of sequences, to get the proper Z value. Then a second time to pass sequences in batches to hmmsearch, along with the pre-calculated Z, otherwise it will use Z=10,000. Is there a way to hold off calculating e-values until after the last batch of sequences is passed, so I only have to make one pass through the input (this seems like it's probably not possible)?
  2. As mentioned in the documentation, the parallelization strategy is not optimal when the number of profiles is less than the number of threads. When I run with a single hmm, I'm seeing usage of a single core. Are there any plans to implement a parallelization strategy that would be more appropriate for these kinds of workflows? If you don't have time, but can point me in the right direction, I might be able to submit a pull request to do this.

I like using pyhmmer instead of the hmmsearch executable for this, because my input files are actually genbank genome files with CDS annotations, so with pyhmmer I can read the files, translate the CDSs and run hmmsearch all in memory without ever writing the protein sequences to disk, which is great.

Could not parse file: Line 20: illegal character -

It can't parse FASTA files with "-" symbol, though it's present in hmm.alphabet.symbols

            with pyhmmer.plan7.HMMFile(model_filepath) as hmm_file:
                hmm = next(hmm_file)

            for record in SeqIO.parse(sequences_filepath, "fasta"):
                # Can't create Digital Sequence here :(
                pass

            with pyhmmer.easel.SequenceFile(sequences_filepath) as seq_file:
                seq_file.set_digital(hmm.alphabet)
                sequences = list(seq_file) # Producing error

            pipeline = pyhmmer.plan7.Pipeline(hmm.alphabet)
            hits = pipeline.search_hmm(hmm, sequences)
Traceback (most recent call last):
  File "/home/jediknight/Documents/Nonribosomal-Peptides/hmm.py", line 110, in <module>
    acc = test_models()
  File "/home/jediknight/Documents/Nonribosomal-Peptides/hmm.py", line 76, in test_models
    sequences = list(seq_file)
  File "pyhmmer/easel.pyx", line 3683, in pyhmmer.easel.SequenceFile.__next__
  File "pyhmmer/easel.pyx", line 3724, in pyhmmer.easel.SequenceFile.read
  File "pyhmmer/easel.pyx", line 3791, in pyhmmer.easel.SequenceFile.readinto
ValueError: Could not parse file: Line 20: illegal character -

Error raised from C code: fseeko() failed, eslESYS (status code 12)

Hello Martin!
I've encountered an error while trying to load into memory pressed hmm database. I'm using current version of pyhmmer (0.7.4).
To reproduce this error you can take these steps:

  1. Download and unzip http://prodata.swmed.edu/ecod/distributions/ecodf.hmm.tar.gz
  2. Press this database with following:
    pyhmmer.hmmer.hmmpress(pyhmmer.plan7.HMMFile("ecodf.hmm"), "ecod")
  3. Try to load it into memory:
with pyhmmer.hmmer.HMMFile("ecodf") as hmm_db:
    models = list(hmm_db.optimized_profiles())

The last one results in:

EaselError: Error raised from C code: fseeko() failed, eslESYS (status code 12)
SystemError:  returned a result with an exception set

If I do something wrong, could you give me, please, an example of loading pressed database into RAM?
Thanks in advance for your replay ;)

Error while running nhmmer

Hello! I am trying to use this package for searching DNA queries against a sequence database, and I am running into an error I am unable to debug. I am running the function pyhmmer.hmmer.nhmmer(msa, sequences, cpus=1) and am encountering the following error - TypeError: Cannot convert pyhmmer.easel.DigitalMSA to pyhmmer.easel.DigitalSequence.

Further, when I convert the msa to an hmm and pass it to the function, I get a different error: UnexpectedError: Unexpected error occurred in 'idlen_list_assign': eslENOTFOUND (status code 6).

The code I am running is given below. I am running it on Google Colab:

import pyhmmer

alphabet = pyhmmer.easel.Alphabet.dna()

with pyhmmer.easel.MSAFile("output.msa", digital=True, alphabet=alphabet) as msa_file:
    msa = msa_file.read()

msa.name = b"output_mafft"

builder = pyhmmer.plan7.Builder(alphabet)
background = pyhmmer.plan7.Background(alphabet)
hmm, _, _ = builder.build_msa(msa, background)

with pyhmmer.easel.SequenceFile("all.fasta", digital=True) as seq_file:
    sequences = seq_file.read_block()

hits = next(pyhmmer.hmmer.nhmmer(msa, sequences, cpus=1)) # or hmm instead of msa

I am attaching the MSA file and the FASTA sequence file herewith. Could you please help me understand what could be going wrong? Thank you! (PS: the MSA is of type aligned FASTA and was created using mafft, in case that is useful information.)

nhmmer.zip

Aborted (core dumped) occurred while doing translate

Hi - I was trying to load a dna coding sequence and want to further translate to peptide sequence for hmmsearch. I loaded the gzipped fasta dna sequence with pyhmmer.easel.SequenceFile and everything is fine so far, the Alphabet correctly identify it as dna. However when I do translate my jupyter kernel keep crashing. I also tried it with regular script but still got Aborted (core dumped) failure.

I did a quick test to find out what going on. I found that when I can successfully translate the SequenceBlock with the first 5 entries. I can also successfully translate the block from 6th to 20th entries. But when I do it with 0 to 6, it crashed. I also checked the fasta file but cannot found anything that is suspicious.

And this failure seems to be vary between machines. I did the test on another host and it success for the 0 to 6 block. But when I tried to translate the entire block it will eventually failed.

Not sure whether there is anything that I didn't noticed in my fasta so I included the fasta below.
Actinomucor_elegans_CBS_100.09.cds.fasta.gz

And here is my test script - it usually failed at the last line but in would failed in seqs[0:6].translate() when running on jupyter. I've already updated the pyhmmer to the latest version 0.10.1.

import pyhmmer

print(pyhmmer.__version__)
seq_file = pyhmmer.easel.SequenceFile("Actinomucor_elegans_CBS_100.09.cds.fasta.gz", digital=True)
seqs = seq_file.read_block()
peps = seqs[0:5].translate()
print("Try to print peptide")
print(peps[-1].sequence)
peps = seqs[6:20].translate()
print("Try to print peptide")
print(peps[0].sequence)
peps = seqs[0:6].translate()
print("Try to print peptide")
print(peps[-1].sequence)

[Feature Request] Internally assigning Z for ```HMM``` class

Hello!

Thank you for the awesome and much-needed package.

I wanted to ask if there could be an internal way to determine the Z value or if it can be stored as an internal attribute of the plan7. HMMFile class. For example :

from pyhmmer.plan7 import HMMFile
from pyhmmer.plan7 import SequenceFile

with HMMFile("path/to/hmm/db.hmm") as hmm_fiel:
   hmms = hmm_file.read()
   Z_val = hmms.Z 

   with SequenceFile(proteins, digital=True) as seqs:
      prots = seq_file.read_block()
      pli = Pipeline(hmms.alphabet, Z=Z_val, bit_cutoffs="gathering")
      hits = pli.search_hmm(hmms, prots)

Sometimes I am working with hmm databases where I don't know the z value in advance. I usually find out using

z_score=$(grep -c "NAME" ${input.db})

but it would be nice if it could be easily internally handled!

Best,

Erfan

ValueError: Could not build HMM: Unable to name the HMM.

When trying to start a search with pyhmmer.plan7.Pipeline.search_msa I get the following error:

ValueError: Could not build HMM: Unable to name the HMM.

After some digging, I realised the pyhmmer.easel.MSA object has a name attribute that I could just fill with a b'placeholder', which makes the error go away. I suspect this attribute should've been set by pyhmmer.easel.MSAFile which I am using to read an alignment in stockholm format.

Is this a bug?

`pyhmmer.nhmmer` throws error when the number of FASTA seqs is larger than 64

Hi @althonos, this is a bug report about pyhmmer.nhmmer.

I encountered an error when entering a multi-fasta in pyhmmer.nhmmer with pybarrnap.
After investigating the conditions for reproducing the error, it appears that the error occurs when the number of FASTA seqs is greater than 64. Also, the error contents are not always the same, such as Segmentation fault, corrupted size vs. prev_size, munmap_chunk(): invalid pointer.

You can reproduce the error with the following examples data and commands.

examples.zip

pybarrnap examples/64seqs_success.fa
pybarrnap examples/65seqs_failed.fa
pybarrnap examples/91seqs_failed.fa

new empty HMM segfaults when saved to file

import pyhmmer
from pyhmmer.plan7 import HMMFile, HMM
from pyhmmer.easel import Alphabet
print(pyhmmer.__version__)
newhmm = HMM(10, Alphabet.dna())
with open("newhmm.hmm", "w+b") as fh:
  newhmm.write(fh)

Output:

0.3.0
Segmentation fault (core dumped)

I don't know if this is a bug or not, but this seems a bit counterintuitive at least. If this is a bug, is there any chance that I can know why this is segfaulting?

CPUs not being utilized

Hello!

I have a use case where I am running hmmscan for a number of proteins against Pfam.

I have prefetched Pfam by pressing it and then loading an OptimizedProfileBlock, to be my targets.

I then loaded my sequences into DigitalSequenceBlock.

I am not very memory limited, so I figured prefetching the hmms and loading the sequences as a block (as opposed to SequenceFile) was the right strat, LMK if not.

Regardless, when I specify cpus=32, I notice only one cpu (100%CPU) utilization using top.

Thanks for any help!

search_hmm finds 0 hits

Hi! Thanks for your tool, It's very useful to have a Python interface to HMMER.

I'm having some problems with search_hmm.
I can reproduce the example in the Documentation with LuxC.hmm and LuxC.faa, but when using different hmm models search_hmm does not find any hit.

I'm using the MHC_I.hmm model (downloaded from http://pfam.xfam.org/family/PF00129/hmm) and a set of sequences belonging to the same domain (downloaded from http://pfam.xfam.org/family/PF00129#tabview=tab3, RP15 with no gaps).
Running hmmserach from the terminal I find exactly one hit for each sequence (as expected), but pipeline.search_hmm doesn't find any

Here my code to reproduce the error:

#load hmm                                                                                                                                                                         
path_hmm = './test/MHC_I.hmm'                                                                                                                                                                 
hmm_file = pyhmmer.plan7.HMMFile(path_hmm)                                                                                                                                                    
hmm = next(hmm_file)                                                                                                                                                                          
#pipeline                                                                                                                                                                                     
alphabet = pyhmmer.easel.Alphabet.amino()                                                                                                                                                     
background = pyhmmer.plan7.Background(alphabet)                                                                                                                                               
pipeline = pyhmmer.plan7.Pipeline(alphabet, background=background, report_e=1e-5)                                                                                                             
#apply hmm to a sequence dataset                                                                                                                                                              
seq_file = pyhmmer.easel.SequenceFile("./test/PF00129_rp15.faa")                                                                                                                              
seq_file.set_digital(alphabet)                                                                                                                                                                
hits = pipeline.search_hmm(query=hmm, sequences=seq_file) 
len(hits) = 0

Am I missing something?
Thanks!

Hmmsearch callback tqdm update

Hi,
I have a question about callback in the hmmsearch function. I would update my progress after each query, but my code does not work as expected.

bar = tqdm(range(len(hmm_list)), unit="hmm", desc="Align gene families to HMM", disable=disable_bar)
    options = {"bit_cutoffs": bit_cutoffs, 'callback': lambda p: bar.update()}
    for top_hits in pyhmmer.hmmsearch(hmm_list, gf_sequences, cpus=threads, **options):

Maybe I do not understand how to use it.

I update it manually at the end of the for loop to make it work for the time, but I would also use this to write the name of the HMM in a debug (with the logging package). So, it seems a good idea to define a callback function.

Thanks for your help

segmentation fault in hmmalign

Hi there

This might be my fault (I'm just getting started with pyhmmer) or related to issue

new empty HMM segfaults when saved to file

but when I execute this code

import pyhmmer as phmm
with phmm.easel.SequenceFile(shared_mount+"DB/pfam_IPR002213_top19.fasta") as sf:
		sequences = sf.read_block()
sequences = sequences.digitize(phmm.easel.Alphabet.amino())
hmm = phmm.plan7.HMM(100, phmm.easel.Alphabet.amino())
align = phmm.hmmer.hmmalign(hmm, sequences)

I'm getting a "Segmentation fault (core dumped)".

Given that I would expect a meaningful error when my syntax wouldn't be correct, the only way I can get help is by letting you know.

In the pfam_IPR002213_top19.fasta file are the top 19 sequences of a much larger file.

Regards

Christoph

Generating sequences

Is it possible to generate sequences based on the hmm profile transition probabilities? (e.g. sample from the alphabet based on the vocabulary)

HMMSearch block and freeze

Hi

As explained in your documentation, I'm trying to align multiple HMM (736) against a digitalized database of sequences (32196). I don't know why, but it looks like pyhmmer is freezing during hmmsearch. When I force to stop the command I have this message:

Traceback (most recent call last):
  File "/env/cns/home/jarnoux/.conda/envs/panorama/bin/panorama", line 8, in <module>
    sys.exit(main())
  File "/env/export/cns_n02_agc/agc/proj/PANORAMA/panorama/main.py", line 168, in main
    panorama.annotate.launch(args)
  File "/env/export/cns_n02_agc/agc/proj/PANORAMA/panorama/annotate/annotate.py", line 276, in launch
    annot_pangenomes(pangenomes=pangenomes, source=args.source, table=args.table,
  File "/env/export/cns_n02_agc/agc/proj/PANORAMA/panorama/annotate/annotate.py", line 256, in annot_pangenomes
    pangenomes2metadata = annot_pangenomes_with_hmm(pangenomes, hmm, mode, bit_cutoffs, threads, disable_bar)
  File "/env/export/cns_n02_agc/agc/proj/PANORAMA/panorama/annotate/annotate.py", line 224, in annot_pangenomes_with_hmm
    pangenome2annot[pangenome.name] = annot_with_hmm(pangenome, hmms, hmm_df, mode, bit_cutoffs,
  File "/env/export/cns_n02_agc/agc/proj/PANORAMA/panorama/annotate/hmm_search.py", line 261, in annot_with_hmm
    res = annot_with_hmmsearch(hmms, gf_sequences, meta, bit_cutoffs, threads, disable_bar)
  File "/env/export/cns_n02_agc/agc/proj/PANORAMA/panorama/annotate/hmm_search.py", line 195, in annot_with_hmmsearch
    for top_hits in pyhmmer.hmmsearch(hmm_list, gf_sequences, cpus=threads, **options):
  File "/env/cns/home/jarnoux/.conda/envs/panorama/lib/python3.10/site-packages/pyhmmer/hmmer.py", line 505, in _multi_threaded
    query_queue.put(chore)  # <-- blocks if too many chores in queue
  File "/env/cns/home/jarnoux/.conda/envs/panorama/lib/python3.10/queue.py", line 140, in put
    self.not_full.wait()
  File "/env/cns/home/jarnoux/.conda/envs/panorama/lib/python3.10/threading.py", line 320, in wait
    waiter.acquire()
KeyboardInterrupt

I tried another HMM database, and everything works well. So, It could come from the HMM profile, but they are from CasFinder. That's why I do not understand from where it's coming.

I also tried to launch directly hmmer and it looks like everything works as expected.

how to search HMM file with multiple profiles, ie Pfam?

I'm probably missing something obvious, but I can't seem to figure out how to use pyhmmer to search against a database of HMM profiles, it Pfam. When I try to read the file, it does not seem to iterate over the individual profiles.

Loading sequences you can do this, which create a list of the digitized sequences in

 with pyhmmer.easel.SequenceFile('inputfile.fasta', digital=True, alphabet=alphabet) as seq_file:
    sequences = list(seq_file)

I naively assumed that the HMMFile would do the same?

with pyhmmer.plan7.HMMFile('/path/to/Pfam-A.hmm') as hmm_file:
    hmm = list(hmm_file.read())

This errors out with TypeError: 'pyhmmer.plan7.HMM' object is not an iterator.

If I just run this:

with pyhmmer.plan7.HMMFile('/path/to/Pfam-A.hmm') as hmm_file:
    hmm = hmm_file.read()

Then it is just the single HMM profile (the first one in the Pfam database 1-cysPrx_C). So what am I doing wrong?

`incE` not being respected

pyhmmer vs 0.8.1

Running hmmscan on some proteins vs Pfam.

eg.

seqs = pyhmmer.easel.SequenceFile('test.fasta', format='fasta', digital=True, alphabet=pyhmmer.easel.Alphabet.amino())
targets = pyhmmer.plan7.HMMFile('Pfam.hmm')
all_hits = pyhmmer.hmmer.hmmscan(seqs, targets, cpus=cpu, incE=eval_con)

The following then raises:

for top_hits in all_hits:
    for hit in top_hits:
        assert hit.evalue < top_hits.incE

LMK if inclusion of additional info would help.

EDIT: If this filtering happens in python, I am happy to find it and make a pull, though if its down in the cython I will have a hard time :(

Pass query name to translated sequences

Current behaviour
When using translate, translated sequences come without names

import pyhmmer 

with pyhmmer.easel.SequenceFile("COG0012.fna", digital=True) as seqs_file:
    nucleotides = seqs_file.read_block()
    
proteins = nucleotides.translate()
print(nucleotides[0].name)
print(proteins[0].name)  
b'1089456.NCGM2_0913'
b''

Proposed change
Pass query names to the translated sequences

Save the DigitalSequenceBlock as pickle

Hi Martin!

In one of my project, I would like to save a list of pyhmmer.easel.DigitalSequenceBlock to a pickle file as an intermediate data. However, an error occurred that state TypeError: no default __reduce__ due to non-trivial __cinit__. Are there any other way to save the pyhmmer.easel.DigitalSequenceBlock as a file which can be loaded later?

In terms of the (sub)sequences retrieval, we were using esl-sfetch from original hmmer suit to do this previously. I also noticed there is a pyhmmer.easel.SSIReader and pyhmmer.easel.SSIWriter class but I'm not quite sure whether this is something related to the esl-sfetch?

I would also like to report a glitch on the webpage. I was not able to reach the documentation page in the latest version 0.10.3 - when I click on the tab it always lead me back to the landing page.

Best way to Read in/Search Over Many Many HMMs?

Hi Martin,

Pyhmmer is awesome - just trying to play around with PHROGs and build it into some tooling.

Using v0.9.0.

One question - what do you think the best way is to read in lots of HMMs? Like 38000? I've made a bunch with pyhmmer really easily.

In the example (https://pyhmmer.readthedocs.io/en/stable/examples/recipes.html#Loading-multiple-HMMs) the hmm were hardcoded. I've tried a few approaches to get around this but am running into a strange error.

For example after tweaking the class to take a list

class HMMFiles(typing.ContextManager[typing.Iterable[HMM]]):
    def __init__(self, files: list['os.PathLike[bytes]']) -> None:
        self.stack = contextlib.ExitStack()
        self.hmmfiles = [self.stack.enter_context(HMMFile(f)) for f in files]

    def __enter__(self) -> typing.Iterable[HMM]:
        return itertools.chain.from_iterable(self.hmmfiles)

    def __exit__(self, exc_value: object, exc_type: object, traceback: object) -> None:
        self.stack.close()

Then specifying the files and reading them in

from pathlib import Path
import glob

# MSA_Phrogs_M50_HMM is the directory in the working dir containg all the .hmms
HMM_dir = Path("MSA_Phrogs_M50_HMM")
pattern = "*.hmm"  # Replace with your desired file pattern
files = HMM_dir.glob(pattern)

with HMMFiles(files) as hmm_files:
    all_hits = list(pyhmmer.hmmsearch(hmm_files, targets))

But this throws a very weird error:

FileNotFoundError: [Errno 2] no such file or directory: PosixPath('MSA_Phrogs_M50_HMM/phrog_29267.hmm')

when this file does definitely exist.

George

Multiple HMMFiles

This is related to issue #23, but rather having multiple HMMs in a single file, I'd like to treat the HMMs across multiple files as a single iterable to hmmsearch. This would reduce the overall memory footprint (again, similar to the motivations discussed #23).

Here's a wrapper class that does the trick and also works as a context manager:

from pyhmmer.plan7 import HMMFile
from itertools import chain

class HMMFiles():
    def __init__(self, files):
        self.hmmfiles = [HMMFile(f) for f in files]

    def __enter__(self):
        return chain.from_iterable(self.hmmfiles)

    def __exit__(self, *args):
        for f in self.hmmfiles:
            f.close()

Usage:

with pyhmmer.easel.SequenceFile('queries.fasta') as seq_file, HMMFiles(['1.hmm', '2.hmm']) as hmm_file:
    hits = list(pyhmmer.hmmsearch(hmm_file, list(sequences)))

Thought I would leave this here for posterity even if it doesn't fit in the repo.

ValueError: Could not determine format of file: '/dbfs/mnt/LuxC.sto'

Hi,
I was following your tutorial of Multiple sequence alignment (mas) to HMM.
I have downloaded your example data into my working directory. and I can see the two files (LuxC.faa and LuxC.sto) there as this:
[FileInfo(path='dbfs:/mnt/LuxC.faa', name='LuxC.faa', size=153510),
FileInfo(path='dbfs:/mnt/LuxC.sto', name='LuxC.sto', size=150686),

when I tried to run this code:

with pyhmmer.easel.MSAFile("/dbfs/mnt/LuxC.sto") as msa_file:
    msa_file.set_digital(alphabet)
    msa = next(msa_file)

It gives me error like this:
ValueError: Could not determine format of file: '/dbfs/mnt/LuxC.sto'

I am not sure where it went wrong, the installation and the first two commands in the tutorial works fine.
Thanks for your help

zip argument #1 must support iteration error + other question

Hey Martin,

Thanks for making this tool, I'm finding it very useful for my current project.

I have a profile hmm database obtained from CONJScan that I want to use to scan through a fasta file containing multiple sequences.
I am running into some issues that I can't seem to figure out a work around.

To preface my issue, let me explain what I am trying to do:
Using the CONJScan database and python, I am iterating over the profile hmms in a for-loop. Each loop, I am using a profile hmm to scan through a fasta file containing multiple sequences. Then at the end of each loop, I output a graphic via dna_features_viewer with a unique name containing a visualization of my alignments.

There are two problems I am encountering:

  1. Occasionally, I will receive an error saying that zip argument #1 must be iterable, this is in reference to for ax, hit in zip(axes, hits):... where argument #1 in zip(axes, hits) is not iterable. I am not sure why this is because aside from the ad-hoc loop I created to go through each profile hmm in my database, everything was done mimicking the example provided on the readthedocs.io page.

  2. At the end of the process, I will have multiple hits from different hmm profiles on the same fasta sequence. However, I would like to visualize them together, rather then separately. I am unsure if I am using the tool incorrectly or if this is unsupported currently.

Copied below is my code, excuse me for the messiness, I am still testing things out.


import pyhmmer
import os
from dna_features_viewer import GraphicFeature, GraphicRecord
import matplotlib.pyplot as plt

directory = 'profiles'
#iterate over profiles in folder
#this is to iterate over a folder containing many profile Hmm (CONJScan database)
for hmmprofile in os.listdir(directory):
    f = os.path.join(directory, hmmprofile)
    if os.path.isfile(f):
        try:
            with pyhmmer.plan7.HMMFile(f) as hmm_file:
                hmm = next(hmm_file) 
            with pyhmmer.easel.SequenceFile("test.fasta", digital=True) as seq_file: #test.fasta contains many sequences in amino acid format
                sequences = list(seq_file)
            pipeline = pyhmmer.plan7.Pipeline(hmm.alphabet)
            hits = pipeline.search_hmm(hmm, sequences)
            ali = hits[0].domains[0].alignment
            hmm_name = (ali.hmm_name.decode()) #storing the name of the hmm profile in the event that a search succeeds
            # create an index so we can retrieve a Sequence from its name
            seq_index = { seq.name:seq for seq in sequences }

            fig, axes = plt.subplots(nrows=len(hits), figsize=(30, 30), sharex=True)
            try:
                for ax, hit in zip(axes, hits):
                    # add one feature per domain
                    features = [
                        GraphicFeature(start=d.alignment.target_from-1, end=d.alignment.target_to, color='#00FF00', label=hmm_name) #using the hmm_name to create labels for the graphic feature
                        for d in hit.domains
                    ]
                    length = len(seq_index[hit.name])
                    desc = seq_index[hit.name].description.decode()

                    # render the feature records
                    record = GraphicRecord(sequence_length=length, features=features)
                    record.plot(ax=ax)
                    ax.set_title(desc)
                    try:
                        ax.figure.tight_layout()
                        ax.figure.savefig(desc + hmm_name + ".png") #using both the descriptor + hmm_name to create a unique result and saving the graphic as a png
                    except Exception as e:
                        # print(e)
                        continue
            except Exception as e:
                # print(e)
                continue
        except Exception as e:
            # print(e)
            continue

Any advise you can provide would help immensely.
Thank you.

PyHmmer installation error

C:\Users\Irish\AppData\Local\Programs\Python\Python311\Lib\site-packages>pip install pyhmmer
Collecting pyhmmer
Downloading pyhmmer-0.8.2.tar.gz (11.0 MB)
---------------------------------------- 11.0/11.0 MB 9.5 MB/s eta 0:00:00
Installing build dependencies ... done
Getting requirements to build wheel ... done
Installing backend dependencies ... done
Preparing metadata (pyproject.toml) ... done
Requirement already satisfied: psutil~=5.8 in c:\users\irish\appdata\local\programs\python\python311\lib\site-packages (from pyhmmer) (5.9.5)
Building wheels for collected packages: pyhmmer
Building wheel for pyhmmer (pyproject.toml) ... error
error: subprocess-exited-with-error

Γ— Building wheel for pyhmmer (pyproject.toml) did not run successfully.
β”‚ exit code: 1

Is this a me issue or a PyHMMER issue? It seems to be downloading fine until the pyroject.toml file cant build the wheels properly. What is the best way to handle this issue?

Production stable or WIP

Hi Martin,
thanks a lot for this wonderful Python bindings project (and all others). We're keen to use this to replace HMMER in Bakta and we're quite happy with the results so far. However, I only have one last question regarding your statement in the readme This library is still a work-in-progress, and in an experimental stage, ...

Since this is already published, could you comment on and maybe clarify whether we could/should use this for production purposes in Bakta? We're only using the hmmsearch mode. So if this is stable, we'd be happy to use this.

Again, thanks a lot and best regards,
Oliver

`hmmscan` gets confused about short monotone sequences' alphabet. ( assumes `Amino`, actually `RNA`)

I suppose this is an extremely unlikely usecase in the first place and i can circumvent this in my work, decided to write up anyway:

I have some 3-20 nucleotide long trna/mrna fragments sprinkled in with longer rna and protein chains i'm scanning against. Is there a way to enforce the pyhmmer.easel.Alphabet.rna() for those? For example, it infers amino for CCA, which i'm not sure where to override, though i see the FIXME in the wrapper.

Minimum reproducible example:

alphabet = pyhmmer.easel.Alphabet.rna()
sequence = "CCA"
query_seq  = pyhmmer.easel.TextSequence(name=bytes(seq_record.id,'utf-8'), sequence=sequence)
scans = list(pyhmmer.hmmscan([query_seq.digitize(alphabet)],[ <VALID 2000+ RNA HMMS> ] ))

The err i'm getting:

Traceback (most recent call last):
  File "/home/rtviii/dev/riboxyz/scripts/test.py", line 448, in <module>
    hmms.scan(alphabet, [ SeqRecord(cca['entity_poly_seq_one_letter_code_can']) ] )
  File "/home/rtviii/dev/riboxyz/ribctl/lib/classification.py", line 308, in scan
    scans = list(pyhmmer.hmmscan(query_seqs,[*self.class_hmms_registry.values()] ))
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rtviii/dev/ribxzvenv/lib/python3.11/site-packages/pyhmmer/hmmer.py", line 1481, in hmmscan
    profile.configure(item, _background)
  File "pyhmmer/plan7.pyx", line 7272, in pyhmmer.plan7.Profile.configure
  File "pyhmmer/plan7.pyx", line 7304, in pyhmmer.plan7.Profile.configure
pyhmmer.errors.AlphabetMismatch: Expected RNA alphabet, found amino alphabet
``

cannot reproduce example for ```pyhmmer.hmmer.hmmsearch```

Hello and hope all is well. I'm running into issues reproducing this code (https://pyhmmer.readthedocs.io/en/stable/examples/performance_tips.html#Search-and-scan):

import time
import pyhmmer

t1 = time.time()
with pyhmmer.plan7.HMMFile("data/hmms/bin/t2pks.h3m") as hmms:
    with pyhmmer.easel.SequenceFile("data/seqs/938293.PRJEB85.HG003687.faa", digital=True) as seqs:
        total = sum(len(hits) for hits in pyhmmer.hmmer.hmmsearch(hmms, seqs))
        print(f"- hmmsearch without prefetching took {time.time() - t1:.3} seconds")

# pre-fetching targets - fast, but needs the whole target database in memory
t1 = time.time()
with pyhmmer.easel.SequenceFile("data/seqs/938293.PRJEB85.HG003687.faa", digital=True) as seq_file:
    seqs = seq_file.read_block()
with pyhmmer.plan7.HMMFile("data/hmms/bin/t2pks.h3m") as hmms:
    total = sum(len(hits) for hits in pyhmmer.hmmer.hmmsearch(hmms, seqs))
    print(f"- hmmsearch with prefetching took {time.time() - t1:.3} seconds")

My code is:

import os
import sys
import psutil
import argparse

import pyhmmer
from pyhmmer.easel import SequenceFile
from pyhmmer.plan7 import HMMFile
from pyhmmer.hmmer import hmmsearch
from pyhmmer.hmmer import hmmscan


def run_pyhmmer(proteins, hmmdb, tblout, domtblout, bitcutoff, cores_n):

        available_memory = psutil.virtual_memory().available
        proteins_size = os.stat(proteins).st_size
        database_size = os.stat(proteins).st_size
        input_size = proteins_size + database_size

        with HMMFile(hmmdb) as hmm_file:
                hmms = list(hmm_file)
                Z_val=len(hmms)

                with SequenceFile(proteins, digital=True) as seq_file:
                        if input_size < available_memory * 0.1:
                                print("Enough available memory!")
                                print("Pre-fetching targets into memory...")
                                seqs = seq_file.read_block()

                        else:
                                seqs = seq_file

                        print("Performing pyhmmer hmmsearch...")
                        all_hits = hmmsearch(hmm_file, seqs, cpus=cores_n, Z=Z_val, bit_cutoffs=bitcutoff)
                        print(type(all_hits)) # This output prints <generator object run_pyhmmer.<locals>.<genexpr> at 0x7f18801a87b0>

The documentation states that hmmsearch yields a TopHits object, but I seem to be getting <generator object run_pyhmmer.<locals>.<genexpr> at 0x7f18801a87b0>.

I also wanted to ask if I can run hmmsearch(hmms, seqs, cpus=cores_n, Z=Z_val, bit_cutoffs=bitcutoff) instead of hmmsearch(hmm_file, seqs, cpus=cores_n, Z=Z_val, bit_cutoffs=bitcutoff) if my system has enough memory to store the db.

My hmm database is a large database of 31,150 HMM profiles (~3 GB), and my protein is a FAA output from prodigal with 208,993 proteins (76 MB) .

pip install pyhmmer fail for M1

HMMER has a branch that can be built from source for a M1 chip https://github.com/EddyRivasLab/hmmer/tree/h3-arm, but it looks like they are holding out on merging it to main for awhile. This seems to be confirmed when they did it by accident in the closed pull request EddyRivasLab/hmmer#232.

It looks like the pyhmmer pip install is using the main branch for hmmer that currently doesn't support M1. If there is a workaround or if this issue was fixed please let know.

Collecting pyhmmer
  Using cached pyhmmer-0.5.0.tar.gz (9.8 MB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Installing backend dependencies ... done
  Preparing metadata (pyproject.toml) ... done
Collecting psutil~=5.8
  Using cached psutil-5.9.0-cp310-cp310-macosx_11_0_arm64.whl
Building wheels for collected packages: pyhmmer
  Building wheel for pyhmmer (pyproject.toml) ... error
  error: subprocess-exited-with-error

  Γ— Building wheel for pyhmmer (pyproject.toml) did not run successfully.
  β”‚ exit code: 1
  ╰─> [794 lines of output]
      pyHMMER is not supported on CPU architecture: 'arm64'
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build/lib.macosx-11.0-arm64-cpython-310
      creating build/lib.macosx-11.0-arm64-cpython-310/pyhmmer
      copying pyhmmer/hmmer.py -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer
      copying pyhmmer/__init__.py -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer
      copying pyhmmer/utils.py -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer
      running egg_info
      writing pyhmmer.egg-info/PKG-INFO
      writing dependency_links to pyhmmer.egg-info/dependency_links.txt
      writing requirements to pyhmmer.egg-info/requires.txt
      writing top-level names to pyhmmer.egg-info/top_level.txt
      reading manifest file 'pyhmmer.egg-info/SOURCES.txt'
      reading manifest template 'MANIFEST.in'
      warning: no previously-included files matching '*.ai' found under directory 'vendor/hmmer'
      warning: no previously-included files matching '*.pl' found under directory 'vendor'
      no previously-included directories found matching 'vendor/easel/demotic'
      no previously-included directories found matching 'vendor/easel/documentation'
      no previously-included directories found matching 'vendor/easel/esl_msa_testfiles'
      no previously-included directories found matching 'vendor/easel/miniapps'
      no previously-included directories found matching 'vendor/easel/testsuite'
      no previously-included directories found matching 'vendor/hmmer/autobuild'
      no previously-included directories found matching 'vendor/hmmer/documentation'
      no previously-included directories found matching 'vendor/hmmer/contrib'
      no previously-included directories found matching 'vendor/hmmer/profmark'
      no previously-included directories found matching 'vendor/hmmer/testsuite'
      no previously-included directories found matching 'vendor/hmmer/test-speed'
      no previously-included directories found matching 'vendor/hmmer/tutorial'
      adding license file 'COPYING'
      writing manifest file 'pyhmmer.egg-info/SOURCES.txt'
      copying pyhmmer/easel.pxd -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer
      copying pyhmmer/easel.pyi -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer
      copying pyhmmer/easel.pyx -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer
      copying pyhmmer/errors.pyi -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer
      copying pyhmmer/errors.pyx -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer
      copying pyhmmer/exceptions.pxi -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer
      copying pyhmmer/plan7.pxd -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer
      copying pyhmmer/plan7.pyi -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer
      copying pyhmmer/plan7.pyx -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer
      copying pyhmmer/py.typed -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer
      creating build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/fileobj
      copying pyhmmer/fileobj/bsd.pxi -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/fileobj
      copying pyhmmer/fileobj/linux.pxi -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/fileobj
      creating build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/reexports
      copying pyhmmer/reexports/__init__.pxd -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/reexports
      copying pyhmmer/reexports/esl_sqio_ascii.h -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/reexports
      copying pyhmmer/reexports/esl_sqio_ascii.pxd -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/reexports
      copying pyhmmer/reexports/p7_hmmfile.h -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/reexports
      copying pyhmmer/reexports/p7_hmmfile.pxd -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/reexports
      copying pyhmmer/reexports/p7_tophits.h -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/reexports
      copying pyhmmer/reexports/p7_tophits.pxd -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/reexports
      creating build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests
      copying pyhmmer/tests/__init__.py -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests
      copying pyhmmer/tests/requirements.txt -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests
      copying pyhmmer/tests/test_doctest.py -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests
      copying pyhmmer/tests/test_errors.py -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests
      copying pyhmmer/tests/test_hmmer.py -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests
      copying pyhmmer/tests/utils.py -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests
      creating build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data
      copying pyhmmer/tests/data/README.md -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data
      creating build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/hmms
      copying pyhmmer/tests/data/hmms/make.sh -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/hmms
      creating build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/hmms/bin
      copying pyhmmer/tests/data/hmms/bin/PF02826.h3m -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/hmms/bin
      copying pyhmmer/tests/data/hmms/bin/PKSI-AT.h3m -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/hmms/bin
      copying pyhmmer/tests/data/hmms/bin/Thioesterase.h3m -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/hmms/bin
      copying pyhmmer/tests/data/hmms/bin/t2pks.h3m -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/hmms/bin
      creating build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/hmms/db
      copying pyhmmer/tests/data/hmms/db/PF02826.hmm.h3f -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/hmms/db
      copying pyhmmer/tests/data/hmms/db/PF02826.hmm.h3i -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/hmms/db
      copying pyhmmer/tests/data/hmms/db/PF02826.hmm.h3m -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/hmms/db
      copying pyhmmer/tests/data/hmms/db/PF02826.hmm.h3p -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/hmms/db
      copying pyhmmer/tests/data/hmms/db/PKSI-AT.hmm.h3f -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/hmms/db
      copying pyhmmer/tests/data/hmms/db/PKSI-AT.hmm.h3i -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/hmms/db
      copying pyhmmer/tests/data/hmms/db/PKSI-AT.hmm.h3m -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/hmms/db
      copying pyhmmer/tests/data/hmms/db/PKSI-AT.hmm.h3p -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/hmms/db
      copying pyhmmer/tests/data/hmms/db/Thioesterase.hmm.h3f -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/hmms/db
      copying pyhmmer/tests/data/hmms/db/Thioesterase.hmm.h3i -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/hmms/db
      copying pyhmmer/tests/data/hmms/db/Thioesterase.hmm.h3m -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/hmms/db
      copying pyhmmer/tests/data/hmms/db/Thioesterase.hmm.h3p -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/hmms/db
      copying pyhmmer/tests/data/hmms/db/t2pks.hmm.h3f -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/hmms/db
      copying pyhmmer/tests/data/hmms/db/t2pks.hmm.h3i -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/hmms/db
      copying pyhmmer/tests/data/hmms/db/t2pks.hmm.h3m -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/hmms/db
      copying pyhmmer/tests/data/hmms/db/t2pks.hmm.h3p -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/hmms/db
      creating build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/hmms/txt
      copying pyhmmer/tests/data/hmms/txt/LuxC.hmm -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/hmms/txt
      copying pyhmmer/tests/data/hmms/txt/PF02826.hmm -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/hmms/txt
      copying pyhmmer/tests/data/hmms/txt/PKSI-AT.hmm -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/hmms/txt
      copying pyhmmer/tests/data/hmms/txt/Thioesterase.hmm -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/hmms/txt
      copying pyhmmer/tests/data/hmms/txt/bmyD.hmm -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/hmms/txt
      copying pyhmmer/tests/data/hmms/txt/laccase.hmm -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/hmms/txt
      copying pyhmmer/tests/data/hmms/txt/t2pks.hmm -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/hmms/txt
      creating build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/hmms/txt2
      copying pyhmmer/tests/data/hmms/txt2/PF02826.hmm2 -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/hmms/txt2
      copying pyhmmer/tests/data/hmms/txt2/PKSI-AT.hmm2 -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/hmms/txt2
      copying pyhmmer/tests/data/hmms/txt2/Thioesterase.hmm2 -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/hmms/txt2
      copying pyhmmer/tests/data/hmms/txt2/t2pks.hmm2 -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/hmms/txt2
      creating build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/msa
      copying pyhmmer/tests/data/msa/LuxC.faa -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/msa
      copying pyhmmer/tests/data/msa/LuxC.hmmalign.sto -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/msa
      copying pyhmmer/tests/data/msa/LuxC.sto -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/msa
      copying pyhmmer/tests/data/msa/laccase.clw -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/msa
      creating build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/seqs
      copying pyhmmer/tests/data/seqs/938293.PRJEB85.HG003687.faa -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/seqs
      copying pyhmmer/tests/data/seqs/BGC0001090.gbk -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/seqs
      copying pyhmmer/tests/data/seqs/CP000560.2.fna -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/seqs
      copying pyhmmer/tests/data/seqs/CP040672.1.genes_100.fna -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/seqs
      copying pyhmmer/tests/data/seqs/LuxC.faa -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/seqs
      copying pyhmmer/tests/data/seqs/PKSI.faa -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/seqs
      copying pyhmmer/tests/data/seqs/bmyD.fna -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/seqs
      creating build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/tables
      copying pyhmmer/tests/data/tables/A0A089QRB9.domtbl -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/tables
      copying pyhmmer/tests/data/tables/PF02826.domtbl -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/tables
      copying pyhmmer/tests/data/tables/PF02826.tbl -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/tables
      copying pyhmmer/tests/data/tables/bmyD1.tbl -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/tables
      copying pyhmmer/tests/data/tables/bmyD2.tbl -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/tables
      copying pyhmmer/tests/data/tables/bmyD3.tbl -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/tables
      copying pyhmmer/tests/data/tables/t2pks.domtbl -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/tables
      copying pyhmmer/tests/data/tables/t2pks.tbl -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/data/tables
      creating build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/test_easel
      copying pyhmmer/tests/test_easel/__init__.py -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/test_easel
      copying pyhmmer/tests/test_easel/test_alphabet.py -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/test_easel
      copying pyhmmer/tests/test_easel/test_bitfield.py -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/test_easel
      copying pyhmmer/tests/test_easel/test_keyhash.py -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/test_easel
      copying pyhmmer/tests/test_easel/test_matrix.py -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/test_easel
      copying pyhmmer/tests/test_easel/test_msa.py -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/test_easel
      copying pyhmmer/tests/test_easel/test_msafile.py -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/test_easel
      copying pyhmmer/tests/test_easel/test_randomness.py -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/test_easel
      copying pyhmmer/tests/test_easel/test_sequence.py -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/test_easel
      copying pyhmmer/tests/test_easel/test_sequencefile.py -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/test_easel
      copying pyhmmer/tests/test_easel/test_ssi.py -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/test_easel
      copying pyhmmer/tests/test_easel/test_vector.py -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/test_easel
      creating build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/test_plan7
      copying pyhmmer/tests/test_plan7/__init__.py -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/test_plan7
      copying pyhmmer/tests/test_plan7/test_background.py -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/test_plan7
      copying pyhmmer/tests/test_plan7/test_builder.py -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/test_plan7
      copying pyhmmer/tests/test_plan7/test_hmm.py -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/test_plan7
      copying pyhmmer/tests/test_plan7/test_hmmfile.py -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/test_plan7
      copying pyhmmer/tests/test_plan7/test_optimizedprofile.py -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/test_plan7
      copying pyhmmer/tests/test_plan7/test_pipeline.py -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/test_plan7
      copying pyhmmer/tests/test_plan7/test_profile.py -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/test_plan7
      copying pyhmmer/tests/test_plan7/test_tophits.py -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/test_plan7
      copying pyhmmer/tests/test_plan7/test_tracealigner.py -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/test_plan7
      copying pyhmmer/tests/test_plan7/test_traces.py -> build/lib.macosx-11.0-arm64-cpython-310/pyhmmer/tests/test_plan7
      running build_clib
      creating build/temp.macosx-11.0-arm64-cpython-310
      generating "divsufsort.h" for divsufsort library
      copying vendor/hmmer/libdivsufsort/divsufsort.h.in -> build/temp.macosx-11.0-arm64-cpython-310/divsufsort.h
      generating "esl_config.h" for easel library
      checking whether <stdio.h> can be included... yes
      checking whether <stdlib.h> can be included... yes
      checking whether <string.h> can be included... yes
      checking whether <inttypes.h> can be included... yes
      checking whether <stdint.h> can be included... yes
      checking whether <strings.h> can be included... yes
      checking whether <sys/stat.h> can be included... yes
      checking whether <sys/types.h> can be included... yes
      checking whether <unistd.h> can be included... yes
      checking whether <endian.h> can be included... no
      checking whether <netinet/in.h> can be included... yes
      checking whether <sys/param.h> can be included... yes
      checking whether <sys/sysctl.h> can be included... yes
      checking whether function 'aligned_alloc' is available... no
      checking whether function 'erfc' is available... no
      checking whether function 'getpid' is available... no
      checking whether function '_mm_malloc' is available... no
      checking whether function 'popen' is available... no
      checking whether function 'posix_memalign' is available... no
      checking whether function 'strcasecmp' is available... no
      checking whether function 'strsep' is available... no
      checking whether function 'sysconf' is available... no
      checking whether function 'sysctl' is available... no
      checking whether function 'times' is available... no
      checking whether function 'fseeko' is available... no
      generating "p7_config.h" for hmmer library
      checking whether <endian.h> can be included... no
      checking whether <inttypes.h> can be included... yes
      checking whether <stdint.h> can be included... yes
      checking whether <unistd.h> can be included... yes
      checking whether <sys/types.h> can be included... yes
      checking whether <netinet/in.h> can be included... yes
      checking whether <sys/param.h> can be included... yes
      checking whether <sys/sysctl.h> can be included... yes
      checking whether function 'mkstemp' is available... no
      checking whether function 'popen' is available... no
      checking whether function 'putenv' is available... no
      checking whether function 'strcasecmp' is available... no
      checking whether function 'strsep' is available... no
      checking whether function 'times' is available... no
      checking whether function 'getpid' is available... no
      checking whether function 'sysctl' is available... no
      checking whether function 'sysconf' is available... no
      checking whether function 'getcwd' is available... no
      checking whether function 'chmod' is available... no
      checking whether function 'stat' is available... no
      checking whether function 'fstat' is available... no
      checking whether function 'erfc' is available... no
      generating build/temp.macosx-11.0-arm64-cpython-310/esl_sqio_ascii.c from vendor/easel/esl_sqio_ascii.c
      generating build/temp.macosx-11.0-arm64-cpython-310/p7_hmmfile.c from vendor/hmmer/src/p7_hmmfile.c
      generating build/temp.macosx-11.0-arm64-cpython-310/libdivsufsort.a from vendor/hmmer/libdivsufsort/divsufsort.c
      creating build/temp.macosx-11.0-arm64-cpython-310/vendor
      creating build/temp.macosx-11.0-arm64-cpython-310/vendor/hmmer
      creating build/temp.macosx-11.0-arm64-cpython-310/vendor/hmmer/libdivsufsort
      clang -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /Users/tmsincomb/miniforge3/envs/sadie/include -arch arm64 -fPIC -O2 -isystem /Users/tmsincomb/miniforge3/envs/sadie/include -arch arm64 -Ibuild/temp.macosx-11.0-arm64-cpython-310 -c vendor/hmmer/libdivsufsort/divsufsort.c -o build/temp.macosx-11.0-arm64-cpython-310/vendor/hmmer/libdivsufsort/divsufsort.o
      ar rcs build/temp.macosx-11.0-arm64-cpython-310/libdivsufsort.a build/temp.macosx-11.0-arm64-cpython-310/vendor/hmmer/libdivsufsort/divsufsort.o
      ranlib build/temp.macosx-11.0-arm64-cpython-310/libdivsufsort.a
      generating build/temp.macosx-11.0-arm64-cpython-310/libeasel.a from vendor/easel/esl_exponential.c, vendor/easel/esl_msashuffle.c, vendor/easel/esl_msafile_stockholm.c, vendor/easel/esl_msafile_a2m.c, vendor/easel/esl_ssi.c, vendor/easel/esl_subcmd.c, vendor/easel/esl_sq.c, vendor/easel/esl_matrixops.c, vendor/easel/esl_getopts.c, vendor/easel/esl_randomseq.c, vendor/easel/esl_hmm.c, vendor/easel/esl_stopwatch.c, vendor/easel/esl_neon.c, vendor/easel/esl_mpi.c, vendor/easel/esl_gencode.c, vendor/easel/esl_stats.c, vendor/easel/esl_avx512.c, vendor/easel/esl_alloc.c, vendor/easel/easel.c, vendor/easel/esl_distance.c, vendor/easel/esl_histogram.c, vendor/easel/esl_gev.c, vendor/easel/esl_paml.c, vendor/easel/esl_msafile_phylip.c, vendor/easel/esl_msa.c, vendor/easel/esl_tree.c, vendor/easel/esl_arr2.c, vendor/easel/esl_vectorops.c, vendor/easel/esl_ratematrix.c, vendor/easel/esl_msaweight.c, vendor/easel/esl_sqio_ncbi.c, vendor/easel/esl_mixgev.c, vendor/easel/esl_msafile_psiblast.c, vendor/easel/esl_minimizer.c, vendor/easel/esl_dsqdata.c, vendor/easel/esl_varint.c, vendor/easel/esl_swat.c, vendor/easel/esl_gamma.c, vendor/easel/esl_msafile_clustal.c, vendor/easel/esl_huffman.c, vendor/easel/esl_stack.c, vendor/easel/esl_msafile_afa.c, vendor/easel/esl_avx.c, vendor/easel/esl_dirichlet.c, vendor/easel/esl_regexp.c, vendor/easel/esl_stretchexp.c, vendor/easel/esl_mem.c, vendor/easel/esl_keyhash.c, vendor/easel/esl_arr3.c, vendor/easel/esl_vmx.c, vendor/easel/esl_recorder.c, vendor/easel/esl_heap.c, vendor/easel/esl_cluster.c, vendor/easel/esl_sse.c, vendor/easel/esl_random.c, vendor/easel/esl_alphabet.c, vendor/easel/esl_graph.c, vendor/easel/esl_weibull.c, vendor/easel/esl_composition.c, vendor/easel/esl_red_black.c, vendor/easel/esl_scorematrix.c, vendor/easel/interface_gsl.c, vendor/easel/esl_rand64.c, vendor/easel/esl_quicksort.c, vendor/easel/esl_sqio.c, vendor/easel/esl_fileparser.c, vendor/easel/esl_json.c, vendor/easel/esl_rootfinder.c, vendor/easel/esl_hyperexp.c, vendor/easel/esl_workqueue.c, vendor/easel/esl_msafile2.c, vendor/easel/esl_buffer.c, vendor/easel/esl_mixdchlet.c, vendor/easel/esl_dmatrix.c, vendor/easel/esl_gumbel.c, vendor/easel/esl_msafile.c, vendor/easel/esl_cpu.c, vendor/easel/interface_lapack.c, vendor/easel/esl_normal.c, vendor/easel/esl_wuss.c, vendor/easel/esl_bitfield.c, vendor/easel/esl_threads.c, vendor/easel/esl_msacluster.c, vendor/easel/esl_msafile_selex.c, build/temp.macosx-11.0-arm64-cpython-310/esl_sqio_ascii.c
      creating build/temp.macosx-11.0-arm64-cpython-310/vendor/easel
      creating build/temp.macosx-11.0-arm64-cpython-310/build
      creating build/temp.macosx-11.0-arm64-cpython-310/build/temp.macosx-11.0-arm64-cpython-310
      clang -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /Users/tmsincomb/miniforge3/envs/sadie/include -arch arm64 -fPIC -O2 -isystem /Users/tmsincomb/miniforge3/envs/sadie/include -arch arm64 -Ivendor/easel -Ibuild/temp.macosx-11.0-arm64-cpython-310 -c vendor/easel/esl_exponential.c -o build/temp.macosx-11.0-arm64-cpython-310/vendor/easel/esl_exponential.o
      clang -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /Users/tmsincomb/miniforge3/envs/sadie/include -arch arm64 -fPIC -O2 -isystem /Users/tmsincomb/miniforge3/envs/sadie/include -arch arm64 -Ivendor/easel -Ibuild/temp.macosx-11.0-arm64-cpython-310 -c vendor/easel/esl_msashuffle.c -o build/temp.macosx-11.0-arm64-cpython-310/vendor/easel/esl_msashuffle.o
      vendor/easel/esl_msashuffle.c:449:17: warning: comparison of integers of different signs: 'unsigned long' and 'int' [-Wsign-compare]
        if (strlen(y) != L) ESL_XEXCEPTION(eslEINVAL, "sequences of different lengths in qrna shuffle");
            ~~~~~~~~~ ^  ~
      1 warning generated.
      clang -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /Users/tmsincomb/miniforge3/envs/sadie/include -arch arm64 -fPIC -O2 -isystem /Users/tmsincomb/miniforge3/envs/sadie/include -arch arm64 -Ivendor/easel -Ibuild/temp.macosx-11.0-arm64-cpython-310 -c vendor/easel/esl_msafile_stockholm.c -o build/temp.macosx-11.0-arm64-cpython-310/vendor/easel/esl_msafile_stockholm.o
      clang -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /Users/tmsincomb/miniforge3/envs/sadie/include -arch arm64 -fPIC -O2 -isystem /Users/tmsincomb/miniforge3/envs/sadie/include -arch arm64 -Ivendor/easel -Ibuild/temp.macosx-11.0-arm64-cpython-310 -c vendor/easel/esl_msafile_a2m.c -o build/temp.macosx-11.0-arm64-cpython-310/vendor/easel/esl_msafile_a2m.o
      clang -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /Users/tmsincomb/miniforge3/envs/sadie/include -arch arm64 -fPIC -O2 -isystem /Users/tmsincomb/miniforge3/envs/sadie/include -arch arm64 -Ivendor/easel -Ibuild/temp.macosx-11.0-arm64-cpython-310 -c vendor/easel/esl_ssi.c -o build/temp.macosx-11.0-arm64-cpython-310/vendor/easel/esl_ssi.o
      vendor/easel/esl_ssi.c:261:12: warning: comparison of integers of different signs: 'int64_t' (aka 'long long') and 'uint64_t' (aka 'unsigned long long') [-Wsign-compare]
        if (nkey >= ssi->nprimary) { status = eslENOTFOUND; goto ERROR; }
            ~~~~ ^  ~~~~~~~~~~~~~
      vendor/easel/esl_ssi.c:728:13: warning: comparison of integers of different signs: 'int' and 'uint32_t' (aka 'unsigned int') [-Wsign-compare]
        if ((n+1) > ns->flen) ns->flen = n+1;
             ~~~  ^ ~~~~~~~~
      vendor/easel/esl_ssi.c:862:9: warning: comparison of integers of different signs: 'int' and 'uint32_t' (aka 'unsigned int') [-Wsign-compare]
        if (n > ns->plen) ns->plen = n;
            ~ ^ ~~~~~~~~
      ... 

      fatal error: too many errors emitted, stopping now [-ferror-limit=]
      20 errors generated.
      error: command '/usr/bin/clang' failed with exit code 1
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for pyhmmer
Failed to build pyhmmer
ERROR: Could not build wheels for pyhmmer, which is required to install pyproject.toml-based projects

Work with multiprocessing

Hi,
I would work with multiple CPU, but I don't understand how to give more than one CPU to pyhmmer.
So I tried to use multiprocessing packages, but pyhmmer object are non-trivial __cinit__.
Example :
multiprocessing.pool.MaybeEncodingError: Error sending result: '<pyhmmer.plan7.TopHits object at 0x561959114ad0>'. Reason: 'TypeError('no default __reduce__ due to non-trivial __cinit__')'

Could you give me an example to use pyhmmer with more than one CPU if it's possible ?
Thanks

Scoring a set of sequences with HMM

Thanks for this great library! I just started recently to dig into using HMMs so my question is very very basic. I have build an HMM from an alignment to a specific protein family, now I want to use the HMM to score a list of other protein sequences (all of them).

I tried using the TraceAligner with something like this because hmmalign() would not give the desired metrics (?):

    seq_block = pyhmmer.easel.DigitalSequenceBlock(alphabet, seq)
    aligner = pyhmmer.plan7.TraceAligner(posteriors=True)
    traces = aligner.compute_traces(hmm=hmm, sequences=seq_block)
    msa = aligner.align_traces(hmm=hmm, sequences=seq_block, traces=traces)
    scores = [trace.expected_accuracy() for trace in traces]

Is there an alternative that you recommend to get various alignment scores for my query sequences? I need all of them scored and typically I am interested in alignment scores, e-values, and other easily available metrics.

Great appreciation!

Peace
Paos

Possibility to filter alignment results on coverage between HMM and target

Hi !

I was searching if that's possible to report the alignment coverage on the HMM and the target. I'm using the PADLOC-DB as HMM database and I notice that they include hmm.coverage.threshold and target.coverage.threshold to filter results in there metadata file.

So I was searching if these values were in the HIT object, but they're not, and I don't find them in the documentation. Did I miss something ?

Thanks

search_hmm results in segmentation fault

Hi,

I have created a hmm file and stored on the disk. I read it later to see how I score a specific sequence using search_hmm. But that results in segmentation fault. See the code snippet below:

`
with pyhmmer.plan7.HMMFile(targetfile) as hmmfilehandler:

            hmm = next(hmmfilehandler)
            
            seq = pyhmmer.easel.DigitalSequence(alphabet=hmm.alphabet, name=str.encode("seq1"), sequence=str.encode(sequence))

            pipeline = pyhmmer.plan7.Pipeline(hmm.alphabet)
            hits = pipeline.search_hmm(query=hmm, sequences=[seq])     # this line results in segmentation fault

            for hit in hits:
                print ("Score: ", hit.score, tag)`

Any help debugging this issue is appreciated. Thank you.

[QUESTION] How to run hmmsearch and save results using pyhmmer ?

Dear @althonos ,
could you please guide me how to write pyhmmer code finish below task ?

hmmsearch \
      --cpu {threads} \
      -E {params.hmmsearch_evalue} \
      -o {params.tmp} \
      --tblout {output.hmm} \
      {input.db} \
      {input.pep} \
      >> {log} 2>&1

I have tried below code:

import pyhmmer
import sys

alphabet = pyhmmer.easel.Alphabet.amino()

with pyhmmer.easel.SequenceFile(sys.argv[1], digital=True, alphabet=alphabet) as seq_file:
    sequences = list(seq_file)

with pyhmmer.plan7.HMMFile(sys.argv[2]) as hmm_file:
    all_hits = list(pyhmmer.hmmsearch(hmm_file, sequences, cpus=sys.argv[3]))

But I don't know how to set -E parameter and save results to {output.hmm} from object all_hits.

Thanks!

Load more than 1024 HMM files

Hi - I have a related question about loading multiple HMM files mentioned in #24.

I'm working on a tool that use pyhmmer hmmsearch module to find orthologs among several sequences against the busco dataset. It worked great when searching against the fungi_odb10 dataset (contains 758 markers). But when I tested with mammalia_odb10 (with 9,226 markers) and vertebrata_odb10 (with 3,354 markers) markersets, a file not found error occurred even that file do exist.

I did a bunch of test and found that the file not found error always occurred on the 1020th markers. And later I realize it might be related to the system constrain. Many of the system limit a user to open up to 1024 files at the same time. (according to cmd ulimit -a)

Although might not be directly related to your package, do you have any suggestion about opening up to 1024 files through the context manager? Or is that possible to keep the HMM information after close the file?

nhmmer reports Alphabet mismatch

I'm trying to run nhmmer function but can't seem to get the alphabet right. Running

   import pyhmmer    
   seq1 = pyhmmer.easel.TextSequence(name=b"seq1", sequence="ACCGACA")
   seq2 = pyhmmer.easel.TextSequence(name=b"seq2", sequence="GGGCCAACA")
   rna = pyhmmer.easel.Alphabet.rna()
   dig1, dig2 = [s.digitize(rna) for s in [seq1, seq2]]
   builder = pyhmmer.plan7.Builder(rna, prior_scheme="alphabet")
   
   gen = pyhmmer.hmmer.nhmmer([dig1], [dig2], builder=builder)
   next(gen)

results in

    Traceback (most recent call last):
      File "pyhmmer_test.py", line 10, in <module>
        next(gen)
      File "/apps/conda/fbosnic/envs/test/lib/python3.8/site-packages/pyhmmer/hmmer.py", line 310, in _multi_threaded
        raise thread.error from None
      File "/apps/conda/fbosnic/envs/test/lib/python3.8/site-packages/pyhmmer/hmmer.py", line 112, in run
        self.process(index, query)
      File "/apps/conda/fbosnic/envs/test/lib/python3.8/site-packages/pyhmmer/hmmer.py", line 125, in process
        hits = self.search(query)
      File "/apps/conda/fbosnic/envs/test/lib/python3.8/site-packages/pyhmmer/hmmer.py", line 166, in search
        return self.pipeline.search_seq(query, self.sequences, self.builder)
      File "pyhmmer/plan7.pyx", line 3777, in pyhmmer.plan7.Pipeline.search_seq
      File "pyhmmer/plan7.pyx", line 3819, in pyhmmer.plan7.Pipeline.search_seq
    pyhmmer.errors.AlphabetMismatch: Expected Alphabet.amino(), found Alphabet.rna()

Am I using it correctly, does the alphabet need to be set somewhere else as well?

As far as I could trace it, the following line might be the cause

self.pipeline = Pipeline(alphabet=Alphabet.amino(), **options)

Query type is mis-specified in `iterate_hmm`

with easel.SequenceFile('tests/data/proteins.fasta', digital=True) as sf:
    proteins = sf.read_block()

hmm = next(plan7.HMMFile('tests/data/hmms/pfam/Pfam-A.hmm'))

abc = easel.Alphabet.amino()
pli = plan7.Pipeline(abc, incE=1e-3, incdomE=1e-3)
iterator = pli.iterate_hmm(hmm, proteins)
TypeError: Argument 'query' has incorrect type (expected pyhmmer.easel.DigitalSequence, got pyhmmer.plan7.HMM)

Fixing the types in plan7 isn't quite enough though, since if you want more than one iteration:

max_iterations = 1
for n in range(max_iterations):
    iteration = next(iterator)
    if iteration.converged:
        break
  File "pyhmmer/plan7.pyx", line 3788, in pyhmmer.plan7.IterativeSearch.__next__
TypeError: Argument 'sequence' has incorrect type (expected pyhmmer.easel.Sequence, got pyhmmer.plan7.HMM)

which is this line:

            extra_traces = [Trace.from_sequence(self.query)]

Which makes sense, since the query should be an HMM type.

I have a fix for both issues in a branch here, I'll open a PR.

Conflict with numba

I wrote a package to run the Viterbi hmm-profile alignment algorithm on hmmer3 profiles. The package uses pyhmmer to parse the hmm files and get numpy arrays from the values. https://github.com/seanrjohnson/hmmer_compare/tree/numba

In pure python, the algorithm is very slow, so I experimented with speeding it up using numba, which leads to huge improvements in performance (50-100x in the few examples I tried).

However, it also gives me a strange warning from Numpy.

/home/sean/miniconda3/envs/hmmer_compare/lib/python3.10/site-packages/numpy/core/getlimits.py:500: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero.
  setattr(self, word, getattr(machar, word).flat[0])
/home/sean/miniconda3/envs/hmmer_compare/lib/python3.10/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero.
  return self._float_to_str(self.smallest_subnormal)
/home/sean/miniconda3/envs/hmmer_compare/lib/python3.10/site-packages/numpy/core/getlimits.py:500: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.
  setattr(self, word, getattr(machar, word).flat[0])
/home/sean/miniconda3/envs/hmmer_compare/lib/python3.10/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.
  return self._float_to_str(self.smallest_subnormal)

Using the snippet below, there are three ways I can get rid of that warning:

  • Explicitly suppress the warning using warnings.filterwarnings
  • Don't import pyhmmer
  • Don't use the numba @jit decorator. (And therefore don't use Numba)
import warnings
#warnings.filterwarnings("ignore", category=UserWarning, module='numpy') ## Uncommenting this will suppress the warning
from numba import jit
import numba as nb
import numpy as np
import pyhmmer   ## commenting this will suppress the warning
from typing import Tuple


@jit(nb.types.Tuple((nb.float32,nb.uint64))(nb.float32, nb.float32, nb.uint64, nb.uint64),cache=True) ## commenting this will suppress the warning
def max2(sMM:float, sXY:float, layer1:int, layer2:int) -> Tuple[float, int]:
    if sMM > sXY:
        score = sMM
        bt = layer1
    else:
        score = sXY
        bt = layer2
    return score, bt

if __name__ == "__main__":
    print(max2(1,2,3,4))

I seem to get the same output as I did before switching to Numba, so I think for practical purposes, it's not actually a problem, I can just suppress the warning. It's just kind of confusing and annoying, so I thought I'd bring it to your attention in case there is an easy fix.

Reported/Included domain counts are doubled after TopHits merge

To replicate:

thioesterase = HMMFile("tests/data/hmms/db/Thioesterase.hmm").read()

with SequenceFile('tests/data/seqs/938293.PRJEB85.HG003687.faa', digital=True) as seqfile:
    proteins = seqfile.read_block()
pli = Pipeline(thioesterase.alphabet, T=1, domT=1, incT=1, incdomT=1)
hits1 = pli.search_hmm(thioesterase, proteins[:1000],)
hits2 = pli.search_hmm(thioesterase, proteins[1000:2000])
hits3 = pli.search_hmm(thioesterase, proteins[2000:])
merged = hits1.merge(hits2, hits3)

with open("test.tblout", "wb") as f:
    merged.write(f, format="targets")
with open("test.domtblout", "wb") as f:
    merged.write(f, format="domains")

If you inspect the target table you see the following reported for domain number estimation for the hit 938293.PRJEB85.HG003687_113:

exp reg clu  ov env dom rep inc
--- --- --- --- --- --- --- ---
1.2   1   0   0   1   1   2   2

the rep/inc count is double than what is expected. The counts are correct when including all proteins and without merging

Bioconda Install Error: Unsatisfiable Error

I'm trying to download pyhmmer via bioconda with this line:

conda install -c bioconda pyhmmer

but I keep getting the following error:

UnsatisfiableError: The following specifications were found to be incompatible with each other:
Output in format: Requested package -> Available versions

Notably, it fails to list/provide detail for what specifications were incompatible. I've tried doing it in previous python versions in case that impacts it, but I get the same error for every version above 3.5. I asked another friend of mine to attempt the install because my computer is a mac from early 2014 on outdated software (Catalina 10.15.7), but he too ran into the error.

The pip install largely works and I'm trying to work through that for now, but do you know if this error has appeared among other users and if a fix exists?

Suggestion: Add esl-reformat as method to convert MSA to stockholm format

Hi,

I'm using pyhmmer to build HMM from MSA, but the tool I'm using to build MSAs doesn't give an output with stockholm format.
I would to suggest adding the esl-reformat command from HMMER as a method in pyhmmer.easel. Moreover, it could really help if a random #GF ID was generated (with maybe an option for suffix and prefix).

I tried to do it by myself to create a PR, but I'm really not quite good with cython. Sorry.

Thanks

Is it possible to get/apply cutoffs?

From the code it looks like it should be possible to obtain the PFAM gathering/noise/trusted cutoffs from the HMM class
e.g.
https://github.com/althonos/pyhmmer/blob/master/pyhmmer/plan7.pyx#L623
https://github.com/althonos/pyhmmer/blob/master/pyhmmer/plan7.pyx#L2259
https://github.com/althonos/pyhmmer/blob/master/pyhmmer/plan7.pyi#L218

However:

from pyhmmer.plan7 import HMMFile, HMM

with HMMFile("path/to/Pfam-A.hmm") as hmms:
   hmm = next(hmms)

dir(hmm) ## I don't see any attributes that seem to correspond to cutoffs

Is there any way to obtain the cutoffs or, even better, apply the cutoff in a search_hmm pipeline?

Using pyhmmer v0.4.5.

Thanks!

Batch the hmmsearch output

Hi - I'm wondering if it is possible to return the hmmsearch results by batch?

As the size of the result isn't directly calculable based on the size of the input/database, but rather depending on the number of matches discovered, out-of-memory errors may arise when matches continuously expand, eventually reach the maximum memory I've set initially.

Is it possible to batch the output results, enabling me to extract the necessary information and then save it to disk?

ImportError: Cannot install pyhmmer in ubuntu

Hi, I was trying to install pyhmmer in ubuntu. Because I have both python2 and 3 in ubuntu, I used below command to install it. It could download the package, but fail to install due to an import error: "ImportError: cannot import name 'errors'"

Could I known which additional package I need to install before pyhmmer installation?

Thanks for your help!

python3 -m pip install pyhmmer

Collecting pyhmmer
  Using cached https://files.pythonhosted.org/packages/65/f3/1de70443d58ccf843aaaa01393d640452e26b60912c0998302a0b8fea021/pyhmmer-0.4.11.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-build-1585x62q/pyhmmer/setup.py", line 636, in <module>
        sdist=sdist,
      File "/usr/lib/python3/dist-packages/setuptools/__init__.py", line 129, in setup
        return distutils.core.setup(**attrs)
      File "/usr/lib/python3.6/distutils/core.py", line 121, in setup
        dist.parse_config_files()
      File "/usr/lib/python3/dist-packages/setuptools/dist.py", line 494, in parse_config_files
        ignore_option_errors=ignore_option_errors)
      File "/usr/lib/python3/dist-packages/setuptools/config.py", line 106, in parse_configuration
        meta.parse()
      File "/usr/lib/python3/dist-packages/setuptools/config.py", line 382, in parse
        section_parser_method(section_options)
      File "/usr/lib/python3/dist-packages/setuptools/config.py", line 355, in parse_section
        self[name] = value
      File "/usr/lib/python3/dist-packages/setuptools/config.py", line 173, in __setitem__
        value = parser(value)
      File "/usr/lib/python3/dist-packages/setuptools/config.py", line 430, in _parse_version
        version = self._parse_attr(value)
      File "/usr/lib/python3/dist-packages/setuptools/config.py", line 305, in _parse_attr
        module = import_module(module_name)
      File "/usr/lib/python3.6/importlib/__init__.py", line 126, in import_module
        return _bootstrap._gcd_import(name[level:], package, level)
      File "<frozen importlib._bootstrap>", line 994, in _gcd_import
      File "<frozen importlib._bootstrap>", line 971, in _find_and_load
      File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
      File "<frozen importlib._bootstrap>", line 665, in _load_unlocked
      File "<frozen importlib._bootstrap_external>", line 678, in exec_module
      File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
      File "/tmp/pip-build-1585x62q/pyhmmer/pyhmmer/__init__.py", line 19, in <module>
        from . import errors
    ImportError: cannot import name 'errors'

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-1585x62q/pyhmmer/

pyhmmer.hmmer.hmmsearch is always slower than using hmmsearch directly

Hello, Martin!
Thank you for your work. I'm trying to use pyhmmer instead on Linux. However, I find that pyhmmer always works slower than using hmmsearch directly. My test is based on 13 gene hmms and several fasta files of already identified genes.

Here is my script:

    hmm_file= "mito_CDS.hmm.h3m"
    seq_file = "ATP8.fasta"  # and others
    seq_num = sum(1 for line in open(seq_file).readlines() if line.startswith(">"))

    t1 = time.time()  
    with pyhmmer.easel.SequenceFile(seq_file, digital=True) as f:
        seqs = f.read_block()
    hmms = pyhmmer.plan7.HMMFile(hmm_file)
    for hits in pyhmmer.hmmer.hmmsearch(hmms, seqs, cpus=48,callback=lambda x,y:print(x,y)):
        pass
    t2 = time.time()
    print(f"pyhmmer-hmmsearch search {seq_file}({seq_num}) using: {t2-t1}")
    
    t1 = time.time()
    cmd = f"hmmsearch {hmm_file} {seq_file}"
    subprocess.run(cmd, shell=True, capture_output=True)
    t2 = time.time()
    print(f"hmmsearch search {seq_file} using({seq_num}) : {t2-t1}")

And here is the result:

; about hmms
<HMM alphabet=Alphabet.dna() M=285 name=b'ND4L'> 13
<HMM alphabet=Alphabet.dna() M=786 name=b'COX3'> 13
<HMM alphabet=Alphabet.dna() M=672 name=b'ATP6'> 13
<HMM alphabet=Alphabet.dna() M=1536 name=b'COX1'> 13
<HMM alphabet=Alphabet.dna() M=1137 name=b'CYTB'> 13
<HMM alphabet=Alphabet.dna() M=948 name=b'ND1'> 13
<HMM alphabet=Alphabet.dna() M=351 name=b'ND3'> 13
<HMM alphabet=Alphabet.dna() M=153 name=b'ATP8'> 13
<HMM alphabet=Alphabet.dna() M=1335 name=b'ND4'> 13
<HMM alphabet=Alphabet.dna() M=516 name=b'ND6'> 13
<HMM alphabet=Alphabet.dna() M=1017 name=b'ND2'> 13
<HMM alphabet=Alphabet.dna() M=1713 name=b'ND5'> 13
<HMM alphabet=Alphabet.dna() M=681 name=b'COX2'> 13
; ATP8
pyhmmer-hmmsearch search ATP8.fasta(2570) using: 4.567137002944946
hmmsearch search ATP8.fasta using(2570) : 3.0275957584381104
; COX2
pyhmmer-hmmsearch search COX2.fasta(8291) using: 538.6848599910736
hmmsearch search COX2.fasta using(8291) : 49.312217235565186
; ND3
pyhmmer-hmmsearch search ND3.fasta(2817) using: 38.611982345581055
hmmsearch search ND3.fasta using(2817) : 19.30312705039978
; ATP6
pyhmmer-hmmsearch search ATP6.fasta(2788) using: 185.83723640441895
hmmsearch search ATP6.fasta using(2788) : 28.809940576553345

I want to know what happened. Is it because of the method I am using? Thank you very much for your help.

PyHMMER 0.8.0 TMD should be 0 for last node

Hi,

We use your excellent software in ModelAngelo! Ever since the upgrade to pyHMMER v0.8.0, we are getting the following error during HMMAlign:

ValueError: Invalid HMM: TMD should be 0 for last node

I have attached an example of the HMM files we write. Could you please advise how we should proceed so that it works with the new version?
0.hmm.txt

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.