GithubHelp home page GithubHelp logo

iqtlabs / hypothesis-bio Goto Github PK

View Code? Open in Web Editor NEW
15.0 8.0 2.0 687 KB

Hypothesis extension for computational biology

Home Page: https://lab41.github.io/hypothesis-bio

License: Apache License 2.0

Python 100.00%
hypothesis property-based-testing computational-biology bioinformatics fuzzing

hypothesis-bio's Introduction

Hypothesis-Bio

Build status Docs status codecov

Hypothesis-Bio is a Hypothesis extension for property-based testing of bioinformatic software.

Automates the testing process to validate the correctness of bioinformatics tools by generating a wide range of test cases beyond human testers. Finds and returns the minimal error test case that causes an exception.

Features

This module provides a Hypothesis strategy for generating biological data formats. This can be used to efficiently and thoroughly test your code.

Currently supports DNA, RNA, protein, CDS, k-mers, FASTA, & FASTQ formats.

Quick Start

Basic Example

So what exactly does Hypothesis-Bio do? Let's look at some example code that calculates GC-content:

def gc_content(seq):
    return (seq.count("G") + seq.count("C")) / len(seq)

(Can you spot the bug in the code?)

Now let's use Hypothesis-Bio to find the bug. To do so, we specify a property about our code that we expect to hold true over all examples. In this case, GC-content is a percentage, so we know it should always be between 0 and 1. We can encode that requirement into a test:

from hypothesis import given
from hypothesis_bio import dna


@given(dna())
def test_gc_content(seq):
    assert 0 <= gc_content(seq) <= 1

When we run the test (by calling test_gc_content), we get the following output:

Falsifying example: test_gc_content(seq='')

ZeroDivisionError: division by zero

Aha! When given an empty sequence, our simple gc_content calculator raises an error. This simple example shows the power of property-based testing. Instead of hard coding inputs and output examples, we can let Hypothesis-Bio do the hard work for us.

Another Example

We saw that Hypothesis-Bio can catch simple bugs like a division by zero error, but it can do so much more than that. Let's consider another function that translates from DNA to protein:

genetic_code = {
    "ATA": "I", "ATC": "I", "ATT": "I", "ATG": "M", "ACA": "T", "ACC": "T", "ACG": "T", "ACT": "T",
    "AAC": "N", "AAT": "N", "AAA": "K", "AAG": "K", "AGC": "S", "AGT": "S", "AGA": "R", "AGG": "R",
    "CTA": "L", "CTC": "L", "CTG": "L", "CTT": "L", "CCA": "P", "CCC": "P", "CCG": "P", "CCT": "P",
    "CAC": "H", "CAT": "H", "CAA": "Q", "CAG": "Q", "CGA": "R", "CGC": "R", "CGG": "R", "CGT": "R",
    "GTA": "V", "GTC": "V", "GTG": "V", "GTT": "V", "GCA": "A", "GCC": "A", "GCG": "A", "GCT": "A",
    "GAC": "D", "GAT": "D", "GAA": "E", "GAG": "E", "GGA": "G", "GGC": "G", "GGG": "G", "GGT": "G",
    "TCA": "S", "TCC": "S", "TCG": "S", "TCT": "S", "TTC": "F", "TTT": "F", "TTA": "L", "TTG": "L",
    "TAC": "Y", "TAT": "Y", "TAA": "*", "TAG": "*", "TGC": "C", "TGT": "C", "TGA": "*", "TGG": "W",
}


def translate(dna):
    protein = ""
    for codon_start_index in range(0, len(dna), 3):
        codon = dna[codon_start_index : codon_start_index + 3]
        protein += genetic_code[codon]
    return protein

This looks pretty good, right? (Hint: nope! Can you find all the bugs?) For our testing code, we can rely on the property that a DNA sequence's protein is always a third the length of DNA sequence (since three DNA bases are used to code for each amino acid in the protein):

from hypothesis import given
from hypothesis_bio import dna


@given(dna())
def test_translate(seq):
    assert len(translate(seq)) == len(seq) / 3

When we run it, we get the following error:

Falsifying example: test_translate(seq='A')

KeyError: 'A'

It turns out that our translation function never actually checked to ensure that the DNA sequence was a coding sequence. If the sequence isn't at least three letters long, there's no way to convert it into a protein. We should fix our function, but to see just what Hypothesis-Bio can do, we'll tell it the minimum length DNA sequence we want via the min_size argument:

@given(dna(min_size=3))
def test_translate(seq):
    assert len(translate(seq)) == len(seq) / 3

Now we get this error:

Falsifying example: test_translate(seq='AA-')

KeyError: 'AA-'

Whoops, we forgot to take gap characters into account! Note that Hypothesis didn't just find any example that raised a bug, it found the smallest falsifying example. Again, while we should fix the translate function, let's just ignore the issue to see what else Hypothesis will find:

@given(dna(min_size=3, allow_gaps=False))
def test_translate(seq):
    assert len(translate(seq)) == len(seq) / 3

Now we get:

Falsifying example: test_translate(seq='AAB')

KeyError: 'AAB'

It turns out we also forgot the ambiguous nucleotides as well. What else can we find if we ignore ambiguous nucleotides?

@given(dna(min_size=3, allow_gaps=False, allow_ambiguous=False))
def test_translate(seq):
    assert len(translate(seq)) == len(seq) / 3

Now we get:

Falsifying example: test_translate(seq='AAa')

KeyError: 'AAa'

We also forgot to handle lowercase characters! By passing the argument uppercase_only=True to dna, we can tell Hypothesis-Bio to only generate uppercase DNA sequences:

@given(dna(min_size=3, allow_gaps=False, allow_ambiguous=False, uppercase_only=True))
def test_translate(seq):
    assert len(translate(seq)) == len(seq) / 3

And now we get:

Falsifying example: test_translate(seq='AAAA')

KeyError: 'A'

We now see another bug, in which a sequence whose length isn't divisible by 3 will result in a KeyError since there'll be a partial codon. Gaps and ambiguous bases and lowercase letters, oh my! Thankfully, Hypothesis-Bio will generate all of these weird edge cases so you don't manually have to.

Installation

Hypothesis-Bio will be available from PyPI via:

pip install hypothesis-bio

And Conda using:

conda install -c [CHANNEL GOES HERE] hypothesis-bio

Documentation

The documentation for Hypothesis-Bio is available here.

Citation

If you use Hypothesis-Bio, please cite it as:

Hypothesis-Bio. https://github.com/Lab41/hypothesis-bio

or, for BibTeX:

@misc{hypothesis_bio,
  author    = {Benjamin Lee and Reva Shenwai and Zongyi Ha and Michael B. Hall and Vaastav Anand},
  title     = {{Hypothesis-Bio}},
  publisher = {GitHub},
  url       = {https://github.com/Lab41/hypothesis-bio}
}

hypothesis-bio's People

Contributors

arjunrajaram avatar benjamin-lee avatar luizirber avatar mbhall88 avatar scatcher125 avatar vaastav avatar zyha avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hypothesis-bio's Issues

Please share Hypothesis success stories!

Hi all! (@luizirber, @Benjamin-Lee, etc.)

I'm one of the Hypothesis core developers, and have a request:

I'm currently writing up some case studies about Hypothesis-in-science for a SciPy proceedings paper, and would love to add some in biology / bioinformatics - I've got plenty from foundational tools like Numpy and Pandas and a nice one about time in Astropy, but the more diverse the better. So:

  • have you found any cool bugs with hypothesis-bio?
  • has property-based testing changed your workflow? (for better or worse!)
  • do you think the quality and/or quantity of code you write has changed?
  • any challenges or difficulties getting started?
  • is there anyone else I should contact for this?

Happy to discuss here or by email, though preferably soon... a draft is due Friday 😅

Add support for VCF

Generate VCF entries.

This could be quite complex. I will think over the strategies for this in the next day or two.

Potential logo for Hypothesis-bio

Made this initial-draft for a logo for hypothesis bio, based on the Hypothesis package logo. I don't have any image editing software so I just used MS ppt for putting together the image and a DNA image taken from google (wikipedia). Let me know what you think/any changes I should make!
image

FASTQ

We should have a strategy that generates FASTQ string such as:

@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

Unable to run tox fix_lint

I've installed tox and flit in my virtualenv and am getting the following error:

~/D/P/R/hypothesis-bio ❯❯❯ pip install tox flit
Requirement already satisfied: tox in /Users/BenjaminLee/.virtualenvs/hypothesis_bio/lib/python3.7/site-packages (3.14.0)
Requirement already satisfied: flit in /Users/BenjaminLee/.virtualenvs/hypothesis_bio/lib/python3.7/site-packages (1.3)
Requirement already satisfied: importlib-metadata<1,>=0.12; python_version < "3.8" in /Users/BenjaminLee/.virtualenvs/hypothesis_bio/lib/python3.7/site-packages (from tox) (0.23)
Requirement already satisfied: py<2,>=1.4.17 in /Users/BenjaminLee/.virtualenvs/hypothesis_bio/lib/python3.7/site-packages (from tox) (1.8.0)
Requirement already satisfied: six<2,>=1.0.0 in /Users/BenjaminLee/.virtualenvs/hypothesis_bio/lib/python3.7/site-packages (from tox) (1.12.0)
Requirement already satisfied: toml>=0.9.4 in /Users/BenjaminLee/.virtualenvs/hypothesis_bio/lib/python3.7/site-packages (from tox) (0.10.0)
Requirement already satisfied: filelock<4,>=3.0.0 in /Users/BenjaminLee/.virtualenvs/hypothesis_bio/lib/python3.7/site-packages (from tox) (3.0.12)
Requirement already satisfied: pluggy<1,>=0.12.0 in /Users/BenjaminLee/.virtualenvs/hypothesis_bio/lib/python3.7/site-packages (from tox) (0.13.0)
Requirement already satisfied: virtualenv>=14.0.0 in /Users/BenjaminLee/.virtualenvs/hypothesis_bio/lib/python3.7/site-packages (from tox) (16.7.6)
Requirement already satisfied: packaging>=14 in /Users/BenjaminLee/.virtualenvs/hypothesis_bio/lib/python3.7/site-packages (from tox) (19.2)
Requirement already satisfied: requests in /Users/BenjaminLee/.virtualenvs/hypothesis_bio/lib/python3.7/site-packages (from flit) (2.22.0)
Requirement already satisfied: docutils in /Users/BenjaminLee/.virtualenvs/hypothesis_bio/lib/python3.7/site-packages (from flit) (0.15.2)
Requirement already satisfied: pytoml in /Users/BenjaminLee/.virtualenvs/hypothesis_bio/lib/python3.7/site-packages (from flit) (0.1.21)
Requirement already satisfied: zipp>=0.5 in /Users/BenjaminLee/.virtualenvs/hypothesis_bio/lib/python3.7/site-packages (from importlib-metadata<1,>=0.12; python_version < "3.8"->tox) (0.6.0)
Requirement already satisfied: pyparsing>=2.0.2 in /Users/BenjaminLee/.virtualenvs/hypothesis_bio/lib/python3.7/site-packages (from packaging>=14->tox) (2.4.2)
Requirement already satisfied: idna<2.9,>=2.5 in /Users/BenjaminLee/.virtualenvs/hypothesis_bio/lib/python3.7/site-packages (from requests->flit) (2.8)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /Users/BenjaminLee/.virtualenvs/hypothesis_bio/lib/python3.7/site-packages (from requests->flit) (1.25.6)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /Users/BenjaminLee/.virtualenvs/hypothesis_bio/lib/python3.7/site-packages (from requests->flit) (3.0.4)
Requirement already satisfied: certifi>=2017.4.17 in /Users/BenjaminLee/.virtualenvs/hypothesis_bio/lib/python3.7/site-packages (from requests->flit) (2019.9.11)
Requirement already satisfied: more-itertools in /Users/BenjaminLee/.virtualenvs/hypothesis_bio/lib/python3.7/site-packages (from zipp>=0.5->importlib-metadata<1,>=0.12; python_version < "3.8"->tox) (7.2.0)
~/D/P/R/hypothesis-bio ❯❯❯ tox fix_lint
.package create: /Users/BenjaminLee/Desktop/Python/Research/hypothesis-bio/.tox/.package
.package installdeps: flit
ERROR: invocation failed (exit code 1), logfile: /Users/BenjaminLee/Desktop/Python/Research/hypothesis-bio/.tox/.package/log/.package-1.log
================================================================== log start ==================================================================
WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x106d4acc0>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known')': /simple/flit/
WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x106d4ab38>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known')': /simple/flit/
WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x106d4a8d0>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known')': /simple/flit/
WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x106d4add8>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known')': /simple/flit/
WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x106d1d470>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known')': /simple/flit/
ERROR: Could not find a version that satisfies the requirement flit (from versions: none)
ERROR: No matching distribution found for flit

=================================================================== log end ===================================================================
ERROR: could not install deps [flit]; v = InvocationError('/Users/BenjaminLee/Desktop/Python/Research/hypothesis-bio/.tox/.package/bin/python -m pip install flit', 1)
ERROR: invocation failed (exit code 1), logfile: /Users/BenjaminLee/Desktop/Python/Research/hypothesis-bio/.tox/.package/log/.package-2.log
================================================================== log start ==================================================================
Traceback (most recent call last):
  File "/Users/BenjaminLee/.virtualenvs/hypothesis_bio/lib/python3.7/site-packages/tox/helper/build_requires.py", line 7, in <module>
    backend = __import__(backend_spec, fromlist=[None])
ModuleNotFoundError: No module named 'flit'

=================================================================== log end ===================================================================
ERROR: FAIL could not package project - v = InvocationError("/Users/BenjaminLee/Desktop/Python/Research/hypothesis-bio/.tox/.package/bin/python /Users/BenjaminLee/.virtualenvs/hypothesis_bio/lib/python3.7/site-packages/tox/helper/build_requires.py flit.buildapi ''", 1)
~/D/P/R/hypothesis-bio ❯❯❯

@luizirber, do you have any ideas what is going wrong?

Generic wrapping function

Create a generic function that takes a string s and an integer i and wraps s to maximum line length of i.

hackseq review

Cool idea!! I can't even count how many times scripts haven't worked from the smallest bug that are hard to catch by eye. The examples were simple and concise which helped with understanding the concept.

A few questions:
What may be some limitations to hypothesis-bio? Are there any situations that it is not able to test for?
I'm also curious how many given's you can provide to Hypothesis-bio, and if any point it becomes repetitive to provide every possible given.

Looking forward to the updates!

Trying to improve unit tests by mocking

  • hypothesis-bio version:
  • Python version: 3.7.3
  • Operating System: Ubuntu Linux

Description

Hello. I am interested in this project, but can't attend the hackseq, so I am unsure if I can contribute.

I tried to improve unit tests, and found a test that actually exposes a bug. It is hard to test this code, and although the functions you are implementing can generate a diverse range of objects (e.g. very diverse fasta/protein sequences), the tests are biased towards the minimal generated values (but maybe I misunderstood the tests, sorry if this is the case). Instead of letting hypothesis draw the random elements for us, which is what makes some tests hard I think, we can mock the draw function and fix the element we want to draw, and this might or might not help testing (the mocking can also inspect the variables passed to e.g. hypothesis's text() or draw(), making sure the arguments are correct, e.g. when you want a random dna sequence with no gaps, you call dna(allow_gaps=False), then you can check if the alphabet given to text() indeed has no gaps).

Going more direct to the point: this commit adds two more tests to test_protein.py: https://github.com/leoisl/hypothesis-bio/commit/28e5316f0ade1e5f242c77f2e4387e0fc34c370d .

The first test:
https://github.com/leoisl/hypothesis-bio/blob/28e5316f0ade1e5f242c77f2e4387e0fc34c370d/tests/test_protein.py#L60-L70
fixes the drawn protein to ACD and transform it to AlaCysAsp and everything works as expected.

However, the second test: https://github.com/leoisl/hypothesis-bio/blob/28e5316f0ade1e5f242c77f2e4387e0fc34c370d/tests/test_protein.py#L73-L80 fixes the drawn protein to acd (since we can generate lowercase protein sequences), and tries to transform it to the 3-letter format, but fails because utilities.protein_1to3 does not contain lowercase entries.

If you think this could be useful to the other tests (this should be interpreted just as an example that it could), I can make a PR, otherwise I will close this issue.

Add support for multi-FASTA

Currently, our FASTA strategy only generates one FASTA comment and sequence:

> foo
ATGC...

We should have it default to multi-FASTA:

> foo
ATGC...
>bar
ATGG...

List affiliations in paper.md

Just thinking ahead to if/when we publish this as a paper, to include y'all as authors, we'll need to list your affiliations. To make it less of a hassle in the future, can you list your affiliations in paper.md and submit a PR?

  • Jessica
  • Michael
  • Luiz
  • Vaastav
  • Reva

Add a separate options for generating Illumina, Sanger, Solexa FASTQ strings

The actual FASTQ format, https://en.wikipedia.org/wiki/FASTQ_format, suggests that the quality scores have characters from 33 to 126.
But according to @mbhall88 in commit e91a86c4f191c8c1130f30938b50dd6629fb5db, the Illumina scores are from 64-126 instead.

As @mbhall88 pointed out, there are 3 different quality string formats and we should be supporting all of them : https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2847217/

I think we should instead have a different option for generating Illumina sequences as compared to FASTQ sequences. The reason behind this is that the Sequences from the Illumina software use a systematic identifier instead of the random thing we are generating.

Can we drop Blast6 for hypothesis-csv?

It seems that hypothesis-csv already has pretty good support for generating tab-delimited files. Would it make sense to drop our blast6 implementation and add a section in the docs referring users to hypothesis-csv (maybe with an example)?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.