bionitio-team / bionitio Goto Github PK

View Code? Open in Web Editor NEW

114.0 7.0 26.0 615 KB

Demonstrating best practices for bioinformatics command line tools

License: MIT License

Shell 99.00% Dockerfile 1.00%

bioinformatics best-practises

bionitio's Introduction

Overview

Bionitio provides a template for command line bioinformatics tools in various programming languages.

In each language we implement a simple tool that carries out a basic bioinformatics task. The program reads one or more input FASTA files, computes a variety of simple statistics on each file, and prints a tabulated output.

The purpose of the tool is to provide an easy-to-understand working example that is built on best-practice software engineering principles. It can be used as a basis for learning and as a solid foundation for starting new projects. We provide a script called bionitio-boot.sh for starting new projects from bionitio, which saves time and ensures good programming practices are adopted from the beginning (see below for details).

An additional advantage of bionitio is that it allows us to compare programming styles in different languages and programming paradigms.

Bionitio is intended to work on POSIX-like operating systems (such as Linux and OSX). It has not been tested extensively on variants of the Windows operating system.

Please see our publication Bionitio: demonstrating and facilitating best practices for bioinformatics command-line software in GigaScience that provides a detailed discussion of the tool.

Languages

Language	Repository	Travis Testing Status
C	https://github.com/bionitio-team/bionitio-c
C++	https://github.com/bionitio-team/bionitio-cpp
C#	https://github.com/bionitio-team/bionitio-csharp
Clojure	https://github.com/bionitio-team/bionitio-clojure
Java	https://github.com/bionitio-team/bionitio-java
Javascript	https://github.com/bionitio-team/bionitio-js
Haskell	https://github.com/bionitio-team/bionitio-haskell
Perl 5	https://github.com/bionitio-team/bionitio-perl5
Python 3	https://github.com/bionitio-team/bionitio-python
R	https://github.com/bionitio-team/bionitio-r
Ruby	https://github.com/bionitio-team/bionitio-ruby
Rust	https://github.com/bionitio-team/bionitio-rust

Basic functionality of bionitio

Bionitio is intended to be a simple prototypical bioinformatics tool that is easy to understand and modify. Therefore it has only minimal functionality; just enough to demonstrate all the key features of a real bioinformatics command line program without becoming distracted by unnecessary complexity.

If you use bionitio as the starting point for a new project we expect that you will rewrite it to implement your own desired functionality. However, much of the boilerplate is already provided for you; modifying the program should be significantly easier than starting from scratch.

All implementations of bionitio implement the same functionality and provide the same command line interface. Specific details of bionitio's behaviour, usage, and installation, can be found in the README for each implementation.

Key features of the tool include:

Command line argument parsing and usage information.
Reading input from multiple files or optionally from standard input.
The use of library code for parsing a common bioinformatics file format (FASTA).
Progress and error logging.
Defined exit status values.
A test suite (unit testing and integration testing).
A version number.
Standardised software building and packaging using programming language specific mechanisms.
A standard open-source software license.
User documentation.
Code documentation.
Docker container.
Common Workflow Language (CWL) wrapper.

Where possible we follow the recommended conventions for programming style for each implementation language.

License

The bionitio project is released as open source software under the terms of MIT License. However, we grant permission to users who derive their own projects from bionitio to apply their own license to their derived works. Licenses applied to projects deriving from bionitio do not affect in any way the license of the overall bionitio project, or licenses applied to other independent derivations.

Starting a new project from bionitio

How to set up a new bionitio project, step-by-step.

In the examples below $ indicates the Unix prompt.

One of the main goals of bionitio is to provide a good place to start writing bioinformatics command line tools. To make that easy we've provided a shell script called bionitio-boot.sh to help you start a new project, which is run like so:

$ boot/bionitio-boot.sh -i python -n skynet -c BSD-3-Clause -g cyberdyne -a 'Miles Bennett Dyson' -e '[email protected]'

The example above starts a fresh project called skynet under the BSD-3-Clause license, using Python as the implementation language. A new git repository will be created in a sub-directory called skynet which will be initialised with a copy of bionitio and a blank revision history. All references to bionitio in the source code are replaced with skynet. Finally, the code is pushed to a new repository on www.github.com for the username cyberdyne.

You should replace skynet with a project name of your choice, and cyberdyne with your github username, if you have a github account. You may be asked to enter your github username. This assumes you do not already have a github project of the given name. If you don't have a github account, do not use the -g option.

After you have started a new project from bionitio you are free to modify it as you see fit, modifying its functionality to suit your own requirements.

When setting up a new project using bionitio-boot.sh You must specify the following things:

Required:

-i LANGUAGE: the programming language you want to use (one of: c, clojure, cpp, csharp, haskell, java, js, perl5, python, r, ruby, rust)
-n NAME: the name of your new project.

If you are new to programming, and do not know which programming language to use, then we recommend picking one of the high-level interpreted languages that are popular in Bioinformatics, such as Python or R. You may also need to seek advice from your peers about which language(s) are most appropriate for your purposes. We have tried to cover as many popular languages as possible, and apologise if your preference is not currently available. However, we also welcome new implementations of Bionitio in languages not already covered.

Optional:

-c LICENSE: the license that you want to assign to your new project (one of: Apache-2.0, BSD-2-Clause, BSD-3-Clause, GPL-2.0, GPL-3.0, MIT). If you do not specify a license then it defaults to the MIT license.
-g GITHUB-USERNAME: create a new remote repository in github and push new project to that repository. Replace GITHUB-USERNAME with your actual github user name. You may be prompted for your github password. This assumes you do not already have a repository in github with the same name as specified by the -n NAME option.
-a AUTHOR-NAME: Use this name for the author of the code (will appear in source code headers and other places where a name is appropriate).
-e AUTHOR-EMAIL: Use this string for the email address of author of the code (will appear in source code headers and other places where an email address is appropriate).
-v: enable verbose mode; the script will print a lot more information about what it is doing. This is mostly useful for debugging if it does not work as expected.
-l LOGFILE: log progress information to the file named LOGFILE. This may be useful for debugging purposes.

If you don't have a local copy of the script, you can run it from the web like so, using curl:

$ URL=https://git.io/bionitio-boot
$ curl -sSfL $URL \
 | bash -s -- -i python -n skynet -c BSD-3-Clause -g cyberdyne -a 'Miles Bennett Dyson' -e '[email protected]'

Note that https://git.io/bionitio-boot redirects to the biontio bootstrap script on GitHub: https://raw.githubusercontent.com/bionitio-team/bionitio/master/boot/bionitio-boot.sh.

Or if you have Docker installed on your computer, you can run the Docker container like so:

docker run -it -v "$(pwd):/out" --rm bionitio/bionitio-boot \
  -i python -n skynet -c BSD-3-Clause -g cyberdyne -a 'Miles Bennett Dyson' -e '[email protected]'

Or you can make a local copy of the bionitio-boot.sh script, and run it locally, as shown below:

# Copy the script to your local computer
$ URL=https://git.io/bionitio-boot
$ curl -sSfL $URL > bionitio-boot.sh

# Inspect the script to ensure you are happy with the commands it will execute on your system.

# Run the script on your local computer
$ bash bionitio-boot.sh -i python -n skynet -c BSD-3-Clause -g cyberdyne -a 'Miles Bennett Dyson' -e '[email protected]'

Authors

Alphabetically:

Jessica Chung
Harriet Dashnow
Peter Georgeson
Andrew Lonsdale
Michael Milton
Bernie Pope
David R Powell
Torsten Seemann
Clare Sloggett
Anna Syme

bionitio's People

Contributors

Stargazers

Watchers

bionitio's Issues

bionitio perl5 and ruby need a licence file

Invalid fasta file behaviour

At the moment, behaviour is inconsistent if the input file exists, but is not a fasta file. I suspect the program should exit with a nonzero exit status in this situation, but a lot of the implementations do not

Consistent handling of empty files

What is the expected behaviour on files with no sequences?

Currently the perl impl prints to stderr : "Skipping $file - doesn't seem to be FASTA?\n";
The haskell version silently skips.

Consistent error handling

I think a useful contribution would be to demonstrate good error handling practices.

One thing I was thinking of is a command line argument to decide whether the program will stop if it encounters a bad input FASTA file, or whether it should continue processing the remaining files.

We should try to make the error handling and output of the different versions the same.

Consider mounting functional_tests directory in Docker, rather than including it in the containers

We could mount in the functional_tests directory, so that it doesn't have to be part of the Docker image itself (like it is now). However this would mean the users can't run the tests solely using the Docker image.

Consider using GitHub actions instead of Travis

Set repos up as GitHub Template repos

GitHub recently added a new repository setting to mark a repo as a "template repository".

It's quite easy to set up (see docs) and basically just adds a little badge to the repo which allows people to copy the repo to their account (NB: not fork it, there's no link here).

This will obviously skip some of the bionitio functionality such as renaming the project within the code, however it could offer a second alternative route for users.

biotool-js : failing on empty file test

Bug in upstream library. Done pull request for fix.

Test GitHub integration on Windows

A user has reported that GitHub integration failed on Windows under PowerShell.

Running the bootstrap script:

a new/different command-line window was opened
this asked for the user's GitHub password
the password was entered in plaintext (maybe it was the username instead)?
nothing happened when the enter key was pressed
the GitHub repository was not created

boostrap script should remove the readme_includes directory in newly created projects

readme_includes is only needed for making the README.md files using bionitio-readme, and should be removed from newly created projects generated by the bootstrap script.

Underscore in NAME clashes with R package syntax

Just noting that _ in program name will clash with the R implementation when using biotool-boot.sh and produce an error:

Malformed package name

Not sure whether to enforce for all languages a minimal common set of characters, or add a check for R only

Standardise prompt icon

Readmes are mixed in either using $ or % as the unix prompt.

We should standardise this to $

Rounding differences in languages when computing average sequence length

I noticed a difference in the average output between the Haskell and Python implementations.

Haskell:
round 0.5 == 0

Python 2.7:
round(0.5) == 1.0

Python 3.3:
round(0.5) == 0

This will have consequences for the output of our programs, if we compare them for computing exactly the same answer.

Should we standardise on the output?

Can we add some logging to the program?

I think it would be nice to demonstrate logging as well.

Almost all my bioinf programs have optional logging to a file.

perl version generates warnings/errors on single_greather_than.fasta

$ ./biotool.pl ../test_data/single_greather_than.fasta
FILENAME TOTAL NUMSEQ MIN AVG MAX
Use of uninitialized value $top in pattern match (m//) at /Library/Perl/5.18/Bio/SeqIO/fasta.pm line 147, chunk 1.
Use of uninitialized value $L in numeric lt (<) at ./biotool.pl line 67, line 1.
Use of uninitialized value in addition (+) at ./biotool.pl line 70, line 1.
Use of uninitialized value $L in numeric lt (<) at ./biotool.pl line 71, line 1.
Use of uninitialized value $L in numeric gt (>) at ./biotool.pl line 72, line 1.
Use of uninitialized value in join or string at ./biotool.pl line 86.
../test_data/single_greather_than.fasta 1 0 0 0

Typo in file name (within test_data)

test_data/single_greather_than.fasta

correct to "greater"

Consider making installation instructions for each language more detailed

For beginners, I think the installation instructions should say:

git clone [the repo URL]
move into that folder
then follow the installation instructions

[or as per the correct steps]

What do you think?

Use -g info to project name plus `-team` rather than user or local

When the -g option is used, a github name is provided so we could update templated links
for username with project code-review from

https://github.com/code-review-team/code-review

https://github.com/username/code-review

biotool-c not handling empty file correctly

Peter - need to figure out kseq diffs between empty file and no file?

$ ./.travis/test-all.sh
Invalid fasta file format 'test_data/empty_file'
Test Failed: c/biotool-c test_data/empty. Expected 'test_data/empty_file    0   0   -   -   -'
There were 1 errors found

Make it more useful

I know these are only meant to be templates for good programming practices but could we make the scripts a bit more useful by making them calculate things like GC content, n50 and number of Ns? Then the code would be much more useful in itself.

Restructure repository by splitting into separate repos

Currently bionitio consists of one single repo with each language implementation inside. This has a number of disadvantages, especially:

Commit history for each language is intermingled
Travis tests must test every implementation every time code is changed
Travis dependencies for each implementation must be installed in order to test a single implementation

I suggest:

Splitting the repo into separate repositories, so that each language is in a separate repo within the project. This should be done by forking the current repo once for each language implementation, and then deleting the code not related to this implementation in order to retain the commit history.
Retaining a separate master repo remaining for the other code, such as the boot script, licenses, test data etc.
Since it seems that the only data that needs to be shared between language implementation repos is the test data, this can be included in the implementation repos either by including the test data as a submodule in each language, or if preferred, the test data can be cloned from the master repo during the builds.

Check that renaming directories and files in biotool-boot.sh is safe and does not clash with existing names

In biotool-boot.sh we rename any files or directories in the source tree that might contain the word "biotool" and replace it with whatever the new project is called.

However, during this renaming, I don't think that we check whether a file already exists with the new name. This is extremely unlikely, but still possible. We should check for this and abort the program if it occurs.

Perhaps the simplest way to do this is to check the exit status of the mv command?

README.md wrong?

Looks like the TOTAL and NUMSEQ columns are swapped in all the examples?

Python - setuptools

Your setup in the python template probably wont work if the user doesnt have setuptools installed. which is actually common.

Something like this in setup.py would be good:

try:
    from setuptools import setup
except ImportError:
    from ez_setup import use_setuptools
    use_setuptools()

I have a copy of ez_setup here: https://github.com/BeatsonLab-MicrobialGenomics/samplemod/blob/master/ez_setup.py

biotool-java failing on empty file test

Spec if now defined for empty files

URL in setup.py points to bjpop GitHub

After creating a new python project with bionitio my setup.py contains:

url='https://github.com/bjpop/biodemo',

Is this intended behaviour?

Command run:

URL=https://raw.githubusercontent.com/bionitio-team/bionitio/master/boot/bionitio-boot.sh
curl -sSf $URL | bash -s -- -i python -n biodemo -a 'Harriet Dashnow' -e [email protected] -g hdashnow

Consider making github and travis instructions more detailed

For beginners. For example:

After setting up the project template:

Github

Go to your Github account; create a new repo; copy the URL
git remote add origin [url] # sets a new remote
git remote -v # verifies the new remote
git push -u origin master # pushes the folder to your repo (=origin) on its master branch
make a text edit # it seems like this is necessary to get the project to show in travis?
git add, commit, push

Travis

Go to your Travis page
Check for the new repo in your list
Make sure it's ticked
Make another change in the README, git add, commit, push
Travis should now test whenever a new change is pushed; click on the repo name to check

This is just an idea - it may be too simple/unnecessary.

Bug in Perl code (and perhaps others): minlen can cause the program to claim input file is not FASTA if there are 0 reads greater than the minimum length

When minlen causes 0 reads to be counted the program claims the input is not FASTA:

$ ./biotool.pl --minlen 1000 ../test_data/*
FILENAME TOTAL NUMSEQ MIN AVG MAX
Skipping ../test_data/empty_file - doesn't seem to be FASTA?
Skipping ../test_data/one_sequence.fasta - doesn't seem to be FASTA?
Skipping ../test_data/two_sequence.fasta - doesn't seem to be FASTA?

I haven't tested all the other implementations, but something to watch out for.

Looks similar to cookie-cutter

Maria just told me about this project. Looks excellent.

Have you seen Audrey's CookieCutter?

It does something very similar and may be worth looking at for ideas.

Create placeholder for license name, similar to BIONITIO_AUTHOR

This will allow us to use the correct license name in module headers and documentation.

bionitio.pl only runs with the .pl extension

but the aim is that it can run with "bionitio" only, as this is what is written in the "Help message" in the README

Docker container

Might be useful to have a simple docker script to make containers out of biotool.

Will need one for each language implementation.

boot.sh fails if name contains bionitio

Using a name that already contains bionitio in it causes the search/replace step at renaming files and directories to fail

mv: cannot move 'mybionitio/LICENSE' to 'mymybionitio/LICENSE': No such file or directory

Should try and find a workaround, or since we're going to encourage the use of -g to push to GitHub, and developing own tools, just prevent this with a check on the -n option and just enforce that the name can't contain bionitio?

Functional tests missing with `git clone`

A user using a plain clone (git or http address), will be missing functional tests content without a --recurse-submodules option. None of the README seem to mention this option. If we are encouraging the boot script this should be fine, but if we need add in git clone instructions as per issue #53 from @AnnaSyme then we should add this to it.

README for new bootstrapped projects contains bad travis URL

Due to the bionitio-team in our original URLs, these are not correctly patched by the bootstrap script, so newly bootstrapped projects contain bad URLs.

We should update the bootstrap script to fix this.

What is the intended semantics of the verbose flag?

We should work this out soon. It has significant impact on the Haskell code, due to purity.

Perl program uses /dev/stdin to read from stdin, I'm not sure that this is portable

Won't work on Windows (I realise we might not care about Windows) but there are probably ways we could make this feature portable.

bootstrap script should respond better to github errors

Currently if there is an error when trying to create a new github repository the bootstrap script just terminates (although it does at least say something went wrong with the git command).

This is particularly annoying if you happen to type your github password incorrectly.

Maybe it could respond better by looping and trying again?

Missing expected file still passes

Using the Python mode, I found that if the expected file was missing that the test still passes.

The line starting with "diff" indicates a file is missing, but the test still parsed and successfully passed Travis CI.

(biodemo_dev) ubuntu@None:~/code/biodemo/functional_tests$ ./biodemo-test.sh -p biodemo -d test_data -v
biodemo-test.sh Testing stdout and exit status: biodemo one_sequence.fasta
biodemo-test.sh Testing stdout and exit status: biodemo two_sequence.fasta
biodemo-test.sh Testing stdout and exit status: biodemo --minlen 200 two_sequence.fasta
biodemo-test.sh Testing stdout and exit status: biodemo --minlen 200 < two_sequence.fasta
biodemo-test.sh Testing stdout and exit status: biodemo empty_file
biodemo-test.sh Testing stdout and exit status: biodemo --minlen 1000 two_sequence.fasta
biodemo-test.sh Testing stdout and exit status: biodemo --maxlen 170 two_sequence.fasta
biodemo-test.sh Testing stdout and exit status: biodemo --maxlen 170 < two_sequence.fasta
biodemo-test.sh Testing stdout and exit status: biodemo --maxlen 10 two_sequence.fasta
biodemo-test.sh Testing stdout and exit status: biodemo --minlen 10 --maxlen 1000 two_sequence.fasta
biodemo-test.sh Testing stdout and exit status: biodemo --minlen 200 --maxlen 10 two_sequence.fasta
biodemo-test.sh Testing stdout and exit status: biodemo --minlen 200 --maxlen 1000 two_sequence.fasta
biodemo-test.sh Testing stdout and exit status: biodemo --minlen 10 --maxlen 170 two_sequence.fasta
biodemo-test.sh Testing stdout and exit status: biodemo --minlen 50 --maxlen 340 < two_sequence.fasta
diff: two_sequence.fasta.stdin.expected: No such file or directory
biodemo-test.sh Testing exit status: biodemo --this_is_not_a_valid_argument > /dev/null 2>&1
biodemo-test.sh Testing exit status: biodemo this_file_does_not_exist.fasta > /dev/null 2>&1
biodemo passed all 16 successfully

A (semi) reproducible example is at:
trickytank/biodemo@d6d3f4a

Travis CI:
https://travis-ci.org/trickytank/biodemo/builds/461596203

GitHub API is deprecating password access. Use personal access token instead (PAT).

Currently the bionitio-boot.sh script uses curl to communicate with the GitHub API.

Password authentication via this API is being deprecated and the recommendation from GitHub is to use a personal access token instead.

See: https://github.com/settings/tokens

Document generation for Sphinx

Consider adding example document generation for Sphinx.

JS does not return correct exit status on command line error

According to our README.md, biotool should return an exit status of 2 when there is a command line argument error.

The JS implementation returns an exit status of 0 when given the argument:

--this_is_not_a_valid_argument

This is causing the JS library to fail on a travis test.

Handling of a dodgy file when multiple files supplied

For error 1 and 3 below, do I skip over the dodgy file and continue processing, then return the exit code at the end? Or do I die immediately with that exit code?

1: File I/O error. This can occur if at least one of the input FASTA files cannot be opened for reading. This can occur because the file does not exist at the specified path, or biotool does not have permission to read from the file.

3: Input FASTA file is invalid. This can occur if biotool can read an input file but the file format is invalid.

Point to location of bionitio-boot.sh in README

As a new user I read the README and went looking for the bionitio-boot.sh script. I found it (it's in boot), but I was initially confused. Suggest explicitly mentioning /boot in the README.

Make all implementations call their executable binitio (and remove any extensions in the name)

This will help homogenise our README.md

All implementations should be called bionitio, and e.g. not bionitio-py or whatever.

biotool-cpp doesn't handle empty file

Errors with "Could not open the file", should have output as defined in README

Overview page: clarify tool use vs template tool setup

Idea for restructure of Overview page.

After the licence section, add in:

How to use bionitio as a tool
- git clone [repo URL for the language you want]
- install as per instructions
- run and/or modify the code
How to make a new bioinformatics tool
(= Starting a new project from bionitio)
- curl the boot script
- run the boot script, specify language
- this makes the project template folders and files
- you will then need to install the bionitio-language script # is this correct?
- run and/or modify the code

Is this correct? Would this be useful?

Seems simple now I've written it down but I was confused about this distinction for a while.

biotool-R failing tests

Test Failed: r/biotool.R --minlen 200 < test_data/two_sequence.fasta. Expected ' 1 237 237 237 237'
Test Failed 'r/biotool.R --this_is_not_a_valid_argument'. Exit status was 1. Expected 2

Package up Perl code?

Hi Torsten,

I tried to run the Perl code, but I don't have the necessary Bio library installed on my computer.

Is there a way to package up the perl code so that it will download the necessary dependencies, preferably into a local package database? Similar to pip, or the stack tool in Haskell?

Version of python in travis.yml - multiple?

python 3.4 is currently specified in the python travis.yml. Do we need version 2 specified here, and/or can multiple versions be specified.