GithubHelp home page GithubHelp logo

bionitio-team / bionitio Goto Github PK

View Code? Open in Web Editor NEW
114.0 7.0 26.0 615 KB

Demonstrating best practices for bioinformatics command line tools

License: MIT License

Shell 99.00% Dockerfile 1.00%
bioinformatics best-practises

bionitio's Introduction

travis

Overview

Bionitio provides a template for command line bioinformatics tools in various programming languages.

In each language we implement a simple tool that carries out a basic bioinformatics task. The program reads one or more input FASTA files, computes a variety of simple statistics on each file, and prints a tabulated output.

The purpose of the tool is to provide an easy-to-understand working example that is built on best-practice software engineering principles. It can be used as a basis for learning and as a solid foundation for starting new projects. We provide a script called bionitio-boot.sh for starting new projects from bionitio, which saves time and ensures good programming practices are adopted from the beginning (see below for details).

An additional advantage of bionitio is that it allows us to compare programming styles in different languages and programming paradigms.

Bionitio is intended to work on POSIX-like operating systems (such as Linux and OSX). It has not been tested extensively on variants of the Windows operating system.

Please see our publication Bionitio: demonstrating and facilitating best practices for bioinformatics command-line software in GigaScience that provides a detailed discussion of the tool.

Languages

Language Repository Travis Testing Status
C https://github.com/bionitio-team/bionitio-c travis
C++ https://github.com/bionitio-team/bionitio-cpp travis
C# https://github.com/bionitio-team/bionitio-csharp travis
Clojure https://github.com/bionitio-team/bionitio-clojure travis
Java https://github.com/bionitio-team/bionitio-java travis
Javascript https://github.com/bionitio-team/bionitio-js travis
Haskell https://github.com/bionitio-team/bionitio-haskell travis
Perl 5 https://github.com/bionitio-team/bionitio-perl5 travis
Python 3 https://github.com/bionitio-team/bionitio-python travis
R https://github.com/bionitio-team/bionitio-r travis
Ruby https://github.com/bionitio-team/bionitio-ruby travis
Rust https://github.com/bionitio-team/bionitio-rust travis

Basic functionality of bionitio

Bionitio is intended to be a simple prototypical bioinformatics tool that is easy to understand and modify. Therefore it has only minimal functionality; just enough to demonstrate all the key features of a real bioinformatics command line program without becoming distracted by unnecessary complexity.

If you use bionitio as the starting point for a new project we expect that you will rewrite it to implement your own desired functionality. However, much of the boilerplate is already provided for you; modifying the program should be significantly easier than starting from scratch.

All implementations of bionitio implement the same functionality and provide the same command line interface. Specific details of bionitio's behaviour, usage, and installation, can be found in the README for each implementation.

Key features of the tool include:

  • Command line argument parsing and usage information.
  • Reading input from multiple files or optionally from standard input.
  • The use of library code for parsing a common bioinformatics file format (FASTA).
  • Progress and error logging.
  • Defined exit status values.
  • A test suite (unit testing and integration testing).
  • A version number.
  • Standardised software building and packaging using programming language specific mechanisms.
  • A standard open-source software license.
  • User documentation.
  • Code documentation.
  • Docker container.
  • Common Workflow Language (CWL) wrapper.

Where possible we follow the recommended conventions for programming style for each implementation language.

License

The bionitio project is released as open source software under the terms of MIT License. However, we grant permission to users who derive their own projects from bionitio to apply their own license to their derived works. Licenses applied to projects deriving from bionitio do not affect in any way the license of the overall bionitio project, or licenses applied to other independent derivations.

Starting a new project from bionitio

How to set up a new bionitio project, step-by-step.

In the examples below $ indicates the Unix prompt.

One of the main goals of bionitio is to provide a good place to start writing bioinformatics command line tools. To make that easy we've provided a shell script called bionitio-boot.sh to help you start a new project, which is run like so:

$ boot/bionitio-boot.sh -i python -n skynet -c BSD-3-Clause -g cyberdyne -a 'Miles Bennett Dyson' -e '[email protected]' 

The example above starts a fresh project called skynet under the BSD-3-Clause license, using Python as the implementation language. A new git repository will be created in a sub-directory called skynet which will be initialised with a copy of bionitio and a blank revision history. All references to bionitio in the source code are replaced with skynet. Finally, the code is pushed to a new repository on www.github.com for the username cyberdyne.

You should replace skynet with a project name of your choice, and cyberdyne with your github username, if you have a github account. You may be asked to enter your github username. This assumes you do not already have a github project of the given name. If you don't have a github account, do not use the -g option.

After you have started a new project from bionitio you are free to modify it as you see fit, modifying its functionality to suit your own requirements.

When setting up a new project using bionitio-boot.sh You must specify the following things:

Required:

  • -i LANGUAGE: the programming language you want to use (one of: c, clojure, cpp, csharp, haskell, java, js, perl5, python, r, ruby, rust)
  • -n NAME: the name of your new project.

If you are new to programming, and do not know which programming language to use, then we recommend picking one of the high-level interpreted languages that are popular in Bioinformatics, such as Python or R. You may also need to seek advice from your peers about which language(s) are most appropriate for your purposes. We have tried to cover as many popular languages as possible, and apologise if your preference is not currently available. However, we also welcome new implementations of Bionitio in languages not already covered.

Optional:

  • -c LICENSE: the license that you want to assign to your new project (one of: Apache-2.0, BSD-2-Clause, BSD-3-Clause, GPL-2.0, GPL-3.0, MIT). If you do not specify a license then it defaults to the MIT license.
  • -g GITHUB-USERNAME: create a new remote repository in github and push new project to that repository. Replace GITHUB-USERNAME with your actual github user name. You may be prompted for your github password. This assumes you do not already have a repository in github with the same name as specified by the -n NAME option.
  • -a AUTHOR-NAME: Use this name for the author of the code (will appear in source code headers and other places where a name is appropriate).
  • -e AUTHOR-EMAIL: Use this string for the email address of author of the code (will appear in source code headers and other places where an email address is appropriate).
  • -v: enable verbose mode; the script will print a lot more information about what it is doing. This is mostly useful for debugging if it does not work as expected.
  • -l LOGFILE: log progress information to the file named LOGFILE. This may be useful for debugging purposes.

If you don't have a local copy of the script, you can run it from the web like so, using curl:

$ URL=https://git.io/bionitio-boot
$ curl -sSfL $URL \
 | bash -s -- -i python -n skynet -c BSD-3-Clause -g cyberdyne -a 'Miles Bennett Dyson' -e '[email protected]'

Note that https://git.io/bionitio-boot redirects to the biontio bootstrap script on GitHub: https://raw.githubusercontent.com/bionitio-team/bionitio/master/boot/bionitio-boot.sh.

Or if you have Docker installed on your computer, you can run the Docker container like so:

docker run -it -v "$(pwd):/out" --rm bionitio/bionitio-boot \
  -i python -n skynet -c BSD-3-Clause -g cyberdyne -a 'Miles Bennett Dyson' -e '[email protected]'

Or you can make a local copy of the bionitio-boot.sh script, and run it locally, as shown below:

# Copy the script to your local computer
$ URL=https://git.io/bionitio-boot
$ curl -sSfL $URL > bionitio-boot.sh

# Inspect the script to ensure you are happy with the commands it will execute on your system.

# Run the script on your local computer
$ bash bionitio-boot.sh -i python -n skynet -c BSD-3-Clause -g cyberdyne -a 'Miles Bennett Dyson' -e '[email protected]'

Authors

Alphabetically:

  • Jessica Chung
  • Harriet Dashnow
  • Peter Georgeson
  • Andrew Lonsdale
  • Michael Milton
  • Bernie Pope
  • David R Powell
  • Torsten Seemann
  • Clare Sloggett
  • Anna Syme

bionitio's People

Contributors

annasyme avatar bjpop avatar claresloggett avatar drpowell avatar hdashnow avatar jessicachung avatar l-d-s avatar lonsbio avatar multimeric avatar supernifty avatar tseemann avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

bionitio's Issues

Invalid fasta file behaviour

At the moment, behaviour is inconsistent if the input file exists, but is not a fasta file. I suspect the program should exit with a nonzero exit status in this situation, but a lot of the implementations do not

Consistent handling of empty files

What is the expected behaviour on files with no sequences?

Currently the perl impl prints to stderr : "Skipping $file - doesn't seem to be FASTA?\n";
The haskell version silently skips.

Consistent error handling

I think a useful contribution would be to demonstrate good error handling practices.

One thing I was thinking of is a command line argument to decide whether the program will stop if it encounters a bad input FASTA file, or whether it should continue processing the remaining files.

We should try to make the error handling and output of the different versions the same.

Set repos up as GitHub Template repos

GitHub recently added a new repository setting to mark a repo as a "template repository".

It's quite easy to set up (see docs) and basically just adds a little badge to the repo which allows people to copy the repo to their account (NB: not fork it, there's no link here).

This will obviously skip some of the bionitio functionality such as renaming the project within the code, however it could offer a second alternative route for users.

Test GitHub integration on Windows

A user has reported that GitHub integration failed on Windows under PowerShell.

Running the bootstrap script:

  • a new/different command-line window was opened
  • this asked for the user's GitHub password
  • the password was entered in plaintext (maybe it was the username instead)?
  • nothing happened when the enter key was pressed
  • the GitHub repository was not created

Underscore in NAME clashes with R package syntax

Just noting that _ in program name will clash with the R implementation when using biotool-boot.sh and produce an error:

Malformed package name

Not sure whether to enforce for all languages a minimal common set of characters, or add a check for R only

Standardise prompt icon

Readmes are mixed in either using $ or % as the unix prompt.

We should standardise this to $

Rounding differences in languages when computing average sequence length

I noticed a difference in the average output between the Haskell and Python implementations.

Haskell:
round 0.5 == 0

Python 2.7:
round(0.5) == 1.0

Python 3.3:
round(0.5) == 0

This will have consequences for the output of our programs, if we compare them for computing exactly the same answer.

Should we standardise on the output?

perl version generates warnings/errors on single_greather_than.fasta

$ ./biotool.pl ../test_data/single_greather_than.fasta
FILENAME TOTAL NUMSEQ MIN AVG MAX
Use of uninitialized value $top in pattern match (m//) at /Library/Perl/5.18/Bio/SeqIO/fasta.pm line 147, chunk 1.
Use of uninitialized value $L in numeric lt (<) at ./biotool.pl line 67, line 1.
Use of uninitialized value in addition (+) at ./biotool.pl line 70, line 1.
Use of uninitialized value $L in numeric lt (<) at ./biotool.pl line 71, line 1.
Use of uninitialized value $L in numeric gt (>) at ./biotool.pl line 72, line 1.
Use of uninitialized value in join or string at ./biotool.pl line 86.
../test_data/single_greather_than.fasta 1 0 0 0

biotool-c not handling empty file correctly

Peter - need to figure out kseq diffs between empty file and no file?

$ ./.travis/test-all.sh
Invalid fasta file format 'test_data/empty_file'
Test Failed: c/biotool-c test_data/empty. Expected 'test_data/empty_file    0   0   -   -   -'
There were 1 errors found

Make it more useful

I know these are only meant to be templates for good programming practices but could we make the scripts a bit more useful by making them calculate things like GC content, n50 and number of Ns? Then the code would be much more useful in itself.

Restructure repository by splitting into separate repos

Currently bionitio consists of one single repo with each language implementation inside. This has a number of disadvantages, especially:

  • Commit history for each language is intermingled
  • Travis tests must test every implementation every time code is changed
  • Travis dependencies for each implementation must be installed in order to test a single implementation

I suggest:

  • Splitting the repo into separate repositories, so that each language is in a separate repo within the project. This should be done by forking the current repo once for each language implementation, and then deleting the code not related to this implementation in order to retain the commit history.
  • Retaining a separate master repo remaining for the other code, such as the boot script, licenses, test data etc.
  • Since it seems that the only data that needs to be shared between language implementation repos is the test data, this can be included in the implementation repos either by including the test data as a submodule in each language, or if preferred, the test data can be cloned from the master repo during the builds.

Check that renaming directories and files in biotool-boot.sh is safe and does not clash with existing names

In biotool-boot.sh we rename any files or directories in the source tree that might contain the word "biotool" and replace it with whatever the new project is called.

However, during this renaming, I don't think that we check whether a file already exists with the new name. This is extremely unlikely, but still possible. We should check for this and abort the program if it occurs.

Perhaps the simplest way to do this is to check the exit status of the mv command?

README.md wrong?

Looks like the TOTAL and NUMSEQ columns are swapped in all the examples?

URL in setup.py points to bjpop GitHub

After creating a new python project with bionitio my setup.py contains:

url='https://github.com/bjpop/biodemo',

Is this intended behaviour?

Command run:

URL=https://raw.githubusercontent.com/bionitio-team/bionitio/master/boot/bionitio-boot.sh
curl -sSf $URL | bash -s -- -i python -n biodemo -a 'Harriet Dashnow' -e [email protected] -g hdashnow

Consider making github and travis instructions more detailed

For beginners. For example:

After setting up the project template:

Github

  • Go to your Github account; create a new repo; copy the URL
  • git remote add origin [url] # sets a new remote
  • git remote -v # verifies the new remote
  • git push -u origin master # pushes the folder to your repo (=origin) on its master branch
  • make a text edit # it seems like this is necessary to get the project to show in travis?
  • git add, commit, push

Travis

  • Go to your Travis page
  • Check for the new repo in your list
  • Make sure it's ticked
  • Make another change in the README, git add, commit, push
  • Travis should now test whenever a new change is pushed; click on the repo name to check

This is just an idea - it may be too simple/unnecessary.

Bug in Perl code (and perhaps others): minlen can cause the program to claim input file is not FASTA if there are 0 reads greater than the minimum length

When minlen causes 0 reads to be counted the program claims the input is not FASTA:

$ ./biotool.pl --minlen 1000 ../test_data/*
FILENAME TOTAL NUMSEQ MIN AVG MAX
Skipping ../test_data/empty_file - doesn't seem to be FASTA?
Skipping ../test_data/one_sequence.fasta - doesn't seem to be FASTA?
Skipping ../test_data/two_sequence.fasta - doesn't seem to be FASTA?

I haven't tested all the other implementations, but something to watch out for.

Docker container

Might be useful to have a simple docker script to make containers out of biotool.

Will need one for each language implementation.

boot.sh fails if name contains bionitio

Using a name that already contains bionitio in it causes the search/replace step at renaming files and directories to fail

mv: cannot move 'mybionitio/LICENSE' to 'mymybionitio/LICENSE': No such file or directory

Should try and find a workaround, or since we're going to encourage the use of -g to push to GitHub, and developing own tools, just prevent this with a check on the -n option and just enforce that the name can't contain bionitio?

Functional tests missing with `git clone`

A user using a plain clone (git or http address), will be missing functional tests content without a --recurse-submodules option. None of the README seem to mention this option. If we are encouraging the boot script this should be fine, but if we need add in git clone instructions as per issue #53 from @AnnaSyme then we should add this to it.

bootstrap script should respond better to github errors

Currently if there is an error when trying to create a new github repository the bootstrap script just terminates (although it does at least say something went wrong with the git command).

This is particularly annoying if you happen to type your github password incorrectly.

Maybe it could respond better by looping and trying again?

Missing expected file still passes

Using the Python mode, I found that if the expected file was missing that the test still passes.

The line starting with "diff" indicates a file is missing, but the test still parsed and successfully passed Travis CI.

(biodemo_dev) ubuntu@None:~/code/biodemo/functional_tests$ ./biodemo-test.sh -p biodemo -d test_data -v
biodemo-test.sh Testing stdout and exit status: biodemo one_sequence.fasta
biodemo-test.sh Testing stdout and exit status: biodemo two_sequence.fasta
biodemo-test.sh Testing stdout and exit status: biodemo --minlen 200 two_sequence.fasta
biodemo-test.sh Testing stdout and exit status: biodemo --minlen 200 < two_sequence.fasta
biodemo-test.sh Testing stdout and exit status: biodemo empty_file
biodemo-test.sh Testing stdout and exit status: biodemo --minlen 1000 two_sequence.fasta
biodemo-test.sh Testing stdout and exit status: biodemo --maxlen 170 two_sequence.fasta
biodemo-test.sh Testing stdout and exit status: biodemo --maxlen 170 < two_sequence.fasta
biodemo-test.sh Testing stdout and exit status: biodemo --maxlen 10 two_sequence.fasta
biodemo-test.sh Testing stdout and exit status: biodemo --minlen 10 --maxlen 1000 two_sequence.fasta
biodemo-test.sh Testing stdout and exit status: biodemo --minlen 200 --maxlen 10 two_sequence.fasta
biodemo-test.sh Testing stdout and exit status: biodemo --minlen 200 --maxlen 1000 two_sequence.fasta
biodemo-test.sh Testing stdout and exit status: biodemo --minlen 10 --maxlen 170 two_sequence.fasta
biodemo-test.sh Testing stdout and exit status: biodemo --minlen 50 --maxlen 340 < two_sequence.fasta
diff: two_sequence.fasta.stdin.expected: No such file or directory
biodemo-test.sh Testing exit status: biodemo --this_is_not_a_valid_argument > /dev/null 2>&1
biodemo-test.sh Testing exit status: biodemo this_file_does_not_exist.fasta > /dev/null 2>&1
biodemo passed all 16 successfully

A (semi) reproducible example is at:
trickytank/biodemo@d6d3f4a

Travis CI:
https://travis-ci.org/trickytank/biodemo/builds/461596203

JS does not return correct exit status on command line error

According to our README.md, biotool should return an exit status of 2 when there is a command line argument error.

The JS implementation returns an exit status of 0 when given the argument:

--this_is_not_a_valid_argument

This is causing the JS library to fail on a travis test.

Handling of a dodgy file when multiple files supplied

For error 1 and 3 below, do I skip over the dodgy file and continue processing, then return the exit code at the end? Or do I die immediately with that exit code?

1: File I/O error. This can occur if at least one of the input FASTA files cannot be opened for reading. This can occur because the file does not exist at the specified path, or biotool does not have permission to read from the file.

3: Input FASTA file is invalid. This can occur if biotool can read an input file but the file format is invalid.

Point to location of bionitio-boot.sh in README

As a new user I read the README and went looking for the bionitio-boot.sh script. I found it (it's in boot), but I was initially confused. Suggest explicitly mentioning /boot in the README.

Overview page: clarify tool use vs template tool setup

Idea for restructure of Overview page.

After the licence section, add in:

  • How to use bionitio as a tool

    • git clone [repo URL for the language you want]
    • install as per instructions
    • run and/or modify the code
  • How to make a new bioinformatics tool
    (= Starting a new project from bionitio)

    • curl the boot script
    • run the boot script, specify language
    • this makes the project template folders and files
    • you will then need to install the bionitio-language script # is this correct?
    • run and/or modify the code

Is this correct? Would this be useful?

Seems simple now I've written it down but I was confused about this distinction for a while.

biotool-R failing tests

Test Failed: r/biotool.R --minlen 200 < test_data/two_sequence.fasta. Expected ' 1 237 237 237 237'
Test Failed 'r/biotool.R --this_is_not_a_valid_argument'. Exit status was 1. Expected 2

Package up Perl code?

Hi Torsten,

I tried to run the Perl code, but I don't have the necessary Bio library installed on my computer.

Is there a way to package up the perl code so that it will download the necessary dependencies, preferably into a local package database? Similar to pip, or the stack tool in Haskell?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.