GithubHelp home page GithubHelp logo

akosny / compare-annotations Goto Github PK

View Code? Open in Web Editor NEW

This project forked from rrwick/compare-annotations

0.0 0.0 0.0 16 KB

A script for comparing old vs new versions of genome annotations

License: GNU General Public License v3.0

Python 100.00%

compare-annotations's Introduction

Compare annotations

This repo contains a script (compare_annotations.py) for quantifying the improvement in an annotation when a genome is reassembled and/or reannotated.

For example, imagine you had an annotated bacterial genome that's a couple of years old. You've now come back to this genome with new versions of the assembler and annotator and made an updated version. This script can tell you how things changed at the gene level. Hopefully they got better!

This script works by doing a global alignment of the genes of one annotation to the other. This means the two genomes must be roughly aligned at the gene level โ€“ i.e. they should start and end at the same places. If your new genome contains structural rearrangements, that will break this script!

A few other things to note:

  • The input annotated genomes must be in GenBank format.
  • This script only looks at 'CDS' features in the genomes, nothing else.

Requirements

This script uses Python 3 and Biopython. If you can run python3 -c "import Bio" without getting an error, you should be good to go!

No installation is required: just clone the repo and run the script:

git clone https://github.com/rrwick/Compare-annotations.git
Compare-annotations/compare_annotations.py --help

2023 update: When I tried this script using Biopython v1.81, it no longer works. But it still works using Biopython v1.78, so you might need to make a conda environment like this to run the script:

conda create -n old_biopython biopython=1.78
conda activate old_biopython

Example usage

Here are two versions of a genome you can try this script on: CP001172.1 and CP001172.2.

Download them in GenBank format and then run the script like this:

compare_annotations.py CP001172.1.gb CP001172.2.gb > results

Output

This script outputs a gene-by-gene analysis.

When two CDSs are identical, you'll see something like this:

Exact match
  old: Anhydro-N-acetylmuramic acid kinase(AnhMurNAc kinase) (14970-16098 -, 1128 bp)
  new: Anhydro-N-acetylmuramic acid kinase (14970-16098 -, 1128 bp)

Or if the CDSs are similar (i.e. the same gene) but not identical, you'll see something like this:

Inexact match (98.54% ID, old seq longer)
  old: tyrosyl-tRNA synthetase (16159-17392 +, 1233 bp)
  new: Tyrosine--tRNA ligase (16177-17392 +, 1215 bp)

And if a CDS is only in one of the two assemblies, you might see stuff like this:

In old but not in new:
  Glutathione S-transferase, C-terminal domain protein (24047-24413 +, 366 bp)

In new but not in old:
  Sodium:neurotransmitter symporter family protein (25765-27244 +, 1479 bp)

Summarising output

Here are a few lines of Bash to generate a summary file:

printf "Features in old assembly: %5s\n" $(grep "Features in old assembly" results | grep -oP "\d+") >> summary
printf "Features in new assembly: %5s\n" $(grep "Features in new assembly" results | grep -oP "\d+") >> summary
printf "\n" >> summary
printf "Exact match:              %5s\n" $(grep -c "Exact match" results) >> summary
printf "\n" >> summary
printf "Inexact match:            %5s\n" $(grep -c "Inexact match" results) >> summary
printf "  same length:            %5s\n" $(grep -c "same length" results) >> summary
printf "  new seq longer:         %5s\n" $(grep -c "new seq longer" results) >> summary
printf "  old seq longer:         %5s\n" $(grep -c "old seq longer" results) >> summary
printf "\n" >> summary
printf "In new but not in old:    %5s\n" $(grep -c "In new but not in old" results) >> summary
printf "In old but not in new:    %5s\n" $(grep -c "In old but not in new" results) >> summary
printf "\n" >> summary
printf "No longer hypothetical:   %5s\n" $(grep -c "no longer hypothetical" results) >> summary
printf "Still hypothetical:       %5s\n" $(grep -c "still hypothetical" results) >> summary
printf "Became hypothetical:      %5s\n" $(grep -c "became hypothetical" results) >> summary

The result of which should look something like this:

Features in old assembly:  3458
Features in new assembly:  3427

Exact match:               2937

Inexact match:              354
  same length:                3
  new seq longer:           286
  old seq longer:            65

In new but not in old:      136
In old but not in new:      167

No longer hypothetical:     239
Still hypothetical:         638
Became hypothetical:         84

License

GNU General Public License, version 3

compare-annotations's People

Contributors

rrwick avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.