GithubHelp home page GithubHelp logo

nlabrad / cims-cyanobacterial-its-motif-slicer Goto Github PK

View Code? Open in Web Editor NEW
3.0 2.0 1.0 6.21 MB

A tool to identify and extract the commonly used ITS folding motifs from a 16s-23s rRNA sequence.

License: GNU General Public License v3.0

Python 100.00%
16s-rrna 16s-seq 23s bacteria bioinformatics cyanobacteria phycology rna rnafolding rrna

cims-cyanobacterial-its-motif-slicer's Introduction

CIMS: Cyanobacterial ITS Motif Slicer

CIMS is a tool to extract the commonly used ITS folding motifs from a 16s-23s rRNA sequence. It takes a fasta or at least one Genbank accession number and returns a list of motifs with their respective labels for each of the sequences provided. Dedicated to the cyanobacteria researches who spend many hours highlighting motifs in MS Word.

Table of content

========

Why did we make this tool?


The 16S-23S rRNA internal transcribed spacer (ITS) is a commonly employed phylogenetic marker in cyanobacterial systematics. Examination of ITS regions allows researchers to discover congruencies and apomorphies between species of cyanobacteria. This gives the researcher more evidence when erecting new cryptic taxon or analyzing previously unresolved taxonomic relationships. The challenge however is that historically researchers must manually dig through sequence data to visually find and identify ITS sequence motifs. This painstaking process deters researchers from using ITS motifs, leads to errors, and not to mention… causes headaches.

We knew there was a better way to do this, so after dissecting the manual process, we created CIMS.

CIMS finds the commonly used ITS folding motifs such as D1-D1’, Box B, tRNA-ile and tRNA-ala to ensure researchers are using homologous operons when comparing ITS secondary structures between taxa.

What does it do again?


  • CIMS is a terminal application written in Python that
  • Can process one or more Genbank accession numbers or a fasta file with one or more sequences.
  • Automatically talks to Genbank for you so you don't have to download the fasta files yourself.
  • Returns a text output with the motifs identified and their lenghts for you to use as you please.

In the current version of the software, the motifs included in the standard output are:

  • Leader
  • D1-D1`
  • Spacer – D2 – Spacer
  • tRNA-ala
  • Spacer – V2 - Spacer
  • tRNA- ile
  • Box B
  • D4
  • BoxA
  • V3

Installation


Pre-Requisites

We get it, you're a biologist, we got you. All you need is beginner level of terminal... maybe not even that much. If you know how to browse to a directory (cd) and run an executable(./cims), you're good to go.

Simple Method: Download the pre-packaged files from Releases.

To keep things simple, we pre-packaged CIMS with all it's dependencies into a single file and compiled it for Windows, Linux and MacOS. These files are available under Releases.

  • Download the zip file that corresponds to your system.
  • Unzip it in whichever directory you'd like.
  • You're done! 👐
  • To run, open your favorite Terminal, cd to that directory, and run CIMS as an executable, usually by typing ./cims.

To keep things simple, we suggest saving CIMS to the directory where you'll have the fasta files you want to process. If you're pulling your sequences straight from Genbank, it doesn't really matter.

Advanced Method: Download the Python script.

If you want to perhaps make your own changes to the flanking regions, or make changes to the code, you can simply download CIMS.py from and run it with Python. (But you probably already knew that if that's what you wanted).

To run CIMS you will need:

  • Python 3
  • BioPython: $ pip install Biopython
  • Colorama $ pip install colorama

BioPython allows CIMS to communicate with Genbank to download sequences. Colorama allows us to easily output the motifs in pretty colors.

Once you have those dependencies installed (either globally or in a virtual environment), simply run cims.py.

Usage


CIMS runs in the terminal. It is provided a sequence either through a FASTA file or by fetching them from Genbank based on accession numbers. The input for this tool must either be a fasta file with one or more properly formatted 16s-23s ITS sequences or a Genbank accession number to a 16s-23s ITS sequence.

Navigate to the location where CIMS was saved.

For example, in Windows, you'd use cd to move to a directory as such:

cd C:/Users/{your-username}/Desktop/PathtoFile

Or in Linux/Mac:

cd /home/{your username}/{where you downloaded cims}

To run CIMS, simply execute it by running ./cims or python cims.py from the directory where it was saved.

When running this on your terminal the output will include all motifs found in the sequences given to the program. If you would like to save the output of your run remember to use “>>” to save output into a text file:

cims -f myfasta.fasta >> motifs.txt

The list of flags, arguments and their descriptions are below:

Usage: cims [-f or -g] [file or accession number] [OPTIONS]

Options:
-f, --fasta PATH-TO-FASTA-FILE                                             Provide FASTA to be processed.
-g, --genbank ACCESSION1 [ACCESSION2 ...]                                  Provide one or more Genbank Accession Numbers to fetch and process.
-s, --select {leader,d1d1,sp_v2_sp,trna_ile,trna_ala,boxa,boxb,d4,v3,all}  Select which motifs to print out. By default it prints all.
-e, --email                                                                Provide an email to be used when querying Genbank. An NCBI requirement.
-j, --json                                                                 Create a json file in the working directory with the output.
-t, --trna                                                                 Returns ONLY how many tRNAs were found per sequence. 

Examples:

cims =f allmycyanos.fasta

Result: CIMS will process the provided fasta file and return all the motifs it finds.

cims -f ~/home/me/fasta/limnothrix_16-23_ITS.fasta -s d1d1, trna_ile, trna_ala, boxb

Result: Processes the limnothrix_16-23_ITS.fasta file stored in a directory that resides in /home/me/fasta and asks CIMS to only output d1d1, the tRNAs and BoxB motifs.

cims -g KU574618.1 -e [email protected]

Result: Fetches the sequence of KU574618.1 from Genbank (providing an email that is required by NCBI), processes the sequence, and returns the motifs.

cims -f allmycyanos.fasta -t

Result: Fetches the sequence from Genbank, and returns how many tRNAs were found on each organism. This allows to easily check if the organisms in the fasta are homologous operons.

Note If you ever get lost, you can always run cims -h or python cims.py -h and you will get a quick reference of the available options.

Possible errors:


1. “Could not find the end of 16S to determine the ITS region boundaries”

This error means that the sequence given to the software did not contain the sequence that represents the end of the 16S region (CCTCCTT). You may proceed with the run if you have fed the program the ITS region only and everything will run as normal otherwise, abort the run for that sequence by typing “N” when prompted “Proceed with search anyway? (Y/N)”. This will allow the program to move onto the next sequence in the fasta file or allow you to try again with another file/accession #.

3. “Region length too short. Skipped.”

This will be printed if the ITS region after the end of the 16S gene is under 20bps. This feature is coded to remove sequences with ITS regions that are too small to be used to find any of the motifs.

4. “Not found in this sequence.”

This output will be printed when a particular motif was not found in the ITS sequence. This could be because the flanking regions are unique or otherwise rare and so the software did not find these. If this happens frequently in your dataset, please report this to us in the “Issues” page of the GitHub so that we can address this error and improve the code.

5. “Not present in this operon”

This will be printed only regarding tRNAs in the sequence. If the program does not find tRNA-ala or tRNA-ile, it will assume that this operon does not contain one or both tRNAs. Remember, it is best to use homologous operons when comparing ITS motifs between taxa (ie. Operons containing the same number of tRNAs).

cims-cyanobacterial-its-motif-slicer's People

Contributors

callmcgovern avatar nlabrad avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

callmcgovern

cims-cyanobacterial-its-motif-slicer's Issues

earlier CCTCCTT prevents finding leader/D1

Code worked and found boxb but couldn't find leader or D1--D1 has typical start and end sequences, so I think it may be because there is an earlier "CCTCCTT" which occurs in the sequence

accession number:
MT135015.1

Two ending sequences of D1D1

Sometimes there are two D1D1 endings that are close nearby and the program takes the first one, but a few times (especially when the sequence was 30-40ish base pairs) it needed the second D1D1 ending. Sometimes it will still fold nicely (but will only have one bubble) but usually the structure is broken. Some sequences I ran that had this issues were MT764787.1, MG255294.1, KF941246.1, KF941239.1. I'm not sure if its always the case that the second one should be choose or just sometimes when the sequence is too short but I just saw it happen a few times.

Set limit to how long BoxB can be

Please set a limit to how long boxB can be because currently there are times where 7 possible boxB sequences will be in the output and some of them are wayyyy too long. Lets cap it at 80 for now. Thanks

Do double output for Box B when needed.

Sometimes it finds the beginning of BoxB too early. There can be another true beginning for boxB a couple of basepairs away from the one it grabs. Can we return both possibilities to the user?

BoxB missed when beginning w CAGCAT

Ran this Rivularia ITS sequence (fasta file attached) through the script, and found that it correctly identified everything except boxB--which surprised me, because boxB in this sequence begins and ends with the familiar [CAGCA]...[TGCTG] pattern. But then I looked at the script and saw that it's looking for CAGCA(AorC) at the beginning of boxB, and wouldn't ya know it: this boxB begins with CAGCAT!
rivularia_input.txt

BoxB sequences not found in fischerella spp

Can't figure out why this BoxB sequence couldn't be found in two Fischerella spp (order Nostocales, both spp have the same boxb sequence): TAGCATCTGAATGAAAATATTCAGGCTGCTG
Accession numbers for sequences: DQ786173.1 and DQ786171.1

wrong boxb found with newest code

previous version(s?) of the code identified boxb correctly, but now identifies incorrect sequences as boxb; all begin with "TAGCA"
accession numbers for sequences with this issue (all in order Nostocales):
KF417427.1
MK953008.1
MN15981.1

quick fix for when using the -t flag

When using the -t flag i found that the output says "tRNA1:" and "tRNA1:" instead of "tRNA1:" and "tRNA2:" OR, better yet, "tRNAile:" and "tRNAala:"

Monolithic file

Change layout of the project to be modular instead of a monolithic single file.

synechococus and cynobium contain unique 16s ending and d1d1 beginning!

I have found that these two genera (which are commonly examined) have "CCTCCTA" as the end of their 16s and "GACAA" as the beginning of their D1D1 region. Perhaps we can code these possibilities into the software or even have a question that asks, "Is this a sequence from Synechococcus or cynobium?" when the software finds CCTCCTA instead of CCTCCTT.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.