GithubHelp home page GithubHelp logo

nlabrad / cims-cyanobacterial-its-motif-slicer Goto Github PK

View Code? Open in Web Editor NEW
3.0 3.0 1.0 6.21 MB

A tool to identify and extract the commonly used ITS folding motifs from a 16s-23s rRNA sequence.

License: GNU General Public License v3.0

Python 100.00%
16s-23s 16s-rrna 16s-seq 23s bacteria bioinformatics cyanobacteria help-wanted phycology phylogeny rna rna-folding rnafolding rrna

cims-cyanobacterial-its-motif-slicer's Issues

synechococus and cynobium contain unique 16s ending and d1d1 beginning!

I have found that these two genera (which are commonly examined) have "CCTCCTA" as the end of their 16s and "GACAA" as the beginning of their D1D1 region. Perhaps we can code these possibilities into the software or even have a question that asks, "Is this a sequence from Synechococcus or cynobium?" when the software finds CCTCCTA instead of CCTCCTT.

Two ending sequences of D1D1

Sometimes there are two D1D1 endings that are close nearby and the program takes the first one, but a few times (especially when the sequence was 30-40ish base pairs) it needed the second D1D1 ending. Sometimes it will still fold nicely (but will only have one bubble) but usually the structure is broken. Some sequences I ran that had this issues were MT764787.1, MG255294.1, KF941246.1, KF941239.1. I'm not sure if its always the case that the second one should be choose or just sometimes when the sequence is too short but I just saw it happen a few times.

earlier CCTCCTT prevents finding leader/D1

Code worked and found boxb but couldn't find leader or D1--D1 has typical start and end sequences, so I think it may be because there is an earlier "CCTCCTT" which occurs in the sequence

accession number:
MT135015.1

Set limit to how long BoxB can be

Please set a limit to how long boxB can be because currently there are times where 7 possible boxB sequences will be in the output and some of them are wayyyy too long. Lets cap it at 80 for now. Thanks

Do double output for Box B when needed.

Sometimes it finds the beginning of BoxB too early. There can be another true beginning for boxB a couple of basepairs away from the one it grabs. Can we return both possibilities to the user?

Monolithic file

Change layout of the project to be modular instead of a monolithic single file.

BoxB missed when beginning w CAGCAT

Ran this Rivularia ITS sequence (fasta file attached) through the script, and found that it correctly identified everything except boxB--which surprised me, because boxB in this sequence begins and ends with the familiar [CAGCA]...[TGCTG] pattern. But then I looked at the script and saw that it's looking for CAGCA(AorC) at the beginning of boxB, and wouldn't ya know it: this boxB begins with CAGCAT!
rivularia_input.txt

BoxB sequences not found in fischerella spp

Can't figure out why this BoxB sequence couldn't be found in two Fischerella spp (order Nostocales, both spp have the same boxb sequence): TAGCATCTGAATGAAAATATTCAGGCTGCTG
Accession numbers for sequences: DQ786173.1 and DQ786171.1

wrong boxb found with newest code

previous version(s?) of the code identified boxb correctly, but now identifies incorrect sequences as boxb; all begin with "TAGCA"
accession numbers for sequences with this issue (all in order Nostocales):
KF417427.1
MK953008.1
MN15981.1

quick fix for when using the -t flag

When using the -t flag i found that the output says "tRNA1:" and "tRNA1:" instead of "tRNA1:" and "tRNA2:" OR, better yet, "tRNAile:" and "tRNAala:"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.