GithubHelp home page GithubHelp logo

straightlab / flypipe Goto Github PK

View Code? Open in Web Editor NEW
1.0 4.0 1.0 662.43 MB

Pipeline to process single end ChARseq reads to a combined, filtered and annotated matrix of RNA to DNA contacts

Home Page: https://elifesciences.org/articles/27024

License: GNU General Public License v3.0

Python 44.68% Shell 55.32%
char-seq drosophila

flypipe's Introduction

README

Usage:
bash charseq_flypipe.sh [/path/to/R01.fastq.gz] [/complete/desired/output/path/]
[# of cores to use] [dirty or cleanup]

Where the first argument is the single ended gzipped fastq file
Second is where you would like your output files
Third is the number of cores you'd like to use 
    [enter 1 if you don't want to parallelize]
Fourth is optional: type cleanup to automatically gzip intermediate files 
    after the run to save space"
    
ie:
bash charseq_flypipe.sh /data/cl8-test.fastq.gz /outputdirectory 4

output files are:
.ChARdnaRna.prioritized.sense.txt / .ChARdnaRna.prioritized.antisense.txt
    where RNA contacts were prioritized by annotation in the following order
	tRNA miscRNA ncRNA transcript three_prime_UTR five_prime_UTR exon intron miRNA gene gene_extended2000
	eg: a read matching both a ncRNA and a transcript would receive ncRNA annotations
	sense is for RNA specifically in the sense direction of the annotated transcript
	antisense are RNAs that did not initally align in sense but did antisense
	Header:
	DNA chrom	DNA pos	DNA posplus1	unique ID	DNA adjusted pos	DNA strand	DNA MAPQ	DNA CIGAR	RNA Fb..(gn/tr/...)	RNA pos in transcript ref	RNA MAPQ	RNA CIGAR	RNAref type	RNAref chrom	RNAref bed start	RNAref bed stop	RNAref name (transcriptome specific)	RNAref pos adjusted	RNA chrom	RNAref TSS	RNAref TTS	RNAref Splice sites	RNA chrom	RNA chrom pos	RNA chrom posplus1	RNA chrom pos adjusted	RNA chrom (read) strand	RNA gene name	RNA transcript strand	RNA transcript length	RNA read length	Parentgn	RNA gene TSS	RNA gene TTS	RNA ref chrom	RNA gene bed start	RNA gene bed stop	RNA gene length
.ChARdnaRna.ribominus.sense.txt / .ChARdnaRna.ribominus.antisense.txt
	as above but without 'rRNA' or 'FBgn0085813',
	ribosomal genes that occupy much of the dataset
.ChARdnaRna.notranscriptome
	contains RNA to DNA contacts that where the RNA was not assigned to a transcript from flybase
	Header:
	UniqueID	DNA chrom	DNA pos 	DNA MAPQ	DNA CIGAR	DNA length	DNA strand	RNA chrom	RNA pos 	RNA MAPQ	RNA CIGAR	RNA length	RNA strand


Dependencies:
	You must have the following programs in your path
		python			https://www.python.org/downloads/
		bowtie2			http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
		samtools		http://www.htslib.org/
		super_deduper	https://github.com/dstreett/Super-Deduper
	And the following Python packages:tri
		os
		re
		sys
		bz2
		time
		gzip
		string
		argparse
		itertools
		optparse
		biopython AlignIO and SeqIO
	The program will automatically check for these and notify you if you're missing one.


Description of code:
	Beginning with gzipped fastq's: PCR duplicates are removed using Super Deduper (Petersen, 2015)
	using the entire read length. Adapters are trimmed with Trimmomatic (Bolger, 2014) using a 
	composite set of Illumina adapters, leading/trailing cut length of 3, slidingwindow of 4:15, 
	and a minimum length of 36. RNA and DNA portions of the read are identified and split by a 
	custom python program that finds the bridge, uses its polarity to identify the reads, and 
	verifies that there is only a copy of the bridge. DNA is aligned to the dm3/release 5 using 
	bowtie2 (Langmead, 2012) with the --very-sensitive option except allowing one mismatch. RNA is 
	aligned to all transcriptomes from flybase (downloaded January 2016 versions r5.57) with the 
	same parameters, first forcing forward strandedness and then, for reads that didn't align 
	in the sense direction, permitting antisense alignment. Reads that had both aligning sense RNA 
	and DNA are linked, preserving the transcriptome annotations in the order tRNA, miscRNA, ncRNA,
	transcript, three_prime_UTR, five_prime_UTR, exon, intron, miRNA, gene, gene_extended2000 (eg: 
	a read matching both a ncRNA and a transcript would receive ncRNA annotations). If a read is
	not caught by a sense alignment but is caught by an antisense alignment, it is linked and ordered
	in the same manner. If an RNA read did not map to a transcriptome but did map to the genome, it
	is combined into a separate file with no annotation. Annotated (sense and antisense) reads then
	have rRNAs removed.

	super deduper paper: http://dl.acm.org/citation.cfm?id=2811568
	trimmomatic paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4103590/
	bowtie2: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3322381/
	
	NOTE:   Requires Python 2.7 (does not work with Python 3) and is intended to run in a Linux environment, 
	        but can be modified to run on a mac by replacing "zcat" with "gunzip -c"

flypipe's People

Contributors

afstraight avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar

flypipe's Issues

why don‘t bowtie2 produce sam files directly?

Hi:
In your process : after split the raw fastqs into dna.fasta and rna.fastq, while mapping them with genome.fasta , you use bowtie2 and right after samtools convert to bam.
my question is : why don't u produce sam files and 'samtools view -F 4 ' and awk to make theme simple.txt files?
I think if doing this we can save time and memory, looking forword to your reply!
any help will be grateful!
Haifeng Sun
China NanJing Medical University
20181114

dmel_coord.dict directory How can this be generated for another model organism?

I'm wondering in the code the add_gene_info_to_transcriptome.py file looks for a directory called /dmel_coords.dict/ and a file called dmel-all-$transcriptome-r5.57.coords.dict. It's unclear if this file is generated or if we need to provide it.

If this is something you need to provide how would you generate it for a different model organism for ex. the mouse?

Thanks for the clarification!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.