chasewnelson / ebt Goto Github PK
View Code? Open in Web Editor NEWEvolutionary Bioinformatics Toolkit (EBT)
License: GNU General Public License v3.0
Evolutionary Bioinformatics Toolkit (EBT)
License: GNU General Public License v3.0
Y/N?
Dear SNPGenie/CHASeq authors:
I was in CHASeq for I would like to use CHASeq to deal with gtf, fasta and vcf for reverse strand genes/products to SNPGenie. But I got this message when I run the perl script. I was wondering if there is anything I did wrong, or I hope it would be fixed soon.
seq length is 51304566
Two products have the same starting position, causing an error.
Please contact script author for a revision.
1). gtf was downloaded from an old version of Ensembl annotation (ftp://ftp.ensembl.org/pub/release-73/gtf/homo_sapiens/Homo_sapiens.GRCh37.73.gtf.gz), and filtered for CDS records whose full length are multiples of 3. These were done by the following commands (e.g. for chr22):
## to get the ids for only those ORFs/CDSs with both start codon and stop codon. These ORFs/CDSs should also be annotated as protein coding genes. The final ids should be gene(ENSG???????????)_transcript(ENSG???????????)
zcat Homo_sapiens.GRCh37.73.gtf.gz | awk '{if($3=="start_codon"){a[$12]=0;print $12"\tstart";} if($3=="stop_codon"){b[$12]=0;print $12"\tstop";} } ' | sort -u | cut -f1| uniq -d | sed 's/"//g;s/;//;' > tmp
zcat Homo_sapiens.GRCh37.73.gtf.gz | awk 'BEGIN{OFS="\t"} FILENAME==ARGV[1]{aa[$1]=0;} $1==22 && $2=="protein_coding" && $3=="CDS"{split($10,a,"\"");split($12,b,"\"");if(b[2] in aa){print $1,$2,$3,$4,$5,$6,$7,$8,"gene_id \""a[2]"_"b[2]"\"";}}' tmp - > chr22.gtf
## to filter for ORFs/CDSs whose length are multiples of 3.
sort -k9 -k4 chr22.gtf | awk ' {if($10==a){len=$5-$4+1+len;}else{if(len % 3 != 0 ){print a;}a=$10;len=$5-$4+1; } } END{if(len % 3 != 0 ){print a;}}' > tmp
grep -v -f tmp chr22.gtf > chr22.filter.gtf
2). The vcf was downloaded from OneKGenome phase 3(http://www.internationalgenome.org/data and more specificly ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/).
3). I got fasta file from UCSC (http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz).
4). I run vcf2revcom.pl
vcf2revcom.pl chr22.fa chr22.filter.gtf ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf
P.S. the order of input is not vcf, fasta and then gtf as written in the git page (https://github.com/chasewnelson/CHASeq). But instead the order above.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.