GithubHelp home page GithubHelp logo

bioawk's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bioawk's Issues

bioawk trimming the protein sequences in the start

Hello,

I am trying to convert the .faa format protein sequences into OrthoMCL readable format (organism_ID|protein_ID) using the bioawk -c fastx '{ print ">GMI1000|"$name; print $seq }'. I am only getting the results with around 900 sequences out of 4000. I found that bioawk is not reading the sequences from first 3000 proteins in the .faa format.
Is there any way to solve this problem?

Thank you very much in advance!!

Make array of VCF info and GFF group field.

GTF file:

$ zcat ./test.gtf.gz
chr21   hg19_knownGene  exon    33026871    33027740    0.000000    -   .   gene_id "uc002yoz.1"; transcript_id "uc002yoz.1"; 
chr21   hg19_knownGene  exon    33030247    33030540    0.000000    -   .   gene_id "uc002yoz.1"; transcript_id "uc002yoz.1"; 
chr21   hg19_knownGene  exon    33031710    33031813    0.000000    -   .   gene_id "uc002yoz.1"; transcript_id "uc002yoz.1"; 
chr21   hg19_knownGene  start_codon 33032083    33032085    0.000000    +   .   gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
chr21   hg19_knownGene  CDS 33032083    33032154    0.000000    +   0   gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
chr21   hg19_knownGene  exon    33031935    33032154    0.000000    +   .   gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
chr21   hg19_knownGene  CDS 33036103    33036199    0.000000    +   0   gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
chr21   hg19_knownGene  exon    33036103    33036199    0.000000    +   .   gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
chr21   hg19_knownGene  CDS 33038762    33038831    0.000000    +   2   gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
chr21   hg19_knownGene  exon    33038762    33038831    0.000000    +   .   gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
chr21   hg19_knownGene  CDS 33039571    33039688    0.000000    +   1   gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
chr21   hg19_knownGene  exon    33039571    33039688    0.000000    +   .   gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
chr21   hg19_knownGene  CDS 33040784    33040888    0.000000    +   0   gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
chr21   hg19_knownGene  stop_codon  33040889    33040891    0.000000    +   .   gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
chr21   hg19_knownGene  exon    33040784    33041243    0.000000    +   .   gene_id "uc002ypa.3"; transcript_id "uc002ypa.3";

List some fields of GTF with bioawk:

$ ./bioawk -c gff '{ print $feature,$start,$group }' ./test.gtf.gz
exon    33026871    gene_id "uc002yoz.1"; transcript_id "uc002yoz.1"; 
exon    33030247    gene_id "uc002yoz.1"; transcript_id "uc002yoz.1"; 
exon    33031710    gene_id "uc002yoz.1"; transcript_id "uc002yoz.1"; 
start_codon 33032083    gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
CDS 33032083    gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
exon    33031935    gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
CDS 33036103    gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
exon    33036103    gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
CDS 33038762    gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
exon    33038762    gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
CDS 33039571    gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
exon    33039571    gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
CDS 33040784    gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
stop_codon  33040889    gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
exon    33040784    gene_id "uc002ypa.3"; transcript_id "uc002ypa.3";

It would be nice if for the GFF group field and the VCF info field, $group and $info,
is an array which used each subfeature as key:

$ ./bioawk -c gff '{ print $feature,$start,$group[gene_id],$group[transcript_id] }' ./test.gtf.gz
exon    33026871    uc002yoz.1  uc002yoz.1
exon    33030247    uc002yoz.1  uc002yoz.1
exon    33031710    uc002yoz.1  uc002yoz.1
start_codon 33032083    uc002ypa.3  uc002ypa.3
CDS 33032083    uc002ypa.3  uc002ypa.3
exon    33031935    uc002ypa.3  uc002ypa.3
CDS 33036103    uc002ypa.3  uc002ypa.3
exon    33036103    uc002ypa.3  uc002ypa.3
CDS 33038762    uc002ypa.3  uc002ypa.3
exon    33038762    uc002ypa.3  uc002ypa.3
CDS 33039571    uc002ypa.3  uc002ypa.3
exon    33039571    uc002ypa.3  uc002ypa.3
CDS 33040784    uc002ypa.3  uc002ypa.3
stop_codon  33040889    uc002ypa.3  uc002ypa.3
exon    33040784    uc002ypa.3  uc002ypa.3

At the moment it is not supported and I get the following error:

./bioawk: can't assign to group; it's an array name.
 source line number 1

bioawk

how to solve this problem when trying to run make in bioawk file?
yacc -d awkgram.y
awkgram.y: warning: 43 shift/reduce conflicts [-Wconflicts-sr]
awkgram.y: warning: 85 reduce/reduce conflicts [-Wconflicts-rr]
mv y.tab.c ytab.c
mv y.tab.h ytab.h
gcc -g -Wall -O2 -c ytab.c
make: gcc: Command not found
Makefile:50: recipe for target 'ytab.o' failed
make: *** [ytab.o] Error 127

FR: add asorti function

Dear,

if possible, could asort and asorti be added to the bioawk source, these functions would be very interesting when parsing large biodata files and producing hash during the process to finally output the sorted by value hash in the END{} block.

thanks in advance

How to cite bioawk?

Hi,
How can I cite bioawk? I am using it for my analysis.
Best Regards
Zillur

addon.c:274:3: warning: implicit declaration of function ‘fileno’ ==> POSIX extension

I get some warnings when compiling bioawk:

$ make
yacc -d awkgram.y
conflicts: 43 shift/reduce, 85 reduce/reduce
mv y.tab.c ytab.c
mv y.tab.h ytab.h
gcc -g -Wall -O2 -ansi -c ytab.c
gcc -g -Wall -O2 -ansi   -c -o b.o b.c
gcc -g -Wall -O2 -ansi   -c -o main.o main.c
gcc -g -Wall -O2 -ansi   -c -o parse.o parse.c
gcc -g -Wall -O2 -ansi maketab.c -o maketab
./maketab >proctab.c
gcc -g -Wall -O2 -ansi   -c -o proctab.o proctab.c
gcc -g -Wall -O2 -ansi   -c -o tran.o tran.c
gcc -g -Wall -O2 -ansi   -c -o lib.o lib.c
gcc -g -Wall -O2 -ansi   -c -o run.o run.c
gcc -g -Wall -O2 -ansi   -c -o lex.o lex.c
lex.c: In function ‘gettok’:
lex.c:157:9: warning: ignoring return value of ‘strtod’, declared with attribute warn_unused_result
gcc -g -Wall -O2 -ansi   -c -o addon.o addon.c
addon.c: In function ‘bio_getrec’:
addon.c:274:3: warning: implicit declaration of function ‘fileno’
gcc -g -Wall -O2 -ansi ytab.o b.o main.o parse.o proctab.o tran.o lib.o run.o lex.o addon.o  -o awk -lm -l

fileno is a POSIX extension:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37771
http://stackoverflow.com/questions/9427145/how-do-i-remove-the-following-implicit-declaration-of-function-warnings

Compiling bioawk with the following flags (without -ansi, doesn't give the addon.c:274:3: warning: implicit declaration of function ‘fileno’ error.

make CFLAGS='-g -Wall -O2'

ARGIND not support by bioawk?

Hi?

ARGIND is not support by bioawk?

If we support ARGIND , it will be a easy way to process one file based on another file.

such as finding the intersect part of them.

Best Regards

bioawk does not stop parsing a file on `nextfile`

Scenario: I have 4 FASTQ files in my current directory, and I want to print the first 10 seq names of each file. Ideally, the following command should work very quickly:

bioawk -c fastx 'FNR<11 {print $name} FNR==11{nextfile}' *.fastq

For example, with regular awk, the following works extremely fast:

awk 'FNR==1{print} FNR>1{nextfile}' *.fastq

But bioawk does not stop processing the file when it encounters nextfile. Instead, it seems to continue parsing the whole file, which is crazy when dealing with gigantic files. Plus, it seems to skip a file when running into the nextfile command, which is weird.

match with regexp does not work for me

Dear,

I try to extract fastq reads that start with a 7-base UMI followed by a primer sequence.

They should match the regexp /^[ATGC]{7}GCCCAGGTGCAGCTGCAG/ but it does not work when I add the UMI part on the left as a regexp.

It seems that the match with // fails for regexp expression.
Is it me or do I need to write this differently?
Thanks for help

# fails
bioawk -c fastx 'match($seq, /^[ATGC]{7}GCCCAGGTGCAGCTGCAG/) {print "@"$name" "$comment"\n"$seq"\n+\n"$qual}' sampleR1.fq

# fails
bioawk -c fastx 'match($seq, /^[:alpha:]{7}GCCCAGGTGCAGCTGCAG/) {print "@"$name" "$comment"\n"$seq"\n+\n"$qual}' sampleR1.fq

# fails
bioawk -c fastx 'match($seq, /^.{7}GCCCAGGTGCAGCTGCAG/) {print "@"$name" "$comment"\n"$seq"\n+\n"$qual}' sampleR1.fq

while 

# works
bioawk -c fastx 'match($seq, /GCCCAGGTGCAGCTGCAG/) {print "@"$name" "$comment"\n"$seq"\n+\n"$qual}' sampleR1.fq

# also works 
# but I would prefer to use a true regexp for more readability
bioawk -c fastx 'match($seq, /^.......GCCCAGGTGCAGCTGCAG/) {print "@"$name" "$comment"\n"$seq"\n+\n"$qual}' sampleR1.fq

sampleR1.fq

@M00270:128:000000000-JV5LR:1:1101:21126:1000 1:N:0:1
NGGCTGGGCCCAGGTGCAGCTGCAGGAGTCTGGGGGAGGATTGGTACAGGCTGGGGGCTCTCTGAGACTCTCCTGTGTAGCCTCTGGAGGCACCTTCAGTACCTATGCCATGGGCTGGTTCCGCCAGGCTCCAGGGAAGGAGCGTGAGTTTGTAACACGTATTAACCGGAGTAATGATCGCACATTGTATTCGGACTCCGTGAAGGGCCGATTCACCATTTTGAGAGACAACGCCAAGAGCACAGTGTATCTACAAATGAACAGCCTGAAAACTGAGGACACGGCCGTTTATTATTGTTC
+
!8@BCFEGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGG:FGGGGCFGGFGGGGGGGGGFGGGGGGGGGGGGGGGGGEGGGGGGGGGFFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGDFGGGGGGGGFFGGGGGFGFD@FEFFGGGDFGGG7F:BG,@B9DCCDGGGGGDBDFGFGGECG?*<BCCFEFGACCECCFEFF6BDCGGGGGCE6FF60CGGCFE7CG6CC7+<FG?8CGFCFG+CFFFGF0<1>E3D>EECF@7CFGFBF*)

bioawk vs awk give different results (locale-specific)

Hi,

I'm trying to check my VCF filtering and I wanted to play around with bioawk but I noticed it gives different results than regular awk present on my system (it's a server that should have OpenSUSE according to admin).

See this example:

grep -v '#' r123_newrbmcl100_fb.vcf | awk '$6>20 {print $1,$2,$6}' | head
E57432_L179 11 7368.02
E57432_L179 19 8461.42
E57432_L179 35 12953.8
E57432_L179 43 14922.7
E57432_L179 51 19613.4
E57432_L179 62 20413.3
E57432_L179 70 20282.5
E57432_L179 108 25482.3
E57432_L179 118 27017.1
E57432_L179 149 24639.4

compared to:

grep -v '#' r123_newrbmcl100_fb.vcf | bioawk '$6>20 {print $1,$2,$6}' | head
E57432_L179 11 7368.02
E57432_L179 19 8461.42
E57432_L179 62 20413.3
E57432_L179 70 20282.5
E57432_L179 108 25482.3
E57432_L179 118 27017.1
E57432_L179 124 9.09527e-05
E57432_L179 128 4.50905e-05
E57432_L179 129 3.73425e-05
E57432_L179 138 4.61921e-05

Note the only difference between the commands is replacing awk with bioawk - no special features of bioawk are called. However I noticed this originally when I did use the features (namely -tc hdr and calling variables with colnames) and I got the same wrong result. I then tested it as above.

For some reason bioawk doesn't print some lines that pass the $6>20 filter (like positions 35, 43, and 51) while it prints possitions that should not pass the filter (positions 124-138). System awk prints it correctly.

I installed bioawk using conda manager and bioconda channel. The version is 1.0 and build h84994c4_4.

Thanks for any input on what could go wrong.

[notbug] unexpected termination on fasta.gz

Unexpected termination returns

2 �
9590 >

by the command:

~/miniconda3/bin/bioawk '/^>/ {count[substr($0,1,1)]++}END{for(j in count) print count[j],j}' SILVA_128_SSURef_tax_silva_full_align_trunc.fasta.gz

Reference gzip and mawk commands (doing the same count) are OK:

time zcat *fasta.gz|./bioawk '/^>/ {count[substr($0,1,1)]++}END{for(j in count) print count[j],j}'
1922213 >

real	47m44.564s
user	46m22.997s
sys	2m48.813s

time zcat *fasta.gz|mawk '/^>/ {count[substr($0,1,1)]++}END{for(j in count) print count[j],j}'
1922213 >

real	10m47.346s
user	11m31.835s
sys	1m44.985s

Is it similar to zgrep, which can search both in compressed and not compressed, or -c fastx should be explicit?

header does not respect FS flag in BEGIN

this will demonstrate the problem:

Given his text file

echo $'a space\tb\n1\t2\n1\t2' > t.txt

and this code:

./awk 'BEGIN{FS="\t"}{ print $b }' t.txt

I expect:

b
2
2

but it instead gives nothing because it splits the headers on space sets them as are a, space,b
even though the separator is '\t'. This is because the headers are parsed before the BEGIN block
sets FS.

fastq comment line

Hej,
thanks a lot for the tool, I really love it! But I have one question I couldn't find a proper way to solve it:
How can I work on the third line in a fastq file (the +-comment line)? When manipulating fastx one can access the 4 variables 1:name 2:seq 3:qual 4:comment, of which the $comment-variable does not mean the comment line but everything behind the name in the read-header. Is it possible to access the values in the comment-line though?

Thanks a lot and best regards!
Philipp

bioawk -c fastx sets OFS=\\t with no way to change it.

Hello,

When using bioawk with the -c fastx option the OFS parameter is set to a tab (\t) with no way to change it.

Reproduced by building from source.

Test:

bioawk -c fastx -v OFS=" " '{print OFS}' some.fasta | head -1 | od -bc

Will output:

0000000 011 012
         \t  \n
0000002

Note: some.fasta can be anything (if at least it is a fasta file), but not providing any file will not trigger this.

make fails

make
yacc -d awkgram.y
make: yacc: Command not found
Makefile:50: recipe for target 'ytab.o' failed
make: *** [ytab.o] Error 127

no ytab.o file while Makefile is including one

solved after installing bison (not present under ubuntu by default

Maybe nice to add to the readme.

Thanks for the great tool

header: multiple input files causes multiple headers

Multiple input files causes the header to be printed multiple times in the output. The desired output would have a single header. Ideally, bioawk would check that the headers of each input file are identical.

bioawk -c header 1 foo bar

Segmentation fault on empty files

bioawk segfaults when asked to parse an empty files

$ touch test.fastq
$ gzip test.fastq 
$ bioawk -c fastx '{print}' test.fastq.gz 
Segmentation fault

Actually, it also segfaults on non-gzipped input:

$ touch test.fastq
$ bioawk -c fastx '{print}' test.fastq
Segmentation fault

Is it something easily fixable?

Installation issue

I installed it as below...

  1. sudo apt-get install bison
  2. git clone https://github.com/lh3/bioawk.git
  3. cd bioawk
  4. make

Everything works well, but when I typed "bioawk" even under bioawk folder, got a error message "bioawk: command not found". How can I fix this?

test

Are there some test? I want run some tests on different OS tocompare their running time.thanks very much!

revcomp bug?

Hi,
May be I am missing something. But below seemed weird to me:

bioawk 'BEGIN{a="TTTT" ; print a , revcomp(a) }'
TTTT AAAA

bioawk 'BEGIN{a="TTTT" ; b=revcomp(a); print a,b }'
AAAA AAAA

I am using a Dell PE2900 server, running CentOS 6.

Bernardo

recipe for target 'ytab.o' failed

Hi
I want to install biohawk on the latest ubuntu (18.04) but I am getting this error after running make;

~/bioawk$ make
yacc -d awkgram.y
make: yacc: Command not found
Makefile:50: recipe for target 'ytab.o' failed
make: *** [ytab.o] Error 127

Any thoughts? (sorry if it is an easy fix)
Thanks

Segmentation fault (core dumped)

When i run a test order:
bioawk -c fastx 'END{print NR}' chr17_fragm.fasta
it says:
Segmentation fault (core dumped)
How should i do? Thanks very much for every help.

compilation fatal error - missing zlib.h

Hi,

I tried to compile bioawk after cloning it from this repo and installing byacc (using sudo apt install byacc on elementary OS Hera / Ubuntu 18.04), but I got fatal error:

jena@cloudbook:~/bin/bioawk$ make
yacc -d awkgram.y
yacc: 43 shift/reduce conflicts, 85 reduce/reduce conflicts.
mv y.tab.c ytab.c
mv y.tab.h ytab.h
gcc -g -Wall -O2 -c ytab.c
gcc -g -Wall -O2   -c -o b.o b.c
gcc -g -Wall -O2   -c -o main.o main.c
gcc -g -Wall -O2   -c -o parse.o parse.c
gcc -g -Wall -O2 maketab.c -o maketab
./maketab >proctab.c
gcc -g -Wall -O2   -c -o proctab.o proctab.c
gcc -g -Wall -O2   -c -o tran.o tran.c
gcc -g -Wall -O2   -c -o lib.o lib.c
gcc -g -Wall -O2   -c -o run.o run.c
gcc -g -Wall -O2   -c -o lex.o lex.c
gcc -g -Wall -O2   -c -o addon.o addon.c
addon.c:241:10: fatal error: zlib.h: Adresář nebo soubor neexistuje
 #include <zlib.h> /* FIXME: it would be better to drop this dependency... */
          ^~~~~~~~
compilation terminated.
make: *** [<vestavěný>: addon.o] Chyba 1

Any help would be welcome.

help for gff format

Hi- First, thanks for developing bioawk!
I noticed that the description of the gff format from bioawk -c help has 10 fields, last two being "group" and "attribute".

bioawk -c help
...
gff:
    1:seqname 2:source 3:feature 4:start 5:end 6:score 7:filter 8:strand 9:group 10:attribute 
...

Shouldn't there be 9 fields, where the 9th is either called "attribute" or "group"? For reference: https://www.sanger.ac.uk/resources/software/gff/spec.html#t_2 or http://genome.ucsc.edu/FAQ/FAQformat.html#format3

Dario

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.