GithubHelp home page GithubHelp logo

Comments (8)

snayfach avatar snayfach commented on May 11, 2024

If you run Prodigal on your .fna file, it will produce the necessary genes files (.ffn and .faa) as well as a .gff file that contains the coordinates of the genes. You can then convert the .gff to the .genes format. When I get the chance, I can add support for .gff files instead of the non-standard .genes files.

Thanks,
Stephen

from midas.

palomo11 avatar palomo11 commented on May 11, 2024

Hi Stephen,

Thanks for the advice. That's what I've done and it is working so far.

from midas.

cirodri1 avatar cirodri1 commented on May 11, 2024

Hello,
I am also trying to build my own database, I was just curious how can you convert the .gff file into the .genes file, is it just by renaming the file and changing the extension name from ".gff" to ".genes"? Or does MIDAS now support for .gff files instead of the .genes files?

I would appreciate any guidance on this!

from midas.

palomo11 avatar palomo11 commented on May 11, 2024

Hi,

Not sure if MIDAS now support *gff, otherwise you can get the .genes files as follows (I guess there must be simplest way of doing it, but this one works well for me):

#Starting with ".faa" generated with Prodigal
for file in ./Genome_/.faa

do

awk 'sub(/^>/, "")' < $file &gt; ${file%.faa}.txt

awk -F" # " '$1=$1' OFS="\t" ${file%.faa}.txt > ${file%.faa}_OK.txt

awk '{gsub("-1","-",$4)}1' OFS="\t" ${file%.faa}_OK.txt > ${file%.faa}_OK2.txt

awk '{gsub("1","+",$4)}1' OFS="\t" ${file%.faa}_OK2.txt > ${file%.faa}_OK3.txt

awk 'match($1,/_[0-9]+$/) {printf("%s\t%s\n", $1, substr($1,0,RSTART), substr($1,RSTART,RLENGTH))}' ${file%.faa}_OK3.txt > ${file%.faa}_OK4.txt

sed -i 's/.$//' ${file%.faa}_OK4.txt

awk -vOFS='\t' '{$5 = "CDS"; print}' ${file%.faa}_OK3.txt > ${file%.faa}_OK3_OK.txt

awk -vOFS='\t' 'NR==FNR {h[$1] = $2; next} {print $1,h[$1],$2,$3,$4,$5}' ${file%.faa}_OK4.txt ${file%.faa}_OK3_OK.txt > ${file%.faa}.genes

sed -i '1 i\gene_id\tscaffold_id\tstart\tend\tstrand\tgene_type' ${file%.faa}.genes

rm ./Genome_/.txt

done

from midas.

cirodri1 avatar cirodri1 commented on May 11, 2024

Dear @palomo11 Thank you so much for your help! I tried the code you kindly shared with me but I got the following error: "sed: 1: "8_1C_n2-B.animalis_P19_ ...": invalid command code _
awk: invalid -v option

awk: invalid -v option"

The name of my .faa file is 8_1C_n2-B.animalis_P19.faa and the code I ran was:
`for file in *.faa

do

awk 'sub(/^>/, "")' < $file &gt; ${file%.faa}.txt

awk -F" # " '$1=$1' OFS="\t" ${file%.faa}.txt > ${file%.faa}_OK.txt

awk '{gsub("-1","-",$4)}1' OFS="\t" ${file%.faa}_OK.txt > ${file%.faa}_OK2.txt

awk '{gsub("1","+",$4)}1' OFS="\t" ${file%.faa}_OK2.txt > ${file%.faa}_OK3.txt

awk 'match($1,/_[0-9]+$/) {printf("%s\t%s\n", $1, substr($1,0,RSTART), substr($1,RSTART,RLENGTH))}' ${file%.faa}_OK3.txt > ${file%.faa}_OK4.txt

sed -i 's/.$//' ${file%.faa}_OK4.txt

awk -vOFS='\t' '{$5 = "CDS"; print}' ${file%.faa}_OK3.txt > ${file%.faa}_OK3_OK.txt

awk -vOFS='\t' 'NR==FNR {h[$1] = $2; next} {print $1,h[$1],$2,$3,$4,$5}' ${file%.faa}_OK4.txt ${file%.faa}_OK3_OK.txt > ${file%.faa}.genes

sed -i '1 i\gene_id\tscaffold_id\tstart\tend\tstrand\tgene_type' ${file%.faa}.genes

done`

I am attaching the files I got by running that code. Do you have any more guidance on this? Any help would be greatly appreciated!

8_1C_n2-B.animalis_P19_OK2.txt
8_1C_n2-B.animalis_P19_OK.txt
8_1C_n2-B.animalis_P19_OK3.txt
8_1C_n2-B.animalis_P19_OK4.txt
8_1C_n2-B.animalis_P19.txt

I also get files with names
8_1C_n2-B.animalis_P19_OK3_OK.txt
8_1C_n2-B.animalis_P19.genes
but these are empty

from midas.

palomo11 avatar palomo11 commented on May 11, 2024

I think there should be a space after -v:

awk -v OFS='\t' '{$5 = "CDS"; print}' ${file%.faa}_OK3.txt > ${file%.faa}_OK3_OK.txt

awk -v OFS='\t' 'NR==FNR {h[$1] = $2; next} {print $1,h[$1],$2,$3,$4,$5}' ${file%.faa}_OK4.txt ${file%.faa}_OK3_OK.txt > ${file%.faa}.genes

sed -i '1 i\gene_id\tscaffold_id\tstart\tend\tstrand\tgene_type' ${file%.faa}.genes

from midas.

cirodri1 avatar cirodri1 commented on May 11, 2024

Dear @palomo11 ,

Thank you so much for your help!

I was able to obtain a ".genes" file that is not empty, but I still get an error at the end that says "sed: 1: "8_1C_n2-B.animalis_P19_ ...": invalid command code _
sed: 1: "8_1C_n2-B.animalis_P19. ...": invalid command code _ "

The name of my .faa file is 8_1C_n2-B.animalis_P19.faa and the code I ran was:

`for file in *.faa

do

awk 'sub(/^>/, "")' < $file &gt; ${file%.faa}.txt

awk -F" # " '$1=$1' OFS="\t" ${file%.faa}.txt > ${file%.faa}_OK.txt

awk '{gsub("-1","-",$4)}1' OFS="\t" ${file%.faa}_OK.txt > ${file%.faa}_OK2.txt

awk '{gsub("1","+",$4)}1' OFS="\t" ${file%.faa}_OK2.txt > ${file%.faa}_OK3.txt

awk 'match($1,/_[0-9]+$/) {printf("%s\t%s\n", $1, substr($1,0,RSTART), substr($1,RSTART,RLENGTH))}' ${file%.faa}_OK3.txt > ${file%.faa}_OK4.txt

sed -i 's/.$//' ${file%.faa}_OK4.txt

awk -v OFS='\t' '{$5 = "CDS"; print}' ${file%.faa}_OK3.txt > ${file%.faa}_OK3_OK.txt

awk -v OFS='\t' 'NR==FNR {h[$1] = $2; next} {print $1,h[$1],$2,$3,$4,$5}' ${file%.faa}_OK4.txt ${file%.faa}_OK3_OK.txt > ${file%.faa}.genes

sed -i '1 i\gene_id\tscaffold_id\tstart\tend\tstrand\tgene_type' ${file%.faa}.genes

done`

I am attaching below the .genes file I obtained (I changed the extension to .txt so I could attach it here), I still do not see the correct columns that MIDAS asks for:
gene_id (CHAR)
scaffold_id (CHAR)
start (INT)
end (INT)
strand (+ or -)
gene_type (CDS or RNA)

It seems I only have 3 columns, one with genome ID and a number attached to it, a possible protein name, and the gene type "CDS."
8_1C_n2-B.animalis_P19.genes.txt

Do you have any insights into what could be going wrong with the sed command?

from midas.

cirodri1 avatar cirodri1 commented on May 11, 2024

Hello! I just wanted to post an update for those struggling to make a ".genes" file. I was able to convert a .gff file obtained from Prodigal to a .genes file with the correct columns by using the program csvtk and doing the steps below:

To count number of columns in a .gff file:

csvtk dim --cols -t 8_1C_n2-B.animalis_P19.gff

Notes:
The CSV parser requires all the lines have same number of fields/columns. Even lines with spaces will cause error. Use '-I/--ignore-illegal-row' to skip these lines if necessary.
By default, csvtk handles CSV files, use flag -t for tab-delimited files.

To select fields/columns:

csvtk cut -f 1,3-5,7,9 --ignore-illegal-row -t 8_1C_n2-B.animalis_P19.gff > 8_1C_n2-B.animalis_P19.genes

Check the number of columns on the resulting .genes file:

csvtk dim --cols -t 8_1C_n2-B.animalis_P19.genes
Result: 6

To rename fields/columns in genes file:

csvtk rename -f 1-6 -t -n scaffold_id,gene_type,start,end,strand,gene_id 8_1C_n2-B.animalis_P19.genes > 8_1C_n2-B.animalis_P19_renamed_columns.genes

from midas.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.