Hi, I would like to use MIDAS with my own genomes as they are not present in the r

Dear <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url

I think there should be a space after -v: awk -v OFS=' ' '{$5 = "CD

Dear <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url

Build own database: <genome_id>.genes,about snayfach/midas

Comments (8)

snayfach commented on May 11, 2024

If you run Prodigal on your .fna file, it will produce the necessary genes files (.ffn and .faa) as well as a .gff file that contains the coordinates of the genes. You can then convert the .gff to the .genes format. When I get the chance, I can add support for .gff files instead of the non-standard .genes files.

Thanks,
Stephen

from midas.

palomo11 commented on May 11, 2024

Hi Stephen,

Thanks for the advice. That's what I've done and it is working so far.

from midas.

cirodri1 commented on May 11, 2024

Hello,
I am also trying to build my own database, I was just curious how can you convert the .gff file into the .genes file, is it just by renaming the file and changing the extension name from ".gff" to ".genes"? Or does MIDAS now support for .gff files instead of the .genes files?

I would appreciate any guidance on this!

from midas.

palomo11 commented on May 11, 2024

Hi,

Not sure if MIDAS now support *gff, otherwise you can get the .genes files as follows (I guess there must be simplest way of doing it, but this one works well for me):

#Starting with ".faa" generated with Prodigal
for file in ./Genome_/.faa

awk 'sub(/^>/, "")' < $file > ${file%.faa}.txt

awk -F" # " '$1=$1' OFS="\t" ${file%.faa}.txt > ${file%.faa}_OK.txt

awk '{gsub("-1","-",$4)}1' OFS="\t" ${file%.faa}_OK.txt > ${file%.faa}_OK2.txt

awk '{gsub("1","+",$4)}1' OFS="\t" ${file%.faa}_OK2.txt > ${file%.faa}_OK3.txt

awk 'match($1,/_[0-9]+$/) {printf("%s\t%s\n", $1, substr($1,0,RSTART), substr($1,RSTART,RLENGTH))}' ${file%.faa}_OK3.txt > ${file%.faa}_OK4.txt

sed -i 's/.$//' ${file%.faa}_OK4.txt

awk -vOFS='\t' '{$5 = "CDS"; print}' ${file%.faa}_OK3.txt > ${file%.faa}_OK3_OK.txt

awk -vOFS='\t' 'NR==FNR {h[$1] = $2; next} {print $1,h[$1],$2,$3,$4,$5}' ${file%.faa}_OK4.txt ${file%.faa}_OK3_OK.txt > ${file%.faa}.genes

sed -i '1 i\gene_id\tscaffold_id\tstart\tend\tstrand\tgene_type' ${file%.faa}.genes

rm ./Genome_/.txt

done

from midas.

cirodri1 commented on May 11, 2024

Dear @palomo11 Thank you so much for your help! I tried the code you kindly shared with me but I got the following error: "sed: 1: "8_1C_n2-B.animalis_P19_ ...": invalid command code _
awk: invalid -v option

awk: invalid -v option"

The name of my .faa file is 8_1C_n2-B.animalis_P19.faa and the code I ran was:
`for file in *.faa

awk 'sub(/^>/, "")' < $file > ${file%.faa}.txt

awk -F" # " '$1=$1' OFS="\t" ${file%.faa}.txt > ${file%.faa}_OK.txt

awk '{gsub("-1","-",$4)}1' OFS="\t" ${file%.faa}_OK.txt > ${file%.faa}_OK2.txt

awk '{gsub("1","+",$4)}1' OFS="\t" ${file%.faa}_OK2.txt > ${file%.faa}_OK3.txt

awk 'match($1,/_[0-9]+$/) {printf("%s\t%s\n", $1, substr($1,0,RSTART), substr($1,RSTART,RLENGTH))}' ${file%.faa}_OK3.txt > ${file%.faa}_OK4.txt

sed -i 's/.$//' ${file%.faa}_OK4.txt

awk -vOFS='\t' '{$5 = "CDS"; print}' ${file%.faa}_OK3.txt > ${file%.faa}_OK3_OK.txt

awk -vOFS='\t' 'NR==FNR {h[$1] = $2; next} {print $1,h[$1],$2,$3,$4,$5}' ${file%.faa}_OK4.txt ${file%.faa}_OK3_OK.txt > ${file%.faa}.genes

sed -i '1 i\gene_id\tscaffold_id\tstart\tend\tstrand\tgene_type' ${file%.faa}.genes

done`

I am attaching the files I got by running that code. Do you have any more guidance on this? Any help would be greatly appreciated!

8_1C_n2-B.animalis_P19_OK2.txt
8_1C_n2-B.animalis_P19_OK.txt
8_1C_n2-B.animalis_P19_OK3.txt
8_1C_n2-B.animalis_P19_OK4.txt
8_1C_n2-B.animalis_P19.txt

I also get files with names
8_1C_n2-B.animalis_P19_OK3_OK.txt
8_1C_n2-B.animalis_P19.genes
but these are empty

from midas.

palomo11 commented on May 11, 2024

I think there should be a space after -v:

awk -v OFS='\t' '{$5 = "CDS"; print}' ${file%.faa}_OK3.txt > ${file%.faa}_OK3_OK.txt

awk -v OFS='\t' 'NR==FNR {h[$1] = $2; next} {print $1,h[$1],$2,$3,$4,$5}' ${file%.faa}_OK4.txt ${file%.faa}_OK3_OK.txt > ${file%.faa}.genes

sed -i '1 i\gene_id\tscaffold_id\tstart\tend\tstrand\tgene_type' ${file%.faa}.genes

from midas.

cirodri1 commented on May 11, 2024

Dear @palomo11 ,

Thank you so much for your help!

I was able to obtain a ".genes" file that is not empty, but I still get an error at the end that says "sed: 1: "8_1C_n2-B.animalis_P19_ ...": invalid command code _
sed: 1: "8_1C_n2-B.animalis_P19. ...": invalid command code _ "

The name of my .faa file is 8_1C_n2-B.animalis_P19.faa and the code I ran was:

`for file in *.faa

awk 'sub(/^>/, "")' < $file > ${file%.faa}.txt

awk -F" # " '$1=$1' OFS="\t" ${file%.faa}.txt > ${file%.faa}_OK.txt

awk '{gsub("-1","-",$4)}1' OFS="\t" ${file%.faa}_OK.txt > ${file%.faa}_OK2.txt

awk '{gsub("1","+",$4)}1' OFS="\t" ${file%.faa}_OK2.txt > ${file%.faa}_OK3.txt

awk 'match($1,/_[0-9]+$/) {printf("%s\t%s\n", $1, substr($1,0,RSTART), substr($1,RSTART,RLENGTH))}' ${file%.faa}_OK3.txt > ${file%.faa}_OK4.txt

sed -i 's/.$//' ${file%.faa}_OK4.txt

awk -v OFS='\t' '{$5 = "CDS"; print}' ${file%.faa}_OK3.txt > ${file%.faa}_OK3_OK.txt

awk -v OFS='\t' 'NR==FNR {h[$1] = $2; next} {print $1,h[$1],$2,$3,$4,$5}' ${file%.faa}_OK4.txt ${file%.faa}_OK3_OK.txt > ${file%.faa}.genes

sed -i '1 i\gene_id\tscaffold_id\tstart\tend\tstrand\tgene_type' ${file%.faa}.genes

done`

I am attaching below the .genes file I obtained (I changed the extension to .txt so I could attach it here), I still do not see the correct columns that MIDAS asks for:
gene_id (CHAR)
scaffold_id (CHAR)
start (INT)
end (INT)
strand (+ or -)
gene_type (CDS or RNA)

It seems I only have 3 columns, one with genome ID and a number attached to it, a possible protein name, and the gene type "CDS."
8_1C_n2-B.animalis_P19.genes.txt

Do you have any insights into what could be going wrong with the sed command?

from midas.

cirodri1 commented on May 11, 2024

Hello! I just wanted to post an update for those struggling to make a ".genes" file. I was able to convert a .gff file obtained from Prodigal to a .genes file with the correct columns by using the program csvtk and doing the steps below:

To count number of columns in a .gff file:

csvtk dim --cols -t 8_1C_n2-B.animalis_P19.gff

Notes:
The CSV parser requires all the lines have same number of fields/columns. Even lines with spaces will cause error. Use '-I/--ignore-illegal-row' to skip these lines if necessary.
By default, csvtk handles CSV files, use flag -t for tab-delimited files.

To select fields/columns:

csvtk cut -f 1,3-5,7,9 --ignore-illegal-row -t 8_1C_n2-B.animalis_P19.gff > 8_1C_n2-B.animalis_P19.genes

Check the number of columns on the resulting .genes file:

csvtk dim --cols -t 8_1C_n2-B.animalis_P19.genes
Result: 6

To rename fields/columns in genes file:

csvtk rename -f 1-6 -t -n scaffold_id,gene_type,start,end,strand,gene_id 8_1C_n2-B.animalis_P19.genes > 8_1C_n2-B.animalis_P19_renamed_columns.genes

from midas.

Build own database: <genome_id>.genes about midas HOT 8 OPEN

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs