Comments (8)
If you run Prodigal on your .fna file, it will produce the necessary genes files (.ffn and .faa) as well as a .gff file that contains the coordinates of the genes. You can then convert the .gff to the .genes format. When I get the chance, I can add support for .gff files instead of the non-standard .genes files.
Thanks,
Stephen
from midas.
Hi Stephen,
Thanks for the advice. That's what I've done and it is working so far.
from midas.
Hello,
I am also trying to build my own database, I was just curious how can you convert the .gff file into the .genes file, is it just by renaming the file and changing the extension name from ".gff" to ".genes"? Or does MIDAS now support for .gff files instead of the .genes files?
I would appreciate any guidance on this!
from midas.
Hi,
Not sure if MIDAS now support *gff, otherwise you can get the .genes files as follows (I guess there must be simplest way of doing it, but this one works well for me):
#Starting with ".faa" generated with Prodigal
for file in ./Genome_/.faa
do
awk 'sub(/^>/, "")' <
awk -F" # " '$1=$1' OFS="\t" ${file%.faa}.txt > ${file%.faa}_OK.txt
awk '{gsub("-1","-",$4)}1' OFS="\t" ${file%.faa}_OK.txt > ${file%.faa}_OK2.txt
awk '{gsub("1","+",$4)}1' OFS="\t" ${file%.faa}_OK2.txt > ${file%.faa}_OK3.txt
awk 'match(
sed -i 's/.$//' ${file%.faa}_OK4.txt
awk -vOFS='\t' '{$5 = "CDS"; print}' ${file%.faa}_OK3.txt > ${file%.faa}_OK3_OK.txt
awk -vOFS='\t' 'NR==FNR {h[$1] = $2; next} {print $1,h[$1],$2,$3,$4,$5}' ${file%.faa}_OK4.txt ${file%.faa}_OK3_OK.txt > ${file%.faa}.genes
sed -i '1 i\gene_id\tscaffold_id\tstart\tend\tstrand\tgene_type' ${file%.faa}.genes
rm ./Genome_/.txt
done
from midas.
Dear @palomo11 Thank you so much for your help! I tried the code you kindly shared with me but I got the following error: "sed: 1: "8_1C_n2-B.animalis_P19_ ...": invalid command code _
awk: invalid -v option
awk: invalid -v option"
The name of my .faa file is 8_1C_n2-B.animalis_P19.faa and the code I ran was:
`for file in *.faa
do
awk 'sub(/^>/, "")' <
awk -F" # " '$1=$1' OFS="\t" ${file%.faa}.txt > ${file%.faa}_OK.txt
awk '{gsub("-1","-",$4)}1' OFS="\t" ${file%.faa}_OK.txt > ${file%.faa}_OK2.txt
awk '{gsub("1","+",$4)}1' OFS="\t" ${file%.faa}_OK2.txt > ${file%.faa}_OK3.txt
awk 'match(
sed -i 's/.$//' ${file%.faa}_OK4.txt
awk -vOFS='\t' '{$5 = "CDS"; print}' ${file%.faa}_OK3.txt > ${file%.faa}_OK3_OK.txt
awk -vOFS='\t' 'NR==FNR {h[$1] = $2; next} {print $1,h[$1],$2,$3,$4,$5}' ${file%.faa}_OK4.txt ${file%.faa}_OK3_OK.txt > ${file%.faa}.genes
sed -i '1 i\gene_id\tscaffold_id\tstart\tend\tstrand\tgene_type' ${file%.faa}.genes
done`
I am attaching the files I got by running that code. Do you have any more guidance on this? Any help would be greatly appreciated!
8_1C_n2-B.animalis_P19_OK2.txt
8_1C_n2-B.animalis_P19_OK.txt
8_1C_n2-B.animalis_P19_OK3.txt
8_1C_n2-B.animalis_P19_OK4.txt
8_1C_n2-B.animalis_P19.txt
I also get files with names
8_1C_n2-B.animalis_P19_OK3_OK.txt
8_1C_n2-B.animalis_P19.genes
but these are empty
from midas.
I think there should be a space after -v:
awk -v OFS='\t' '{$5 = "CDS"; print}' ${file%.faa}_OK3.txt > ${file%.faa}_OK3_OK.txt
awk -v OFS='\t' 'NR==FNR {h[$1] = $2; next} {print $1,h[$1],$2,$3,$4,$5}' ${file%.faa}_OK4.txt ${file%.faa}_OK3_OK.txt > ${file%.faa}.genes
sed -i '1 i\gene_id\tscaffold_id\tstart\tend\tstrand\tgene_type' ${file%.faa}.genes
from midas.
Dear @palomo11 ,
Thank you so much for your help!
I was able to obtain a ".genes" file that is not empty, but I still get an error at the end that says "sed: 1: "8_1C_n2-B.animalis_P19_ ...": invalid command code _
sed: 1: "8_1C_n2-B.animalis_P19. ...": invalid command code _ "
The name of my .faa file is 8_1C_n2-B.animalis_P19.faa and the code I ran was:
`for file in *.faa
do
awk 'sub(/^>/, "")' <
awk -F" # " '$1=$1' OFS="\t" ${file%.faa}.txt > ${file%.faa}_OK.txt
awk '{gsub("-1","-",$4)}1' OFS="\t" ${file%.faa}_OK.txt > ${file%.faa}_OK2.txt
awk '{gsub("1","+",$4)}1' OFS="\t" ${file%.faa}_OK2.txt > ${file%.faa}_OK3.txt
awk 'match(
sed -i 's/.$//' ${file%.faa}_OK4.txt
awk -v OFS='\t' '{$5 = "CDS"; print}' ${file%.faa}_OK3.txt > ${file%.faa}_OK3_OK.txt
awk -v OFS='\t' 'NR==FNR {h[$1] = $2; next} {print $1,h[$1],$2,$3,$4,$5}' ${file%.faa}_OK4.txt ${file%.faa}_OK3_OK.txt > ${file%.faa}.genes
sed -i '1 i\gene_id\tscaffold_id\tstart\tend\tstrand\tgene_type' ${file%.faa}.genes
done`
I am attaching below the .genes file I obtained (I changed the extension to .txt so I could attach it here), I still do not see the correct columns that MIDAS asks for:
gene_id (CHAR)
scaffold_id (CHAR)
start (INT)
end (INT)
strand (+ or -)
gene_type (CDS or RNA)
It seems I only have 3 columns, one with genome ID and a number attached to it, a possible protein name, and the gene type "CDS."
8_1C_n2-B.animalis_P19.genes.txt
Do you have any insights into what could be going wrong with the sed command?
from midas.
Hello! I just wanted to post an update for those struggling to make a ".genes" file. I was able to convert a .gff file obtained from Prodigal to a .genes file with the correct columns by using the program csvtk and doing the steps below:
To count number of columns in a .gff file:
csvtk dim --cols -t 8_1C_n2-B.animalis_P19.gff
Notes:
The CSV parser requires all the lines have same number of fields/columns. Even lines with spaces will cause error. Use '-I/--ignore-illegal-row' to skip these lines if necessary.
By default, csvtk handles CSV files, use flag -t for tab-delimited files.
To select fields/columns:
csvtk cut -f 1,3-5,7,9 --ignore-illegal-row -t 8_1C_n2-B.animalis_P19.gff > 8_1C_n2-B.animalis_P19.genes
Check the number of columns on the resulting .genes file:
csvtk dim --cols -t 8_1C_n2-B.animalis_P19.genes
Result: 6
To rename fields/columns in genes file:
csvtk rename -f 1-6 -t -n scaffold_id,gene_type,start,end,strand,gene_id 8_1C_n2-B.animalis_P19.genes > 8_1C_n2-B.animalis_P19_renamed_columns.genes
from midas.
Related Issues (20)
- setup.py: add scripts
- test not completing sucessfully HOT 1
- Strain tracking - getting identified strain names
- test_midas.py fail for test 7 HOT 1
- Can't download database
- Building database error HOT 3
- Test 11 and 14 fails (and fixes)
- Database marker gene file "phyeco.map" has wrong genome_id label
- samtools invalid option 'f'
- build custom database error
- samtools errors HOT 1
- Samtools missing binaries HOT 3
- How can I use MIDAS to get pN/pS on a per-gene basis?
- AttributeError: Can't pickle local object 'parallel.<locals>.init_worker' HOT 1
- Question regarding speed of execution HOT 1
- test_class (__main__._07_RunSNPs) ... FAIL
- Request for information on how are marker genes chosen in custom database ? HOT 4
- Population diversity with intermediate frequency HOT 1
- cannot download the database midas_db_v1.2.tar.gz HOT 2
- no outputs for speciation
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from midas.