Comments (5)
Regardless, I've added a sanity check for this case (in anvio-dev
) so that you get a nice output error message and not a code traceback. Now when I run anvi-gen-contigs-database -f GCF_005281615.1-renamed.fasta -o Rhodo_def.db --external-gene-calls GCF_005281615_genecall-external-gene-calls_modified.txt
, I get the following error:
Config Error: Something is wrong with your external gene calls file. It seems that the gene
with gene callers id 2792, on contig c_000000000002, has positions that go
beyond the length of the contig. Specifically, the length of the contig is
50020, but the gene starts at position 1057049 and goes to position 1057235.
We've removed the partially-created contigs database for you (but you can see if
if you re-run your command with the `--debug` flag).
from anvio.
Hi @ivagljiva, it went perfectly after correcting the contig number. Thank you so much for your help! I still have genome files which have more than 200 contigs, so I will try anvi-script-reformat-fasta --simplify-names --report-file
to get the label file. I really appreciate your assistance!
from anvio.
Hey @smta11 , thanks for the test dataset. I was able to reproduce your error in both anvi'o v8 and anvi'o-dev.
I did some debugging and got the following information about the gene call that is failing:
gene caller id: ...................................: 2,792
gene start: .................................: 1,057,049
gene stop: ..................................: 1,057,235
contig: .....................................: c_000000000002
contig length: ..............................: 50,020
Clearly, the start and stop positions of the gene (taken from the external gene calls file) are much larger than the length of the contig it is on, which is causing an IndexError
because this part of the code stores nucleotide information at the per-contig level. That is, our list of nucleotide info has a length of 50,020, but we are trying to place information at positions between 1,057,049 and 1,057,235 (which does not work).
Not all genes on contig c_000000000002
have start and stop positions greater than the contig's length. The first 48 genes on the contig have expected ranges, but the next 988 genes are all out of range:
$ sqlite3 Rhodo_def.db "select * from genes_in_contigs where contig='c_000000000002' and start<50020" | wc -l
48
$ sqlite3 Rhodo_def.db "select * from genes_in_contigs where contig='c_000000000002' and start>=50020" | wc -l
988
The next question is, is this issue coming from anvi-script-process-genbank
, or from the original GenBank file? I downloaded the Genbank file for accession GCF_005281615.1 from NCBI and ran the program like this:
anvi-script-process-genbank -i ncbi_dataset/data/GCF_005281615.1/genomic.gbff -O GCF_005281615
I verified that the only difference between the contig sequences provided in the test datapack and in the output FASTA file from this program had to do with the names of the contig headers (since the former is reformatted and the latter is not):
diff GCF_005281615-contigs.fa ../report/GCF_005281615.1-renamed.fasta
1c1
< >NZ_SZZM01000001
---
> >c_000000000001
3c3
< >NZ_SZZM01000010
---
(......)
So I am working with the same input contig sequences, at least.
I directly fed the output of anvi-script-process-genbank
into anvi-gen-contigs-database
like this:
anvi-gen-contigs-database -f GCF_005281615-contigs.fa -o CONTIGS.db --external-gene-calls GCF_005281615-external-gene-calls.txt
And that command worked perfectly on all 20 contigs. :/
This investigation makes me think that something went wrong during the reformatting process, @smta11. How did you update the external gene calls file with the new contig headers after you ran anvi-script-reformat-fasta
? Is it possible that two different contigs could have been mismatched? If so, then this error could be coming from a gene call that has been incorrectly assigned to a much shorter contig, leading to the IndexError we see above.
from anvio.
Thank you so much @ivagljiva for your extra work on this problem. According to your comments, I have checked my outputs and now understand why I get this problem; I see that the order of contigs were changed after runing anvi-script-reformat-fasta
with -l 0 --simplify-names
options. For example, a contig originally labeled as NZ_SZZM01000002.1 (contig 2) was changed to c_000000000012 (contig 12). Can you please teach me how you match these labels after running runing anvi-script-reformat-fasta
with--simplify-names option
?
from anvio.
Great, I'm so glad that we figured out the problem. :)
Usually its best to run anvi-script-reformat-fasta --simplify-names
with the --report-file
flag, so that you also get a text output that matches the original contig names to their new labels. For example, when I run that on my GCF_005281615 FASTA file, it looks like this:
c_000000000001 NZ_SZZM01000001
c_000000000002 NZ_SZZM01000010
c_000000000003 NZ_SZZM01000011
(.....)
(if you didn't do that, it is probably okay, because this is a small enough genome that you could probably match up the 20 contigs manually).
Then, you can use the report file to replace each old contig name with its corresponding new name in the external gene calls file. I'd probably do it in Python (but you could use any strategy you feel comfortable with). For example, here is how I did it for GCF_005281615 using the Pandas package:
import pandas as pd
ext_genes = pd.read_csv("GCF_005281615-external-gene-calls.txt", sep="\t", index_col=0)
old_name_to_new_name = {}
with open('reformat_report.txt', 'r') as f:
for line in f.readlines():
columns = line.strip().split()
old_name_to_new_name[columns[1]] = columns[0]
ext_genes.contig.replace(old_name_to_new_name, inplace=True)
ext_genes.to_csv("reformatted_external_gene_calls.txt", sep="\t")
I hope that helps. I will close this issue for now, but feel free to reopen it if you feel all your questions weren't addressed :)
from anvio.
Related Issues (20)
- [FEATURE REQUEST] Annotate which KOfams were added by our bitscore relaxation heuristic
- [FEATURE REQUEST] A conda package for anvi'o with a minimal installation option HOT 1
- [BUG] `--prodigal-single-mode` breaks metagenomic workflow HOT 5
- [BUG] anvio-cluster-contigs fails with Generate input data
- [BUG] anvi-meta-pan-genome does not allow me to use gene calls not from prodigal HOT 10
- [BUG] Missing USearch in installation instruction and workflow DAG declarations HOT 1
- [BUG] contigs.db has issues when inputing aa_sequence in the external-gene-call file HOT 9
- [BUG] MaxBin2 failing due to one missing coverage HOT 2
- [BUG] anvio installation HOT 15
- [BUG] Pandas dataframe has no attribute 'append' HOT 4
- [BUG] Pandas error in `anvi-get-codon-frequencies` HOT 6
- [BUG] ERROR in running anvi-script-gen_stats_for_single_copy_genes.R HOT 1
- [FEATURE REQUEST] Using external gene clusters for anvi-pan-genome HOT 1
- [FEATURE REQUEST] Case-sensitive search in `anvi-export-locus` HOT 6
- [BUG] The CONTIGS.db file is 0kb and slurm job is still running HOT 2
- Interactive interface, TypeError: Cannot read properties of undefined (reading 'angle') when trying to organize by length HOT 5
- [FEATURE REQUEST] adding contigs database names to deflines of exported genes/proteins fasta HOT 2
- [BUG] cannot import vbgmm in concoct
- [BUG] anvi-pan-genome - diamond BUG
- Running anvio through snakemake, no errors but the job submission stops running HOT 12
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from anvio.