GithubHelp home page GithubHelp logo

Comments (5)

ivagljiva avatar ivagljiva commented on August 30, 2024 2

Regardless, I've added a sanity check for this case (in anvio-dev) so that you get a nice output error message and not a code traceback. Now when I run anvi-gen-contigs-database -f GCF_005281615.1-renamed.fasta -o Rhodo_def.db --external-gene-calls GCF_005281615_genecall-external-gene-calls_modified.txt, I get the following error:

Config Error: Something is wrong with your external gene calls file. It seems that the gene
              with gene callers id 2792, on contig c_000000000002, has positions that go
              beyond the length of the contig. Specifically, the length of the contig is
              50020, but the gene starts at position 1057049 and goes to position 1057235.
              We've removed the partially-created contigs database for you (but you can see if
              if you re-run your command with the `--debug` flag).

from anvio.

smta11 avatar smta11 commented on August 30, 2024 1

Hi @ivagljiva, it went perfectly after correcting the contig number. Thank you so much for your help! I still have genome files which have more than 200 contigs, so I will try anvi-script-reformat-fasta --simplify-names --report-file to get the label file. I really appreciate your assistance!

from anvio.

ivagljiva avatar ivagljiva commented on August 30, 2024

Hey @smta11 , thanks for the test dataset. I was able to reproduce your error in both anvi'o v8 and anvi'o-dev.

I did some debugging and got the following information about the gene call that is failing:

gene caller id:  ...................................: 2,792
gene start:  .................................: 1,057,049
gene stop:  ..................................: 1,057,235
contig:  .....................................: c_000000000002
contig length:  ..............................: 50,020

Clearly, the start and stop positions of the gene (taken from the external gene calls file) are much larger than the length of the contig it is on, which is causing an IndexError because this part of the code stores nucleotide information at the per-contig level. That is, our list of nucleotide info has a length of 50,020, but we are trying to place information at positions between 1,057,049 and 1,057,235 (which does not work).

Not all genes on contig c_000000000002 have start and stop positions greater than the contig's length. The first 48 genes on the contig have expected ranges, but the next 988 genes are all out of range:

$ sqlite3 Rhodo_def.db "select * from genes_in_contigs where contig='c_000000000002' and start<50020" | wc -l
      48
$ sqlite3 Rhodo_def.db "select * from genes_in_contigs where contig='c_000000000002' and start>=50020" | wc -l
     988

The next question is, is this issue coming from anvi-script-process-genbank, or from the original GenBank file? I downloaded the Genbank file for accession GCF_005281615.1 from NCBI and ran the program like this:

anvi-script-process-genbank -i ncbi_dataset/data/GCF_005281615.1/genomic.gbff -O GCF_005281615

I verified that the only difference between the contig sequences provided in the test datapack and in the output FASTA file from this program had to do with the names of the contig headers (since the former is reformatted and the latter is not):

 diff GCF_005281615-contigs.fa ../report/GCF_005281615.1-renamed.fasta
1c1
< >NZ_SZZM01000001
---
> >c_000000000001
3c3
< >NZ_SZZM01000010
---
(......)

So I am working with the same input contig sequences, at least.

I directly fed the output of anvi-script-process-genbank into anvi-gen-contigs-database like this:

anvi-gen-contigs-database -f GCF_005281615-contigs.fa -o CONTIGS.db --external-gene-calls GCF_005281615-external-gene-calls.txt

And that command worked perfectly on all 20 contigs. :/

This investigation makes me think that something went wrong during the reformatting process, @smta11. How did you update the external gene calls file with the new contig headers after you ran anvi-script-reformat-fasta? Is it possible that two different contigs could have been mismatched? If so, then this error could be coming from a gene call that has been incorrectly assigned to a much shorter contig, leading to the IndexError we see above.

from anvio.

smta11 avatar smta11 commented on August 30, 2024

Thank you so much @ivagljiva for your extra work on this problem. According to your comments, I have checked my outputs and now understand why I get this problem; I see that the order of contigs were changed after runing anvi-script-reformat-fastawith -l 0 --simplify-names options. For example, a contig originally labeled as NZ_SZZM01000002.1 (contig 2) was changed to c_000000000012 (contig 12). Can you please teach me how you match these labels after running runing anvi-script-reformat-fasta with--simplify-names option?

from anvio.

ivagljiva avatar ivagljiva commented on August 30, 2024

Great, I'm so glad that we figured out the problem. :)

Usually its best to run anvi-script-reformat-fasta --simplify-names with the --report-file flag, so that you also get a text output that matches the original contig names to their new labels. For example, when I run that on my GCF_005281615 FASTA file, it looks like this:

c_000000000001	NZ_SZZM01000001
c_000000000002	NZ_SZZM01000010
c_000000000003	NZ_SZZM01000011
(.....)

(if you didn't do that, it is probably okay, because this is a small enough genome that you could probably match up the 20 contigs manually).

Then, you can use the report file to replace each old contig name with its corresponding new name in the external gene calls file. I'd probably do it in Python (but you could use any strategy you feel comfortable with). For example, here is how I did it for GCF_005281615 using the Pandas package:

import pandas as pd
ext_genes = pd.read_csv("GCF_005281615-external-gene-calls.txt", sep="\t", index_col=0)
old_name_to_new_name = {}
with open('reformat_report.txt', 'r') as f:
  for line in f.readlines():
    columns = line.strip().split()
    old_name_to_new_name[columns[1]] = columns[0]
ext_genes.contig.replace(old_name_to_new_name, inplace=True)
ext_genes.to_csv("reformatted_external_gene_calls.txt", sep="\t")

I hope that helps. I will close this issue for now, but feel free to reopen it if you feel all your questions weren't addressed :)

from anvio.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.