GithubHelp home page GithubHelp logo

FaToVcf reference issue. about kent HOT 4 CLOSED

Salvobioinfo avatar Salvobioinfo commented on August 19, 2024
FaToVcf reference issue.

from kent.

Comments (4)

AngieHinrichs avatar AngieHinrichs commented on August 19, 2024 1

Hi @Salvobioinfo -- sorry to hear that faToVcf isn't working as you expect. faToVcf uses the first sequence in the input fasta file as reference, unless you add the -ref=seqName option (where seqName is the name of the reference sequence, which must be included in the input fasta file). In order to help us diagnose, I have a few questions:

  1. How did you install faToVcf? bioconda? Download executable? Build from source? When did you last install or update it?
  2. What is the exact faToVcf command that you ran? (It will help to see the exact options used)
  3. Better yet, if you can run the same faToVcf command but add -verbose=2, and then copy-paste both the command and the stderr messages from the terminal into a reply, that will provide additional clues.
  4. Best of all, if you can share the fasta input then I can try to run the same command -- but I understand that it's not always possible to share.

from kent.

AngieHinrichs avatar AngieHinrichs commented on August 19, 2024 1

Thanks @Salvobioinfo -- it helps to know that you're using the GISAID MSA. That file uses a masking character that faToVcf does not expect: it uses '-' instead of 'N' to mask sites that are known to be problematic when building phylogenetic trees. When faToVcf sees a '-', it interprets that as an alignment gap not an N or masked base. When it sees '-' in the reference sequence, it assumes that some other sequence has an insertion at that location, and it skips over that base in all input sequences. However, if '-' is used as a masking character not an alignment gap, then faToVcf's assumption is incorrect and it results in incorrect coordinates in the output.

Here's how to work around that.

  1. Make a fasta file with only unmasked EPI_ISL_402124 sequence (not extracted from the MSA file -- the original sequence from GISAID), say EPI_ISL_402124.fa.
  2. Edit EPI_ISL_402124.fa to make sure that the sequence name after > is not exactly "EPI_ISL_402124". (The long hCoV-19/.... name is fine, or only "WIV04" -- just not "EPI_ISL_402124".)
  3. Don't use -includeRef and -ref; instead, concatenate EPI_ISL_402124.fa with 2021-11-09_masked.fa like this:
/dir/faToVcf -vcfChrom=NC_045512.2 -noGenotypes -ambiguousToN -verbose=2 \
    <(cat EPI_ISL_402124.fa 2021-11-09_masked.fa) \
    21-11-09_masked_EPI_ISL_402124.vcf > fatovcf_NC_045512.2.log 2>&1 &

EPI_ISL_402124 will still be included in the VCF because it is in 2021-11-09_masked.fa, but the directly downloaded, unmasked sequence of EPI_ISL_402124 will be used as the reference, resulting in a masking behavior instead of gap-skipping behavior.

Hope that helps, and please let us know if any of that is unclear or if it doesn't work.

from kent.

Salvobioinfo avatar Salvobioinfo commented on August 19, 2024

Hi @AngieHinrichs,
1.Question:
I downloaded the executable following the instruction at this UCSC_link, the 13.11.2021.
2-3. Question:
The command I used is:
/dir/faToVcf -includeRef -ref=EPI_ISL_402124 -vcfChrom=NC_045512.2 -noGenotypes -ambiguousToN -verbose=2 2021-11-09_masked.fa 21-11-09_masked_EPI_ISL_402124.vcf > fatovcf_NC_045512.2.log 2>&1 &
The log file is:
fatovcf.log
4. Question:
I cannot share the file, but it’s possible to get it from GISAID. It’s the MSA (masked).
Thank you in advance.
Best,
Salvo

from kent.

Salvobioinfo avatar Salvobioinfo commented on August 19, 2024

Dear @AngieHinrichs.
The tool works well now.
Thank you.
Salvo

from kent.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.