GithubHelp home page GithubHelp logo

mforootan / comprehensive_allopolyploid_genotyper Goto Github PK

View Code? Open in Web Editor NEW

This project forked from kkulkarni1/capg

0.0 0.0 0.0 6.82 MB

https://doi.org/10.1093/bioinformatics/btac729

Shell 1.27% Python 2.43% C 93.98% R 2.06% Awk 0.04% Makefile 0.06% CMake 0.16%

comprehensive_allopolyploid_genotyper's Introduction

CAPG - Comprehensive Allopolyploid Genotyper

This software genotypes targeted genomic regions in allotetraploids using whole genome sequencing (WGS) reads aligned to both reference subgenomes. (For a version that handles targeted amplicons, see capg_amp.) The main genotyper is written in C as a standalone executable.

Table of Contents

  1. Prerequisites
  2. Installation
  3. Required Inputs
  4. Output
  5. Other Command-Line Options
  6. Tutorial
  7. How to Cite
  8. Contact
  9. Amplicon Version

Prerequisites

The genotyper capg_wgs requires the C compiler from the GNU Compiler Collection (GCC), CMake, and Samtools executable installed on your system.

There is also a data simulator capg_sim available. It additionally requires the R Standalone Math Library. Often, the Rmath library (libRmath.a or libRmath.so for Linux or libRmath.dylib for MacOS) will be installed with R, but not always. Here are some other locations for the library.

If RMathLib is not installed on your system, everything should be fine except capg_sim will not be compiled.

Installation

  1. Clone the repository.

    git clone https://github.com/Kkulkarni1/CAPG.git
  2. Compile CAPG. The executable is called capg_wgs. It will appear in the CAPG/src/wgs directory.

    cd CAPG/src/wgs
    cmake .
    make
  3. Install CAPG. Copy the executable to wherever you need it. If you have root privileges, you can install it into the system path, for example:

    sudo cp capg_wgs /usr/local/bin

Required Inputs

The software requires multiple input files.

  1. Fasta files containing subgenomic references, one subgenome per file. Pass them in via the --fsa_files command-line option.

  2. SAM files containing the reads aligned to each subgenomic reference separately. Pass them in via the --sam_files command-line option.

  3. SAM file containing the alignments of selected target regions in each subgenome to each other. Pass it in via the --geno command-line option.

It also requires one command-line option.

You must name the target regions to genotype, including the chromosome name and the start and end positions relative to the whole chromosome. For example, chr1:1-10 means you want to genotype from position 1 to 10 (1 based) in chromosome 1. Use ':' to seperate chromosome name and region index. Use '-' to seperate start and end positions. Pass these in by the --ref_names command-line option.

We have used MUMmer4 to produce the SAM file for the alignment of the targeted region(s). For example, the following command will output a SAM file called ref.sam.

nucmer --sam-long=ref --mum target_A.fa target_B.fa

Output

The genotyping output for each subgenome are stored in VCF files, one per subgenome, if the --vcf_files command-line option is used. The name used to identify the current individual in the VCF file output can be provided with the --name option. In addition, the program will extract the target regions into FASTA files, by default called extracted0.fsa and extracted1.fsa, though you can change the prefix extracted with the -j command-line option. These files are currently deleted unless the program terminates unexpectedly, so you can place these in a temporary directory. The command also currently produces //a lot// of output to stderr that you may wish to capture and examine.

Command-Line Options

Please run ./capg_wgs -h for detailed information about all available options.

Tutorial

All the files used and created in this tutorial are in the data folder. In this example we will genotype positions 1 to 5000 of both subgenomes assuming they are homoeologous. Finally, we store the output in VCF files, whose names are provided via the --vcf_files option.

From the src/wgs directory, the command line for genotyping is:

./capg_wgs --ref_names Genome_A:1-5000 Genome_B:1-5000 --sam_files ../../data/aln0A.sam ../../data/aln0B.sam --fsa_files ../../data/refA.fa ../../data/refB.fa --geno ../../data/ref.sam -equal --vcf_files ../../data/A.vcf ../../data/B.vcf

The positions with no coverage in the first genome will not be outputed.

More detailed tutorials demonstrating real data analysis of peanut data and an extensive simulation, can be found here.

How to Cite

  • This work is under review. Please see bioarxiv.

Contact

If you have any problems with this software, please contact:

Roshan Kulkarni ([email protected]) or Karin S. Dorman ([email protected])

Amplicon Version

We also have a similar software for genotyping amplicon sequences, here we briefly mention how to use it.

Installation

Compile CAPG for amplicon. The executable is called capg_amp. It will appear in the CAPG/src/amplicon directory.

cd CAPG/src/amplicon
make capg_amp

Command-Line Options

CAPG_AMP(1)

NAME
	capg_amp - genotype tetraploids

SYNOPSIS
	capg_amp --sam_files SAM1 SAM2 --fasta_files FSA1 FSA2 --ref_names REF1 REF2
		[[--genotype_by_clustering [--alignment FILE1 FILE2]]
		[--sample INT --min-subgenomic-coverage FLOAT]
		[--min INT --max INT --expected-errors FLOAT --indel INT --loglik FLOAT
		 --min-posterior FLOAT --secondary --soft-clipped INT]
		[--coverage FLOAT --biallelic FLOAT --equal_coverage_test [FLOAT1 FLOAT2]]
		[--drop INT --amplici EXE [--amplici-f FILE --amplici-o STRING --amplici-l FLOAT]]
		[--error_file|--error_data FILE] ...]

DESCRIPTION
	capg_amp genotypes allotetraploids using reads in SAM1 and SAM2 aligned to
	REF1 and REF2 references from fasta files FSA1 FSA2.
	SAM files typically contain reads from a single individual, genotype, or
	accession aligned to multiple amplified targets, but capg_amp genotypes
	one individual at one amplicon.

OPTIONS

Input (required):
	--fasta_files FILE1 FILE2
		Subgenomic reference fasta files (Default: none).
		DEPRECATED: see --ref_fasta_files
	--ref_fasta_files FILE1 FILE2
		Subgenomic reference fasta files (Default: none).
	--sam_files FILE1 FILE2
		SAM files with reads aligned to each subgenome (Default: none)
	--ref_names STRING1 STRING2
		Names of subgenomic reference target regions (Default: none)
	--ref_alignment FILE
		SAM file containing alignment of references (Default: none)

Output (optional):
	--display_alignment
		Display alignments in stderr output (Default: no).
	--vcf_files FILE1 FILE2
		Genotyping output in one vcf file per subgenome (Default: none).
	--subref_fasta_files FILE|FILE1 FILE2
		Subsetted reference regions output to these FASTA files (Default: subsetted_refs[12].fsa)
	--gl
		Toggle GL output to vcf files (Default: yes).
	--name STRING
		Name of accession/individual/genotype; used in vcf header (Default: sample).

Estimation/Inference (optional):
	--genotype_by_clustering
		Genotype by clustering (Default: no).
		Requires command-line arguments --amplici and --clustalo or --mafft.
	--clustalo FILE
		The clustal omega executable (Default: (null)).
	--mafft FILE
		The mafft executable (Default: (null)).
	--alignment FILE1 FILE2
		Alignment input FILE1 and output FILE2 (Default: selected_haplotypes.fa selected_haplotypes.co.fa)
	--misalignment_rate FLOAT
		Maximum allowed subgenomic misalignment rate in [0, 0.5) (Default: 0.30).
		Tolerated proportion of reads from one subgenome aligning to the other.

Screening paralogs and other contaminants (optional):
	-p, --drop INT
		Drop reads aligning to paralogs; -1 to automate (Default: -1)
		Drop INT of four most abundant haplotypes if specified.
		Requires command-line argument --amplici.
	--amplici EXE
		The amplicon denoiser software (Default: none).
		Writes auxiliary files "amplici.fastq" "amplici.fa", and "amplici.out".
		See https://github.com/DormanLab/AmpliCI for more information.
	--amplici_fastq FILE
		Selected reads output to this FASTQ file for denoising (Default: amplici.fastq).
	--write-fastq [FILE]
		Write fastq file for AmpliCI and quit (Default: no).
		See --amplici_fastq to name the file.
		With optional argument write named fastq file after AmpliCI paralog filtering (Default: none).

comprehensive_allopolyploid_genotyper's People

Contributors

kdorman avatar yudizhangzyd avatar kkulkarni1 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.