This repository contains some example low-level applets, suitable for developers to use as inspiration when building their own.
More information about these applets (wiki.dnanexus.com)
The applets below are labeled with brief "tags" that show any significant features that the applet demonstrates, as follows:
- bash: the applet is written in bash (shell script).
- Invokes other applets: the applet invokes other applets.
- Multiple entry points: the applet makes use of multiple entry points to spawn subjobs (frequently used to facilitate parallelism across multiple computers).
- Python: the applet is written in Python.
These applets implement simple data transformations that are frequently useful as parts of larger pipelines.
contigset_to_fasta_gz: Takes a ContigSet object (one of the DNAnexus type objects which are found in the public project "Reference Genomes"), and produces a gzipped fasta file. Illustrates: Python
fastq_splitter: Takes a gzipped fastq file and splits it by number of reads into several smaller files. Useful to parallelize aligning reads. Illustrates: bash
sff_splitter: Takes a gzipped sff file and splits it by number of reads into several smaller files. Useful to parallelize aligning reads. Illustrates: python
flexbar_read_demultiplexer: Demultiplexes indexed (barcoded) reads. Illustrates: bash
flexbar_read_trimmer: Trims reads by quality score and/or position. Illustrates: bash
picard_calculate_hs_metrics: Calculates hybrid selection (target enrichment) metrics using the Picard CalculateHsMetrics tool. Illustrates: bash
picard_mark_duplicates: Runs MarkDuplicates on a BAM file. Defaults to discarding duplicate reads instead of marking them. Illustrates: Python
picard_merge_sam_files: Runs Picard module of same name. Useful as the reduce step of a map-reducte strategy. Illustrates: Python
picard_sam_to_fastq: Wraps Picard module of the same name. Produces gzipped fastq files. Can convert either BAM or SAM files to fastq. Useful for those who receive sequence data in BAM format. Illustrates: Python
samtools_merge: Runs the merge module of samtools. Useful as the reduce step of a map-reduce strategy. Illustrates: bash
samtools_view: Runs the view module of samtools. Used in the pipeline to split a BAM by region. Could also be used to extract mappings of a certain flag or convert from BAM to SAM. Illustrates: Python
split_bam_interchromosomal_pairs: This is the only applet which wraps custom written code instead of an open-source program. This splits a BAM file by intra and interchromosomal mappings. This is required by picard_mark_duplicates in a strategy which uses map-reduce to split by genome region. All interchromosmally mapped read pairs must be considered together. Illustrates: Python
bowtie_indexer: Takes a gzipped fasta file and produces a .tar.gz file that bowtie_mapper will take as input. Illustrates: bash
bowtie_mapper: Takes one or two arrays of gzipped fastq files for paired or unpaired reads and an indexed genome from the bowtie_indexer applet and maps them with bowtie. Illustrates: bash
bowtie_mapper_parallel: Wraps fastq_splitter, bowtie_mapper, and samtools_merge to align reads in parallel. Defaults to provide a worker for every 25,000,000 reads. Illustrates: bash, invokes other applets, multiple entry points
bwa_indexer: Takes a gzipped fasta file and produces a .tar.gz file that bwa_aligner will take as input. Defaults to using bwtsw. Illustrates: Python
bwa_aligner: Takes one or two arrays of gzipped fastq files for paired or unpaired reads, and runs bwa aln followed by bwa samse/sampe. Illustrates: Python
bwa_recalibration_pipeline: Wraps parallel_bwa, split_bam_interchromosomal_pairs, picard_mark_duplicates, gatk_realign_and_recalibrate_applet, and picard_merge_sam_files to align and recalibrate reads in parallel. Illustrates: Python
parallel_bwa: Wraps fastq_splitter, bwa_aligner, and picard_merge_sam_files to align reads in parallel. Defaults to providing a worker for every 10,000,000 reads. Illustrates: Python, multiple entry points
tmap_indexer: Takes a gzipped fasta file and produces a .tar.gz file that tmap_aligner will take as input. Illustrates: bash
tmap_aligner: Takes an SFF file to be mapped and a reference genome indexed by the 'tmap_indexer' applet. The reads will be chunked into smaller parts by the 'sff_splitter' applet and mapped in parallel. The number of reads per mapping job is a parameter of the applet. Illustrates: bash using scatter-gather template
gatk_apply_variant_recalibration: Runs GATK's ApplyRecalibration on a VCF and a model produced by the gatk_variant_recalibrator applet. This recalibrates the quality of a VCF file and applies a filter to the data. Default filter level is 95% specificity. Illustrates: Python
gatk_realign_and_recalibrate_applet: Performs indel realignment and quality recalibration on a BAM file. Requires dbSNP and known indel files which are available in the datasets directory of the Developer Applets project. Illustrates: Python
gatk_variant_annotator_applet: Wraps GATK module of the same name. Can be used to annotate with dbsnp, annotation or comparison VCFs, or a snpEff VCF. Illustrates: Python
gatk_variant_caller_applet: Runs GATK's UnifiedGenotyper on a BAM file to produce a VCF file. Illustrates: Python
gatk_variant_recalibration_pipeline: Wraps the gatk_variant_recalibrator and gatk_apply_variant_recalibration applets to build a model from a VCF(s) and apply it to recalibrate the quality of the VCF. Illustrates: Python
gatk_variant_recalibrator: Runs GATK's VariantRecalibrator on VCF(s) to produce model files to use in variant recalibration. Illustrates: Python
tumor_normal_snp_pipeline: Wraps bwa_recalibration_pipeline and somatic_sniper to align, recalibrate, and call tumor vs. normal snps on two sets of paired reads. Defaults to outputting VCF. Illustrates: Python
somatic_sniper: Runs program of same name which takes a tumor BAM file and a normal BAM file and produces variant calls. Defaults to produce output in VCF format. Illustrates: Python