usda-vs / vsnp3 Goto Github PK

vSNP -- validate SNPs

License: GNU General Public License v3.0

Python 97.76% Shell 2.24%

bam-files snps analysis-pipeline brucella freebayes genomics high-throughput-sequencing influenza mycobacterium-tuberculosis-complex phylogenetics

vsnp3's Introduction

vSNP3

This is an introduction to vSNP version 3. A comprehensive overview of vSNP version 2 is here with version 2 GitHub repo here.

vSNP3 generates BAM, VCF and annotated SNP tables and corresponding phylogenetic trees to achieve a high resolution SNP analysis.

Whole genome sequencing for disease tracing and outbreak investigations is routinely required for high consequence diseases. vSNP is an accreditation-friendly and robust tool, designed for easy error correction and SNP validation... vSNP. vSNP generates annotated SNP tables and corresponding phylogenetic trees that can be scaled for reporting purposes. It is able to process large scale datasets, and can accommodate multiple references.

Conda installation

Tested with Python 3.8 - 3.9.

Anaconda setup

conda create -c conda-forge -c bioconda -n vsnp3 vsnp3=3.23

Installation test

which vsnp3_step1.py

vsnp3_step1.py -h

vsnp3_step2.py -h

Test workflow

Download test files

cd ~; git clone https://github.com/USDA-VS/vsnp3_test_dataset.git

Add reference:

cd ~/vsnp3_test_dataset/vsnp_dependencies

vsnp3_path_adder.py -d `pwd`

Run test with AF2122 (Mycobacterium bovis)

Test step 1:

cd ~/vsnp3_test_dataset/AF2122_test_files/step1

vsnp3_step1.py -r1 *_R1*.fastq.gz -r2 *_R2*.fastq.gz -t Mycobacterium_AF2122

Test step 2:

cd ~/vsnp3_test_dataset/AF2122_test_files/step2

vsnp3_step2.py -wd . -a -t Mycobacterium_AF2122

Run test with NC_045512 (SARS-CoV-2)

Test step 1:

cd ~/vsnp3_test_dataset/NC_045512_test_files/step1

for i in *.fastq.gz; do n=`echo $i | sed 's/[_.].*//'`; echo "$i in directory $n"; mkdir -p $n; mv $i $n/; done

cdir=`pwd`; for f in *; do echo $f; cd ./$f; vsnp3_step1.py -r1 *_R1*.fastq.gz -r2 *_R2*.fastq.gz -t NC_045512_wuhan-hu-1; cd $cdir; done

mkdir stats; cp **/*stats.xlsx stats; cd stats; vsnp3_excel_merge_files.py #see combined_excelworksheets-*.xlsx stat summary

Test step 2:

cd ~/vsnp3_test_dataset/NC_045512_test_files/step2

vsnp3_step2.py -a -t NC_045512_wuhan-hu-1 -remove

Description

See -h option for detail and usage

Step 1

vsnp3_step1.py # main entry for step 1
vsnp3_alignment_vcf.py
vsnp3_assembly.py
vsnp3_best_reference_sourmash.py
vsnp3_fastq_stats_seqkit.py
vsnp3_group_reporter.py
vsnp3_vcf_annotation.py
vsnp3_zero_coverage.py

Step 2

vsnp3_step2.py, # main entry for step 2
vsnp3_fasta_to_snps_table.py
vsnp3_group_on_defining_snps.py
vsnp3_html_step2_summary.py
vsnp3_remove_from_analysis.py

Step 1 and 2

vsnp3_annotation.py
vsnp3_file_setup.py
vsnp3_reference_options.py

Utility scripts

vsnp3_path_adder.py
vsnp3_bruc_mlst.py
vsnp3_download_fasta_gbk_gff_by_acc.py
vsnp3_excel_merge_files.py
vsnp3_filter_finder.py
vsnp3_spoligotype.py

Input Summary

Map

Version Enhancements

Python 3.9 support.
Modularity improvements allowing for better usage of functions.
Sourmash to select best reference.
All dependency files can be provided as individual options, ie. file options can be explicit or based on reference directory.
All thresholds added as options.
FASTAs and GBKs can remain separate.
GBK download utility script to get full versus partial file.
Updated annotation descriptions.
Enhanced annotation when "no annotation provided".
Indels called as N at group SNP positions.
VCF files in set with "wrong" chromosome will halt script.
Defining snps and filters zipped with starting VCF files for better repeatability.
Ability to test defining SNPs individually.
With only a FASTA alignment a cascading SNP table can be generated.
Version and option capturing enhancements.
Enhanced file input options.
Oxford Nanopore read compatible (beta).
all_vcf named as defining SNP column label.
Removed Picard which drops Java requirements thus providing easier conda installation.
Strict sample/VCF file naming requirements when matching metadata names. Using the two column metadata Excel worksheet first column must match VCF file exactly, else minus ".vcf", else minus "_zc.vcf", else no match.
Improved organization of cascading SNP tables.
Unmapped reads not assembled by default. New flag to request assembly of unmapped reads.
Detailed steps recorded in run_log.txt.
Spoligotype for TB complex isolates not found by default. -s option required with step 1 or find using vsnp3_spoligotype.py.
Brucella MLST removed from step 1. Find Brucella MLST using vsnp3_bruc_mlst.py.

Additional Tools

vsnp3's People

Contributors

Stargazers

Watchers

Forkers

duceppemo

vsnp3's Issues

macOS installation when using the M chip

For the issue where conda/mamba says "package not found" when using a Mac that is looking for osx-64arm and not the original osx-64 (Intel) architecture.

Error in Test step 2 with AF2122 (Mycobacterium bovis)

Hello!

I have downloaded vsnp3 on my MacOS (M2 Chip) using conda and I am currently trying to Run test with AF2122 (Mycobacterium bovis).

I was able to download the test files, add references (with vsnp3_path_adder.py), and complete step 1, as instructed.

However, I am running into an error when I run:

vsnp3_step2.py -wd . -a -t Mycobacterium_AF2122

Error message:

Traceback (most recent call last):
  File "/Users/sh/opt/miniconda3/envs/bioinfo/bin/vsnp3_step2.py", line 405, in <module>
    group = Group(cwd=global_working_dir, metadata=args.metadata, defining_snps=args.defining_snps, excel_remove=args.remove_by_name, gbk_list=args.gbk, dataframes=vcf_to_df.dataframes, all_vcf=args.all_vcf, find_new_filters=args.find_new_filters, no_filters=args.no_filters, qual_threshold=int(args.qual_threshold), n_threshold=int(args.n_threshold), mq_threshold=int(args.mq_threshold), abs_pos=args.abs_pos, group=args.group, debug=args.debug)
  File "/Users/sh/opt/miniconda3/envs/bioinfo/bin/vsnp3_group_on_defining_snps.py", line 363, in __init__
    tree = self.raxml_table_build(group)
  File "/Users/sh/opt/miniconda3/envs/bioinfo/bin/vsnp3_group_on_defining_snps.py", line 534, in raxml_table_build
    tables.build_tables()
  File "/Users/sh/opt/miniconda3/envs/bioinfo/bin/vsnp3_fasta_to_snps_table.py", line 182, in build_tables
    with open(self.tree, 'rt') as tree_file: #must be the single line newick format.  Not Nexus which will be mutliline often with formating
FileNotFoundError: [Errno 2] No such file or directory: '/Users/sh/vsnp3_test_dataset/AF2122_test_files/step2/Mbovis-01/Mbovis-01_2024-06-13_14-32-24.tre'

How can I fix this? Thanks for your help!

vcflib unable find libraries

Fix by downgrading to previous version:
conda install vcflib=1.0.1 -c bioconda

Error while running vSNP 3.20 "step1" with nanopore data

Command used:
vsnp3_step1.py \ -r1 "$r1" \ --spoligo \ --nanopore

Error message:
Traceback (most recent call last): File "/home/bioinfo/miniconda3/envs/vsnp3.20/bin/vsnp3_step1.py", line 273, in <module> vsnp.run() File "/home/bioinfo/miniconda3/envs/vsnp3.20/bin/vsnp3_step1.py", line 158, in run if alignment.zero_coverage.ave_coverage < 40 or alignment.DUPLICATION_RATIO > .80 or float(fastq_stats.R1.passQ20) < 50.0: ValueError: could not convert string to float: '11,015'
Looks like it's caused by the use of a comma (",") as thousands separator. It might be link to the host system language preferences if you don't get that error on your dev / production machine.

mpilup takes 5h per sample with nanopore data

Hi Tod,

I've been trying the nanopore support of vSNP3 and I think it still needs to be optimized.

First, when installing vSNP3 with conda, it lacks 2 dependencies: vcftools and bcftools. To get a fully working pipeline (I only tested step 1 so far), I had to run:

conda create -y -n vsnp3 -c bioconda vsnp3=3.06
conda install -c bioconda vcftools bcftools
# I had a problem with some vcftools library that could be solved by creating a symbolic link
ln -s /home/bioinfo/miniconda3/envs/vsnp3/lib/libcrypto.so.1.1 /home/bioinfo/miniconda3/envs/vsnp3/lib/libcrypto.so.1.0.0

There are a few Warnings printed in the terminal while the step1 runs. The command I used:

$ vsnp3_step1.py -n -r1 '/home/bioinfo/analyses/mbovis_nanopore_vsnp3/fastq/MBWGS036.fastq.gz' -f /home/bioinfo/vsnp3_test_dataset/vsnp_dependencies/Mycobacterium_AF2122/NC_002945v4.fasta         -b /home/bioinfo/vsnp3_test_dataset/vsnp_dependencies/Mycobacterium_AF2122/NC_002945v4.gbk

The terminal output:

vsnp3_step1.py SET ARGUMENTS:
Namespace(FASTQ_R1='/home/bioinfo/analyses/mbovis_nanopore_vsnp3/fastq/MBWGS036.fastq.gz', FASTQ_R2=None, FASTA=['/home/bioinfo/vsnp3_test_dataset/vsnp_dependencies/Mycobacterium_AF2122/NC_002945v4.fasta'], gbk=['/home/bioinfo/vsnp3_test_dataset/vsnp_dependencies/Mycobacterium_AF2122/NC_002945v4.gbk'], reference_type=None, nanopore=True, assemble_unmap=False, debug=False)



Best Reference Finding with Sourmash 
2022-05-19 14:51:17

== This is sourmash version 4.4.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

select query k=31 automatically.
loaded query: /home/bioinfo/analyses/mbovis_... (k=31, DNA)
loaded 1 databases.

WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
11 matches; showing first 3:
similarity   match
----------   -----
  6.3%       NC_002945.4 Mycobacterium bovis AF2122/97 genome assembly...
  6.2%       NZ_CP041790.1 Mycobacterium tuberculosis strain SEA170200...
  6.2%       CP016401.1 Mycobacterium caprae strain Allgaeu genome

Sample: MBWGS036
Top Sourmash Finding: NC_002945.4 
Reference Set: Mycobacterium_AF2122 
Top reference that is automatically available: /home/bioinfo/vsnp3_test_dataset/vsnp_dependencies/Mycobacterium_AF2122/NC_002945v4.fasta

#############


Spoligotype 
2022-05-19 14:51:23

Align and make VCF file 
2022-05-19 14:52:36
[M::mm_idx_gen::0.136*1.01] collected minimizers
[M::mm_idx_gen::0.160*1.95] sorted minimizers
[M::main::0.160*1.95] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::0.184*1.83] mid_occ = 11
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 1
[M::mm_idx_stat::0.194*1.79] distinct minimizers: 770441 (96.15% are singletons); average occurrences: 1.053; average spacing: 5.362; total length: 4349904
[M::worker_pipeline::3.132*5.40] mapped 9607 sequences
[M::main] Version: 2.24-r1122
[M::main] CMD: minimap2 -a -x map-ont -R @RG\tID:MBWGS036\tSM:MBWGS036\tPL:ILLUMINA\tPI:250 -t 8 -o MBWGS036.sam /home/bioinfo/analyses/mbovis_nanopore_vsnp3/step1/NC_002945v4.fasta /home/bioinfo/analyses/mbovis_nanopore_vsnp3/step1/MBWGS036.fastq.gz
[M::main] Real time: 3.149 sec; CPU: 16.941 sec; Peak RSS: 0.794 GB
[bam_sort_core] merging from 0 files and 8 in-memory blocks...
[markdup] warning: unable to calculate estimated library size. Read pairs 0 should be greater than duplicate pairs 0, which should both be non zero.
Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
[mpileup] 1 samples in 1 input files

VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--vcf temp1.vcf
	--recode-INFO-all
	--out temp2
	--recode
	--remove-indels

Warning: Expected at least 2 parts in INFO entry: ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes for each ALT allele, in the same order as listed">
Warning: Expected at least 2 parts in INFO entry: ID=DP4,Number=4,Type=Integer,Description="Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases">
Warning: Expected at least 2 parts in INFO entry: ID=DP4,Number=4,Type=Integer,Description="Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases">
After filtering, kept 1 out of 1 Individuals
Outputting VCF file...
After filtering, kept 516 out of a possible 611 Sites
Run Time = 0.00 seconds

Zero Coverage 
2022-05-19 17:12:07
	Positions with no coverage: 12,953, 0.297777% of reference

MBWGS036 Poor FASTQ Usability
MBWGS036 Acceptable Reference Usability

As you can notice, the top reference has a very low % value. It still picks the right one, but this part of the pipeline is not optimized for Nanopore. Also, why is it still looking for the best reference is we already told which one to use?

The log file looks like this:


vsnp3_step1.py SET ARGUMENTS:
Namespace(FASTQ_R1='MBWGS009.fastq.gz', FASTQ_R2=None, FASTA=['/home/bioinfo/vsnp3_test_dataset/vsnp_dependencies/Mycobacterium_AF2122/NC_002945v4.fasta'], gbk=['/home/bioinfo/vsnp3_test_dataset/vsnp_dependencies/Mycobacterium_AF2122/NC_002945v4.gbk'], reference_type=None, nanopore=True, assemble_unmap=False, debug=False)

Call Summary:
SYSTEM CALL: minimap2 -a -x map-ont -R "@RG\tID:MBWGS009\tSM:MBWGS009\tPL:ILLUMINA\tPI:250" -t 8 /home/bioinfo/analyses/mbovis_nanopore_vsnp3/fastq/MBWGS009/NC_002945v4.fasta /home/bioinfo/analyses/mbovis_nanopore_vsnp3/fastq/MBWGS009/MBWGS009.fastq.gz -o MBWGS009.sam -- 2022-05-19_17:45:11
SYSTEM CALL: samtools fixmate -O bam,level=1 -m MBWGS009.sam MBWGS009_fixmate.bam -- 2022-05-19_17:45:24
SYSTEM CALL: samtools sort -l 1 -@8 -o MBWGS009_pos_srt.bam MBWGS009_fixmate.bam -- 2022-05-19_17:45:24
SYSTEM CALL: samtools markdup -f markduplicate_stats.txt -r -O bam,level=1 MBWGS009_pos_srt.bam MBWGS009_nodup.bam -- 2022-05-19_17:45:24
NOTE: Read stats gathered by markduplicate_stats.txt -- 2022-05-19_17:45:24
NOTE: Nanopore - bcftools mpileup used to call SNPs and make VCF files *** -- 2022-05-20_00:17:23
SYSTEM CALL: bcftools mpileup --threads 16 -Ou -f /home/bioinfo/analyses/mbovis_nanopore_vsnp3/fastq/MBWGS009/NC_002945v4.fasta MBWGS009_nodup.bam | bcftools call --threads 16 -mv -v -Ov -o MBWGS009_unfiltered_hapall.vcf -- 2022-05-20_00:17:23
SYSTEM CALL: vcffilter -f "QUAL > 20" MBWGS009_unfiltered_hapall.vcf > temp1.vcf -- 2022-05-20_00:17:23
NOTE: Nanopore QUAL values increased by 100 to obtain closer values seen with Illumina reads, and allowing VCF files from both platforms to be ran together. -- 2022-05-20_00:17:23
NOTE: Skipped unmapped read assembly -- 2022-05-20_00:17:23
IMPORT: VCF_Annotation(gbk_list=self.gbk, vcf_file=filtered_hapall) -- 2022-05-20_00:17:25
IMPORT: Zero_Coverage(FASTA=reference, bam=nodup_bamfile, vcf=filtered_hapall,) -- 2022-05-20_00:17:41
NOTE: Files moved to temp_dir and removed: *_unmapped*.fastq.gz, *_all.bam, *_fixmate.bam, *_pos_srt.bam, markduplicate_stats.txt, *.bai, *_filtered_hapall.vcf, *_mapfix_hapall.vcf, *_unfiltered_hapall.vcf, *_filtered_hapall_nanopore.vcf, *.sam, *.amb, *.ann, *.bwt, *.pac, *.fasta.sa, *_sorted.bam, *.dict, chrom_ranges.txt, *.fai, dup_metrics.csv -- 2022-05-20_00:17:41

Versions:
vSNP3: 3.06
Bio, 1.79
numpy, 1.22.3
pandas, 1.4.2
Minimap2: 2.24-r1122
Freebayes: v1.3.6
samtools 1.15
Using htslib 1.14

The main issue right now is that the mpileup step (using bcftools) takes about 5h per sample. I just can rerun all my samples with vSNP3 if it takes that long!

Here's the content of the Excel stats file:

sample	date	FASTA/s	Sourmash Sequence Similarity	Found_Reference_Set	FASTQ_R1	R1 File Size	R1 Read Count	R1 Length Sum	R1 Min Length	R1 Ave Length	R1 Max Length	R1 Passing Q20	R1 Passing Q30	R1 Read Quality Ave	Spoligotype Spacer Counts	Spoligotype Binary Code	Spoligotype Octal Code	Spoligotype SB Number	Groups	Aligner	Mapped Paired Reads	Mapped Single Reads	Unmapped Reads	Unmapped Percent	Unmapped Assembled Contigs	Duplicate Paired Reads	Duplicate Single Reads	Duplicate Percent of Mapped Reads	BAM/Reference File	Reference Length	Genome with Coverage	Average Depth	No Coverage Bases	Percent Ref with Zero Coverage	Quality SNPs
MBWGS009	2022-05-19_17-40-28	NC_002945v4.fasta	3.9%:3b48a55512e8dedc2b8d6e33699893bd	Mycobacterium_AF2122	MBWGS009.fastq.gz	248.4 MB	74,874	262,465,587	1	3,505.4	36,224	65.27%	36.07%	13.8717	20:23:0:27:0:24:26:24:0:28:0:16:26:26:26:0:28:32:0:23:27:28:36:32:36:43:35:35:31:0:0:0:0:0:32:38:36:35:0:0:0:0:0	binary-1101011101011110110111111111100000111100000	octal-656573377603600	SB1071	group file not provided	Minimap2	0	74,847	2,725	3.5%	skipped assembly	0	442	0.6%	MBWGS009_nodup.bam made with NC_002945v4	4,349,904	99.81%	59.1X	8,295	0.190694%	596

So any plans on improving support for Nanopore? I actually haven't tested vSNP3 on paired end data yet, so I don't know if the speed problem is only Nanopore related or not. Let me know if you need more info.

Thanks!
Marco

Best_ref error

vSNP3

I tried to download vSNP3 in MacOS Monterey, ver 12.6.7 and it does not work. Do you have any suggestions? I followed the installation instructions that are provided for Mac users. Thanks!

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs

Jooble