suwonglab / arcsv Goto Github PK

Complex structural variant detection from WGS data

License: MIT License

Python 99.86% Shell 0.14%

genomics variant-calling structural-variation

arcsv's Introduction

ARC-SV: Automated Reconstruction of Complex Structural Variants

ARC-SV is a structural variant caller for paired-end, whole-genome sequencing data. For methodological details, please see our preprint: [https://doi.org/10.1101/200170].

This software was developed in the Wong Lab at Stanford University with funding from the NSF Grant DGE-114747 and NIH grants T32-GM096982, P50-HG007735, and R01-HG007834.

Requirements
Installation
- Example installation from scratch
- Getting reference resources
Usage

Installation

ARC-SV and its dependencies can be installed as follows:


git clone https://github.com/SUwonglab/arcsv.git
cd arcsv
pip3 install --user .

OS X users with a brewed Python installation should ignore --user above.

The installed location of the main script, arcsv, must be in your path. The correct folder is probably /usr/bin, /usr/local/bin, or ~/.local/bin.

Example: Using `conda`

If an isolated environment is desired, or if installing ARC-SV using pip is causing problems, it is recommended to use conda.

conda create -n arcsv --strict-channel-priority -c conda-forge -c bioconda \
  python=3 pysam numpy scipy scikit-learn matplotlib python-igraph

cd /path/to/arcsv
pip3 install .

To run ARC-SV in future login sessions, the conda environment must first be activated using conda activate arcsv.

Example: installing system dependencies without `conda`

The following commands should install ARC-SV and all dependencies on a fresh copy of Ubuntu:


# update packages
sudo apt-get update

# install pip and setuptools
sudo apt install python3-pip
pip3 install -U pip setuptools

# extra requirements needed for igraph
sudo apt install libxml2-dev zlib1g-dev

# arcsv setup
sudo apt install git
git clone https://github.com/jgarthur/arcsv.git
cd arcsv
pip3 install --user .

# add this to your .bash_profile
export PATH="~/.local/bin/:$PATH"

Getting reference resources

You will need a bed file containing the locations of assembly gaps in your reference genome. The resources/ folder contains files for hg19, GRCh37, and hg38, which were retrieved as follows:

curl http://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/gap.txt.gz | \
     zcat | \
     cut -f2-4 > hg19_gap.bed
     
curl http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/gap.txt.gz | \
     zcat | \
     cut -f2-4 > hg38_gap.bed

or for the NCBI reference (with "2" instead of "chr2"):


curl http://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/gap.txt.gz | \
     zcat | \
     cut -f2-4 | \
     sed 's/^chr//' \
     > GRCh37_gap.bed

Usage

Calling SVs

To call SVs:

arcsv call -i reads.bam -r chrom[:start-end] -R reference.fasta -G reference_gaps.bed -o output_dir

# To see more detailed documentation on all possible arguments
arcsv call -h

Running the example

The folder example/ in this repository contains files to test the ARC-SV installation:

arcsv call -i example/input.bam -r 20:0-250000 -o my_example_output \
  -R example/reference.fa -G example/gaps.bed
  
diff my_example_output/arcsv_out.tab example/expected_output.tab

Filtering and merging output files

ARC-SV works on a single chromosome at a time. Supposing your output folders are named "arcsv_chr#", you can merge and/or filter the results as follows:


# Recommended settings
arcsv filter-merge --min_size 50 --no_insertions arcsv_chr*

# If no filtering is desired
arcsv filter-merge arcsv_chr*

Description of output

For each cluster of candidate breakpoints, ARC-SV attempts to resolve the local structure of both haplotypes. The output file arcsv_out.tab contains one line for each non-reference haploype called. A call typically consists of a single SV (simple or complex), but some contain multiple variants that were called together.

Where multiple values are given, as in svtype, the order is left to right in the alternate haplotype, which is shown in the rearrangement column.

All genomic positions in arcsv_out.tab are 0-indexed for compatibility with BED files. (arcsv_out.vcf is still 1-indexed as required.)

Output field	Description
chrom	chromosome name
minbp	position of first novel adjacency
maxbp	position of last novel adjacency
id	identifier consisting of the region in which the event was called
svtype	classification of each simple SV/complex breakpoint in this event
complextype	complex SV classification
num_sv	number of simple SVs + complex SV breakpoints in this call
bp	all breakpoints, i.e. boundaries of the blocks in the "reference" column (including the flanking blocks)
bp_uncertainty	width of the uncertainty interval around each breakpoint in `bp`. For odd widths, there is 1 bp more uncertainty on the right side of the breakpoint
reference	configuration of genomic blocks in the reference. Blocks are named A through Z, then a through z, then A1 through Z1, etc.
rearrangement	predicted configuration of genomic blocks in the sample. Inverted blocks are followed by a tick mark, e.g., A', and insertions are represented by underscores _
len_affected	length of reference sequence affected by this rearrangement (plus the length of any novel insertions). For complex SVs with no novel insertions, this is often smaller than maxbp - minbp, i.e., the "span" of the rearrangement in the reference
filter	currently, this is `INSERTION` if there is an insertion present, otherwise `PASS`
sv_bp	breakpoint positions for each simple SV/complex breakpoint in the event (there are `num_sv` pairs of non-adjacent reference positions, each one describing a novel adjacency)
sv_bp_uncertainties	breakpoint uncertainties for each simple SV/complex breakpoint in the event
gt	genotype [either `HET` or `HOM`]
af	allele fraction for the called variant [either 0.5 or 1.0, unless `--allele_fraction_list` was set]
inslen	length of each insertion in the call
sr_support	number of supporting split reads for each simple SV and complex breakpoint (length = num_sv)
pe_support	number of supporting discordant pairs for each simple SV and complex breakpoint (length = num_sv)
score_vs_ref	log-likelihood ratio score for the call: `log( p(data
score_vs_next	log-likelihood ratio score for the call vs the next best call: `log( p(data
rearrangement_next	configuration of genomic blocks for the next best call (may contain more blocks than the "reference" and "rearrangement" columns
num_paths	number of paths through this portion of the adjacency graph. The called haplotype corresponds to one such path

arcsv's People

Contributors

Stargazers

Watchers

Forkers

jeffpbruce amirhmstu ambarishk ahmedarslan yunfengwang0317 phillip-a-richmond

arcsv's Issues

arcsv container?

Hi,

I am very interested in testing arcsv as I heard about it at the ASHG2022 in LA, CA. However, I struggle at installing it on a cluster on which I do not have much support or admin rights. I was wondering if a container version of arcsv was available?

Thanks for your help,
Best regards,
Tatiana

cram?

Hi, this looks very useful, will it work with cram files as input? thanks

Can arcsv genotype samples using a ref complex-SV vcf files

Hi,

Thank you for developing this wonderful tool.

I have learned that arcsv can call complex-SVs in each sample. I was wondering that is it possible that I merge all sample's complex-SV calls into one file, and use the merged SVs to genotype all the samples? Or if I identified some complex-SVs by comparing two genomes, could I use arcsv to genotype these SVs in population samples?

Thank you

Best wishes,
Songtao Gui

UnicodeEncodeError: 'ascii' codec can't encode character

Hi,

I am running arcsv for complex SVs. The error raises at line 22 of the function below, which I highlighted by stars.

def sv_affected_len(path, blocks):
    # ref_path = list(range(0, 2 * len(blocks)))
    n_ref = len([x for x in blocks if not x.is_insertion()])
    ref_block_num = list(range(n_ref))
    ref_string = ''.join(chr(x) for x in range(ord('A'), ord('A') + n_ref))

    print('ref_string: {0}'.format(ref_string))

    path_block_num = []
    path_string = ''
    for i in path[1::2]:
        block_num = int(np.floor(i / 2))
        path_block_num.append(block_num)
        if i % 2 == 1:          # forward orientation
            path_string += chr(ord('A') + block_num)
        else:                   # reverse orientation
            path_string += chr(ord('A') + block_num + 1000)

   **print('path_string: {0}'.format(path_string))**

    affected_idx_1, affected_idx_2 = align_strings(ref_string, path_string)
    affected_block_1 = set(ref_block_num[x] for x in affected_idx_1)
    affected_block_2 = set(path_block_num[x] for x in affected_idx_2)
    affected_blocks = affected_block_1.union(affected_block_2)
    
    affected_len = sum(len(blocks[i]) for i in affected_blocks)
    return affected_len

Please help, thanks!

Please depend on igraph instead of python-igraph

Please consider depending on igraph instead of python-igraph. The latter name has been deprecated for nearly two years on PyPI and will soon stop receiving updates. See igraph/python-igraph#699 for details.

Ref: https://github.com/search?q=repo%3ASUwonglab%2Farcsv%20python-igraph&type=code

bamparser_streaming.py typo at line 56

.local/lib/python3.4/site-packages/arcsv/bamparser_streaming.py", line 56
if not_primary(aln) or aln.mpos < start or aln.mpos >= endor aln.is_duplicate:
^
SyntaxError: invalid syntax

`ValueError: start out of range (-1)` and `imp module is deprecated`

I have been using arcsv to genotype SV in a series of samples aligned using BWA without problems. But now, I'm aprocessing a series of samples generated using 10X and aligned with emerald and I got the following error:

/home/carleshf/miniconda2/envs/py36/lib/python3.6/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
  import imp
[run] ref files {'reference': '/media/NFS2/refdata-b37-2.1.0/fasta/genome.fa', 'gap': '/media/NFS/Carles/SV/tools/arcsv/resources/GRCh37_gap.bed'}
[run] calling SVs in 2:0-243199373

Traceback (most recent call last):
  File "/home/carleshf/miniconda2/envs/py36/bin/arcsv", line 156, in <module>
    main()
  File "/home/carleshf/miniconda2/envs/py36/bin/arcsv", line 26, in main
    run(args)
  File "/home/carleshf/miniconda2/envs/py36/lib/python3.6/site-packages/arcsv/call_sv.py", line 93, in run
    call_sv(opts, inputs, reference_files)
  File "/home/carleshf/miniconda2/envs/py36/lib/python3.6/site-packages/arcsv/call_sv.py", line 161, in call_sv
    pb_out = parse_bam(opts, reference_files, bamfiles)
  File "/home/carleshf/miniconda2/envs/py36/lib/python3.6/site-packages/arcsv/bamparser_streaming.py", line 109, in parse_bam
    bam_has_unmapped = has_unmapped_records(bam)  File "/home/carleshf/miniconda2/envs/py36/lib/python3.6/site-packages/arcsv/bamparser_streaming.py", line 491, in has_unmapped_records
    if any([a.is_unmapped and a.qname == aln.qname for a in alns]):
  File "/home/carleshf/miniconda2/envs/py36/lib/python3.6/site-packages/arcsv/bamparser_streaming.py", line 491, in <listcomp>
    if any([a.is_unmapped and a.qname == aln.qname for a in alns]):
  File "/home/carleshf/miniconda2/envs/py36/lib/python3.6/site-packages/arcsv/bamparser_streaming.py", line 430, in <genexpr>
    return itertools.chain.from_iterable(b.fetch(*o1, **o2) for b in self.bamlist)
  File "pysam/libcalignmentfile.pyx", line 855, in pysam.libcalignmentfile.AlignmentFile.fetch (pysam/libcalignmentfile.c:11188)
  File "pysam/libcalignmentfile.pyx", line 783, in pysam.libcalignmentfile.AlignmentFile.parse_region (pysam/libcalignmentfile.c:10755)
ValueError: start out of range (-1)

I don't think that the warning has any impact on the caller but I am not getting why the problem with the BAM files. Any help is welcome!

Impossible to run example

I am facing some difficulties when trying to run the example provided on the github, I get the following error:
[run] ref files {'reference': 'example/reference.fa', 'gap': 'example/gaps.bed'}
[run] calling SVs in 20:0-250000

[parse_bam] extracting approximate library stats
[parse_bam] read_len: 100; rough_insert_median: 367.0
[library_stats] processed 200000 reads (75932 chunks) for each lib
[library_stats] processed 400000 reads (145049 chunks) for each lib
[library_stats] processed 600000 reads (210832 chunks) for each lib
[library_stats] processed 800000 reads (272950 chunks) for each lib
[library_stats] processed 1000000 reads (345156 chunks) for each lib
Traceback (most recent call last):
File "/Users/ebattist/Library/Python/3.11/bin/arcsv", line 156, in
main()
File "/Users/ebattist/Library/Python/3.11/bin/arcsv", line 26, in main
run(args)
File "/Users/ebattist/Library/Python/3.11/lib/python/site-packages/arcsv/call_sv.py", line 93, in run
call_sv(opts, inputs, reference_files)
File "/Users/ebattist/Library/Python/3.11/lib/python/site-packages/arcsv/call_sv.py", line 161, in call_sv
pb_out = parse_bam(opts, reference_files, bamfiles)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/ebattist/Library/Python/3.11/lib/python/site-packages/arcsv/bamparser_streaming.py", line 122, in parse_bam
als = extract_approximate_library_stats(opts, bam, rough_insert_median)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/ebattist/Library/Python/3.11/lib/python/site-packages/arcsv/bamparser_streaming.py", line 87, in extract_approximate_library_stats
insert_pmf = [pmf_kernel_smooth(il, 0, opts['insert_max_mu_multiple'] * mu,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/ebattist/Library/Python/3.11/lib/python/site-packages/arcsv/bamparser_streaming.py", line 87, in
insert_pmf = [pmf_kernel_smooth(il, 0, opts['insert_max_mu_multiple'] * mu,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/ebattist/Library/Python/3.11/lib/python/site-packages/arcsv/bamparser_streaming.py", line 467, in pmf_kernel_smooth
pct = np.percentile(a_trunc, (25, 75))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<array_function internals>", line 200, in percentile
File "/opt/homebrew/lib/python3.11/site-packages/numpy/lib/function_base.py", line 4205, in percentile
return _quantile_unchecked(
^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/numpy/lib/function_base.py", line 4473, in _quantile_unchecked
return _ureduce(a,
^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/numpy/lib/function_base.py", line 3752, in _ureduce
r = func(a, **kwargs)
^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/numpy/lib/function_base.py", line 4639, in _quantile_ureduce_func
result = _quantile(arr,
^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/numpy/lib/function_base.py", line 4756, in _quantile
result = _lerp(previous,
^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/numpy/lib/function_base.py", line 4575, in _lerp
lerp_interpolation = asanyarray(add(a, diff_b_a * t, out=out))
~~~~~~~~~^~~
File "/opt/homebrew/lib/python3.11/site-packages/numpy/matrixlib/defmatrix.py", line 218, in mul
return N.dot(self, asmatrix(other))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<array_function internals>", line 200, in dot
ValueError: shapes (2,976007) and (2,1) not aligned: 976007 (dim 1) != 2 (dim 0)

It is very likely due to a difference in packages.
Could you send me the exact requirements for the packages and the version of python used? The setup.py only specifies the version of pysam