GithubHelp home page GithubHelp logo

chunked-scatter's Introduction

chunked-scatter and scatter-regions

The chunked-scatter tool takes a bed file, fasta index, sequence dictionary or vcf file as input and divides the contigs/chromosomes into overlapping chunks of a given size. These chunks will then be placed in new bed files, one chromosomes per file. Small chromosomes will be put together to avoid the creation of thousands of files.

The scatter-regions tool works in a similar way but with defaults and flags tuned towards creating genome scatters for GATK tools.

The safe-scatter tool produces a more even distribution of sizes in the output bed files, and guarantees that none of the scatters are smaller than --min-input-size.

Installation

Usage

chunked-scatter

usage: chunked-scatter [-h] [-p PREFIX] [-S] [-P] [-c SIZE]
                       [-m MINIMUM_BP_PER_FILE] [-o OVERLAP]
                       INPUT

Given a sequence dict, fasta index or a bed file, scatter over the defined
contigs/regions. Each contig/region will be split into multiple overlapping
regions, which will be written to a new bed file. Each contig will be placed
in a new file, unless the length of the contigs/regions doesn't exceed a given
number.

positional arguments:
  INPUT                 The input file. The format is detected by the
                        extension. Supported extensions are: '.bed', '.dict',
                        '.fai', '.vcf', '.vcf.gz', '.bcf'.

optional arguments:
  -h, --help            show this help message and exit
  -p PREFIX, --prefix PREFIX
                        The prefix of the ouput files. Output will be named
                        like: <PREFIX><N>.bed, in which N is an incrementing
                        number. Default 'scatter-'.
  -S, --split-contigs   If set, contigs are allowed to be split up over
                        multiple files.
  -P, --print-paths     If set prints paths of the output files to STDOUT.
                        This makes the program usable in scripts and
                        worfklows.
  -c SIZE, --chunk-size SIZE
                        The size of the chunks. The first chunk in a region or
                        contig will be exactly length SIZE, subsequent chunks
                        will SIZE + OVERLAP and the final chunk may be
                        anywhere from 0.5 to 1.5 times SIZE plus overlap. If a
                        region (or contig) is smaller than SIZE the original
                        regions will be returned. Defaults to 1e6
  -m MINIMUM_BP_PER_FILE, --minimum-bp-per-file MINIMUM_BP_PER_FILE
                        The minimum number of bases represented within a
                        single output bed file. If an input contig or region
                        is smaller than this MINIMUM_BP_PER_FILE, then the
                        next contigs/regions will be placed in the same file
                        untill this minimum is met. Defaults to 45e6.
  -o OVERLAP, --overlap OVERLAP
                        The number of bases which each chunk should overlap
                        with the preceding one. Defaults to 150.

scatter-regions

usage: scatter-regions [-h] [-p PREFIX] [-S] [-P] [-s SCATTER_SIZE] INPUT

Given a sequence dict, fasta index or a bed file, scatter over the defined
contigs/regions. Creates a bed file where the contigs add up approximately to
the given scatter size.

positional arguments:
  INPUT                 The input file. The format is detected by the
                        extension. Supported extensions are: '.bed', '.dict',
                        '.fai', '.vcf', '.vcf.gz', '.bcf'.

optional arguments:
  -h, --help            show this help message and exit
  -p PREFIX, --prefix PREFIX
                        The prefix of the ouput files. Output will be named
                        like: <PREFIX><N>.bed, in which N is an incrementing
                        number. Default 'scatter-'.
  -S, --split-contigs   If set, contigs are allowed to be split up over
                        multiple files.
  -P, --print-paths     If set prints paths of the output files to STDOUT.
                        This makes the program usable in scripts and
                        worfklows.
  -s SCATTER_SIZE, --scatter-size SCATTER_SIZE
                        The maximum size for the regions over which to
                        scatter. If contigs are not split, and a contig is
                        bigger than the maximum size, the contig will be
                        placed in its own file. Default: 1000000000.

safe-scatter

usage: safe-scatter [-h] [-p PREFIX] [-P] [-c SCATTER_COUNT]
                    [-m MIN_SCATTER_SIZE] [--mix-small-regions]
                    INPUT

Given a sequence dict, fasta index or a bed file, scatter over the defined
contigs/regions. Creates a bed file where the contigs add up to the average
scatter size to within min_scatter_size. NOTE, this tool always splits up
contigs.

positional arguments:
  INPUT                 The input file. The format is detected by the
                        extension. Supported extensions are: '.bed', '.dict',
                        '.fai', '.vcf', '.vcf.gz', '.bcf'.

optional arguments:
  -h, --help            show this help message and exit
  -p PREFIX, --prefix PREFIX
                        The prefix of the ouput files. Output will be named
                        like: <PREFIX><N>.bed, in which N is an incrementing
                        number. Default 'scatter-'. (default: scatter-)
  -P, --print-paths     If set prints paths of the output files to STDOUT.
                        This makes the program usable in scripts and
                        worfklows. (default: False)
  -c SCATTER_COUNT, --scatter-count SCATTER_COUNT
                        The number of chunks to scatter the regions in. All
                        chunks will be within --min-scatter-size of each other
                        except for the final chunk. (default: 50)
  -m MIN_SCATTER_SIZE, --min-scatter-size MIN_SCATTER_SIZE
                        The minimum size of a scatter. This tool will never
                        generate regions smaller than this value, unless the
                        original regions aresmaller. (default: 10000)
  --mix-small-regions   Mix small regions with regular regions in the input
                        regions. This can be useful in case there is a bias in
                        the composition of the regions. For example, the human
                        reference genome has all unplaced contigs (which are
                        small and difficult to process) at the end of the
                        file, which means they all end up in the same bedfile.
                        Enabling mixing prevents this (default: False)

Examples

bed file

Given a bed file located at /data/regions.bed:

chr1	100	1000
chr1	2000	16000
chr2	5000	10000

The command:

chunked-scatter -p /data/scatter_ -m 1000 -c 5000 /data/regions.bed

Will produce the following two output files:

  • /data/scatter_0.bed:
    chr1	100	1000
    chr1	2000	7000
    chr1	6850	12000
    chr1	11850	16000
    
  • /data/scatter_1.bed:
    chr2	5000	10000
    

dict file

Given a dict file located at /data/ref.dict:

@SQ	SN:chr1	LN:3000000
@SQ SN:chr2 LN:500000

The command:

chunked-scatter -p /data/scatter_ /data/regions.bed

Will produce the following output file at /data/scatter_0.bed:

chr1	0	1000000
chr1	999850	2000000
chr1	1999850	3000000
chr2	0	500000

chunked-scatter's People

Contributors

davycats avatar redmar-van-den-berg avatar rhpvorderman avatar

Watchers

 avatar

chunked-scatter's Issues

Release 1.0.0

Release checklist

  • Check outstanding issues on JIRA and Github.
  • Check latest documentation looks fine.
  • Create a release branch.
    • Set version to a stable number.
    • Change current development version in CHANGELOG.rst to stable version.
  • Merge the release branch into master.
  • Create a test pypi package from the master branch. (Instructions.)
  • Install the packages from the test pypi repository to see if they work.
  • Created an annotated tag with the stable version number. Include changes
    from history.rst.
  • Push tag to remote.
  • Push tested packages to pypi.
  • merge master branch back into develop.
  • Add updated version number to develop.
  • Create a new release on github.
  • Update the package on bioconda.

Release 0.2.0

Release checklist

  • Check outstanding issues on JIRA and Github.
  • Check latest documentation looks fine.
  • Create a release branch.
    • Set version to a stable number.
    • Change current development version in CHANGELOG.rst to stable version.
  • Merge the release branch into master.
  • Create a test pypi package from the master branch. (Instructions.)
  • Install the packages from the test pypi repository to see if they work.
  • Created an annotated tag with the stable version number. Include changes
    from history.rst.
  • Push tag to remote.
  • Push tested packages to pypi.
  • merge master branch back into develop.
  • Add updated version number to develop.
  • Create a new release on github.
  • Update the package on bioconda.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.