GithubHelp home page GithubHelp logo

assembly_alignment's Introduction

assembly_alignment

Preliminary assembly-assembly alignment analysis

Features

  • Creates a stats table with 1 row per assembly sequence with numbers per alignment discrepancy type
  • Makes graphs, if the assembly has chromosomes
  • Creates per assembly bed files for each discrepancy type

Future work

  • Update stats to add the top ten by size for each even type

Install

Ensure you're using Python 2.7 and have bedtools installed, then:

$ git clone https://github.com/deannachurch/assembly_alignment
$ cd assembly_alignment

# Work around a pysam issue, https://github.com/pysam-developers/pysam/issues/247
$ export HTSLIB_CONFIGURE_OPTIONS="--disable-libcurl"

# Create a virtualenv and activate it
$ virtualenv env
$ source env/bin/activate

# Install dependencies
$ pip install -r requirements.txt

Run

Create a config file specifying path to files and files to create

assembly_alignment's People

Contributors

deannachurch avatar eweitz avatar

Stargazers

Christopher Dunn avatar  avatar octalene avatar

Watchers

 avatar  avatar  avatar

Forkers

eweitz yqwu1983

assembly_alignment's Issues

Stats files completely empty for CHM1_1.1

Not sure what is going with CHM1_1.1, but regardless of the assembly I run it against (as assm1 or assm2), the stats report is always full of 0s. Compare the results for CHM1_1.1 vs. GRCh38:

/am/ftp-pub-remap/Homo_sapiens/1.7/CHM1_1.1-GRCh38.report
/am/ftp-pub-remap/Homo_sapiens/1.7/GRCh38-CHM1_1.0.report

The reports aren't empty for CHM1_1.1, so I think there's a bug in the stats calculations here.

head CHM1_1.1_GRCh38_alignment_stats.txt

CHM1_1.1 vs GRCh38 assembly alignment report

2016-02-02

Overall stats

Sequence NoHit UnGap_NoHit Collapse(SP) Expansion(SP Only) Inversion Mix

chr1 0 0 0 0 0 0
chr10 0 0 0 0 0 0
chr11 0 0 0 0 0 0
chr12 0 0 0 0 0 0
chr13 0 0 0 0 0 0
chr14 0 0 0 0 0 0

head GRCh38_CHM1_1.1_alignment_stats.txt

GRCh38 vs CHM1_1.1 assembly alignment report

2016-02-02

Overall stats

Sequence NoHit UnGap_NoHit Collapse(SP) Expansion(SP Only) Inversion Mix

1 22628242 4357742 418963 573968 97068 21697
2 4531867 3007867 368258 1011210 124002 11579
3 3590224 3419924 30850 332937 0 2866
4 2979453 2628491 151853 477392 0 3909
5 4170177 3996977 184226 1862318 89276 367
6 3021341 2401341 140231 866387 299095 198

Image creation fails if chr lists aren't an exact match.

It barfed on writing images for a CHM1_1.1/GRCh38 b/c CHM1_1.1 doesn't have a Y chromosome. Not urgent, but figured I'd report it.

Not attaching any files, since these assemblies and align reports are both available from GenBank/Remap.

./assm_align.py --config resources/assm_align_cfg.CHM1_1.1_GRCh38.yml
INFO: ================assm_align.py started: log file=log/assm_align_2016-02-02_1120.log================
INFO: Read CHM1_1.1, sequences: 24, chromosomes: 23
INFO: Read GRCh38, sequences: 455, chromosomes: 24
INFO: Processing CHM1_1.1
INFO: Procssing GRCh38
INFO: Writing stats file: /panfs/pan1.be-md.ncbi.nlm.nih.gov/genome_maint/work/CG-3580/CHM1_1.1_GRCh38/stats/CHM1_1.1_GRCh38_alignment_stats.txt
INFO: Writing stats file: /panfs/pan1.be-md.ncbi.nlm.nih.gov/genome_maint/work/CG-3580/CHM1_1.1_GRCh38/stats/GRCh38_CHM1_1.1_alignment_stats.txt
INFO: Making assembly 1 beds
INFO: Making assembly 2 beds
INFO: Starting image production as both assemblies have chromosomes
INFO: Making collapse image
CRITICAL: chrX not in both lists
ERROR: Chromosome lists not the same, not making graphic
INFO: Making expansion image
CRITICAL: chrX not in both lists
ERROR: Chromosome lists not the same, not making graphic
INFO: Making no hit image
CRITICAL: chrX not in both lists
ERROR: Chromosome lists not the same, not making graphic
INFO: Making ungap no hit image
CRITICAL: chrX not in both lists
ERROR: Chromosome lists not the same, not making graphic

Bug report from Valerie

Hi Deanna:

I ran into an issue with this tonight when I tried to run it with one of the Assemblathon assemblies. Not quite sure what that error means, but it clearly didn't like one of the unitigs. Maybe an issue with name? Any chance you can take a look?

In case you want to try this yourself:
seq report: ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/All/GCA_001297185.1.assembly.txt

Align reports are attached.

Thanks,
Valerie

./assm_align.py --config resources/assm_align_cfg.yml
INFO: ================assm_align.py started: log file=log/assm_align_2016-02-01_2238.log================
INFO: Read PacBioCHM1_r2_GenBank_08312015, sequences: 3641, chromosomes: 0
INFO: Read GRCh38, sequences: 455, chromosomes: 24
INFO: Processing PacBioCHM1_r2_GenBank_08312015
INFO: Procssing GRCh38
INFO: Writing stats file: /panfs/pan1.be-md.ncbi.nlm.nih.gov/genome_maint/work/CG-3580/GCA_001297185.1_GRCh38/stats/GCA_001297185.1_GRCh38_alignment_stats.txt
Traceback (most recent call last):
File "./assm_align.py", line 432, in
main()
File "./assm_align.py", line 376, in main
writeStats(stats1_out, assm1['name'], assm2['name'], assm1_dict)
File "./assm_align.py", line 261, in writeStats
sort_seq_list=sorted(seq_list, key=lambda item: (int(item.partition(' ')[0]) if item[0].isdigit() else float('inf'), item))
File "./assm_align.py", line 261, in
sort_seq_list=sorted(seq_list, key=lambda item: (int(item.partition(' ')[0]) if item[0].isdigit() else float('inf'), item))
ValueError: invalid literal for int() with base 10: '001041F_p_quiver_quiver'

Streamline installation

Adding a requirements.txt file and some set-up instructions would help streamline installation. I'll open a pull request with updates that have helped us get up and running with this package at NCBI.

Code clean up

Need to clean up the code to remove multiple sorting that happens on sequence ids

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.