jason-weirather / alignqc Goto Github PK

Long read alignment analysis. Generate a reports on sequence alignments for mappability vs read sizes, error patterns, annotations and rarefraction curve analysis. The most basic analysis only requires a BAM file, and outputs a web browser compatible xhtml to visualize/share/store/extract analysis results.

License: Apache License 2.0

Python 79.86% CSS 0.93% R 19.20% Shell 0.01%

alignqc's People

Contributors

Stargazers

Watchers

Forkers

sisov ocxtal anqi2017 yzhao80 dfajar2 yunhaowang xiaoying201355 wx904 augroup

alignqc's Issues

Reference sequence basename and comment

Hi, Jason

Running AlignQC with reference sequence containing comments (separated by a space after sequence basename) fails with a keyerror.

For example, reference sequence with the following name:

>gi|556503834|ref|NC_000913.3| Escherichia coli str. K-12 substr. MG1655, complete genome

generates following error:

Traceback (most recent call last):
  File ~/AlignQC/bin/alignqc, line 44, in <module>
    main()
  File ~/AlignQC/bin/alignqc, line 24, in main
    analyze.external_cmd( .join(operable_argv),version=version)
  File ~/AlignQC/utilities/analyze.py, line 77, in external_cmd
    main(args)
  File ~/AlignQC/utilities/analyze.py, line 44, in main
    prepare_all_data.external(args)
  File ~/AlignQC/utilities/prepare_all_data.py, line 569, in external
    main(args)
  File ~/AlignQC/utilities/prepare_all_data.py, line 65, in main
    make_data_bam_reference(args)
  File ~/AlignQC/utilities/prepare_all_data.py, line 300, in make_data_bam_reference
    bam_to_context_error_plot.external_cmd(cmd)
  File ~/AlignQC/utilities/bam_to_context_error_plot.py, line 140, in external_cmd
    main(args)
  File ~/AlignQC/utilities/bam_to_context_error_plot.py, line 48, in main
    epf.add_alignment(e)
  File ~/AlignQC/pylib/Bio/Errors.py, line 46, in add_alignment
    ae = AlignmentErrors(align)
  File ~/AlignQC/pylib/Bio/Errors.py, line 453, in __init__
    astrings = self._alignment.get_alignment_strings(min_intron_size=self._min_intron_size)
  File ~/AlignQC/pylib/Bio/Align.py, line 122, in get_alignment_strings
    tdone += textra+ref[t.chr][t.start-1:t.end].upper()
  File ~/AlignQC/pylib/Bio/Format/Fasta.py, line 74, in __getitem__
    return self._seqs[key]
KeyError: 'gi|556503834|ref|NC_000913.3|'

The reason seems that the reference names are handled with comments but alignment tools include only the basenames in the output sam file (BWA-MEM and GraphMap are tested, my mapping tool minialign is also designed so). How about changing AlignQC to handle reference sequences only with their basenames? And is there any problems (like name conflict) on this?

And here is my quick fix of this based on master: https://github.com/ocxtal/AlignQC/tree/basename

Thanks,

Hajime

Homopolymer Errors

Hi Jason,

Would you mind clarifying how the program identifies homopolymers indels? For example how many bases are required either side of the indel to be classified? Many thanks!

Jean-Michel

Adapter sequences in reads

Hi, do the adapter sequences in the reads affect for example the evaluation of the percentage of the read that aligns to the reference? Also the poly(A) tail, does it not affect since Nanopore now can basecall very long homopolymers (>30)?

Short reads analysis

Is there some reason why AlignQC would not work if short read data is used instead of long read data?

`bioconda` package

Hi @jason-weirather ,

It seems that you already have a conda package on you own channel vacation, but I have troubles installing from there ("solving environment" takes forever...).
Would you mind submitting your recipe on bioconda?

IOError: Not a gzipped file

Hello Jason,

I've been trying to use the alignqc analyze for my mapped pacbio long reads (aligned by GMAP and sorted) with mouse genome reference fa and gtf file. However, keep running into the same issue with IOError: Not a gzipped file.

Any help will be greatly appreciated. Thank you!

Best,
Szi Kay

(anaCogent) -bash-4.2$ alignqc analyze /mnt/data1/Szi/C21/GMAP/C21.hq_isoforms.fastq.sorted.bam -g /mnt/data1/Szi/reference/GRCm38.p4.genome.fa -t /mnt/data1/Szi/reference/GRCm38.p4.gtf -o C21.alignqc.xhtml --output_folder /mnt/data1/Szi/C21/align
Using Rscript version:
R scripting front-end version 3.3.2 (2016-10-31)
Traceback (most recent call last):
  File "/home/sLeung/.conda/envs/anaCogent/bin/alignqc", line 11, in <module>
    sys.exit(entry_point())
  File "/home/sLeung/.conda/envs/anaCogent/lib/python2.7/site-packages/alignqc/alignqc.py", line 47, in entry_point
    main(args,operable_argv)
  File "/home/sLeung/.conda/envs/anaCogent/lib/python2.7/site-packages/alignqc/alignqc.py", line 17, in main
    analyze.external_cmd(operable_argv,version=version)
  File "/home/sLeung/.conda/envs/anaCogent/lib/python2.7/site-packages/alignqc/analyze.py", line 88, in external_cmd
    main(args)
  File "/home/sLeung/.conda/envs/anaCogent/lib/python2.7/site-packages/alignqc/analyze.py", line 48, in main
    gobj = GTFFile(ginf)
  File "/home/sLeung/.conda/envs/anaCogent/lib/python2.7/site-packages/seqtools/format/gtf.py", line 19, in __init__
    for line in filehandle:
  File "/home/sLeung/.conda/envs/anaCogent/lib/python2.7/gzip.py", line 464, in readline
    c = self.read(readsize)
  File "/home/sLeung/.conda/envs/anaCogent/lib/python2.7/gzip.py", line 268, in read
    self._read(readsize)
  File "/home/sLeung/.conda/envs/anaCogent/lib/python2.7/gzip.py", line 303, in _read
    self._read_gzip_header()
  File "/home/sLeung/.conda/envs/anaCogent/lib/python2.7/gzip.py", line 197, in _read_gzip_header
    raise IOError, 'Not a gzipped file'
IOError: Not a gzipped file

Isoforms

Hi Jason,

Happy new year!

Is there anyway to get the isoform information from alignQC? For all annotated isoforms not just the top 5 ones.

Best,
Rojin

Hi @rojinsafavi Sorry for the delay. I'm a busy these days so if lose track of these I appreciate getting the reminder :) It looks like a problem streaming the data. Is your alignment file sorted by genomic position? If they are ... I have a second more complicated problem that this may be due to. If they are supported by position, do you know if the index of your chromosomes are in alphabetical order? I notice that different aligners have different behaviors when it comes to sorting and sometimes they sort chromosomes alphabetically ... sometimes they do other things. And i may be making the alphabetical assumption in the ordering-check. Something you can try is my sort tool thats in seqtools

Hi @rojinsafavi Sorry for the delay. I'm a busy these days so if lose track of these I appreciate getting the reminder :) It looks like a problem streaming the data. Is your alignment file sorted by genomic position? If they are ... I have a second more complicated problem that this may be due to. If they are supported by position, do you know if the index of your chromosomes are in alphabetical order? I notice that different aligners have different behaviors when it comes to sorting and sometimes they sort chromosomes alphabetically ... sometimes they do other things. And i may be making the alphabetical assumption in the ordering-check. Something you can try is my sort tool thats in seqtools

seq-tools sort --bam yourbam -o newbam

If this is the cause I may rethink my order check because I don't want to require another sort before running. Thanks for your help in figuring this out.

Originally posted by @jason-weirather in #7 (comment)

Great

It's a good tools

Error X11 module cannot be loaded

Hello,
I'm running alignqc with python version 2.7 and R version 3.4.1
I have the following error message during .png creation for 2 R scripts (plot_base_error_context.r , and /plot_alignment_errors.r)

Error in png(args[2]) : X11 module cannot be loaded
In addition: Warning message:
In png(args[2]) :
  unable to load shared object '/opt/conda/envs/alignqc/lib/R/modules//R_X11.so':
  libXt.so.6: cannot open shared object file: No such file or directory
Execution halted

I don't have this problem with png output from plot_exon_distro.r or plot_chr_depth.r
Could it come from R version I am using ?
Thanks in advance for your help

input bam

Hi,

I am new to Iso-Seq analysis, I want to ask which one should I choose to do the alignment at the first stage:

the raw subread data?
the ccs data?
isoforms produced with PacBio Iso-Seq packages?
collapsed isoforms produced by Cupcake-tofu2?

Error while reading index

Hi,

While running AlignQC,, I got the following error:

Reading reference fasta
Reading index
Traceback (most recent call last):age
  File "/usr/local/bin/alignqc", line 11, in <module>
    sys.exit(entry_point())
  File "/usr/local/lib/python2.7/dist-packages/alignqc/alignqc.py", line 47, in entry_point
    main(args,operable_argv)
  File "/usr/local/lib/python2.7/dist-packages/alignqc/alignqc.py", line 17, in main
    analyze.external_cmd(operable_argv,version=version)
  File "/usr/local/lib/python2.7/dist-packages/alignqc/analyze.py", line 88, in external_cmd
    main(args)
  File "/usr/local/lib/python2.7/dist-packages/alignqc/analyze.py", line 54, in main
    prepare_all_data.external(args)
  File "/usr/local/lib/python2.7/dist-packages/alignqc/prepare_all_data.py", line 844, in external
    main(args)
  File "/usr/local/lib/python2.7/dist-packages/alignqc/prepare_all_data.py", line 65, in main
    make_data_bam_reference(args)
  File "/usr/local/lib/python2.7/dist-packages/alignqc/prepare_all_data.py", line 404, in make_data_bam_reference
    bam_to_context_error_plot.external_cmd(cmd)
  File "/usr/local/lib/python2.7/dist-packages/alignqc/bam_to_context_error_plot.py", line 146, in external_cmd
    main(args)
  File "/usr/local/lib/python2.7/dist-packages/alignqc/bam_to_context_error_plot.py", line 44, in main
    epf.add_alignment(e)
  File "/usr/local/lib/python2.7/dist-packages/seqtools/errors.py", line 74, in add_alignment
    ae = AlignmentErrors(align)
  File "/usr/local/lib/python2.7/dist-packages/seqtools/errors.py", line 698, in __init__
    self._hpas = self._misalign_split(alns) # split alignment into homopolymer groups
  File "/usr/local/lib/python2.7/dist-packages/seqtools/errors.py", line 1105, in _misalign_split
    tchar = x['target'][i]
IndexError: string index out of range.

Can you please help me?
Thanks in advance,
Luis Alfonso.

ValueError: Expected lines to be ordered but they appear not to be ordered on line 49447

Hi，
Sorry to disturb you .I have used the AlignQC to analyze the full-length transcriptom (ccs data). Unfortunately, the same problem always come .I am confused. Could you please give me some advices. Thanks so much!

Tony

Adjust sort requirement to be compatible with any samtools sorted file

Current requirements for position sort require the chromosomes to be alphabetically sorted, but the samtools convention just requires them to be in the order defined in the sam header. AlignQC should follow this convention since sorting with samtools is the most common way to sort a sam/bam file.

problem with the installation via Conda

Hi, thanks for developing such a nice package.
I just started learning how it works and wanted to run some tests on my both PacBio and ONT reads. But I fail to install via conda. I am not sure if this is something related to the tool or just related to my conda? Obviously I do not have issue installing other packages or new environments though.
This is my installation command: conda create -n alignqc -c vacation AlignQC
when it starts solving the environment, it fails by getting solving environment killed .
I appreciate if you could let me know in case there is something that I can do to get this installed via conda. Thanks

ModuleNotFoundError: No module named 'analyze'

Hi !
I just install alignqc with pip (I tried with conda but solving environment is quite long).
I have the following error :

alignqc analysis -h
Traceback (most recent call last):
File "/opt/conda/envs/alignqc/bin/alignqc", line 7, in
from alignqc.alignqc import entry_point
File "/opt/conda/envs/alignqc/lib/python3.7/site-packages/alignqc/alignqc.py", line 6, in
import analyze
ModuleNotFoundError: No module named 'analyze'

Any help ?
Thanks

annotation error

Hi,

I am trying to use your tool using a RefSeq annotation file in gpd format. It includes non-coding RNAs, thus fields cdsStart and cdsEnd are empty (".").
The process exits with an error, so I am wondering if the annotation file should not include non-coding RNAs.
Attached please find the error.
Thanks,
Maria Angela

Finding genomic features and assigning reads membership
/home/sw/AlignQC-1.2/utilities/annotate_from_genomic_features.py --output_beds /home/madiroma/tmp/weirathe.Lt1qCw/data/beds /home/madiroma/tmp/weirathe.Lt1qCw/data/best.sorted.gpd.gz /home/madiroma/useful_annotations/refGene_hg19.gpd /home/madiroma/tmp/weirathe.Lt1qCw/data/chrlens.txt -o /home/madiroma/tmp/weirathe.Lt1qCw/data/read_genomic_features.txt.gz
Reading Exons
Traceback (most recent call last):
File "/home/sw/AlignQC-1.2/bin/alignqc", line 44, in
main()
File "/home/sw/AlignQC-1.2/bin/alignqc", line 24, in main
analyze.external_cmd(" ".join(operable_argv),version=version)
File "/home/sw/AlignQC-1.2/utilities/analyze.py", line 77, in external_cmd
main(args)
File "/home/sw/AlignQC-1.2/utilities/analyze.py", line 44, in main
prepare_all_data.external(args)
File "/home/sw/AlignQC-1.2/utilities/prepare_all_data.py", line 764, in external
main(args)
File "/home/sw/AlignQC-1.2/utilities/prepare_all_data.py", line 76, in main
make_data_bam_annotation(args)
File "/home/sw/AlignQC-1.2/utilities/prepare_all_data.py", line 423, in make_data_bam_annotation
annotate_from_genomic_features.external_cmd(cmd)
File "/home/sw/AlignQC-1.2/utilities/annotate_from_genomic_features.py", line 216, in external_cmd
main(args)
File "/home/sw/AlignQC-1.2/utilities/annotate_from_genomic_features.py", line 52, in main
exonbed += [x.get_range() for x in gpd.exons]
File "/home/sw/AlignQC-1.2/pylib/Bio/Format/GPD.py", line 48, in exons
self._initialize()
File "/home/sw/AlignQC-1.2/pylib/Bio/Format/GPD.py", line 21, in _initialize
self._entry = _line_to_entry(self._line)
File "/home/sw/AlignQC-1.2/pylib/Bio/Format/GPD.py", line 79, in _line_to_entry
d['cdsStart'] = int(f[6])
ValueError: invalid literal for int() with base 10: '.'

Partial match/annotation

Hi Jason,
Sorry to bother you again!
I was wondering if you can clarify what exactly partial annotation and partial match are? in the manuscript it says:

"A read is assigned to a reference transcript if it can cover the first and last exons with any length, and the internal exons with 80% length. When multiple exons are present and both the read and the reference transcript have the same consecutive exons, the match is called as a “full-length” match, otherwise, it is referred to as a “partial” match."

Thanks!
Rojin

Deterministic outcomes of error calculations from randomly selected reads.

The error rate/pattern calculations are supposed to be based on random selections of reads, so the expected behavior of the software would be that if you ran it more than once you could git somewhat different error pattern/rate calculations. Now, however if you run the software you will get the same error rate/pattern every time. This is because the random class is being set to a seeded random by another class. This is fixed in the dev branch, but I wanted to explain the behavior. A seeded random is currently being used but the intended default behavior is just to use a random number.

Error while reading reference fasta

Hi, during the running of AlignQC with the reference added I got one error, pasted below.
(without the reference runs perfectly)

The following is the command line:
~/AlignQC-1.2/bin/alignqc analyze MLDT_finally.bam -r MLDT-gDNA.fasta --no_annotation -o long_reads.alignqc.xhtml

The following is the error:

Reading reference fasta
Reading index
Traceback (most recent call last):
File "/home/bio/AlignQC-1.2/bin/alignqc", line 44, in
main()
File "/home/bio/AlignQC-1.2/bin/alignqc", line 24, in main
analyze.external_cmd(" ".join(operable_argv),version=version)
File "/home/bio/AlignQC-1.2/utilities/analyze.py", line 77, in external_cmd
main(args)
File "/home/bio/AlignQC-1.2/utilities/analyze.py", line 44, in main
prepare_all_data.external(args)
File "/home/bio/AlignQC-1.2/utilities/prepare_all_data.py", line 764, in external
main(args)
File "/home/bio/AlignQC-1.2/utilities/prepare_all_data.py", line 72, in main
make_data_bam_reference(args)
File "/home/bio/AlignQC-1.2/utilities/prepare_all_data.py", line 383, in make_data_bam_reference
bam_to_context_error_plot.external_cmd(cmd)
File "/home/bio/AlignQC-1.2/utilities/bam_to_context_error_plot.py", line 147, in external_cmd
main(args)
File "/home/bio/AlignQC-1.2/utilities/bam_to_context_error_plot.py", line 47, in main
epf.add_alignment(e)
File "/home/bio/AlignQC-1.2/pylib/Bio/Errors.py", line 55, in add_alignment
ae = AlignmentErrors(align)
File "/home/bio/AlignQC-1.2/pylib/Bio/Errors.py", line 499, in init
self._context_target_errors = self.get_context_target_errors()
File "/home/bio/AlignQC-1.2/pylib/Bio/Errors.py", line 574, in get_context_target_errors
r[t][tafter]['-']['total'] += 0.5
KeyError: '\r'

Would you please help me?
Thanks,
Luis Alfonso

Expected lines to be ordered but they appear not to be ordered

Hi @jason-weirather ,

Thank you for this awesome tool. I want to try it for our PacBio, Illumina, and ONT data. However, I keep on getting the error mentioned in the subject like regardless of my attempts. Can you please help me figure it out?

I used the following script:

module load anaconda2

cd /stornext/General/data/user_managed/grpu_mritchie_1/Shani/long_read_benchmark/alignqc/
source activate alignqc

module load samtools/1.7

REFERENCE="/stornext/General/data/user_managed/grpu_mritchie_1/Shani/atac-seq/20190529_MiRCL_ATAC/references/genome.fa"

mkdir "/stornext/General/data/user_managed/grpu_mritchie_1/Shani/long_read_benchmark/alignqc_output/ont"
OUT_DIR="/stornext/General/data/user_managed/grpu_mritchie_1/Shani/long_read_benchmark/alignqc_output/ont"

# full BAM files took forever - so trying on the subsample
IN_LOC_ONT="/stornext/General/data/user_managed/grpu_mritchie_1/XueyiDong/long_read_benchmark/ONT/bam_subsample"

find ${IN_LOC_ONT} -name '*.bam' -print0 | while IFS= read -r -d '' BAM
do 
OUT_P=${BAM##*/};OUT_P=${OUT_P%%.sorted*};
echo " ######## --------- processing $BAM in $OUT_P -------- #########################";
seq-tools sort --bam ${BAM} -o ${BAM}.sorted.bam;
samtools index ${BAM}.sorted.bam;
mkdir ${OUT_DIR}/${OUT_P};
echo " ######## --------- results saved in $OUT_DIR/$OUT_P -------- #########################";
alignqc analyze ${BAM}.sorted.bam -g ${REFERENCE} --no_transcriptome --threads 8 --specific_tempdir ${OUT_DIR}/${OUT_P} -o ${OUT_DIR}/${OUT_P}/${OUT_P}.ont.alignqc.xhtml
done

The error I'm getting is as follows:

 ######## --------- processing /stornext/Projects/promethion/promethion_access/lab_ritchie/transcr_bench_PacBio/short_term/alignqc/ont_bam_subsample/bam_subsample/barcode05.sorted.bam in barcode05 -------- #########################
 ######## --------- results saved in /stornext/Projects/promethion/promethion_access/lab_ritchie/transcr_bench_PacBio/short_term/alignqc/ont_bam_subsample//barcode05 -------- #########################
Using Rscript version:
R scripting front-end version 3.6.1 (2019-07-05)
WARNING: No annotation specified.  Will be unable to report feature specific outputs
Creating initial alignment mapping data
/stornext/HPCScratch/home/amarasinghe.s/.conda/envs/alignqc/lib/python2.7/site-packages/alignqc/bam_preprocess.py /stornext/Projects/promethion/promethion_access/lab_ritchie/transcr_bench_PacBio/short_term/alignqc/ont_bam_subsample/bam_subsample/barcode05.sorted.bam --minimum_intron_size 68 -o /stornext/Projects/promethion/promethion_access/lab_ritchie/transcr_bench_PacBio/short_term/alignqc/ont_bam_subsample//barcode05/temp/alndata.txt.gz --threads 8 --specific_tempdir /stornext/Projects/promethion/promethion_access/lab_ritchie/transcr_bench_PacBio/short_term/alignqc/ont_bam_subsample//barcode05/temp/
read basics
6257000
check for best set
6250000/6257982
combining results
6257982
Traverse bam for alignment analysis
/stornext/HPCScratch/home/amarasinghe.s/.conda/envs/alignqc/lib/python2.7/site-packages/alignqc/traverse_preprocessed.py /stornext/Projects/promethion/promethion_access/lab_ritchie/transcr_bench_PacBio/short_term/alignqc/ont_bam_subsample//barcode05/temp/alndata.txt.gz -o /stornext/Projects/promethion/promethion_access/lab_ritchie/transcr_bench_PacBio/short_term/alignqc/ont_bam_subsample//barcode05/data/ --specific_tempdir /stornext/Projects/promethion/promethion_access/lab_ritchie/transcr_bench_PacBio/short_term/alignqc/ont_bam_subsample//barcode05/temp/ --threads 8 --min_aligned_bases 50 --max_query_overlap 10 --max_target_overlap 10 --max_target_gap 500000 --required_fractional_improvement 0.2
6257982 alignments   3844424 reads
Writing chromosome lengths from header
/stornext/HPCScratch/home/amarasinghe.s/.conda/envs/alignqc/lib/python2.7/site-packages/alignqc/bam_to_chr_lengths.py /stornext/Projects/promethion/promethion_access/lab_ritchie/transcr_bench_PacBio/short_term/alignqc/ont_bam_subsample/bam_subsample/barcode05.sorted.bam -o /stornext/Projects/promethion/promethion_access/lab_ritchie/transcr_bench_PacBio/short_term/alignqc/ont_bam_subsample//barcode05/data/chrlens.txt
Can we find any known read types
/stornext/HPCScratch/home/amarasinghe.s/.conda/envs/alignqc/lib/python2.7/site-packages/alignqc/get_platform_report.py /stornext/Projects/promethion/promethion_access/lab_ritchie/transcr_bench_PacBio/short_term/alignqc/ont_bam_subsample//barcode05/data/lengths.txt.gz /stornext/Projects/promethion/promethion_access/lab_ritchie/transcr_bench_PacBio/short_term/alignqc/ont_bam_subsample//barcode05/data/special_report
Go through genepred best alignments and make a bed depth file
Generate the depth bed for the mapped reads
gpd_to_bed_depth.py /stornext/Projects/promethion/promethion_access/lab_ritchie/transcr_bench_PacBio/short_term/alignqc/ont_bam_subsample//barcode05/data/best.sorted.gpd.gz -o /stornext/Projects/promethion/promethion_access/lab_ritchie/transcr_bench_PacBio/short_term/alignqc/ont_bam_subsample//barcode05/data/depth.sorted.bed.gz --threads 8
Traceback (most recent call last):
  File "/home/amarasinghe.s/.conda/envs/alignqc/bin/alignqc", line 11, in <module>
    sys.exit(entry_point())
  File "/home/amarasinghe.s/.conda/envs/alignqc/lib/python2.7/site-packages/alignqc/alignqc.py", line 47, in entry_point
    main(args,operable_argv)
  File "/home/amarasinghe.s/.conda/envs/alignqc/lib/python2.7/site-packages/alignqc/alignqc.py", line 17, in main
    analyze.external_cmd(operable_argv,version=version)
  File "/home/amarasinghe.s/.conda/envs/alignqc/lib/python2.7/site-packages/alignqc/analyze.py", line 88, in external_cmd
    main(args)
  File "/home/amarasinghe.s/.conda/envs/alignqc/lib/python2.7/site-packages/alignqc/analyze.py", line 54, in main
    prepare_all_data.external(args)
  File "/home/amarasinghe.s/.conda/envs/alignqc/lib/python2.7/site-packages/alignqc/prepare_all_data.py", line 844, in external
    main(args)
  File "/home/amarasinghe.s/.conda/envs/alignqc/lib/python2.7/site-packages/alignqc/prepare_all_data.py", line 60, in main
    make_data_bam(args)
  File "/home/amarasinghe.s/.conda/envs/alignqc/lib/python2.7/site-packages/alignqc/prepare_all_data.py", line 184, in make_data_bam
    gpd_to_bed_depth(cmd)
  File "/home/amarasinghe.s/.conda/envs/alignqc/lib/python2.7/site-packages/seqtools/cli/utilities/gpd_to_bed_depth.py", line 60, in external_cmd
    main(args)
  File "/home/amarasinghe.s/.conda/envs/alignqc/lib/python2.7/site-packages/seqtools/cli/utilities/gpd_to_bed_depth.py", line 27, in main
    for covs in results:
  File "/home/amarasinghe.s/.conda/envs/alignqc/lib/python2.7/multiprocessing/pool.py", line 271, in <genexpr>
    return (item for chunk in result for item in chunk)
  File "/home/amarasinghe.s/.conda/envs/alignqc/lib/python2.7/multiprocessing/pool.py", line 673, in next
    raise value
ValueError: Expected lines to be ordered but they appear not to be ordered on line 3362988

Then I used the seq-tools sort option to get the files sorted first as you have mentioned in this issue. However, it still doesn't seem to solve the problem as seen form below email.

 ######## --------- processing /stornext/Projects/promethion/promethion_access/lab_ritchie/transcr_bench_PacBio/short_term/alignqc/ont_bam_subsample/bam_subsample/barcode01.sorted.bam in barcode01 -------- #########################
[bam_sort_core] merging from 0 files and 10 in-memory blocks...
 ######## --------- results saved in /stornext/Projects/promethion/promethion_access/lab_ritchie/transcr_bench_PacBio/short_term/alignqc/ont_bam_subsample//barcode01 -------- #########################
Using Rscript version:
R scripting front-end version 3.6.1 (2019-07-05)
WARNING: No annotation specified.  Will be unable to report feature specific outputs
Creating initial alignment mapping data
/stornext/HPCScratch/home/amarasinghe.s/.conda/envs/alignqc/lib/python2.7/site-packages/alignqc/bam_preprocess.py /stornext/Projects/promethion/promethion_access/lab_ritchie/transcr_bench_PacBio/short_term/alignqc/ont_bam_subsample/bam_subsample/barcode01.sorted.bam.sorted.bam --minimum_intron_size 68 -o /stornext/Projects/promethion/promethion_access/lab_ritchie/transcr_bench_PacBio/short_term/alignqc/ont_bam_subsample//barcode01/temp/alndata.txt.gz --threads 8 --specific_tempdir /stornext/Projects/promethion/promethion_access/lab_ritchie/transcr_bench_PacBio/short_term/alignqc/ont_bam_subsample//barcode01/temp/
read basics
5916000
check for best set
5910000/5916804
combining results
5916804
Traverse bam for alignment analysis
/stornext/HPCScratch/home/amarasinghe.s/.conda/envs/alignqc/lib/python2.7/site-packages/alignqc/traverse_preprocessed.py /stornext/Projects/promethion/promethion_access/lab_ritchie/transcr_bench_PacBio/short_term/alignqc/ont_bam_subsample//barcode01/temp/alndata.txt.gz -o /stornext/Projects/promethion/promethion_access/lab_ritchie/transcr_bench_PacBio/short_term/alignqc/ont_bam_subsample//barcode01/data/ --specific_tempdir /stornext/Projects/promethion/promethion_access/lab_ritchie/transcr_bench_PacBio/short_term/alignqc/ont_bam_subsample//barcode01/temp/ --threads 8 --min_aligned_bases 50 --max_query_overlap 10 --max_target_overlap 10 --max_target_gap 500000 --required_fractional_improvement 0.2
5916804 alignments   3720827 reads
Writing chromosome lengths from header
/stornext/HPCScratch/home/amarasinghe.s/.conda/envs/alignqc/lib/python2.7/site-packages/alignqc/bam_to_chr_lengths.py /stornext/Projects/promethion/promethion_access/lab_ritchie/transcr_bench_PacBio/short_term/alignqc/ont_bam_subsample/bam_subsample/barcode01.sorted.bam.sorted.bam -o /stornext/Projects/promethion/promethion_access/lab_ritchie/transcr_bench_PacBio/short_term/alignqc/ont_bam_subsample//barcode01/data/chrlens.txt
Can we find any known read types
/stornext/HPCScratch/home/amarasinghe.s/.conda/envs/alignqc/lib/python2.7/site-packages/alignqc/get_platform_report.py /stornext/Projects/promethion/promethion_access/lab_ritchie/transcr_bench_PacBio/short_term/alignqc/ont_bam_subsample//barcode01/data/lengths.txt.gz /stornext/Projects/promethion/promethion_access/lab_ritchie/transcr_bench_PacBio/short_term/alignqc/ont_bam_subsample//barcode01/data/special_report
Go through genepred best alignments and make a bed depth file
Generate the depth bed for the mapped reads
gpd_to_bed_depth.py /stornext/Projects/promethion/promethion_access/lab_ritchie/transcr_bench_PacBio/short_term/alignqc/ont_bam_subsample//barcode01/data/best.sorted.gpd.gz -o /stornext/Projects/promethion/promethion_access/lab_ritchie/transcr_bench_PacBio/short_term/alignqc/ont_bam_subsample//barcode01/data/depth.sorted.bed.gz --threads 8
Traceback (most recent call last):
  File "/home/amarasinghe.s/.conda/envs/alignqc/bin/alignqc", line 11, in <module>
    sys.exit(entry_point())
  File "/home/amarasinghe.s/.conda/envs/alignqc/lib/python2.7/site-packages/alignqc/alignqc.py", line 47, in entry_point
    main(args,operable_argv)
  File "/home/amarasinghe.s/.conda/envs/alignqc/lib/python2.7/site-packages/alignqc/alignqc.py", line 17, in main
    analyze.external_cmd(operable_argv,version=version)
  File "/home/amarasinghe.s/.conda/envs/alignqc/lib/python2.7/site-packages/alignqc/analyze.py", line 88, in external_cmd
    main(args)
  File "/home/amarasinghe.s/.conda/envs/alignqc/lib/python2.7/site-packages/alignqc/analyze.py", line 54, in main
    prepare_all_data.external(args)
  File "/home/amarasinghe.s/.conda/envs/alignqc/lib/python2.7/site-packages/alignqc/prepare_all_data.py", line 844, in external
    main(args)
  File "/home/amarasinghe.s/.conda/envs/alignqc/lib/python2.7/site-packages/alignqc/prepare_all_data.py", line 60, in main
    make_data_bam(args)
  File "/home/amarasinghe.s/.conda/envs/alignqc/lib/python2.7/site-packages/alignqc/prepare_all_data.py", line 184, in make_data_bam
    gpd_to_bed_depth(cmd)
  File "/home/amarasinghe.s/.conda/envs/alignqc/lib/python2.7/site-packages/seqtools/cli/utilities/gpd_to_bed_depth.py", line 60, in external_cmd
    main(args)
  File "/home/amarasinghe.s/.conda/envs/alignqc/lib/python2.7/site-packages/seqtools/cli/utilities/gpd_to_bed_depth.py", line 27, in main
    for covs in results:
  File "/home/amarasinghe.s/.conda/envs/alignqc/lib/python2.7/multiprocessing/pool.py", line 271, in <genexpr>
    return (item for chunk in result for item in chunk)
  File "/home/amarasinghe.s/.conda/envs/alignqc/lib/python2.7/multiprocessing/pool.py", line 673, in next
    raise value
ValueError: Expected lines to be ordered but they appear not to be ordered on line 3364652

I'm attaching the genome file and a small sample of the BAM file here:
barcode05.sorted.bam.first_10_lines.bam.gz

The header of this .bam file is as follows:

@HD	VN:1.5	SO:coordinate
@SQ	SN:chr1	LN:248956422
@SQ	SN:chr2	LN:242193529
@SQ	SN:chr3	LN:198295559
@SQ	SN:chr4	LN:190214555
@SQ	SN:chr5	LN:181538259
@SQ	SN:chr6	LN:170805979
@SQ	SN:chr7	LN:159345973
@SQ	SN:chr8	LN:145138636
@SQ	SN:chr9	LN:138394717
@SQ	SN:chr10	LN:133797422
@SQ	SN:chr11	LN:135086622
@SQ	SN:chr12	LN:133275309
@SQ	SN:chr13	LN:114364328
@SQ	SN:chr14	LN:107043718
@SQ	SN:chr15	LN:101991189
@SQ	SN:chr16	LN:90338345
@SQ	SN:chr17	LN:83257441
@SQ	SN:chr18	LN:80373285
@SQ	SN:chr19	LN:58617616
@SQ	SN:chr20	LN:64444167
@SQ	SN:chr21	LN:46709983
@SQ	SN:chr22	LN:50818468
@SQ	SN:chrX	LN:156040895
@SQ	SN:chrY	LN:57227415
@SQ	SN:chrM	LN:16569
@SQ	SN:GL000008.2	LN:209709
@SQ	SN:GL000009.2	LN:201709
@SQ	SN:GL000194.1	LN:191469
@SQ	SN:GL000195.1	LN:182896
@SQ	SN:GL000205.2	LN:185591
@SQ	SN:GL000208.1	LN:92689
@SQ	SN:GL000213.1	LN:164239
@SQ	SN:GL000214.1	LN:137718
@SQ	SN:GL000216.2	LN:176608
@SQ	SN:GL000218.1	LN:161147
@SQ	SN:GL000219.1	LN:179198
@SQ	SN:GL000220.1	LN:161802
@SQ	SN:GL000221.1	LN:155397
@SQ	SN:GL000224.1	LN:179693
@SQ	SN:GL000225.1	LN:211173
@SQ	SN:GL000226.1	LN:15008
@SQ	SN:KI270302.1	LN:2274
@SQ	SN:KI270303.1	LN:1942
@SQ	SN:KI270304.1	LN:2165
@SQ	SN:KI270305.1	LN:1472
@SQ	SN:KI270310.1	LN:1201
@SQ	SN:KI270311.1	LN:12399
@SQ	SN:KI270312.1	LN:998
@SQ	SN:KI270315.1	LN:2276
@SQ	SN:KI270316.1	LN:1444
@SQ	SN:KI270317.1	LN:37690
@SQ	SN:KI270320.1	LN:4416
@SQ	SN:KI270322.1	LN:21476
@SQ	SN:KI270329.1	LN:1040
@SQ	SN:KI270330.1	LN:1652
@SQ	SN:KI270333.1	LN:2699
@SQ	SN:KI270334.1	LN:1368
@SQ	SN:KI270335.1	LN:1048
@SQ	SN:KI270336.1	LN:1026
@SQ	SN:KI270337.1	LN:1121
@SQ	SN:KI270338.1	LN:1428
@SQ	SN:KI270340.1	LN:1428
@SQ	SN:KI270362.1	LN:3530
@SQ	SN:KI270363.1	LN:1803
@SQ	SN:KI270364.1	LN:2855
@SQ	SN:KI270366.1	LN:8320
@SQ	SN:KI270371.1	LN:2805
@SQ	SN:KI270372.1	LN:1650
@SQ	SN:KI270373.1	LN:1451
@SQ	SN:KI270374.1	LN:2656
@SQ	SN:KI270375.1	LN:2378
@SQ	SN:KI270376.1	LN:1136
@SQ	SN:KI270378.1	LN:1048
@SQ	SN:KI270379.1	LN:1045
@SQ	SN:KI270381.1	LN:1930
@SQ	SN:KI270382.1	LN:4215
@SQ	SN:KI270383.1	LN:1750
@SQ	SN:KI270384.1	LN:1658
@SQ	SN:KI270385.1	LN:990
@SQ	SN:KI270386.1	LN:1788
@SQ	SN:KI270387.1	LN:1537
@SQ	SN:KI270388.1	LN:1216
@SQ	SN:KI270389.1	LN:1298
@SQ	SN:KI270390.1	LN:2387
@SQ	SN:KI270391.1	LN:1484
@SQ	SN:KI270392.1	LN:971
@SQ	SN:KI270393.1	LN:1308
@SQ	SN:KI270394.1	LN:970
@SQ	SN:KI270395.1	LN:1143
@SQ	SN:KI270396.1	LN:1880
@SQ	SN:KI270411.1	LN:2646
@SQ	SN:KI270412.1	LN:1179
@SQ	SN:KI270414.1	LN:2489
@SQ	SN:KI270417.1	LN:2043
@SQ	SN:KI270418.1	LN:2145
@SQ	SN:KI270419.1	LN:1029
@SQ	SN:KI270420.1	LN:2321
@SQ	SN:KI270422.1	LN:1445
@SQ	SN:KI270423.1	LN:981
@SQ	SN:KI270424.1	LN:2140
@SQ	SN:KI270425.1	LN:1884
@SQ	SN:KI270429.1	LN:1361
@SQ	SN:KI270435.1	LN:92983
@SQ	SN:KI270438.1	LN:112505
@SQ	SN:KI270442.1	LN:392061
@SQ	SN:KI270448.1	LN:7992
@SQ	SN:KI270465.1	LN:1774
@SQ	SN:KI270466.1	LN:1233
@SQ	SN:KI270467.1	LN:3920
@SQ	SN:KI270468.1	LN:4055
@SQ	SN:KI270507.1	LN:5353
@SQ	SN:KI270508.1	LN:1951
@SQ	SN:KI270509.1	LN:2318
@SQ	SN:KI270510.1	LN:2415
@SQ	SN:KI270511.1	LN:8127
@SQ	SN:KI270512.1	LN:22689
@SQ	SN:KI270515.1	LN:6361
@SQ	SN:KI270516.1	LN:1300
@SQ	SN:KI270517.1	LN:3253
@SQ	SN:KI270518.1	LN:2186
@SQ	SN:KI270519.1	LN:138126
@SQ	SN:KI270521.1	LN:7642
@SQ	SN:KI270522.1	LN:5674
@SQ	SN:KI270528.1	LN:2983
@SQ	SN:KI270529.1	LN:1899
@SQ	SN:KI270530.1	LN:2168
@SQ	SN:KI270538.1	LN:91309
@SQ	SN:KI270539.1	LN:993
@SQ	SN:KI270544.1	LN:1202
@SQ	SN:KI270548.1	LN:1599
@SQ	SN:KI270579.1	LN:31033
@SQ	SN:KI270580.1	LN:1553
@SQ	SN:KI270581.1	LN:7046
@SQ	SN:KI270582.1	LN:6504
@SQ	SN:KI270583.1	LN:1400
@SQ	SN:KI270584.1	LN:4513
@SQ	SN:KI270587.1	LN:2969
@SQ	SN:KI270588.1	LN:6158
@SQ	SN:KI270589.1	LN:44474
@SQ	SN:KI270590.1	LN:4685
@SQ	SN:KI270591.1	LN:5796
@SQ	SN:KI270593.1	LN:3041
@SQ	SN:KI270706.1	LN:175055
@SQ	SN:KI270707.1	LN:32032
@SQ	SN:KI270708.1	LN:127682
@SQ	SN:KI270709.1	LN:66860
@SQ	SN:KI270710.1	LN:40176
@SQ	SN:KI270711.1	LN:42210
@SQ	SN:KI270712.1	LN:176043
@SQ	SN:KI270713.1	LN:40745
@SQ	SN:KI270714.1	LN:41717
@SQ	SN:KI270715.1	LN:161471
@SQ	SN:KI270716.1	LN:153799
@SQ	SN:KI270717.1	LN:40062
@SQ	SN:KI270718.1	LN:38054
@SQ	SN:KI270719.1	LN:176845
@SQ	SN:KI270720.1	LN:39050
@SQ	SN:KI270721.1	LN:100316
@SQ	SN:KI270722.1	LN:194050
@SQ	SN:KI270723.1	LN:38115
@SQ	SN:KI270724.1	LN:39555
@SQ	SN:KI270725.1	LN:172810
@SQ	SN:KI270726.1	LN:43739
@SQ	SN:KI270727.1	LN:448248
@SQ	SN:KI270728.1	LN:1872759
@SQ	SN:KI270729.1	LN:280839
@SQ	SN:KI270730.1	LN:112551
@SQ	SN:KI270731.1	LN:150754
@SQ	SN:KI270732.1	LN:41543
@SQ	SN:KI270733.1	LN:179772
@SQ	SN:KI270734.1	LN:165050
@SQ	SN:KI270735.1	LN:42811
@SQ	SN:KI270736.1	LN:181920
@SQ	SN:KI270737.1	LN:103838
@SQ	SN:KI270738.1	LN:99375
@SQ	SN:KI270739.1	LN:73985
@SQ	SN:KI270740.1	LN:37240
@SQ	SN:KI270741.1	LN:157432
@SQ	SN:KI270742.1	LN:186739
@SQ	SN:KI270743.1	LN:210658
@SQ	SN:KI270744.1	LN:168472
@SQ	SN:KI270745.1	LN:41891
@SQ	SN:KI270746.1	LN:66486
@SQ	SN:KI270747.1	LN:198735
@SQ	SN:KI270748.1	LN:93321
@SQ	SN:KI270749.1	LN:158759
@SQ	SN:KI270750.1	LN:148850
@SQ	SN:KI270751.1	LN:150742
@SQ	SN:KI270752.1	LN:27745
@SQ	SN:KI270753.1	LN:62944
@SQ	SN:KI270754.1	LN:40191
@SQ	SN:KI270755.1	LN:36723
@SQ	SN:KI270756.1	LN:79590
@SQ	SN:KI270757.1	LN:71251
@SQ	SN:chrIS	LN:10567884
@PG	ID:minimap2	PN:minimap2	VN:2.17-r974-dirty	CL:minimap2 -ax splice -uf -k14 --junc-bed /wehisan/home/allstaff/d/dong.x/annotation/HumanSequins/gencode.v33.sequins.junction.bed /wehisan/home/allstaff/d/dong.x/annotation/HumanSequins/GrCh38_sequins.fa /stornext/General/data/user_managed/grpu_mritchie_1/XueyiDong/long_read_benchmark/subsample/ONT010/barcode05.fq.gz

Also, I'm attaching the full bam file and a zip file of whatever I got as an output from running the script here:
https://drive.google.com/drive/folders/1HtuIZWOSCh-7PpmxLJyZNy37b8Uo9N6z?usp=sharing

Your help would be really appreciated to figure out what is going on...

Many thanks,
Shani

Illumina alignment

Hi,

I am aligning >750 million 100bp illumina reads to a reference genome on a system with 128Gb memory and 48 cores. I have assigned 40 threads but alignqc seems to fail to go even past the first index stage. Is this too much data for alignqc or is there something i am doing wrong?

Thanks

Minimap2 compatibility

Hi,

although it clearly says that you have tested AlignQC with GMAP, I wonder if you have also tested Minimap2 and can recommend using Minimap2. Minimap2 runs very fast and Minimap (not exactly Minimap2) has recently been shown to be the most sensitive to Nanopore (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5408847/, the application here was DNA overlaps).

transcript read count from alignQC

Dear Jason,
May I ask how can I export read counts of all transcripts after running alignQC on a bam file? Which data file corresponds to all transcript read counts?

For example, for one report, all the files that can be dump are the following:
alignqc dump output_report.xhtml --list

alignment_error_plot.pdf
alignment_stats.txt
alignments.pdf
annot_lengths.pdf
annot_lengths.txt.gz [annot_lengths.txt]
annotbest.txt.gz [annotbest.txt]
best.sorted.bed.gz [best.sorted.bed best.sorted.gpd]
bias.pdf
bias_table.txt.gz [bias_table.txt]
chimera.bed.gz [chimera.bed chimera.gpd]
chrlens.txt
context_error_data.txt
context_plot.pdf
covgraph.pdf
depth.sorted.bed.gz [depth.sorted.bed]
error_data.txt
error_stats.txt
exon_size_distro.pdf
feature_depth.pdf
gapped.bed.gz [gapped.bed gapped.gpd]
gene_full_rarefraction.txt
gene_rarefraction.pdf
gene_rarefraction.txt
junvar.pdf
junvar.txt
lengths.txt.gz [lengths.txt]
pacbio.pdf
params.txt
perchrdepth.pdf
read_genomic_features.pdf
read_genomic_features.txt.gz [read_genomic_features.txt]
techinical_atypical_chimeras.bed.gz [techinical_atypical_chimeras.bed techinical_atypical_chimeras.gpd]
technical_chimeras.bed.gz [technical_chimeras.bed]
transcript_distro.pdf
transcript_full_rarefraction.txt
transcript_rarefraction
transcript_rarefraction.txt

Many thanks,

Weihong

ValueError

Hello,
I want to use ailgnQC to analyze some nanopore RNA data, but I keep getting this allocation error:

alignqc analyze aln.bam -g ../Mus_musculus.GRCm38.cdna.all.fa -t ../UCSC_Main_on_Mouse__all_mrna.gtf.gz -o report.xhtml --portable_output report.portable.xhtml --threads 10

Exception in thread Thread-4:ext coverage
Traceback (most recent call last):
File "/projects/nanopore-working/rojin/nanoraw-signalAlign-nanopolish/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/projects/nanopore-working/rojin/nanoraw-signalAlign-nanopolish/anaconda2/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/projects/nanopore-working/rojin/nanoraw-signalAlign-nanopolish/anaconda2/lib/python2.7/multiprocessing/pool.py", line 326, in _handle_workers
pool._maintain_pool()
File "/projects/nanopore-working/rojin/nanoraw-signalAlign-nanopolish/anaconda2/lib/python2.7/multiprocessing/pool.py", line 230, in _maintain_pool
self._repopulate_pool()
File "/projects/nanopore-working/rojin/nanoraw-signalAlign-nanopolish/anaconda2/lib/python2.7/multiprocessing/pool.py", line 223, in _repopulate_pool
w.start()
File "/projects/nanopore-working/rojin/nanoraw-signalAlign-nanopolish/anaconda2/lib/python2.7/multiprocessing/process.py", line 130, in start
self._popen = Popen(self)
File "/projects/nanopore-working/rojin/nanoraw-signalAlign-nanopolish/anaconda2/lib/python2.7/multiprocessing/forking.py", line 121, in init
self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory

Exception in thread Thread-1:
Traceback (most recent call last):
File "/projects/nanopore-working/rojin/nanoraw-signalAlign-nanopolish/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/projects/nanopore-working/rojin/nanoraw-signalAlign-nanopolish/anaconda2/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/projects/nanopore-working/rojin/nanoraw-signalAlign-nanopolish/anaconda2/lib/python2.7/multiprocessing/pool.py", line 326, in _handle_workers
pool._maintain_pool()
File "/projects/nanopore-working/rojin/nanoraw-signalAlign-nanopolish/anaconda2/lib/python2.7/multiprocessing/pool.py", line 230, in _maintain_pool
self._repopulate_pool()
File "/projects/nanopore-working/rojin/nanoraw-signalAlign-nanopolish/anaconda2/lib/python2.7/multiprocessing/pool.py", line 223, in _repopulate_pool
w.start()
File "/projects/nanopore-working/rojin/nanoraw-signalAlign-nanopolish/anaconda2/lib/python2.7/multiprocessing/process.py", line 130, in start
self._popen = Popen(self)
File "/projects/nanopore-working/rojin/nanoraw-signalAlign-nanopolish/anaconda2/lib/python2.7/multiprocessing/forking.py", line 121, in init
self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory

Killedignments, 3213 min context coverage

I would really appreciate if you can help me with that

Memory Requirements

I keep getting Memory Error in the function traverse_preprocessed.py . How can I properly estimate the memory requirements for AlignQC pipeline given the amount of reads?
I am looking forward to the results ;)

jason-weirather / alignqc Goto Github PK

alignqc's People

Contributors

Stargazers

Watchers

Forkers

alignqc's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs