homeveg / nuctools Goto Github PK

software for analysis of chromatin feature occupancy profiles from high-throughput sequencing data

License: GNU General Public License v3.0

Perl 65.90% CSS 2.43% R 4.06% Shell 27.62%

nuctools's Introduction

NucTools

NucTools is a software package for the analysis of chromatin feature occupancy profiles from high-throughput sequencing data

Biomedical applications of high-throughput sequencing methods generate a vast amount of data in which numerous chromatin features are mapped along the genome. The results are frequently analysed by creating binary data sets that link the presence/absence of a given feature to specific genomic loci. However, the nucleosome occupancy or chromatin accessibility landscape is essentially continuous. It is currently a challenge in the field to cope with continuous distributions of deep sequencing chromatin readouts and to integrate the different types of discrete chromatin features to reveal linkages between them. Here we introduce the NucTools suite of Perl scripts as well as MATLAB- and R-based visualization programs for a nucleosome-centred downstream analysis of deep sequencing data. NucTools accounts for the continuous distribution of nucleosome occupancy. It allows calculations of nucleosome occupancy profiles averaged over several replicates, comparisons of nucleosome occupancy landscapes between different experimental conditions, and the estimation of the changes of integral chromatin properties such as the nucleosome repeat length. Furthermore, NucTools facilitates the annotation of nucleosome occupancy with other chromatin features like binding of transcription factors or architectural proteins, and epigenetic marks like histone modifications or DNA methylation.

SYSTEM REQUIREMENT

Linux (2.6 kernel or later), Windows 7 x32/64 or Mac (OSX 10.6 Snow Leopard or later) operating system with minimum 64 GB of RAM is recommended*. Perl v5.8 or above is required.

The C/C++ compiling environment might be required for installing dependencies, such as bedtools. Systems may vary. Please assure that your system has the essential software building packages (e.g. build-essential for Fedora, XCODE for Mac...etc) installed properly.

NucTools was tested successfully on our Linux servers (CentOS release 6.7 w/ Perl v5.10.1; Fedora release 22 w/ Perl v5.20.3), Macbook Pro laptops (MAC OSX 10.11 w/ XCODE v5.1, 8GB RAM, 4 cores processor), Lenovo ThinkPad laptop (Windows 7, 8Gb RAM, 4 cores processor)

*Memory requirements depend on the experimental system. For big genomes the performance will increase greatly on machines with more memory. For example, the processing of all mouse chromosomes, with the sequencing library size about 30 000 000 reads occupy at the peak load around 60-70 Gb of RAM. It is important to mention that the performance is very dependent on the HDD read-write speed. Therefore the running of many samples in parallel is recommended only on the server-like system or computational clusters with RAID arrays, allowing real multithreaded read-write access to HDDs array.

aggregate_profile.pl script memory usage.

We used occupancy values from 7 mouse chromosomes with the length from 95 millions of bps (chromosome 17) to 195 millions bps (chromosome 1). With the average sequence coverage of 67% and sequencing depth 4.5 folds, computer used 440 Mb of RAM per 1 million bases of a mapped bps.

QUICK START

This is an example of profiling a "test.bed" file using NucTools. The test BED file comes along with the NucTools package in the "test" directory. More details can be found in the INSTRUCTION section.

Obtaining NucTools package:

 $ git clone https://github.com/homeveg/nuctools.git NucTools

Installing NucTools:

the NucTools package does not require installation. It is a collection of individual scripts which can be executed individually.
Generate a genome annotation table using the provided R script:
```
 $ Rscript misc/LoadAnnotation.BioMart.R
```

Prepare BED files from BAM files with external application (optionally)

a. merge multiple replicates to one BAM file and sort by read names:

 $ samtools merge -n /Path_to_folder_with/BAM/test_sorted.bam /Path_to_folder_with/BAM/test.rp1.bam /Path_to_folder_with/BAM/test.rp2.bam /Path_to_folder_with/BAM/test.rp3.bam

b. convert sorted BAM files to BED using bowtie2bed.pl script (or alternatively with an external package bedTools):

 $ perl -w bowtie2bed.pl -i /Path_to_folder_with/BAM/test_sorted.bam -verbose > /Path_to_folder_with/BED/test_sorted.bed
 $ bedtools bamtobed -i /Path_to_folder_with/BAM/test_sorted.bam | pigz > /Path_to_folder_with/BED/test_sorted.bed.gz

Running NucTools: a. Extend single-end reads to the average DNA fragment size

 $ extend_SE_reads.pl -in test.bed -out test.ext.bed -fL 147

b. Extract individual chromosomes from the whole-genome BED file

 $ extract_chr_bed.pl -in test.ext.bed -out test -d /Path_to_folder_with/BED/ -p chr

c. Convert all BED files to occupancy OCC files averaging nucleosomes occupancy values over the window of size 10

 $ bed2occupancy_average.pl -in /Path_to_folder_with/BED/ -odir /Path_to_folder_with/OCC -dir -use -w 10

d. Calculate aggregate profiles and aligned occupancy matrices for each chromosome individually

 $ aggregate_profile.pl -reg genome_annotation.tab -idC 0 -chrC 4 --strC 7 -sC 8 -eC 9 -pbN -lsN -lS <SeqLibSize> \
 -chr 1 -al /Path_to_folder_with/OCC/chr1.test.occ_matrix -av /Path_to_folder_with/OCC/chr1.test.aggregate \
 -in /Path_to_folder_with/OCC/chr1.test.w10.occ -upD 1000 -downD 1000

e. Paste together aggregate profiles of each chromosome in one file and add a header

 $ ls /Path_to_folder_with/OCC/*1000_1000.txt | perl -n -e 'if(/.*(chr.*)\.test.*/gm) { print $1, "\t"; }' | \
 perl -n -e 'if( /(.*)\t$/g )  { print $1}' > /Path_to_folder_with/OCC/test.all.occ.txt
 $ echo "" >> /Path_to_folder_with/OCC/test.all.occ.txt

 $ paste /Path_to_folder_with/OCC/*1000_1000.txt >> /Path_to_folder_with/OCC/test.all.occ.txt

Optionally: Visualize aggregate profiles and run K-mean cluster analysis on aligned occupancy matrixes with the MatLab-based ClusterMaps Building Tool (provided as a part of NucTools package). Download link

Installation

the NucTools suite for a nucleosome-centered downstream analysis of deep sequencing data is primarily Perl-based, and requires at least Perl v5.8 with dependencies installed properly (listed in README_FULL.md). A visualisation program that comes with NucTools is written on MatLab and requires either full MatLab installation or can be provided as a standalone application with web-installer compiled for Windows 7. NucTools utilize whole genome BED files.

Optional external applications:

SamTools - merge, sort and convert BAM files
bedtools - convert BAM to BED
PIGZ - a parallel implementation of gzip for modern multi-processor, multi-core machines

Running NucTools

A typical analysis workflow using NucTools consists of the following steps (see the figure): BAM/SAM files with raw mapped reads are converted to BED format (bowtie2bed.pl), processed to obtain nucleosome-sized reads (extend_SE_reads.pl or extend_PE_reads.pl), and split into chromosomes (extract_chr_bed.pl). Usually, a separate directory with chromosome bed files is created for each sample similarly to the HOMER’s approach. Then chromosome-wide occupancies are calculated and average using a window size suitable for the following analysis (bed2occupancy_average). Then for each cell type/state, an average profile is calculated based on the individual replicate profiles (average_replicates.pl). After this point several types of analysis can be performed in parallel: Finding stable/unstable regions (stable_nucs_replicates.pl); comparing replicate-averaged profiles in different cell states/types (compare_two_conditions.pl); calculating nucleosome occupancy profiles at individual regions identified based on the intersection of stable/unstable regions or regions with differential occupancy with genomic features such as promoters, enhancers, etc (extract_rows_occup.pl); calculating the nucleosome repeat length (nucleosome_repeat_length.pl and plotNRL.R); calculating aggregate profiles or visualizing heat maps of nucleosome occupancy at different genomic features (ClusterMap_Builder). The next types of analysis usually involve gene ontology, multiple-dataset correlations and DNA sequence motif analysis, which can be conducted for the genomic regions of interest identified at the previous steps using external software packages.

The examples below refer to an artificially created input BAM file "test.bam" which we use to run through a NucTools pipeline:

    $ samtools sort -n ./test/test.bam ./test/test_sorted
    $ bowtie2bed.pl -i ./test/test_sorted.bam --verbose > ./test/test_sorted.bed.gz
    $ extend_SE_reads.pl -in ./test/test.bed -out ./test/test.ext.bed.gz -fL 150
    $ extract_chr_bed.pl -in ./test/test.ext.bed.gz -out test/BED -d ./test -p chr 
    $ bed2occupancy_average.pl -in ./test/BED -odir ./test/OCC -dir -use -w 10
    $ aggregate_profile.pl -reg genome_annotation.txt -idC 0 -chrC 4 -strC 7 -sC 8 -eC 9 -pbN -lsN -lS 75000000 -chr 1 -al ./test/OCC/chr1.test.occ_matrix -av ./test/OCC/chr1.test.aggregate -in ./test/OCC/chr1.test.w10.occ.gz -upD 1000 -downD 1000

In this example the test.bam file is sorted by the reads names and converted to test_sorted.bed.gz file. In this case we are dealing with single-end ilumina sequencing reads with the read length of 100 bp and average expected DNA fragment length 150 bp. The reads are extended using extend_SE_reads.pl, then the resulting whole-genome BED file is divided to chromosomes with extract_chr_bed.pl and all per-chromosome BED files are converted to OCC files with bed2occupancy_average.pl using a running window 10bp. ON the last step an aggregate profile around regions specific in genome_annotation.txt is generated using aggregate_profile.pl script

NucTools scripts

Initial data transformation

bowtie2bed.pl

takes as an input standard SAM, BAM or MAP file and converts to the gzip-compressed BED file. The program require samtools installed in PATH to be able to work with BAM files

    $ perl -w bowtie2bed.pl --input=accepte_hits.bam --output=sample.bed.gz [--verbose --help]

extend_SE_reads.pl

extends single-end reads by the user-defined value of the average DNA fragment length. Script works with compressed or uncompressed BED files and save output as compress *.BED.GZ

    $ perl -w extend_SE_reads.pl -in <in.bed> -out <out.bed> -fL <fragment length> \
    [-cC <column Nr.> -sC <column Nr.> -eC <column Nr.> -strC <column Nr.> ] [--help]

extend_PE_reads.pl

takes as an input BED file with mapped paired-end reads (two lines per paired read) sorted according to the read name and reformat it by creating a smaller BED file with one line per nucleosome in the following format: (1) chromosome, (2) nucleosome start, (3) nucleosome end, (4) nucleosome length

    $ perl -w extend_PE_reads.pl -in <in.bed> -out <out.bed> [--help]

calc_fragment_length.pl

estimates mean fragment length for a single-end sequencing based on BED file analysis. The value can be used for single end reads extention

    $ perl -w perl -w calc_fragment_length.pl --input=<in.bed> --output=<filtered.txt> [--delta=<N> --apply_filter \
    --filtering_threshold=<N> --pile=<N> --fix_pile_size ] [--chromosome_col=<column Nr.> --start_col=<column Nr.> \
    --end_col=<column Nr.> --strand_col=<column Nr.> --help]

extract_chr_bed.pl

splits whole genome BED file with mapped reads into smaller BED files per each chromosome

    $ perl -w extract_chr_bed.pl -in all_data.bed.gz -out output_name_template -p [<pattern>] [--help]

bed2occupancy_average.pl

calculates genome-wide nucleosome occupancy, based on the BED file with sequencing reads. It converts BED files for all or specified chromosomes. The running window occupancy file (*.OCC) is a text file containing normalized reads frequency distribution along each chromosome in the running window.

    $ perl -w bed2occupancy_average.pl --input=<in.bed.gz> --output=<out.occ.gz> \
    [--outdir=<DIR_WITH_OCC> --chromosome_col=<column Nr.> --start_col=<column Nr.> --end_col=<column Nr.> \
    --strand_col=<column Nr.> --window=<running window size> --consider_strand --ConvertAllInDir --help]

Core scripts

aggregate_profile.pl

Calculates aggregate profile of sequencing read density around genomic regions. As an input it utilzes a tab-delimited text file or BED file with coordinates of genomic features (promoters, enhancers, chromatin domains, TF binding sites, etc), and the OCC files with continuous chromosome-wide occupancy (nucleosome occupancy, TF distribution, etc). Calculates normalized occupancy profiles for each of the features, as well as the aggregate profile representing the average occupancy centered at the middle of the feature

    $ perl -w aggregate_profile.pl --input=<in.occ.gz> --regions=<annotations.txt> [--expression=<gene_expression.rpkm>] \ 
    --aligned=<output.aligned.tab.gz> --average_aligned=<output.aggregare.txt> \ 
    [--path2log=<AggregateProfile.log> --region_start_column=<column Nr.> --region_end_column=<column Nr.> \
    --strand_column=<column Nr.> --chromosome_col=<column Nr.> --GeneId_column=<column Nr.> \
    --Expression_columnID=<column Nr.> --Methylation_columnID=<column Nr.> --Methylation_columnID2=<column Nr.> \
    --upstream_delta=<column Nr.> --downstream_delta==<column Nr.> --upper_threshold=<column Nr.> --lower_threshold=<column Nr.> \
    --Methylation_threshold=<value|range_start-range_end> --overlap=<length> --library_size=<Nr.> \
    --remove_minus_strand | --ignore_strand | --fixed_strand=[plus|minus] --invert_strand --input_occ --score --dont_save_aligned \
    --Cut_tail --chromosome=chrN --AgregateProfile --GeneLengthNorm --LibsizeNorm --PerBaseNorm --useCentre \
    --use_default --verbose --help ]

average_replicates.pl

Calculates the average occupancy profile and standard deviation based on several replicate occupancy profiles from the working directory and save resulting table, including input occupancy data for individual files. Input *.occ files can be flat or compressed. Resulting extended occupancy file will be saved compressed

    $ perl -w average_replicates.pl --dir=<path to working dir> --output=<path to results file> --coordsCol=0 \
    --occupCol=1 --pattern="occ.gz" --printData --sum [--help]

calc_fragment_length.pl

Estimates mean fragment length for a single-emd sequencing library

    $ perl -w calc_fragment_length.pl --input=<in.bed> --output=<filtered.txt> \
    [--delta=<N> --apply_filter --filtering_threshold=<N> --pile=<N> --fix_pile_size ] \ 
    [--chromosome_col=<column Nr.> --start_col=<column Nr.> --end_col=<column Nr.> --strand_col=<column Nr.> --help]

nucleosome_repeat_length.pl

Calculates frequency of nucleosome-nucleosome distances to determine the nucleosome repeat length

    $ perl -w nucleosome_repeat_length.pl --input=<in.bed> --output=<filtered.txt> \
    [--delta=<N> --apply_filter --filtering_threshold=<N> --pile=<N> --fix_pile_size ] \
    [--chromosome_col=<column Nr.> --start_col=<column Nr.> --end_col=<column Nr.> --strand_col=<column Nr.> --help]

stable_nucs_replicates.pl

Finds stable and fussy nucleosomes using all replicates for the same experimental condition

    $ perl -w stable_nucs_replicates.pl --input=<path to input DIR> --output=<out.bed> --chromosome=chr1 \
    [-coordsCol=0 -occupCol=2 -StableThreshold=0.5 --printData ] [--help]

Vizualization and additional scripts

LoadAnnotation.BioMart.R

R script to retrieve genes annotation from EnsEMBL using Bioconductor BioMart package. Genes annotation table, particulary TSS/TTS coordinates, chromosomes and strand inforamtion is used with aggregate_profile.pl as a genomic features table.

plotNRL.R

Peak detection R script to estimate NRL based on nucleosome_repeat_length.pl output.

CMB - Cluster Maps Builder

Aggregate profile and aligned occupancy matrix visualizer. MatLab-based stand-alone GUI application, compiled to run on Windows (tested on Winows 7)

Additional information

Additional information, publications references and short description of each script from the toolbox can be found here:

http://www.generegulation.info/index.php/nuctools (external link)
https://homeveg.github.io/nuctools/ (GitHub pages)
http://link.springer.com/article/10.1186/s12864-017-3580-2 (BMC Genomics)

How to cite

Vainshtein, Y., Rippe, K. & Teif, V.B. BMC Genomics (2017) 18: 158. doi:10.1186/s12864-017-3580-2

Future possible modifications

parallel processing (beautiful codes snippets for implementation of parallel processing of BAM files with Perl one can find here: https://genomebytes.wordpress.com/2013/07/24/multi-thread-access-of-bam-files-using-perl-and-samtools/ )
NucTools automated package installation with make

Developers:

Yevhen Vainshtein and Vladimir B. Teif

nuctools's People

Contributors

Stargazers

Watchers

Forkers

epigenereg acgtcoder

nuctools's Issues

extend_PE_reads.pl warnning issues

Hi there,

I am trying to run these suite of tools on Paired-end 150 bp MNase data and I'm having trouble when I use the extend_PE_reads.pl script. There are some warnnings and I want to know if these warnnings can make an impact on the output bed file. I'm not good at Perl programming so I can't debug by myself.

I have processed my data as the following steps:
bwa --> sam --> bam --> sorted bam (by read name) --> bamtobed

Then I got the input bed file(as a test file, only 50 lines):
chr17 42933710 42933856 E00572:503:HHFFVCCX2:6:1101:1610:67603 66.52 -
chr17 42933700 42933850 E00572:503:HHFFVCCX2:6:1101:1610:67603 65.94 -
chr8 85652180 85652304 E00572:503:HHFFVCCX2:6:1101:1610:67744 65.90 -
chr8 85652180 85652330 E00572:503:HHFFVCCX2:6:1101:1610:67744 66.82 -

0 150 E00572:503:HHFFVCCX2:6:1101:1610:71822 68.46 -
0 150 E00572:503:HHFFVCCX2:6:1101:1610:71822 67.42 -
chr22 31995093 31995238 E00572:503:HHFFVCCX2:6:1101:1610:72491 65.70 -
chr22 31995172 31995255 E00572:503:HHFFVCCX2:6:1101:1610:72491 63.85 -
chr10 99407979 99408126 E00572:503:HHFFVCCX2:6:1101:1610:73194 66.41 -
chr10 99408062 99408170 E00572:503:HHFFVCCX2:6:1101:1610:73194 65.25 -
chr18 2834899 2835049 E00572:503:HHFFVCCX2:6:1101:1621:55350 67.10 -
chr18 2834899 2835049 E00572:503:HHFFVCCX2:6:1101:1621:55350 66.52 -
chr7 43818217 43818367 E00572:503:HHFFVCCX2:6:1101:1621:55526 66.80 -
chr7 43818240 43818370 E00572:503:HHFFVCCX2:6:1101:1621:55526 65.34 -
0 150 E00572:503:HHFFVCCX2:6:1101:1621:56545 67.35 -
0 108 E00572:503:HHFFVCCX2:6:1101:1621:56545 66.40 -
0 150 E00572:503:HHFFVCCX2:6:1101:1621:57108 68.80 -
0 149 E00572:503:HHFFVCCX2:6:1101:1621:57108 66.73 -
chr10 1615533 1615683 E00572:503:HHFFVCCX2:6:1101:1621:57319 66.90 -
chr10 1615554 1615704 E00572:503:HHFFVCCX2:6:1101:1621:57319 66.27 -
chr15 70516499 70516649 E00572:503:HHFFVCCX2:6:1101:1621:57389 66.91 -
chr15 70516512 70516662 E00572:503:HHFFVCCX2:6:1101:1621:57389 66.74 -
chr17_KI270729v1_random 177362 177480 E00572:503:HHFFVCCX2:6:1101:1621:57987 65.32 -
chr17_KI270729v1_random 177331 177471 E00572:503:HHFFVCCX2:6:1101:1621:57987 65.97 -
chr6 51261925 51262075 E00572:503:HHFFVCCX2:6:1101:1621:60378 66.85 -
chr6 51261975 51262083 E00572:503:HHFFVCCX2:6:1101:1621:60378 65.25 -
chr1 167255990 167256140 E00572:503:HHFFVCCX2:6:1101:1621:60413 67.45 -
chr1 167255944 167256050 E00572:503:HHFFVCCX2:6:1101:1621:60413 65.43 -
chr13 77902599 77902746 E00572:503:HHFFVCCX2:6:1101:1621:60448 66.00 -
chr13 77902577 77902727 E00572:503:HHFFVCCX2:6:1101:1621:60448 66.02 -
chr8 81181199 81181300 E00572:503:HHFFVCCX2:6:1101:1621:60519 65.49 -
chr8 81181145 81181295 E00572:503:HHFFVCCX2:6:1101:1621:60519 66.17 -
0 150 E00572:503:HHFFVCCX2:6:1101:1621:61187 67.22 -
0 120 E00572:503:HHFFVCCX2:6:1101:1621:61187 67.33 -
0 150 E00572:503:HHFFVCCX2:6:1101:1621:61257 68.91 -
0 150 E00572:503:HHFFVCCX2:6:1101:1621:61257 67.34 -
chr5 126102675 126102825 E00572:503:HHFFVCCX2:6:1101:1621:61503 66.40 -
chr5 126102675 126102825 E00572:503:HHFFVCCX2:6:1101:1621:61503 66.45 -
chr4 108061184 108061334 E00572:503:HHFFVCCX2:6:1101:1621:61608 66.38 -
chr4 108061230 108061380 E00572:503:HHFFVCCX2:6:1101:1621:61608 65.91 -
chr1 145970837 145970935 E00572:503:HHFFVCCX2:6:1101:1621:61714 64.11 -
chr1 145970877 145970993 E00572:503:HHFFVCCX2:6:1101:1621:61714 65.01 -
chr2 164629352 164629502 E00572:503:HHFFVCCX2:6:1101:1621:62030 66.28 -
chr2 164629476 164629527 E00572:503:HHFFVCCX2:6:1101:1621:62030 60.37 -
chr7 119535260 119535410 E00572:503:HHFFVCCX2:6:1101:1621:62277 67.02 -
chr7 119535326 119535476 E00572:503:HHFFVCCX2:6:1101:1621:62277 66.94 -
0 150 E00572:503:HHFFVCCX2:6:1101:1621:62417 67.79 -
0 150 E00572:503:HHFFVCCX2:6:1101:1621:62417 67.69 -
chr15 40549730 40549864 E00572:503:HHFFVCCX2:6:1101:1621:62874 65.57 -
chr15 40549780 40549930 E00572:503:HHFFVCCX2:6:1101:1621:62874 66.17 -

I used perl -w extend_PE_reads.pl -in input.bed -out out.bed.gz --gzip and the warnnings are:
Subroutine main::ctime redefined at extend_PE_reads.pl line 88.
Prototype mismatch: sub main::ctime (;$) vs none at extend_PE_reads.pl line 88.
ZGIP support enabled

Started: 21:8:38

in file:/home/xianjinyuan/ming/MNase-seq/bed/test_sorted2_test3.bed.gz
out file:/home/xianjinyuan/ming/MNase-seq/bed/test_test_out3.bed.gz
maximum fragment length: 1000
Reading /home/xianjinyuan/ming/MNase-seq/bed/test_sorted2_test3.bed.gz file of 0 MBs. Please wait...
..Use of uninitialized value $line2 in scalar chomp at extend_PE_reads.pl line 224, line 3.
Use of uninitialized value $line2 in split at extend_PE_reads.pl line 229, line 3.
Use of uninitialized value in subroutine entry at extend_PE_reads.pl line 234, line 3.
Use of uninitialized value in subroutine entry at extend_PE_reads.pl line 236, line 3.
Use of uninitialized value $read_2 in string eq at extend_PE_reads.pl line 242, line 3.
Use of uninitialized value $chr_name_2 in string eq at extend_PE_reads.pl line 242, line 3.
..from 4 reads 48 reads where saved. 2 reads discarded
job finished! Bye!

As a result, the output bed file seems in the correct format:
chr17 42933700 42933856 156
chr8 85652180 85652330 150

0 150 150
chr22 31995093 31995255 162
chr10 99407979 99408170 191
chr18 2834899 2835049 150
chr7 43818217 43818370 153
0 150 150
0 150 150
chr10 1615533 1615704 171
chr15 70516499 70516662 163
chr17_KI270729v1_random 177331 177480 149
chr6 51261925 51262083 158
chr1 167255944 167256140 196
chr13 77902577 77902746 169
chr8 81181145 81181300 155
0 150 150
0 150 150
chr5 126102675 126102825 150
chr4 108061184 108061380 196
chr1 145970837 145970993 156
chr2 164629352 164629527 175
chr7 119535260 119535476 216
chr15 40549730 40549930 200

These 2 reads:

0 150 E00572:503:HHFFVCCX2:6:1101:1621:62417 67.79 -
0 150 E00572:503:HHFFVCCX2:6:1101:1621:62417 67.69 -
were discarded, but it seems that these reads have no problems. I want to know the reason. More importantly, Do these warnnings effect the my output result?

Please do let me know if I can provide anything else to help explain my issue better. looking forward to your reply.

Thanks a lot!

Carrie

Cluster Maps Bulder v3.2.0 incorrect tick marks

I've got a matrix of 200K rows, from -250 to 250, but:

the Xticks are at -250,-125,0,125,250 but labelled as -250,130,0,130,250 (which is incorrect)
the Yticks in the heatmap are labelled as 0, 50K, 100K, 150K and 199990 (that last one should be 200000)

Need help for the program aggregate_profile.pl

Hi,
Can you please just tell me what do you mean by -reg genome_annotation.tab option in that program, what format and what file is needed i am not clear, can I use a GFF fiel for my species for that. Please let me know.

Error in extend_PE

Hi Yevhen,

I found several bugs in your script. For me with real paired reads, your code is not usable in a generic way. The main problem is the way you compute the nucleosome length. You do not take into account the strand of each mate when you do your substraction, that is the reason why you get negative values. In the same way, I think that set a max length to 1000 should be an optional parameter. In my case, I force this parameter during the mapping (not output mapping with insert size > 400 bp = di-nucleosomes). I get also a bug with the && ($lines[$#lines] =~ /^chr.*/ )) => shift when reading bed.
You cannot compare directly the read name 1 vs 2 (\1 or \2 at the end).
So find my working code below:

use List::Util qw(min max);
.
.
.
if(($i==$end_index) && ($end_index % 2 == 0))
            { $last_line= $lines[$#lines]; last; }
        $line1=$lines[$i]; chomp($line1);
        $line2=$lines[$i+1]; chomp($line2);

        my @newline1=split(/\t/, $line1);
        my @newline2=split(/\t/, $line2);

        my $chr_name_1=$newline1[0];
        my $chr_name_2=$newline2[0];

        my $min = min ($newline1[1],$newline2[1]);
        my $max = max ($newline1[2],$newline2[2]);
        my $nuc_length = $max - $min;

        print $OUT_FHs join("\t", $chr_name_1, $min, $max, $nuc_length), "\n";

I hope it helps. I am new in nucleosome analysis and I prefer code in Python. I will keep you in touch if I find other problems. Thanks for the package. I am trying several tools to analyze my data.

NRL script minor edit

maxsimum --> maximum

extend_PE_reads.pl issues

Hi there,

I am trying to run these suite of tools on Paired-end 150 bp MNase ChIP-seq data and I'm having trouble with getting good data when I use the extend_PE_reads.pl script. The script seems to run okay (with the perl line instantiated Warnings) but the resulting occ files and the calculated avg_profiles don't look like they make much sense...

Interestingly, when I skip the extend_PE_reads.pl step and run all the downstream tools starting with the chromosome splitting step the data looks pretty okay ( the aggregate profile looks reasonable) but I'm worried about how to interpret results when this step is skipped since I'm not imposing any insert length constraints or formatting everything into a single line instead of separate mate pairs.

Perhaps the issue has to do with the way I'm processing my data:
bwa --> sam --> bam --> sorted bam (by read name) --> bamtobed

here's the first few lines of my input bed:
chr5 85343921 85344071 A00564:291:HNMKTDSXY:1:1101:1081:31798/1 60 +
chr5 85343965 85344115 A00564:291:HNMKTDSXY:1:1101:1081:31798/2 60 -
chr8 44643710 44643859 A00564:291:HNMKTDSXY:1:1101:1090:4445/1 0 +
chr8 44643747 44643897 A00564:291:HNMKTDSXY:1:1101:1090:4445/2 0 -
chr4 75704836 75704986 A00564:291:HNMKTDSXY:1:1101:1090:16376/1 60 +
chr4 75704847 75704997 A00564:291:HNMKTDSXY:1:1101:1090:16376/2 60 -
chr2 43733654 43733781 A00564:291:HNMKTDSXY:1:1101:1090:17378/1 60 -
chr2 43733654 43733782 A00564:291:HNMKTDSXY:1:1101:1090:17378/2 60 +
chr2 226656017 226656162 A00564:291:HNMKTDSXY:1:1101:1090:18505/1 60 +
chr2 226656017 226656162 A00564:291:HNMKTDSXY:1:1101:1090:18505/2 60 -

first few lines of my output bed generated by the script:
chr8 23266835 23267419 584
chr19 3875472 3875997 525
chr12 31950179 31950372 193
chr12 31950180 31950372 192
chr1 68616820 68617006 186
chr1 68616820 68617006 186
chr1 53738030 53739009 979
chr7 64952935 64953736 801
chr7 64952896 64953697 801
chr1 43652533 43653018 485

Please do let me know if I can provide anything else to help explain my issue better.

Thanks a lot!

Anu

ClusterMapsBuilder v3.2.0 no individual plots

Just started using this to cluster some (non-nucleosome) data, but I've come up with some issues:

If anything other than Rescale (complete matrix/each row) is used in the Normalisation options box, no individual profiles are plotted - the figure is empty (subplots are generated, but there's no lines)

aggregating occupancy profile around certain features at different resolution to the original occupancy file

Hi,
Is it possible to calculate the average aggregate occupancy of a feature across all of the regions on a chromosome at a resolution that is different to that of the occupancy file that was calculated for that chromosome.
Take for example an occupancy file calculated at 1 bp resolution, can one then aggregate the occupancies at a 100 bp resolution? I have tried this for all of the chromosomes as below but the aggregate files that are produced are wrong....

for CHR in {1..19} X Y
do
      perl /storage/projects/teif/nuctools.3.0/bed2occupancy_average.pl --input=H3K4me1/chr${CHR}.bed --output=1_occup_chr${CHR}.bed --window=1;
      perl /storage/projects/teif/nuctools.2.0/aggregate_profile.pl --window=100 --regions=regions_OI.bed --input=H3K4me1/1_occup_chr${CHR}.bed --chromosome=chr${CHR} --verbose --useCenter --upstream_delta=20000 --downstream_delta=20000 --average_aligned=H3K4me1_chr${CHR}.w100 --chromosome_col=0 --region_start_column=1 --region_end_column=2 --strand_column=2 --GeneId_column=1 --ignore_strand --force --save_aligned --AgregateProfile;
done

head H3K4me1/1_occup_chr3.bed
3000000	379.062885209613
3000001	379.062885209613
3000002	379.062885209613
3000003	379.062885209613
3000004	379.062885209613
3000005	379.062885209613
3000006	379.062885209613
3000007	379.062885209613
3000008	379.062885209613
3000009	379.062885209613

head H3K4me1/H3K4me1_chr4.w100.delta_20000_20000.txt
20000 365.85209613
19900 320.88520961
19800 [no_value]
19700 [no_value]
19600 [no_value]
.........
........

The aggregate files are like this across all of the chromosomes...

Getting a list of warning while running the extend_PE_reads.pl

Please help me to know whether we can proceed with this errors in extend_PE_reads.pl program or not

Use of uninitialized value in split at /export/apps/nuctools-master/extend_PE_reads.pl line 179, <$inFH> line 5680.
Use of uninitialized value in subroutine entry at /export/apps/nuctools-master/extend_PE_reads.pl line 184, <$inFH> line 5680.
Use of uninitialized value in subroutine entry at /export/apps/nuctools-master/extend_PE_reads.pl line 185, <$inFH> line 5680.
Use of uninitialized value in string eq at /export/apps/nuctools-master/extend_PE_reads.pl line 191, <$inFH> line 5680.
Use of uninitialized value in string eq at /export/apps/nuctools-master/extend_PE_reads.pl line 191, <$inFH> line 5680.
Use of uninitialized value in scalar chomp at /export/apps/nuctools-master/extend_PE_reads.pl line 176, <$inFH> line 5681.
Use of uninitialized value in split at /export/apps/nuctools-master/extend_PE_reads.pl line 179, <$inFH> line 5681.
Use of uninitialized value in subroutine entry at /export/apps/nuctools-master/extend_PE_reads.pl line 184, <$inFH> line 5681.
Use of uninitialized value in subroutine entry at /export/apps/nuctools-master/extend_PE_reads.pl line 185, <$inFH> line 5681.
Use of uninitialized value in subtraction (-) at /export/apps/nuctools-master/extend_PE_reads.pl line 186, <$inFH> line 5681.
Use of uninitialized value in string eq at /export/apps/nuctools-master/extend_PE_reads.pl line 191, <$inFH> line 5681.
Use of uninitialized value in string eq at /export/apps/nuctools-master/extend_PE_reads.pl line 191, <$inFH> line 5681.
Use of uninitialized value in scalar chomp at /export/apps/nuctools-master/extend_PE_reads.pl line 176, <$inFH> line 5683.
Use of uninitialized value in split at /export/apps/nuctools-master/extend_PE_reads.pl line 179, <$inFH> line 5683.
Use of uninitialized value in subroutine entry at /export/apps/nuctools-master/extend_PE_reads.pl line 184, <$inFH> line 5683.
Use of uninitialized value in subroutine entry at /export/apps/nuctools-master/extend_PE_reads.pl line 185, <$inFH> line 5683.
Use of uninitialized value in subtraction (-) at /export/apps/nuctools-master/extend_PE_reads.pl line 186, <$inFH> line 5683.
Use of uninitialized value in string eq at /export/apps/nuctools-master/extend_PE_reads.pl line 191, <$inFH> line 5683.
Use of uninitialized value in string eq at /export/apps/nuctools-master/extend_PE_reads.pl line 191, <$inFH> line 5683.
Use of uninitialized value in scalar chomp at /export/apps/nuctools-master/extend_PE_reads.pl line 176, <$inFH> line 5684.
Use of uninitialized value in split at /export/apps/nuctools-master/extend_PE_reads.pl line 179, <$inFH> line 5684.
Use of uninitialized value in subroutine entry at /export/apps/nuctools-master/extend_PE_reads.pl line 184, <$inFH> line 5684.
Use of uninitialized value in subroutine entry at /export/apps/nuctools-master/extend_PE_reads.pl line 185, <$inFH> line 5684.
Use of uninitialized value in subtraction (-) at /export/apps/nuctools-master/extend_PE_reads.pl line 186, <$inFH> line 5684.
Use of uninitialized value in string eq at /export/apps/nuctools-master/extend_PE_reads.pl line 191, <$inFH> line 5684.
Use of uninitialized value in string eq at /export/apps/nuctools-master/extend_PE_reads.pl line 191, <$inFH> line 5684.
Use of uninitialized value in scalar chomp at /export/apps/nuctools-master/extend_PE_reads.pl line 176, <$inFH> line 5685.
Use of uninitialized value in split at /export/apps/nuctools-master/extend_PE_reads.pl line 179, <$inFH> line 5685.
Use of uninitialized value in subroutine entry at /export/apps/nuctools-master/extend_PE_reads.pl line 184, <$inFH> line 5685.
Use of uninitialized value in subroutine entry at /export/apps/nuctools-master/extend_PE_reads.pl line 185, <$inFH> line 5685.
Use of uninitialized value in string eq at /export/apps/nuctools-master/extend_PE_reads.pl line 191, <$inFH> line 5685.
Use of uninitialized value in string eq at /export/apps/nuctools-master/extend_PE_reads.pl line 191, <$inFH> line 5685.
Use of uninitialized value in scalar chomp at /export/apps/nuctools-master/extend_PE_reads.pl line 176, <$inFH> line 5688.
Use of uninitialized value in split at /export/apps/nuctools-master/extend_PE_reads.pl line 179, <$inFH> line 5688.
Use of uninitialized value in subroutine entry at /export/apps/nuctools-master/extend_PE_reads.pl line 184, <$inFH> line 5688.
Use of uninitialized value in subroutine entry at /export/apps/nuctools-master/extend_PE_reads.pl line 185, <$inFH> line 5688.
Use of uninitialized value in string eq at /export/apps/nuctools-master/extend_PE_reads.pl line 191, <$inFH> line 5688.
Use of uninitialized value in string eq at /export/apps/nuctools-master/extend_PE_reads.pl line 191, <$inFH> line 5688.
Use of uninitialized value in scalar chomp at /export/apps/nuctools-master/extend_PE_reads.pl line 176, <$inFH> line 5689.

aggregate_profile.pl: duplicate GeneID issue

TLDR: Change the default settings to NOT remove regions with duplicate GeneIDs

GeneID is an optional parameters that is missing in most kinds our input files. The default action of the script is to remove duplicate GeneIDs which is very problematic when the input file has not Gene ID and the user is selecting some other column as GeneID, e.g. the coordinate column. In this case some regions with non-unique coordinates get filtered out

Nucleosome Occupancy Value for an Individual Peak

Hello,

Thank you for developing such a useful and accessible tool. I am hoping to estimate nucleosome occupancy of each peak in a curated ChIP file. I was interested in finding the top 100 peaks with the highest nucleosome occupancy.

Is there a way to accomplish this with aggregate_profile perl script? It seems like it summarizes each chromosome but I lose the occupancy information of each peak.

Thank you in advance

homeveg / nuctools Goto Github PK

nuctools's Introduction

NucTools

NucTools is a software package for the analysis of chromatin feature occupancy profiles from high-throughput sequencing data

SYSTEM REQUIREMENT

aggregate_profile.pl script memory usage.

QUICK START

Installation

Running NucTools

NucTools scripts

Initial data transformation

bowtie2bed.pl

extend_SE_reads.pl

extend_PE_reads.pl

calc_fragment_length.pl

extract_chr_bed.pl

bed2occupancy_average.pl

Core scripts

aggregate_profile.pl

average_replicates.pl

calc_fragment_length.pl

nucleosome_repeat_length.pl

stable_nucs_replicates.pl

Vizualization and additional scripts

LoadAnnotation.BioMart.R

plotNRL.R

CMB - Cluster Maps Builder