nloyfer / wgbs_tools Goto Github PK

tools for working with Bisulfite Sequencing data while preserving reads intrinsic dependencies

License: Other

Python 60.92% Perl 0.29% C++ 38.04% Shell 0.44% C 0.03% Dockerfile 0.28%

methylation dna-methylation wgbs-analysis bioinformatics

wgbs_tools's Introduction

wgbstools - suite for DNA methylation sequencing data representation, visualization, and analysis

wgbstools is an extensive computational suite tailored for bisulfite sequencing data. It allows fast access and ultra-compact representation of high-throughput data, as well as machine learning and statistical analysis, and informative visualizations, from fragment-level to locus-specific representations.

It converts data from standard formats (e.g., bam, bed) into tailored compact yet useful and intuitive formats (pat, beta). These can be visualized in terminal, or analyzed in different ways - subsample, merge, slice, mix, segment and more.

This project is developed by Netanel Loyfer and Jonathan Rosenski in Prof. Tommy Kaplan's lab at the Hebrew University, Jerusalem, Israel.

Quick start

Installation

# Clone
git clone https://github.com/nloyfer/wgbs_tools.git
cd wgbs_tools

# compile
python setup.py

Genome configuration

At least one reference genome must be configured (takes a few minutes).

wgbstools init_genome GENOME_NAME
# E.g, 
wgbstools init_genome hg19
wgbstools init_genome mm9

wgbstools downloads the requested reference FASTA file from the UCSC website. If you prefer using your own reference FASTA, specify the path to the FASTA as follows.

wgbstools init_genome GENOME_NAME --fasta_path /path/to/genome.fa

Dependencies

python 3+, with libraries:
- pandas version 1.0+
- numpy
- scipy
samtools
tabix / bgzip

Dependencies for some features:

bedtools

Usage examples

Now you can generate pat.gz and beta files out of bam files:

wgbstools bam2pat Sigmoid_Colon_STL003.bam
# output:
# Sigmoid_Colon_STL003.pat.gz
# Sigmoid_Colon_STL003.beta

Once you have pat and beta files, you can use wgbstools to visualize them. For example:

wgbstools vis Sigmoid_Colon_STL003.pat.gz -r chr3:119528843-119529245

wgbstools vis *.beta -r chr3:119528843-119529245 --heatmap

Deconvolution

To deconvolve tissues or blood samples, see our UXM software

References

If you are using wgbstools, please cite:
Loyfer et al. (2024) ‘wgbstools: A computational suite for DNA methylation sequencing data representation, visualization, and analysis’, bioRxiv ,2024.
[GEO GSE186458 | Genome browser sessions: hg19 | hg38]

wgbs_tools's People

Contributors

Stargazers

Watchers

wgbs_tools's Issues

wgbstools segment error

I'm attempting to segment the (hg19) beta files available from here:

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE186458

I'm getting the following error:

[ add_loci ] Failed! exception: [wt add_loci] line 0: endCpG < startCpG [wt segment] found 74,349 blocks (dropped 106,621 short blocks) Traceback (most recent call last): File "./wgbs_tools/wgbstools", line 96, in <module> main() File "./wgbs_tools/wgbstools", line 63, in main importlib.import_module(args.command).main() File "./wgbs_tools/src/python/segment.py", line 310, in main SegmentByChunks(args, betas).run() File "./wgbs_tools/src/python/segment.py", line 147, in run self.dump_result(df.reset_index(drop=True)) File "./wgbs_tools/src/python/segment.py", line 183, in dump_result add_bed_to_cpgs(temp_path, self.genome.genome, self.args.out_path) File "./wgbs_tools/src/python/convert.py", line 245, in add_bed_to_cpgs subprocess.check_call(cmd, shell=True) File "/exports/applications/apps/SL7/anaconda/5.0.1/lib/python3.6/subprocess.py", line 291, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command 'cat qfp0iepq | ./wgbs_tools/src/cpg2bed/add_loci ./wgbs_tools/references/hg19/CpG.bed.gz ./wgbs_tools/references/hg19/CpG.chrome.size > ./GSE186458/GSE186458.chr20.blocks.small.bed' returned non-zero exit status 1. Segmenting

Any thoughts on what the problem might be and how to fix it?

Thanks

issues with beta2bw.py

Hello,
I have generated pat and beta files with 'wgbstools bam2pat'. Prior to running this, I had to sort my bam files with 'samtools sort'.

I then issued command
wgbstools beta2bw --dump_cov --genome hg19 -o pat_beta/trackFiles pat_beta/*beta

to generate bw files for both the beta values and the coverage and I found 2 issues:

I got an error at line 73 (convert bedGraphs to bigWigs) complaining that the bedGraph files needed to be sorted first. I resolved this by modifying the python code by adding a 'bedtools sort' step prior to line 73.
No coverage bw files were generated.

I'm not sure if these issues are due to my incorrect usage of the software or reflect small bugs.

Thanks for your attention

Clarify dependencies

Hi, thanks for making this tool!

I installed wgbs_tools on our machine and tried to run the init_genome command. During setup & running, I found a few dependencies that need to be in the environment that's not mentioned in the markdown. Thought it might be useful for you & others to know.

They are:

numpy module (python3)
pandas module (python3)
tabix / bgzip

Workflow question

Hi,
What is the recommended workflow for generating the required BAM files?

Also, what is the recommended tools for clustering wgbs samples? ie, samples from different tissues and then unbiased clustering to see which samples are similar.

Thank you!

split_by_allele and test_bimodal

Hey. Great package. Not so much of an issue. But rather a question.
Could you give an example/ tutorial on how to use the split_by_allele and test_bimodal functions.
Thanks a lot!

While running bam2pat on ML and MM tag containing bam file ,reads information is printed on the terminal.

@ Hi, I am getting read information printed on the terminal while running the bam2pat command, is there any way to stop this printing? Or this is due to the presence of ''Unknown CIGAR character: = ' in the BAM file which is not recognised by WGBStools.

Missing files while compiling

I got an error that the files: 'homog.o', 'homog' and 'main.o' were missing when compiling.
After downloading those files from an old branch (https://github.com/nloyfer/wgbs_tools/tree/e83804a101526d5ddda4837ec56c22b667a50dce/src/homog) I was able to successfully compile wgbstools.

split_by_allele C>T SNPs at CpG sites

Thank you for including the split_by_allele tutorial. Would this work suitably for C>T SNPs at CpG sites as well (using reads on the opposite strand for directional libraries)?

Question based on the calculation to covert bam file methylation values into beta values

Hello everyone,

I am currently thinking about to use Bam2pat to convert some .bam files into .beta files.

I am using ONT .bam files and was just asking myself how the exact calculation looks like to convert the read based information from the .bam files to beta-values.

It would be great to understand the procedure before i start using it.

Thanks a lot!

kind regards,
Azlan

The .beta files and .bigwig files are inconsistent.

Hi,
I downloaded the data from the GSE186458 and want to find markers per cell type. But the beta files and the bigwig files are inconsistent.

wget https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM5652nnn/GSM5652176/suppl/GSM5652176_Adipocytes-Z000000T7.beta.gz .
wget https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM5652nnn/GSM5652176/suppl/GSM5652176_Adipocytes-Z000000T7.bigwig .

md5sum GSM*
a90af84bfa70e783fa540347db1bec41  GSM5652176_Adipocytes-Z000000T7.beta.gz
7ff5b4efdafd8002ebbb639472e6d1be  GSM5652176_Adipocytes-Z000000T7.bigwig

wgbstools beta2bed -o Z000000T7.bed GSM5652176_Adipocytes-Z000000T7.beta.gz
bigWigToBedGraph GSM5652176_Adipocytes-Z000000T7.bigwig Z000000T7.bedGraph

awk '{printf ("%s\t%s\t%s\t%.3f\n",$1,$2,$3,$4/$5)}' Z000000T7.bed > Z000000T7.format.bed
awk '{printf ("%s\t%s\t%s\t%.3f\n",$1,$2,$3,$4)}' Z000000T7.bedGraph > Z000000T7.format.bedGraph

In this case, the intersection of the two files is 26,179,734 records. There are 472,378 records only in beta file, and 1,747,426 records only in bigwig file.

Theoretically, the methylation ratio should be 0 to 1. However, there are more than 5% records with methylation rate of -1 in the bigwig file. The other inconsistencies between two files are chromosomal end sites.

$ awk '$4<0' Z000000T7.format.bedGraph | head
chr1	10983	10985	-1.000
chr1	10989	10991	-1.000
chr1	10997	10999	-1.000

$ grep chr5 Z000000T7.format.bed | grep 141316059
chr5	141316059	141316061	1.000

$ grep chr5 Z000000T7.format.bedGraph | grep 141316059
chr5	141316059	141316063	1.000

Which data is correct? Should I use beta files to find markers?

Thanks.

genome "liftover" for pat format

Hi!
Thank you for developing these great tools!
I was wondering what would be the recommended way to convert pat files from one genome to another? As far as I am aware there is no designated function for this in wgbstools.
I would be interested in this kind of functionality because I would like to transfer the cfDNA pat files from your 2023 nature paper from hg19 to hg38.
Thank you in advance,
all the best,
Nikolaus

How to open beta file

Hi, could you show me how to read the beta file? In addition, what each column means for pat files?

paired test for find_markers

Thank you for this nice software package!
I was looking into the find_markers functionality, which allows to define specified comparisons between two groups of samples via the targets and background options. If I'm not mistaken, at the moment it seems that for this type of comparison an un-paired t-test is the only option. I was wondering if it could be possible to extend the functionality to allow for paired t-tests, since this is a fairly common scenario. For example, one may have a group of individulas and WGBS data from two different cell types for each such individual and may desired to find DMR between the two cell types.
Thanks,
Elisabetta

groups file format

Hello, I am straggling with the groups file.
I was trying all sorts of formats (full beta file name at the name column, partial, prefix only ect.) and I keep getting the following error:
" Error: 2 prefixes from groups file were not found in input bins:
Adipocytes-Z000000T7
Blood-NK-Z000000UF
Invalid input argument
groups file mismatch binary files"

Any ideas? can you share an example for a group file?

bam2pat FileNotFoundError

I got an error as "FileNotFoundError: [Errno 2] No such file or directory: '/home/qyjing/softwares/wgbs_tools/references/default'", even though I have set the parameter "--genome hg19 ". Before I run bam2pat, i have already initiate the genome using my own downloaded fasta file as follows:
wgbstools init_genome hg19 --fasta_path /path/to/hg19.fa

And yes, there is no "default" folder under the references folder. How should I solve the porblem?

convert .fractional.bw data to pat or beta

Hi, I found a large dataset listed on the EpiRR project which contains '.fractional.bw' bigwig files. I'd like to display these samples side-by-side with other samples for which I've already created pat/beta files from the original bam files, but I am not sure how to proceed.

Is there a recommended way to convert '.fractional.bw' data to pat/beta format?

chr8    127738612       127738613       4.34783
chr8    127738613       127738614       3.0303
chr8    127738617       127738618       0
chr8    127738618       127738619       2.77778
chr8    127738621       127738622       0
chr8    127738622       127738623       0
chr8    127738624       127738625       11.1111
chr8    127738625       127738626       0
chr8    127738669       127738670       0
chr8    127738670       127738671       0
chr8    127738677       127738678       0
chr8    127738678       127738679       4.65116
chr8    127738681       127738682       0
chr8    127738682       127738683       0
chr8    127738684       127738685       0
chr8    127738685       127738686       2.17391
chr8    127738693       127738694       0
chr8    127738694       127738695       2.27273

Thanks

beta2blocks

It's been a wee while since I played with this. I'm just looking at it again now, and I'm getting an error when attempting to run the beta_to_table tool:

./wgbs_tools/src/python/beta_to_blocks.py:39: FutureWarning: is_monotonic is deprecated and will be removed in a future version. Use is_monotonic_increasing instead. if not pd.Index(df['startCpG']).is_monotonic: ./wgbs_tools/src/python/beta_to_blocks.py:42: FutureWarning: is_monotonic is deprecated and will be removed in a future version. Use is_monotonic_increasing instead. if not pd.Index(df['endCpG']).is_monotonic: Traceback (most recent call last): File "./wgbs_tools/wgbstools", line 96, in <module> main() File "./wgbs_tools/wgbstools", line 63, in main importlib.import_module(args.command).main() File "./wgbs_tools/src/python/beta_to_table.py", line 154, in main for chunk in chunks: File "./wgbs_tools/src/python/beta_to_table.py", line 137, in beta2table_generator yield get_table(subset_blocks, gf, min_cov, threads, verbose) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "./wgbs_tools/src/python/beta_to_table.py", line 99, in get_table with np.warnings.catch_warnings(): ^^^^^^^^^^^ File "./myconda/envs/meth/lib/python3.11/site-packages/numpy/__init__.py", line 284, in __getattr__ raise AttributeError("module {!r} has no attribute " AttributeError: module 'numpy' has no attribute 'warnings'. Did you mean: 'hanning'?

Here's my pip list:

numpy 1.24.1
pandas 1.5.3
pip 22.3
python-dateutil 2.8.2
pytz 2022.7.1
scipy 1.10.0
setuptools 65.5.0
six 1.16.0
wheel 0.37.1

Any thoughts?

Cheers

bam2pat support for nanopore data

Hi guys,

Amazing tools!

Are there any plans to include functionality in bam2pat to handle nanopre modbams with MM and ML tags? If not, what could be a good way to convert bams with MM and ML tags into a bam that could be used as input for wgbstools?

Many thanks,
Rodrigo

mbias_plot missing / broken by commit 3e5b9bf02f1

In the current version, bam2pat silently fails with this error:

[wt bam2pat] failed in mbias
No module named 'mbias_plot'

It looks like that module doesn't exist / isn't committed.

Tabix 1.3.1 fix

Hi,

Thank you for the great tools. A quick fix might brought about easier usage by fixing the tabix command parameter in src/index.py with " -p vcf -C ", as lower version tabix does not recognize the "-p bed" command correctly and might bring about trouble.

Thanks a lot,
Yi

bam2pat Invalid input

We used bwamem to create a bam file, and then sambamba for duplication removal.
Unfortunately, bam2pat does not work. Would you have any suggestions to move forward?
Thanks!

(base) bash-4.2$ wgbstools bam2pat normal_PA12005_merged.mdup.bam --out_dir $OUTDIR
[wt bam2pat] bam: normal_PA12005_merged.mdup.bam
Invalid input argument
Failed

merge

I saw that the WBC-WGBS (n=23) samples were uploaded. When trying to merge all 23 of them with:
wgbstools merge -T tmp -p WBC-merged *-WBC-WGBS-Rep1.pat.gz
it runs, generates a tmp file with several Gb and finishes writing WBC-merged.pat.gz without error.
However if I check the coverage with beta_cov after generating a .beta from WBC-merged.pat.gz I get
WBC_merged 238.43 and e.g. GSM6810026_CNVS-NORM-110000263-WBC-WGBS-Rep1 82.86

Shouldn't it be much higher (>1'000x) considering 23 samples with a coverage of 50-80x?

Missing chromosomes after bam2pat

Hi, thanks for developing wgbstools,
I am following your workflow and encounter this problem: when creating pat and beta files from my bam files (generated with bismark), I miss entire chromosomes (everything is NA after collapsing the beta to blocks with wgbstools).

This happens in all my samples. Do you have any idea of what might be the culprit here?
Maybe something related to how the bam is sorted?

The missing chrs are 22, 20, 16, 12, 11, 10, 8, 7, 6, 4, 3, 2 ... I see no pattern.
The genome version matches the bam, so that shouldn't be the problem.

Sorry If I can't provide a reproducible example, I will try to investigate the issue better in the meantime.

File necessary for beta2bam missing

Hi,

I am trying to convert a beta file to a bam using beta2bam. I have setup hg38 as the reference. When i run:
wgbstools beta2bed GSM5652219_Oligodendrocytes-Z000000TK.hg38.beta -o GSM5652219_Oligodendrocytes-Z000000TK.hg38.tsv
I get the following error:

Invalid input argument
Invalid reference path: /wgbs_tools/references/hg38/CpG.bed.gz

When I look in the "/wgbs_tools/references/hg38" folder, the only entry is hg38.fa.gz. What steps are required to generate the CpG.bed.gz file?

Thanks

convert 450k data to beta or pat

Hi, is there a way to convert 450k data to beta or pat format?

I am interested in comparing the data from the Nature atlas paper and the data from TCGA 450k array cancer samples for a specific region (MYC gene and upstream+downstream intergenic region).

Is there a way to convert the "*.level3betas.txt" files from TCGA into a format that can then be analysed side-by-side as the *.hg38.beta files in the Nature atlas paper?

E.g. example first few lines of a ".level3betas.txt" file from TCGA:

cg07549526      NA
cg16670573      NA
cg09969830      0.0165025989557348
cg00179196      0.959862895312011
cg03948744      0.0517703039642285
cg02729269      0.770865197097467
cg10009236      0.0196736740017586
cg10143220      0.963637601928311
cg05791870      0.981329901446056
cg01527023      0.0293196490201621
cg00928894      0.0563106576186536
cg02369618      0.0227020285011311
cg09580244      0.0798872678158596
cg02783232      0.0178798741323557
cg00389577      NA
cg08400316      NA
cg07893512      0.0536192270323858
cg05057452      0.0532595029760251
cg04141813      0.167112256692955
cg15597257      NA
cg00697413      NA

Thanks in advance.

bam2pat for nanopore data has some problem in recognising CIGARX string

It seems bam2pat for nanopore data has some problem in recognising "eXtended CIGAR", or CIGARX string.

8,229,233,213,224,53,229,57,235,252,255,255,115,249,247,185,231,247,57,139,181,235,251,74,126,165,34,254,255,170,186,193,255,255,254,241,235,237,255,255,254,238,225,14,254,238,253,255,66,254,254,247  RG:Z:c969ee83   mc:f:99.5874    mg:f:99.6108    NM:i:72     
[ patter ] [ patter ] Unknown CIGAR character: =                                                                                                                                                      
[ patter ] [ chr9 ] Exception while processing line 798. Line content:                                                                                                                                                                                              
m64101_220704_113609/114427486/ccs      0       chr9    239949  60      8034=1I565=4D2297=      *       0       10900   ATTTCCAGTTATTCACATTAGAAACAGTACACCACTGAATAAATTTATGCATTCATCTTTGCTTACCTCTTTAATGATTCTTCACGATAAATGCTAGAAATAGAACCACAGACTTAAAGGTCTCCATTGATATGTGTGGC

issue when use `beta_to_450k`

Hi, thanks for developing this useful tool!

However, I had some problems when I wanted to convert my WGBS data to 850K array data, I trying to use beta_to_450k to do that, but there were some errors.

The command is:

~/software/wgbs_tools/wgbstools beta_to_450k -o test --EPIC ./PR10200180.sorted.beta

The warning message is:

Invalid input argument
Input file is None

Looking forward to your reply. Thanks!!

issue running wgbstools mix_pat

I am attempting to run mix_pat to make different proportions of test data for wgbstools and UXM, but keep hitting the same issue.
When I run the command:

wgbstools mix_pat --rates [[0.0, 1.0] [0.0, 1.0]] GSM5652176_Adipocytes-Z000000T7.hg38.pat.gz GSM5652313_Blood-Granulocytes-Z000000TZ.hg38.pat.gz

I get the following errors:

mix_pat: error: argument --rates: invalid float value: '[[0.0,'

I consistently get this error no matter how I put the rate values & with or without the square brackets.

find_markers error

Thank you for the tool!

I have met some problems when running:

$ wgbstools find_markers -g groups.csv --betas GSE186458_RAW/*.hg38.beta -b blocks..bed.gz --min_cpg 5 --min_bp 10 --max_bp 1500 -c 10

The error is:

py:87: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  blocks_df[b] = dres[b]
Invalid input argument
`popmean.shape[axis]` must equal 1.

I would be greatly appreciated if you could spend some of your time check the process for me!

OSError: [Errno 22] while bam2pat

Hi I am having a strange problem while running bam2pat on the tutorial data

(ngs_packages) [vin@fe-open-01 wgbs_tools]$ wgbstools bam2pat tutorial/bams/Sigmoid_Colon_STL003.small.bam 
Traceback (most recent call last):
  File "/home/vin/wgbs_tools/wgbstools", line 97, in <module>
    main()
  File "/home/vin/wgbs_tools/wgbstools", line 64, in main
    importlib.import_module(args.command).main()
  File "/home/vin/wgbs_tools/src/python/bam2pat.py", line 429, in main
    Bam2Pat(args, bam)
  File "/home/vin/wgbs_tools/src/python/bam2pat.py", line 192, in __init__
    self.gr = GenomicRegion(args)
  File "/home/vin/wgbs_tools/src/python/genomic_region.py", line 28, in __init__
    self.genome_name = get_genome_name(genome_name)
  File "/home/vin/wgbs_tools/src/python/genomic_region.py", line 21, in get_genome_name
    return os.readlink(refdir)
OSError: [Errno 22] Invalid argument: '/home/vin/wgbs_tools/references/default'


(ngs_packages) [vin@fe-open-01 wgbs_tools]$ tree -C -L 2
.
├── Dockerfile
├── docs
│   ├── beta_format.md
│   ├── img
│   ├── init_genome_ref_wgbs.md
│   ├── pat_format.md
│   ├── README.md
│   └── view.md
├── LICENSE.md
├── poetry.lock
├── pyproject.toml
├── README.md
├── references
│   ├── default
│   ├── hg19
│   └── hg38
├── setup.py
├── src
│   ├── collapse_pat.pl
│   ├── cpg2bed
│   ├── cview
│   ├── homog
│   ├── __init__.py
│   ├── pat2beta
│   ├── pat_sampler
│   ├── pipeline_wgbs
│   ├── __pycache__
│   ├── python
│   ├── segment_betas
│   ├── view_beta.sh
│   └── view_lbeta.sh
├── supplemental
│   ├── find_markers_config.txt
│   ├── find_markers_defaults.txt
│   ├── hg19.annotations.bed.gz
│   ├── hg19.annotations.bed.gz.tbi
│   ├── hg19.ilmn2CpG.tsv.gz
│   └── hg38.ilmn2CpG.tsv.gz
├── tutorial
│   ├── bams
│   ├── images
│   └── README.md
└── wgbstools -> src/python/wgbs_tools.py

Tutorials for UXM fragment-level deconvolution

Hi,
Thank you for all the codes and tutorials.

I would like to ask if you have any plan updating tutorials about UXM fragment-level deconvolution?

Is the atlas in the unpublished supplementary? Can I use the deconvolution algorithm in Moss et al. (2018) to deconvolve after having the reference atlas from wgbs tools?

Consider more permissive license for .pat/.beta formats?

Congrats on the paper!

These are lovely tools, but the research use only restriction / noncommercial limit for .pat/.beta will likely limit adoption.

Would y'all consider relicensing just that component to something less restrictive?

Missing last CPG site in chromosome

In patter.cpp, line 95-100, a bool array is constructed. If the last loci is 100 (bsize==100), the bool array should be constructed in a 100 + 1 size. If we make a bool array with 100 elements, we cant access the conv[100].

bsize = loci.at(loci.size() - 1);    

conv = new bool[bsize]();   -->  conv = new bool[(bsize + 1)]();
for (int locus: loci) {
    conv[locus] = true;
}

In line 254, we also need add 1 to make the code assess to the last locus of the chromosome:

    if ((start_locus + i) > (bsize - 1)) {           --->  if ((start_locus + i) > (bsize - 1 + 1)) {
        continue;  
    }

bam2pat invalid input argument empty bam file

I tried wgbstools bam2pat directory/*.bam -r $region and it gives me 3 lines as a result.
The first line says bam: directory/filename.bam
The second line says Invalid input argument
The third line says Empty bam file

When I check the size of my bam file in that directory, it says 1.85 KB.

I'm not sure why I am getting these errors. Any suggestions?

vis heatmap as a file export (pdf, etc.)

Hi,

I've discovered this tool a few days ago and already become a fan of it, as have other scientists in the organization.

Is there a way to produce a file export of the --heatmap form of wgbstools vis?

Thanks

Issue when collapsing beta files to the blocks

Hi,
I am practicing again with the tutorial data step by step, and I get an error below when I run the following:

wgbstools beta_to_table blocks.small.bed.gz --betas *beta | column -t

Pyhton is: Python 3.7.3 (default, Jan 22 2021, 20:04:44)
Runing on HURCS cluster terminal 8CPUs, 32GB

It was OK a few weeks ago, Very strange.
I have downloaded again the master version, and ran the setup steps, but nothing changed.

The process references Python3, but when I run Python3 (3.7.3 ) separately, the imported NumPy package version is 1.21.6, which does not support Int64.
On the other hand the NumPy package version under Python 2 (2.7.16) is 1.16.2 which seem to support Int64.

In both version I run:
import numpy as np
np.version.version
arr = np.array([1, 2, 3, 4], dtype='Int64')

What could be the issue?

The error:

Traceback (most recent call last):
File "/sci/home/aviel_iluz/wdir_fb/software/wgbs_tools/wgbstools", line 96, in
main()
File "/sci/home/aviel_iluz/wdir_fb/software/wgbs_tools/wgbstools", line 63, in main
importlib.import_module(args.command).main()
File "/sci/backup/iris.lavon/aviel_iluz/software/wgbs_tools/src/python/beta_to_table.py", line 154, in main
for chunk in chunks:
File "/sci/backup/iris.lavon/aviel_iluz/software/wgbs_tools/src/python/beta_to_table.py", line 132, in beta2table_generator
blocks_df = load_blocks_file(blocks)
File "/sci/backup/iris.lavon/aviel_iluz/software/wgbs_tools/src/python/beta_to_blocks.py", line 82, in load_blocks_file
header=header, names=names, nrows=nrows, comment='#')
File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 678, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 440, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 787, in init
self._make_engine(self.engine)
File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 1014, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 1708, in init
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 487, in pandas._libs.parsers.TextReader.cinit
File "/usr/lib/python3/dist-packages/pandas/core/dtypes/common.py", line 2013, in pandas_dtype
npdtype = np.dtype(dtype)
TypeError: data type 'Int64' not understood

Supplementary Data: Extended Tables

Hi,

Can you please point me towards the extended tables S1-S5 mentioned in your paper

I cannot seem to find them on bioarXiv

find_markers for RRBS

Hi,

I have been trying to U-markers specific for RRBS using
wgbstools find_markers --betas *hg38.beta -b RRBS_regions_hg38 -g Group_file --only_hypo

but I was running into this error: https://stackoverflow.com/questions/75798944/how-to-solve-invalid-input-argument-popmean-shapeaxis-must-equal-1-when-r/76235648#76235648
and it seems to be related to the groups where only one beta file is available (because if I'm ignoring these, the code runs)
I made the group file using the 'Group' or 'refined_group' column in suppl table 1
But in the published markerfiles and atlases regions are available for these groups containing one file (Epid-Kerat, Osteo, etc), how should I generate markers for these entities?

Could more info be given about the block-file for the RRBS regions, how these were made? I now performed an in-silico digest with MspI and restricting the fragment lengths from 20-200bp.

Kind regards,
Andries

nanopore?

Hi,

looks interesting. Is there any reason it could not handle nanopore data, eg output from modbam2bed ?
https://github.com/epi2me-labs/modbam2bed

We typically convert the output into bedg / bigwig for visualization, so I guess this would be compatible?

Thanks.

pat result for short fragments (<150)

Hi, I wonder if the fragment is short(<150bp) and R1 and R2 are fully overlapped(2*150 pair-end data), if R1 and R2 (with the same read id )showed different methylation level, what the pat file would output, it will output two different rows or just output one rows that only use methylation levels of R1 or R2?

wgbstools convert is not working as expected

Hello,

I'm trying to use wgbstools convert to translates genomic loci to CpG-index, and it's not working as expected.

The command I used: wgbstools convert -L check_head.bed --genome hg38 --debug

Original file:

chr1	10468	10469	22	29
chr1	10470	10471	21	28
chr1	10483	10484	28	29
chr1	10488	10489	21	28
chr1	10492	10493	25	28
chr1	10496	10497	26	27
chr1	10524	10525	26	28
chr1	10541	10542	26	29
chr1	10562	10563	28	31
chr1	10570	10571	29	31

Expected output:

chr1	10468	10469	1	2	22	29
chr1	10470	10471	2	3	21	28
chr1	10483	10484	3	4	28	29
chr1	10488	10489	4	5	21	28
chr1	10492	10493	5	6	25	28
chr1	10496	10497	6	7	26	27
chr1	10524	10525	7	8	26	28
chr1	10541	10542	8	9	26	29
chr1	10562	10563	10	11	28	31
chr1	10570	10571	11	12	29	31

The output I get:

chr1	10468	10469	NA	NA	22	29
chr1	10470	10471	NA	NA	21	28
chr1	10483	10484	NA	NA	28	29
chr1	10488	10489	NA	NA	21	28
chr1	10492	10493	NA	NA	25	28
chr1	10496	10497	NA	NA	26	27
chr1	10524	10525	NA	NA	26	28
chr1	10541	10542	NA	NA	26	29
chr1	10562	10563	NA	NA	28	31
chr1	10570	10571	NA	NA	29	31

I ran with --debug and the command that was printed before the output was:
tabix -R /tmp/tmpfk4kdgtm /path/wgbs_tools/references/hg38/CpG.bed.gz | awk -v OFS='\t' '{print $1,$2,$2+1,$3}' | sort -k1,1 -k2,2n -u | bedtools intersect -sorted -b - -a /tmp/tmpfk4kdgtm -loj | bedtools groupby -g 1,2,3 -c 7,7 -o first,last | awk -v OFS='\t' '{print $1,$2,$3,$4,$5+1;}' | sed 's/\.\t1/NA\tNA/g'

@nloyfer I'll appreciate if you can take a look.

Thank you!

init_genome error

When I enter "wgbtools init_home - f hg38", I get“[wt init] Setting up genome reference files in ./software/wgbs_tools/references/hg38
[wt init] No reference FASTA provided. Attempting to download from
https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 938M 100 938M 0 0 6153k 0 0:02:36 0:02:36 --:--:-- 3578k
[wt init] successfully downloaded FASTA. Now gunzip and bgzip it...
bgzip: invalid option -- '@'
[bgzip] No such file or directory: 12
Traceback (most recent call last):File "./software/wgbs_tools/wgbstools", line 96, in
main()
File "./software/wgbs_tools/wgbstools", line 63, in main
importlib.import_module(args.command).main()
File "./software/wgbs_tools/src/python/init_genome.py", line 297, in main
InitGenome(args).run()
File "./software/wgbs_tools/src/python/init_genome.py", line 57, in init
self.get_fasta()
File "./software/wgbs_tools/src/python/init_genome.py", line 79, in get_fasta
subprocess.check_call(cmd, shell=True)
File "./miniconda3/miniconda3/lib/python3.9/subprocess.py", line 373, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'gunzip ./wgbs_tools/references/hg38/hg38.fa.gz && bgzip -@ 12 ./wgbs_tools/references/hg38/hg38.fa' returned non-zero exit status 1.”

My configuration：
python3 ，pandas version1.5.1，numpy version1.23.4， scipy version1.9.3
htslib version1.13

Compiling error for homog...

$python3 setup.py
Compiling stdin2beta...
SUCCESS
Compiling stdin2pairs...
SUCCESS
Compiling pat_sampler...
SUCCESS
Compiling patter...
SUCCESS
Compiling bp_patter...
SUCCESS
Compiling snp_patter...
SUCCESS
Compiling match_maker...
SUCCESS
Compiling segmentor...
SUCCESS
Compiling cview...
SUCCESS
Compiling homog...
FAIL
Failed compilation.
Command: g++ -std=c++11 -c -o src/homog/main.o src/homog/main.cpp; g++ -std=c++11 -c -o src/homog/homog.o src/homog/homog.cpp; g++ -std=c++11 -o src/homog/homog src/homog/main.o src/homog/homog.o -lz -lboost_iostreams
return code: 1
stderr:

stdout:
In file included from src/homog/main.cpp:5:0:
src/homog/homog.h:8:48: fatal error: boost/algorithm/string/predicate.hpp: No such file or directory
#include <boost/algorithm/string/predicate.hpp>
^
compilation terminated.
In file included from src/homog/homog.cpp:2:0:
src/homog/homog.h:8:48: fatal error: boost/algorithm/string/predicate.hpp: No such file or directory
#include <boost/algorithm/string/predicate.hpp>
^
compilation terminated.
g++: error: src/homog/main.o: No such file or directory
g++: error: src/homog/homog.o: No such file or directory

Failed compiling homog
Compiling add_cpg_counts...
SUCCESS
Compiling add_loci...
SUCCESS

About defining cell-type-specific hypermethylated sites.

Can wgbs_tools find cell-type-specific hypermethylated sites? Which subcommand to use?

CpG loci annotation in beta file

Hi,

Your work is fantastic! Thanks for your documentation and tutorial.

I understand that in the beta file, the CpGs are indexed in the order of genomic loci, and that I can convert the CpG index into genomic loci with wbgstools. But when I load the beta file into R, I don't have such loci information for each CpG. I am wondering if it's possible to have an annotation file which matches the CpG indexes to genomic loci and gene (h19 and hg38). That would be really helpful if others want to process the beta file with R or other language.

kind regards,
Tianyu

visualization for beta files

I'm applying this tools for heart beta files and I used this command but it doesn't show any output. Please guide me whats the best way for visualization
./wgbstools vis ../methylation_atlas/*.beta -r chr3:119528843-119529245 --heatmap
chr3:119528843-119529245 - 403bp,
6CpGs: 5560971-5560977GSM5652212_Heart-Cardiomyocyte-Z0000044G.hg38 : ??????
GSM5652213_Heart-Cardiomyocyte-Z0000044K.hg38 : ??????
GSM5652214_Heart-Cardiomyocyte-Z0000044N.hg38 : ??????
GSM5652215_Heart-Cardiomyocyte-Z0000044P.hg38 : ??????
GSM5652216_Heart-Cardiomyocyte-Z0000044Q.hg38 : ??????
GSM5652217_Heart-Cardiomyocyte-Z0000044R.hg38 : ??????

Thank you

Missing last CPG site in chromosome

bsize = loci.at(loci.size() - 1);    

conv = new bool[bsize]();   -->  conv = new bool[(bsize + 1)]();
for (int locus: loci) {
    conv[locus] = true;
}

In line 254, we also need add 1 to make the code assess to the last locus of the chromosome:

    if ((start_locus + i) > (bsize - 1)) {           --->  if ((start_locus + i) > (bsize - 1 + 1)) {
        continue;  
    }

Non-standard 1-based start position in bed files

Hi,

I believe the bed files emitted from this tool use a non-standard start and end position. The bed files from wgbs_tools appear to be 1-based for the start when they should be zero-based. I imagine it would be a lot of work to correct this, but it would be helpful to users to note this in the README to avoid potential downstream issues.

Find markers error

I'm running into an error with find_markers which looks like an issue when trying to add an integer to a string?

Number of markers found: 15 Traceback (most recent call last): File "./wgbstools", line 97, in <module> main() File "./wgbstools", line 64, in main importlib.import_module(args.command).main() File "./wgbs_tools/src/python/find_markers.py", line 434, in main MarkerFinder(params).run() File "./wgbs_tools/src/python/find_markers.py", line 114, in run self.dump_results(self.res[target].reset_index(drop=True)) File "./wgbs_tools/src/python/find_markers.py", line 378, in dump_results tf['region'] = bed2reg(tf) File "./wgbs_tools/src/python/utils_wgbs.py", line 456, in bed2reg return df['chr'] + ':' + df['start'].astype(str) + '-' + df['end'].astype(str) ~~~~~~~~~~^~~~~ File "~/myconda/envs/meth/lib/python3.11/site-packages/pandas/core/ops/common.py", line 72, in new_method return method(self, other) ^^^^^^^^^^^^^^^^^^^ File "~/myconda/envs/meth/lib/python3.11/site-packages/pandas/core/arraylike.py", line 102, in __add__ return self._arith_method(other, operator.add) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "~/myconda/envs/meth/lib/python3.11/site-packages/pandas/core/series.py", line 6259, in _arith_method return base.IndexOpsMixin._arith_method(self, other, op) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "~/myconda/envs/meth/lib/python3.11/site-packages/pandas/core/base.py", line 1325, in _arith_method result = ops.arithmetic_op(lvalues, rvalues, op) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "~/myconda/envs/meth/lib/python3.11/site-packages/pandas/core/ops/array_ops.py", line 226, in arithmetic_op res_values = _na_arithmetic_op(left, right, op) # type: ignore[arg-type] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "~/myconda/envs/meth/lib/python3.11/site-packages/pandas/core/ops/array_ops.py", line 165, in _na_arithmetic_op result = func(left, right) ^^^^^^^^^^^^^^^^^ numpy.core._exceptions._UFuncNoLoopError: ufunc 'add' did not contain a loop with signature matching types (dtype('int64'), dtype('<U1')) -> None

Any thoughts?

beta_to_450k update

Dear Author,
Could You please update the reference used in beta_to_450k script, to work with Illumina EPIC data?

nloyfer / wgbs_tools Goto Github PK

wgbs_tools's Introduction

wgbstools - suite for DNA methylation sequencing data representation, visualization, and analysis

Quick start

Installation

Genome configuration

Dependencies

Dependencies for some features:

Usage examples

Deconvolution

References

wgbs_tools's People

Contributors

Stargazers

Watchers

Forkers

wgbs_tools's Issues

The error:

Recommend Projects

Recommend Topics

Recommend Org

Jobs