dejunlin / hicrep Goto Github PK

Python implementation of HiCRep stratum-adjusted correlation coefficient of Hi-C data with Cooler sparse contact matrix support

License: GNU General Public License v3.0

Python 100.00%

hicrep's Introduction

hicrep

Python implementation of the HiCRep: a stratum-adjusted correlation coefficient (SCC) for Hi-C data with support for Cooler sparse contact matrices

The algorithm is published in:

HiCRep: assessing the reproducibility of Hi-C data using a stratum-adjusted correlation coefficient. Tao Yang Feipeng Zhang Galip Gürkan Yardımcı Fan Song Ross C. Hardison William Stafford Noble Feng Yue and Qunhua Li, Genome Res. 2017 Nov;27(11):1939-1949. doi: 10.1101/gr.220640.117.

This implementation takes a pair of Hi-C data sets in Cooler format (.cool for single binsize or .mcool multiple binsizes) and computes the HiCRep SCC scores for each pair of chromosomes between the two data sets. A guide for how to convert a file of read-pairs into the appropriate .cool or .mcool format is available in the Cooler documentation here.

The HiCRep SCC computed from this implementaion is consistent with the original R implementaion (https://github.com/MonkeyLB/hicrep/) and it's more than 10x faster than the R version:

Usage

To use as a python module, install the package

pip install hicrep

and then use the util function readMcool

from hicrep.utils import readMcool

to read a pair of mcool files and specify the bin size to compute SCC with:

fmcool1 = "mydata1.mcool"
fmcool2 = "mydata2.mcool"
binSize = 100000
cool1, binSize1 = readMcool(fmcool1, binSize)
cool2, binSize2 = readMcool(fmcool2, binSize)

or a pair of .cool files with built-in bin size:

fcool1 = "mydata1.cool"
fcool2 = "mydata2.cool"
cool1, binSize1 = readMcool(fmcool1, -1)
cool2, binSize2 = readMcool(fmcool2, -1)
# binSize1 and binSize2 will be set to the bin size built in the cool file
binSize = binSize1

then define the parameters for computing HiCRep SCC:

from hicrep import hicrepSCC

# smoothing window half-size
h = 1

# maximal genomic distance to include in the calculation
dBPMax = 500000

# whether to perform down-sampling or not 
# if set True, it will bootstrap the data set # with larger contact counts to
# the same number of contacts as in the other data set; otherwise, the contact 
# matrices will be normalized by the respective total number of contacts
bDownSample = False

# compute the SCC score
# this will result in a SCC score for each chromosome available in the data set
# listed in the same order as the chromosomes are listed in the input Cooler files
scc = hicrepSCC(cool1, cool2, h, dBPMax, bDownSample)

# Optionally you can get SCC score from a subset of chromosomes
sccSub = hicrepSCC(cool1, cool2, h, dBPMax, bDownSample, np.array(['myChr1', 'myOtherChr'], dtype=str))

To use as a command line tool, install this package by

pip install hicrep

then run

hicrep mydata1.mcool mydata2.mcool outputSCC.txt --binSize 100000 --h 1 --dBPMax 500000

when passing in an .mcool file with multiple binsizes or

hicrep mydata1.cool mydata2.cool outputSCC.txt --h 1 --dBPMax 500000

when passing in a .cool file with a single bultin binsize. The output outputSCC.txt has a list of SCC scores for each chromosome in the input. The output SCC scores are listed in the same order as the chromosomes are listed in the input Cooler files. To see the list of command line options:

hicrep -h

You can optionally compute SCC scores for a subset of chromosomes using

hicrep mydata1.cool mydata2.cool outputSCC_Subset.txt --h 1 --dBPMax 500000 --chrNames 'myChr1' 'myOtherChr'

Related Projects

hicrepcm generates a clustermap of multiple Hi-C datasets based on their pairwise hicrep sores
hic2cool converts Hi-C dataset from the *.hic format to *.cool or *.mcool format

hicrep's People

Contributors

Stargazers

Watchers

Forkers

justin-a-sanders kevbrick xieting0603 char-aznable hisakatha robomics agalitsyna

hicrep's Issues

hicrepcm to generate clustermap of hicrep SCC scores

We have created a new package, hicrepcm, which computes hicrep SCC scores for an arbitrary number of input contact matrices and outputs them as a clustermap. You may be interested in linking to this project in the readme for hicrep itself. We find this a convenient way to generate and share results from hicrep. Here is the github repo.

Thanks for your great work on this project.

KeyError: "Unable to open object (object 'chroms' doesn't exist)"

Hi,

Thank you for this cool tool.

I installed hicrep through pip, and got the following error when I used it.

$ hicrep M0_1.mcool M0_2.mcool outputSCC.txt --h 1 --dBPMax 10000000
Traceback (most recent call last):
  File "/home/niuyw/software/anaconda3.7/envs/hicrep/lib/python3.9/site-packages/cooler/api.py", line 95, in _refresh
    _ct = chroms(grp)
  File "/home/niuyw/software/anaconda3.7/envs/hicrep/lib/python3.9/site-packages/cooler/api.py", line 448, in chroms
    .append(pd.Index(h5["chroms"].keys()))
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/home/niuyw/software/anaconda3.7/envs/hicrep/lib/python3.9/site-packages/h5py/_hl/group.py", line 288, in __getitem__
    oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5o.pyx", line 190, in h5py.h5o.open
KeyError: "Unable to open object (object 'chroms' doesn't exist)"

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/niuyw/software/anaconda3.7/envs/hicrep/bin/hicrep", line 8, in <module>
    sys.exit(main())
  File "/home/niuyw/software/anaconda3.7/envs/hicrep/lib/python3.9/site-packages/hicrep/__init__.py", line 80, in main
    cool1, binSize1 = readMcool(fmcool1, binSize)
  File "/home/niuyw/software/anaconda3.7/envs/hicrep/lib/python3.9/site-packages/hicrep/utils.py", line 34, in readMcool
    cool = cooler.Cooler(mcool)
  File "/home/niuyw/software/anaconda3.7/envs/hicrep/lib/python3.9/site-packages/cooler/api.py", line 89, in __init__
    self._refresh()
  File "/home/niuyw/software/anaconda3.7/envs/hicrep/lib/python3.9/site-packages/cooler/api.py", line 104, in _refresh
    listing = list_coolers(self.store)
  File "/home/niuyw/software/anaconda3.7/envs/hicrep/lib/python3.9/site-packages/cooler/fileops.py", line 188, in list_coolers
    if not h5py.is_hdf5(filepath):
  File "/home/niuyw/software/anaconda3.7/envs/hicrep/lib/python3.9/site-packages/h5py/_hl/base.py", line 34, in is_hdf5
    fname = os.path.abspath(fspath(fname))
TypeError: expected str, bytes or os.PathLike object, not File

The mcool file looks like this.

$ cooler attrs M0_1.mcool 
'@attrs':
  format: HDF5::MCOOL
  format-version: 2
resolutions:
  '1000000':
    '@attrs':
      bin-size: 1000000
      bin-type: fixed
      creation-date: 2021-05-18 16:43:44.013710
      format: HDF5::Cooler
      format-url: https://github.com/open2c/cooler
      format-version: 3
      generated-by: cooler-0.8.11
      genome-assembly: unknown
      metadata: {}
      nbins: 2738
      nchroms: 22
      nnz: 1591928
      storage-mode: symmetric-upper
      sum: 6885169
    bins:
      '@attrs': {}
      chrom:
        '@attrs': {}
      end:
        '@attrs': {}
      start:
        '@attrs': {}
      weight:
        '@attrs':
          cis_only: false
          converged: true
          ignore_diags: 2
          mad_max: 5
          min_count: 0
          min_nnz: 10
          scale: 3417.6785530450925
          tol: 1.0e-05
          var: 6.7006041416248e-06
    chroms:
      '@attrs': {}
      length:
        '@attrs': {}
      name:
        '@attrs': {}
    indexes:
      '@attrs': {}
      bin1_offset:
        '@attrs': {}
      chrom_offset:
        '@attrs': {}
    pixels:
      '@attrs': {}
      bin1_id:
        '@attrs': {}
      bin2_id:
        '@attrs': {}
      count:
        '@attrs': {}
  '2000000':
    '@attrs':
      bin-size: 2000000
      bin-type: fixed
      creation-date: 2021-05-18 16:43:47.734047
      format: HDF5::Cooler
      format-url: https://github.com/open2c/cooler
      format-version: 3
      generated-by: cooler-0.8.11
      genome-assembly: unknown
      metadata: {}
      nbins: 1376
      nchroms: 22
      nnz: 739487
      storage-mode: symmetric-upper
      sum: 6885169
    bins:
      '@attrs': {}
      chrom:
        '@attrs': {}
      end:
        '@attrs': {}
      start:
        '@attrs': {}
      weight:
        '@attrs':
          cis_only: false
          converged: true
          ignore_diags: 2
          mad_max: 5
          min_count: 0
          min_nnz: 10
          scale: 6346.484655522293
          tol: 1.0e-05
          var: 4.2289736578993755e-06
    chroms:
      '@attrs': {}
      length:
        '@attrs': {}
      name:
        '@attrs': {}
    indexes:
      '@attrs': {}
      bin1_offset:
        '@attrs': {}
      chrom_offset:
        '@attrs': {}
    pixels:
      '@attrs': {}
      bin1_id:
        '@attrs': {}
      bin2_id:
        '@attrs': {}
      count:
        '@attrs': {}
  '4000000':
    '@attrs':
      bin-size: 4000000
      bin-type: fixed
      creation-date: 2021-05-18 16:43:48.440148
      format: HDF5::Cooler
      format-url: https://github.com/open2c/cooler
      format-version: 3
      generated-by: cooler-0.8.11
      genome-assembly: unknown
      metadata: {}
      nbins: 694
      nchroms: 22
      nnz: 220042
      storage-mode: symmetric-upper
      sum: 6885169
    bins:
      '@attrs': {}
      chrom:
        '@attrs': {}
      end:
        '@attrs': {}
      start:
        '@attrs': {}
      weight:
        '@attrs':
          cis_only: false
          converged: true
          ignore_diags: 2
          mad_max: 5
          min_count: 0
          min_nnz: 10
          scale: 11690.429853353084
          tol: 1.0e-05
          var: 6.810218703424982e-06
    chroms:
      '@attrs': {}
      length:
        '@attrs': {}
      name:
        '@attrs': {}
    indexes:
      '@attrs': {}
      bin1_offset:
        '@attrs': {}
      chrom_offset:
        '@attrs': {}
    pixels:
      '@attrs': {}
      bin1_id:
        '@attrs': {}
      bin2_id:
        '@attrs': {}
      count:
        '@attrs': {}
  '8000000':
    '@attrs':
      bin-size: 8000000
      bin-type: fixed
      creation-date: 2021-05-18 16:43:48.758667
      format: HDF5::Cooler
      format-url: https://github.com/open2c/cooler
      format-version: 3
      generated-by: cooler-0.8.11
      genome-assembly: unknown
      metadata: {}
      nbins: 354
      nchroms: 22
      nnz: 57568
      storage-mode: symmetric-upper
      sum: 6885169
    bins:
      '@attrs': {}
      chrom:
        '@attrs': {}
      end:
        '@attrs': {}
      start:
        '@attrs': {}
      weight:
        '@attrs':
          cis_only: false
          converged: true
          ignore_diags: 2
          mad_max: 5
          min_count: 0
          min_nnz: 10
          scale: 20811.188587116743
          tol: 1.0e-05
          var: 3.814639544957634e-06
    chroms:
      '@attrs': {}
      length:
        '@attrs': {}
      name:
        '@attrs': {}
    indexes:
      '@attrs': {}
      bin1_offset:
        '@attrs': {}
      chrom_offset:
        '@attrs': {}
    pixels:
      '@attrs': {}
      bin1_id:
        '@attrs': {}
      bin2_id:
        '@attrs': {}
      count:
        '@attrs': {}
  '@attrs': {}

Do you have any ideas about this?

Thank you in advance.

`AssertionError` when running hicrep

Hi,

Thank you so much for the great tool for python implementation. I would like to ask several questions if possible.

I get an error AssertionError: Contact matrix 1 of chromosome chrY is empty when I run hicrep on my dataset of mouse HiC mcool/cool files with mm10. I checked the issues and found a similar issue in #15.
Is the lack of data on chrY causing the AssertionError above? Is there a way to find the chromosome of missing data before running the package?
I am interested in generating a heatmap to assess the reproducibility of Hi-C replicates. Is there a way to achieve the goal with the hicrep.py?

Thank you so much for your help!

Best,
Ziwei

About Interpretation of results

Hi！thanks for this work.

I have one problem.

When I get the output file outputSCC.txt which have a list of SCC scores for each chromosome, how can I explain this result?
Because the final result I want to get is one reproducibility score for two samples, not chromosome level.

AssertionError: Input cool files have different number of bins

Hi,

Thanks for this wonderful tool.

I want to compare my .hic data with other downloaded .hic.
I used hic2cool to obtain .cool input but seems that there are difference in nbins.
HiC1.cool has "nbins": 3113 (1MB resolution)
HiC2.cool has "nbins": 3114 (1MB resolution)

Is there any way to fix this?
Any help is appreciated.

Best,

The chromosome order in output.SCC.txt

Hi,

I was wondering if the chromosome order in output.SCC.txt is chr1, chr2,..,chr22, chrX, chrY (from top to bottom)? Thanks!

Best,
Kun

whether could output a data1_vs_data2.cool for downstream analysis?

Describe the bug
A clear and concise description of what the bug is and include the details of any warnings or error messages.

To Reproduce
Steps to reproduce the behavior, including the full set of commands or the sequence of library function call that lead to the behavior :

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: [e.g. ubuntu 16.04]
python Version [e.g. 3.8.1]
numpy Version [e.g. 1.6.1]
scipy Version [e.g. 1.6.3]
HiCRep Version [e.g. 22]

Additional context
If possible, post the link to the data used in the aforementioned example

Hi,dejunlin,
Thanks for your great tool. In fact I have two hic data, including tissue1 and tissue2, and I want to compare the two data and get a tissue1_vs_tissue2.cool for further analysis, Could you help me?

AssertionError: Contact matrix 2 of chromosome Y is empty

When I run the code using 2 .cool files rather than 2 .mcool files, there is an Assertion Error (AssertionError: Contact matrix 2 of chromosome Y is empty). However, when checking the two .cool files, there is data for chromosome Y. Why is there an error? Would just like to add that the same .mcool files show no error and works perfectly. Need an answer as soon as possible :) :)

Checking to see if there is missing data
chromosomes: ['M', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', 'X', 'Y'], binsize: 10000
M : (0, 2)
1 : (2, 24898)
2 : (24898, 49118)
3 : (49118, 68948)
4 : (68948, 87970)
5 : (87970, 106124)
6 : (106124, 123205)
7 : (123205, 139140)
8 : (139140, 153654)
9 : (153654, 167494)
10 : (167494, 180874)
11 : (180874, 194383)
12 : (194383, 207711)
13 : (207711, 219148)
14 : (219148, 229853)
15 : (229853, 240053)
16 : (240053, 249087)
17 : (249087, 257413)
18 : (257413, 265451)
19 : (265451, 271313)
20 : (271313, 277758)
21 : (277758, 282429)
22 : (282429, 287511)
X : (287511, 303116)
Y : (303116, 308839)

Code
from hicrep import hicrepSCC

h = 1

dBPMax = 500000

bDownSample = False

scc = hicrepSCC(cool1, cool2, h, dBPMax, bDownSample)

Error Message

AssertionError Traceback (most recent call last)
Cell In[89], line 18
13 bDownSample = False
15 # compute the SCC score
16 # this will result in a SCC score for each chromosome available in the data set
17 # listed in the same order as the chromosomes are listed in the input Cooler files
---> 18 scc = hicrepSCC(cool1, cool2, h, dBPMax, bDownSample)

File ~/miniconda3/envs/jupyter/lib/python3.11/site-packages/hicrep/hicrep.py:177, in hicrepSCC(cool1, cool2, h, dBPMax, bDownSample, chrNames, excludeChr)
174 assert mS1.shape[0] == mS1.shape[1],
175 "Contact matrix 1 of chromosome %s is not square" % (chrName)
176 mS2 = getSubCoo(p2, bins2, chrName)
--> 177 assert mS2.size > 0, "Contact matrix 2 of chromosome %s is empty" % (chrName)
178 assert mS2.shape[0] == mS2.shape[1],
179 "Contact matrix 2 of chromosome %s is not square" % (chrName)
180 assert mS1.shape == mS2.shape,
181 "Contact matrices of chromosome %s have different input shape" % (chrName)

AssertionError: Contact matrix 2 of chromosome Y is empty

Compare similarity using ICE-normalized cool files

Great work! However, I have a specific requirement where I need to compare the similarity between Hi-C matrices from different batches. I have ICE-normalized cool files. I would like to know how to handle this type of data for my analysis.

Any suggestions or examples would be greatly appreciated.

Thank you in advance for your help!

"chrM" not being excluded

Hi,

I get an error when I run hicrep on my mouse HiC mcool files (mm10 genome). Basically, the lack of data for chrM is causing hicrep to fail. I see that you have rule in the code to exclude chromosome "M", but not "chrM" (line 161 hicrep.py). Would it be possible to allow for more general nomenclature of the mitochondrial chromosome?

Right now, I get around the issue by explicitly listing all chromosomes as an argument but would rather not need to do that.

Error is shown below: Python script uses the toy example from github README with my own mcool files

>python hicrep_test.py 
Traceback (most recent call last):
  File "hicrep_test.py", line 30, in <module>
    scc = hicrepSCC(cool1, cool2, h, dBPMax, bDownSample)
  File "/home/kevbrick/data/get_sequencing_data/hicrep/lib/python3.8/site-packages/hicrep/hicrep.py", line 161, in hicrepSCC
    assert mS1.size > 0, "Contact matrix 1 of chromosome %s is empty" % (chrName)
AssertionError: Contact matrix 1 of chromosome chrM is empty

Thanks for the great tool !
Kevin

KeyError: 'sum'

Hi. I've installed hicrep into Python 3.8 venv. The command line version breaks with the KeyError: 'sum' error:

hicrep --binSize 100000 --h 10 --dBPMax 5000000 03.mcool 04.mcool 03_04.txt
Traceback (most recent call last):
  File "/home/mdozmorov/miniconda3/bin/hicrep", line 8, in <module>
    sys.exit(main())
  File "/home/mdozmorov/miniconda3/lib/python3.8/site-packages/hicrep/__init__.py", line 83, in main
    scc = hicrepSCC(cool1, cool2, h, dBPMax, bDownSample)
  File "/home/mdozmorov/miniconda3/lib/python3.8/site-packages/hicrep/hicrep.py", line 146, in hicrepSCC
    n1 = cool1.info['sum']
KeyError: 'sum'

Trying from Python results in a different error:

from hicrep.utils import readMcool
fmcool1 = "/home/sequencing/j_03.mcool"
fmcool2 = "/home/sequencing/04.mcool"
binSize = 100000
cool1, binSize1 = readMcool(fmcool1, binSize)
cool2, binSize2 = readMcool(fmcool2, binSize)
h = 10
dBPMax = 500000
bDownSample = False
scc = hicrepSCC(cool1, cool2, h, dBPMax, bDownSample)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'hicrepSCC' is not defined

Any advice?
Thanks,
Mikhail

[Request] Functionality use hicrep with .hic files

I have some .hic files I'd like to analyse with hicrep, but the current implementation is hard-coded to use files generated by Cooler.
Are there any plans to extend hicrep's functionality to allow for .hic input files?

Existing python implementation

Hello,

It's great you're doing this, I just wanted to let you know I already reimplemented it in python last year with support for cool files / sparse matrices: https://github.com/cmdoret/hicreppy

I saw that you are writing a paper where you compare your python implementation to the original hicrep package. Perhaps you could consider including my implementation in your comparisons ?

If some parts of my implementations happen to be more efficient, you are welcome to use them in your code.

I suspect results from my implementation differ more from the original package as I took a few liberties. For example, it returns the average SCC of all chromosomes weighted by their lengths, instead of each SCC independently.

Best,
Cyril

Should the input matrix be normalized for hicrep?

Hi,

A quick question: should I feed the raw matrix or normalized matrix to hicrep? If the normalized matrix should be used, what method should be used for normalization, explicit or implicit (matrix-balancing)?

Sorry if the question is too naive. I am new to Hi-C data.

Thank you in advance.

FutureWarning: treating keys as positions in deprecated in pandas

Describe the bug
Running HicRep gives the FutureWarning:

FutureWarning: Series.getitem treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use ser.iloc[pos]
assert (cool1.chroms()[:] == cool2.chroms()[:]).all()[0],\

To Reproduce
Simply run HicRep, it happens every time.

Desktop (please complete the following information):
Most recent version running on Ubuntu Jammy Jellyfish

It would be great to have this cleared up so my users don't have to worry about the package breaking at some point in the future. Thank you!!!