GithubHelp home page GithubHelp logo

Comments (16)

ruochiz avatar ruochiz commented on August 16, 2024

Hey, I think it's likely caused by choosing a too coarse resolution (6000000). To make it easier for the user, we implement some heuristic in Higashi to decide the feature dimension, model size etc. based on the genome reference size and typical resolutions that are used for analysis. It's likely that at the resolution of 6,000,000, one of the suggested dim by the model becomes 1, and leads to this error. I would suggest to change it to 1Mb as a start point. If the problem persists, I'll take another look. If for some reason that it's necessary to use the 6Mb resolution, I can add a fix to the code to avoid this issue.

from higashi.

Samfouss avatar Samfouss commented on August 16, 2024

Thank you for responding to my concern. Your response was very helpful. I take 1Mb as as suggested.
But now I am getting some message like : "The 0 th chrom in your chrom_list has no sample in this generator". Can you help me to understand what does it mean ? I am working on chr11 as specified in the configuration file.
Do you think that it is caused by the sparsity of my data (total_sparsity_cell 0.00040311744154797097) ? Because I noticed that in your tutorial (Higashi/tutorials/4DN_sci-Hi-C_Kim et al.ipynb), you get something like total_sparsity_cell 0.012761184803150997.

`>>> from higashi.Higashi_wrapper import *

#config = "config_mousse.JSON"
config = "config_souris.JSON"
print("1. Config finished")

  1. Config finished

Initialize the Higashi instance

higashi_model = Higashi(config)

Data processing (only needs to be run for once)

higashi_model.process_data()
generating start/end dict for chromosome
extracting from data.txt
100%|████████████████████████████████████████████████████████████████████████████████| 39410250/39410250 [01:56<00:00, 337588.28it/s]
generating contact maps for baseline
data loaded
750 False
creating matrices tasks: 100%|█████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.07it/s]
total_feats_size 200
0%| | 0/1 [00:00<?, ?it/s]Done here 1
1
Done here 2
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 67.57it/s]
higashi_model.prep_model()
cpu_num 32
training on data from: ['chr11']
total_sparsity_cell 0.00040311744154797097
no contractive loss
batch_size 256
Node type num [250 122] [250 372]
start making attribute
0.994: 32%|███████████████████████████▊ | 96/300 [00:00<00:00, 433.01it/s]
loss 0.9697239995002747 loss best 0.9167578220367432 epochs 96

initializing data generator
0%| | 0/1 [00:00<?, ?it/s]
The 0 th chrom in your chrom_list has no sample in this generator
100%|████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 7194.35it/s]
initializing data generator
0%| | 0/1 [00:00<?, ?it/s]
The 0 th chrom in your chrom_list has no sample in this generator
100%|████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 7752.87it/s]

print("2. Process finished")

  1. Process finished`

from higashi.

ruochiz avatar ruochiz commented on August 16, 2024

Um. This error is raised when there are no hyperedges to train the model. For debug purpose could you run the following script?

temp_dir = ...
with h5py.File(os.path.join(temp_dir, "node_feats.hdf5"), "r") as input_f:
    print(len(np.array(input_f['train_data_%s' % "chr11"]).astype('int'))))

Also, what's your minimum and maximum distance in the config files.

Thanks.

from higashi.

Samfouss avatar Samfouss commented on August 16, 2024

In my config file, I have :
"minimum_impute_distance": 0,
"maximum_impute_distance": -1

The fact that there are no hyperedges to train the model is probably related to my cell data. Maybe they are too sparse.

from higashi.

Samfouss avatar Samfouss commented on August 16, 2024

So may be I have to increase this two distances

from higashi.

ruochiz avatar ruochiz commented on August 16, 2024

That seems to be using all the edges, so I think these two parameters are probably fine. Also, I mean the minimum_distance not minimum_impute_distance.

could you try to run the code I provided above to see if there are any edges before the filtering step. That could help to narrow down where the problem is (too few reads, reads are mostly short-ranged interactions, etc. )

Thanks!

from higashi.

Samfouss avatar Samfouss commented on August 16, 2024

When I run the code above, that is what I get.

`>>> temp_dir = ...

with h5py.File(os.path.join(temp_dir, "node_feats.hdf5"), "r") as input_f:
... print(len(np.array(input_f['train_data_%s' % "chr11"]).astype('int')))
...
Traceback (most recent call last):
File "", line 1, in
File "/cvmfs/samfouss/easybuild/software/2020/avx2/Core/python/3.8.10/lib/python3.8/posixpath.py", line 76, in join
a = os.fspath(a)
TypeError: expected str, bytes or os.PathLike object, not ellipsis`

from higashi.

Samfouss avatar Samfouss commented on August 16, 2024

And also now I do not know what happened but I can run "higashi_model.prep_model()" command without any problem.
`

from higashi.Higashi_wrapper import *

Set the path to the configuration file, change it accordingly

#config = "config_mousse.JSON"
config = "config_souris.JSON"
print("1. Config finished")

  1. Config finished

Initialize the Higashi instance

higashi_model = Higashi(config)

Data processing (only needs to be run for once)

higashi_model.process_data()
generating start/end dict for chromosome
extracting from data.txt
100%|████████████████████████████████████████████████████████████████████████████████| 39410250/39410250 [01:58<00:00, 333620.38it/s]
generating contact maps for baseline
data loaded
2831250 False
creating matrices tasks: 100%|████████████████████████████████████████████████████████████████████████| 1/1 [15:21<00:00, 921.79s/it]

total_feats_size 200
0%| | 0/1 [00:00<?, ?it/s]Done here 1
149
Done here 2
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.30it/s]

higashi_model.prep_model()
cpu_num 32
training on data from: ['chr11']
total_sparsity_cell 0.00040311744154797097
no contractive loss
batch_size 1280
Node type num [ 250 12185] [ 250 12435]
start making attribute
0.636: 100%|██████████████████████████████████████████████████████████████████████████████████████| 300/300 [00:01<00:00, 213.18it/s]
loss 0.6364461779594421 loss best 0.6372790932655334 epochs 299

initializing data generator
100%|███████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 25731.93it/s]
initializing data generator
100%|███████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 27962.03it/s]

print("2. Process finished")

  1. Process finished
    `

But "higashi_model.train_for_embeddings()" takes a lot of time to execute.

from higashi.

GMFranceschini avatar GMFranceschini commented on August 16, 2024

I am getting the same error, tried 1Mb and 100Kb to no avail.
I have sparsity 0.22241997971104638 which should be quite good, do you have any suggestion on how to debug this? In the temp_dir I can't find a node_feats.hdf5 file to dump as you suggested.

from higashi.

ruochiz avatar ruochiz commented on August 16, 2024

Hey, what version of Higashi are you using. Is it the one from conda or the github + pip install.

from higashi.

GMFranceschini avatar GMFranceschini commented on August 16, 2024

I installed downloading the repo and using setup.py.
From my conda env export I see:

name: fasthigashi
  - fasthigashi=0.1.1=py_0

from higashi.

ruochiz avatar ruochiz commented on August 16, 2024

I see. And to confirm, the error is: ValueError: Found array with 1 feature(s) (shape=(250, 1)) while a minimum of 2 is required by TruncatedSVD

from higashi.

GMFranceschini avatar GMFranceschini commented on August 16, 2024

Yes, of course with a different n for my data

'ValueError: Found array with 1 feature(s) (shape=(69, 1)) while a minimum of 2 is required by TruncatedSVD.'

from higashi.

ruochiz avatar ruochiz commented on August 16, 2024

To help with debugging:

  1. Are you working with a custom genome or standard ones (like hg38 / mm10)
  2. Are there any chromosomes with length smaller than 1Mb in the dataset?
  3. under the temp_dir, there should be some files with name "cell_adj_%s.npy" in it, could you load one of them and print out the shape?

Thanks!

from higashi.

GMFranceschini avatar GMFranceschini commented on August 16, 2024

Thanks to you! You may already have found the problem.

hg19, but filtered out the uncharacterized chromosomes. However, chrM was still there.
I removed it and removed all entries with it in the pairs. Now the SVD step works. Maybe a small addition to the documentation could address that only chr1:chrX/Y should be included, that's what I should have done in the first place!

Thank you for the help, and feel free to close this!

from higashi.

ruochiz avatar ruochiz commented on August 16, 2024

I see sounds good. Will update the documentation.

from higashi.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.