GithubHelp home page GithubHelp logo

musical's People

Contributors

begeiger avatar dgulhan-bio avatar hu-jin avatar viklju avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

musical's Issues

Use rank deficiency to select n_components for mvNMF

When n_components is over-specified, mvNMF tends to produce rank deficient solutions, i.e., solutions with identical signatures. We can use this feature to select n_components. We need to do tests comparing this method with the current method based on samplewise reconstruction errors.

For an example, see #19

Small bug in remove_samples_based_on_gini?

Two small + related issues with multiple potential solutions: I tried using the remove_samples_based_on_gini method after performing an extraction with H from the DeNovo sig class and my mutation count matrix X โ€” loaded as a pandas DataFrame per the example_full_pipeline.ipynb โ€” as my inputs. This resulted in a NotImplentedError which I traced back to the mad = np.abs(np.subtract.outer(x, x)).mean() line in the gini method of preprocessing.py. The problem seems to be that pandas no longer supports the outer method on DataFrames or Series. Converting the DataFrame to an numpy array fixed the problem, so I think some clearer documentation about the expected data types for the inputs and/or a conversion of DataFrames to arrays within the method would help. Similarly, if you pass H as a DataFrame as opposed to an array, you get a TypeError for the h_norm = h/np.sum(X, axis = 0) line in the remove_samples_based_on_gini method, since that operation isn't supported on DataFrames. Same possible fixes would work!

Improve the automatic selection of n_components

Currently we first filter out n_components that do not have stable solutions, and calculate a p-value for sample-wise reconstruction errors, and then we choose n_components based on those p-values.

Some other pieces of information could be used for automatically selecting the best n_components

  1. Set a threshold for the min stability. For example, we scan from small to large n_components, and stop whenever the min stability drops below, say, 0.5.
  2. Use COSMIC signatures. When we go from small to large n_components, if the additional signature is not that stable but can be matched to (potentially multiple) COSMIC signatures, then we keep it. If the additional signature cannot be matched to COSMIC signatures but is very stable, then we also keep it, since it might be a new signature.

Rstudio Version request

Hi.
Musical is a better tool than Sigprofiler but it is better to develop a R version of Musical.
If you develop it, please let me know. Thank you very much.

Corner cases for sil_score calculation in _gather_results()

There are cases where a cluster contains only 1 sample. In those cases, the sil_score for that signature is currently 0. It might be better to set the sil_score to 1 in those cases.

Example: I ran DenovoSig with mvNMF on simulated PCAWG Lung-AdenoCA data, with init=random and n_replicates=20. SBS3 is difficult to be discovered. It turns out that, in 19 out of the 20 cases, two SBS4's are extracted, while in the rest 1 case, an SBS3 is discovered. So in the final clustering result, there is a SBS3 cluster with 1 sample, and a SBS4 cluster with 39 samples.

This example also illustrates why it may not be good to force each cluster to contain one and only one sample from each replicate, as done in SigProfiler.

Cannot load catalogs

Hello!

I followed the instructions for "Install from Source" in your README (I also checked to see if there was a recipe to install the package from conda using conda search *musical* and via the bioconda site, and didn't find anything, which is why I went with install from source).

I am able to import MuSiCal in Python but when I run musical.load_catalog() I get the following error message

FileNotFoundError: [Errno 2] No such file or directory: /opt/miniconda3/envs/{MYENVNAME}/lib/python3.10/site-packages/musical/data/COSMIC-MuSiCal_v3p2_SBS_WGS.csv'

The same error persists both in Jupyter and invoking Python in the console, so shouldn't be a kernel issue.

Inside the .../site-packages/musica/data/ directory there was only a __pycache__ directory and one py script, no further files.

Can you please advise on how to fix this issue? Possibly there was a problem with my installation? Though I used pip inside my conda environment and there were no issues reported. All dependencies were previously installed via conda.

Improve parallelization of DenovoSig

The current parallelization of DenovoSig using multiprocessing works fine. Note that it is important to set sbatch -n to be ncpu in DenovoSig, instead of sbatch -c.

When sbatch -n is set to be ncpu in DenovoSig (with -c set to 1), I checked that the running time of each job performed by each worker is almost the same as that of a serial job. So we can almost get ncpu times speedup.

When sbatch -c is set to be ncpu in DenovoSig (with -n set to 1), then the running time of each job performed by each worker is about ~ncpu times slower than that of a serial job. As a result, we don't get any speedup.

So it is important to use sbatch -n instead of sbatch -c. That is contradictory to what I understood about sbatch before though. We need to understand more about this behavior.

Note that the results stated above are the same for NMF and mvNMF.

There could be potential improvements of our parallelization scheme. It is possible that some time is spent on pickling objects. See https://thelaziestprogrammer.com/python/a-multiprocessing-pool-pickle. Although, from the result above, there does not seem to be much overhead in our current code.

Add a comment on n_replicates

n_replicates needs to be either 1, which means no replicates will be done, or >= 10. We need to specify this somewhere to let the users know.

Of course for other n_replicates values, the code will still run and generally there won't be an issue. Setting n_replicates >= 10 just makes more sense with the default thresholds in _select_n_components(). See caf9254

Refitting example

Hello,

I am trying to use the refitting example, however, I cannot find "musical.refit" in the package. Should I use "musical.assign" or "musical.assign_grid"?

Thank you!

Cohort Stratification Error

I am trying to stratify my cohort and get an error during the following step:
k, clusters, Xs, model = musical.preprocessing.stratify_samples(X.values, H.values)

This is the error I receive:
MuSiCal/MuSiCal-main/musical/cluster.py:181: RuntimeWarning: divide by zero encountered in log
self.Wk_log = np.log(self.Wk)
MuSiCal/MuSiCal-main/musical/cluster.py:195: RuntimeWarning: divide by zero encountered in log
self.Wk_log_ref_all = np.log(self.Wk_ref_all)
python37_musical/lib/python3.11/site-packages/numpy/core/_methods.py:173: RuntimeWarning: invalid value encountered in subtract x = asanyarray(arr - arrmean)
MuSiCal/MuSiCal-main/musical/cluster.py:206: RuntimeWarning: invalid value encountered in subtract
self.gap_statistic_log = self.Wk_log_ref - self.Wk_log

Here is the beginning of my H.values:
array([[ 13.02370177, 22.36663251, 4.58321832, 2.30586172,
6.80032626, 7.65084841, 20.57174326, 13.90007932,
29.7988945 , 0.87010952, 11.40108359, 37.65595118,
3.78449284, 9.85849575, 8.67073113, 4.58633285],

And the beginning of X.values:
array([[8, 3, 3, ..., 0, 0, 1],

This is WXS data by the way. Thanks for your help!

De novo signature discovery - very slow

Hi!

Thank you for developing this tool! I am currently using it for signature refitting and decomposition, but I'd like to also take advantage of the signature discovery module. However, I've been having some issues with runtime. When trying to obtain signatures from a matrix with ~6000 samples, a 24h job only gets to 1 signature extracted which is very slow compared to other tools and I'm wondering if I'm doing something wrong when assigning resources to the job.

These are the slurm options used (from the documentation I understand that to ncpu value should match the -n, not the -c)

#SBATCH -J musical
#SBATCH -n 20
#SBATCH -c 1
#SBATCH -t 24:00:00
#SBATCH --mem 20G

conda activate python37_musical
python3 Musical.py

And this is the MuSiCal script:

model = musical.DenovoSig(X, 
                          min_n_components=1, # Minimum number of signatures to test
                          max_n_components=10, # Maximum number of signatures to test
                          init='random', # Initialization method
                          method='mvnmf', # mvnmf or nmf
                          n_replicates=20, # Number of mvnmf/nmf replicates to run per n_components
                          ncpu=20, # Number of CPUs to use
                          max_iter=100000, # Maximum number of iterations for each mvnmf/nmf run
                          bootstrap=True, # Whether or not to bootstrap X for each run
                          tol=1e-8, # Tolerance for claiming convergence of mvnmf/nmf
                          verbose=1, # Verbosity of output
                          normalize_X=False # Whether or not to L1 normalize each sample in X before mvnmf/nmf
                         )
model.fit()

And the musical logs:

Extracting signatures for n_components = 1..................
Selected lambda_tilde = 2. This lambda_tilde will be used for all subsequent mvNMF runs.
/home/mmunteanu/.conda/envs/python37_musical/lib/python3.7/site-packages/musical/mvnmf.py:509: UserWarning: No p-value is smaller than or equal to 0.05. The largest lambda_tilde is selected. Enlarge the search grid of lambda_tilde.
  UserWarning)
n_components = 1, replicate 14 finished.
n_components = 1, replicate 10 finished.
n_components = 1, replicate 9 finished.
n_components = 1, replicate 18 finished.
n_components = 1, replicate 2 finished.
n_components = 1, replicate 1 finished.
n_components = 1, replicate 19 finished.
n_components = 1, replicate 7 finished.
n_components = 1, replicate 12 finished.
n_components = 1, replicate 15 finished.
n_components = 1, replicate 8 finished.
n_components = 1, replicate 6 finished.
n_components = 1, replicate 13 finished.
n_components = 1, replicate 11 finished.
n_components = 1, replicate 17 finished.
n_components = 1, replicate 3 finished.
n_components = 1, replicate 16 finished.
n_components = 1, replicate 4 finished.
n_components = 1, replicate 5 finished.
Time elapsed: 2.79e+04 seconds.
Extracting signatures for n_components = 2..................
Selected lambda_tilde = 0.1. This lambda_tilde will be used for all subsequent mvNMF runs.

Many thanks,
Maia

Normalize W at the end of NMF

Currently the output W of NMF.fit() is not normalized. It's better to do a normalization at the end such that W is normalized. I think the output W of MVNMF.fit() is always normalized. But we need to double check.

Inconsistency of using KL divergence with NNLS

We perform NNLS frequently in the code, e.g., at the end of NMF and mvNMF, and after _gather_results(). But we still use KL divergence (e.g., beta divergence with beta = 1) for selection of lambda_tilde in mvNMF, selection of n_components, and filtering of bad solutions in _gather_results(). That is a bit inconsistent.

We can consider switching to beta divergence with beta = 2 once we have done NNLS.

In fact, it is never consistent because we do NMF/mvNMF with KL divergence and NNLS for inferring exposures. So no matter which norm we choose in the downstream selections, there will be inconsistencies.

I came into this issue when I was looking at samplewise errors. Sometimes by eye the reconstructed spectrum is very similar to the sample spectrum, but the KL divergence for that sample is relatively large. I think this is because in NNLS the frobenius norm is being optimized. If we do a KL-divergence version of NNLS, that problem should disappear.

So, maybe instead of doing NNLS, we can do a KL-divergence version of NNLS.

Challenges with Model Validation

Screen Shot 2022-06-22 at 12 18 15 PM

Hello! Thank you for sharing MuSiCal. I have been able to extract de novo signatures which was really exciting, but I am now having trouble with the part of the example full pipeline where Parameter optimization with in silico validation is meant to take place. I would love your help so that I can match to the right number of signatures. Thank you for your time.

Improve matching and refitting for samples with large errors

After matching and refitting, we can look at the reconstruction errors per sample. And for those samples with large reconstruction errors, we can look at their error spectrum. That spectrum informs which signatures are not included in the refitting and should have been. We can go on adding those signatures. This mechanism basically allows more signatures to be considered in the refitting process (e.g., signatures that are not discovered in this tumor type). (Suggestion from Doga)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.