parklab / musical Goto Github PK

View Code? Open in Web Editor NEW

27.0 27.0 4.0 39.98 MB

A comprehensive toolkit for mutational signature analysis

License: Other

Python 100.00%

musical's People

Contributors

Stargazers

Watchers

Forkers

crsky1023 shicheng-guo yixiangzhang1996 wotaicaili

musical's Issues

Improve fcluster or cut_tree in hierarchical clustering

See 6f5dfe0

Use rank deficiency to select n_components for mvNMF

When n_components is over-specified, mvNMF tends to produce rank deficient solutions, i.e., solutions with identical signatures. We can use this feature to select n_components. We need to do tests comparing this method with the current method based on samplewise reconstruction errors.

For an example, see #19

Update simulate_count_matrix using np.random.multinomial

The current code is too slow.

Small bug in remove_samples_based_on_gini?

Two small + related issues with multiple potential solutions: I tried using the remove_samples_based_on_gini method after performing an extraction with H from the DeNovo sig class and my mutation count matrix X — loaded as a pandas DataFrame per the example_full_pipeline.ipynb — as my inputs. This resulted in a NotImplentedError which I traced back to the mad = np.abs(np.subtract.outer(x, x)).mean() line in the gini method of preprocessing.py. The problem seems to be that pandas no longer supports the outer method on DataFrames or Series. Converting the DataFrame to an numpy array fixed the problem, so I think some clearer documentation about the expected data types for the inputs and/or a conversion of DataFrames to arrays within the method would help. Similarly, if you pass H as a DataFrame as opposed to an array, you get a TypeError for the h_norm = h/np.sum(X, axis = 0) line in the remove_samples_based_on_gini method, since that operation isn't supported on DataFrames. Same possible fixes would work!

Improve the structure of DenovoSig

We should follow more closely the design of sklearn estimators: https://scikit-learn.org/stable/developers/develop.html.

For example:

Remove parameter validation from __init__.
Add a reasonable set_params method.
Move the input X from __init__ to fit().

Ideally, we should also do this for other classes, e.g., MVNMF, NMF, etc.

Improve the automatic selection of n_components

Currently we first filter out n_components that do not have stable solutions, and calculate a p-value for sample-wise reconstruction errors, and then we choose n_components based on those p-values.

Some other pieces of information could be used for automatically selecting the best n_components

Set a threshold for the min stability. For example, we scan from small to large n_components, and stop whenever the min stability drops below, say, 0.5.
Use COSMIC signatures. When we go from small to large n_components, if the additional signature is not that stable but can be matched to (potentially multiple) COSMIC signatures, then we keep it. If the additional signature cannot be matched to COSMIC signatures but is very stable, then we also keep it, since it might be a new signature.

Rstudio Version request

Hi.
Musical is a better tool than Sigprofiler but it is better to develop a R version of Musical.
If you develop it, please let me know. Thank you very much.

Use differential_tail_test() in selecting n_components

Currently this test is only used in filtering out bad solutions. We should also use it in selecting n_components, in combination with a simple test for distribution means, which we currently use.

Corner cases for sil_score calculation in _gather_results()

There are cases where a cluster contains only 1 sample. In those cases, the sil_score for that signature is currently 0. It might be better to set the sil_score to 1 in those cases.

Example: I ran DenovoSig with mvNMF on simulated PCAWG Lung-AdenoCA data, with init=random and n_replicates=20. SBS3 is difficult to be discovered. It turns out that, in 19 out of the 20 cases, two SBS4's are extracted, while in the rest 1 case, an SBS3 is discovered. So in the final clustering result, there is a SBS3 cluster with 1 sample, and a SBS4 cluster with 39 samples.

This example also illustrates why it may not be good to force each cluster to contain one and only one sample from each replicate, as done in SigProfiler.

ModuleNotFoundError: No module named 'musical.denovo'

Using example full pipeline

Cannot load catalogs

Hello!

I followed the instructions for "Install from Source" in your README (I also checked to see if there was a recipe to install the package from conda using conda search *musical* and via the bioconda site, and didn't find anything, which is why I went with install from source).

I am able to import MuSiCal in Python but when I run musical.load_catalog() I get the following error message

FileNotFoundError: [Errno 2] No such file or directory: /opt/miniconda3/envs/{MYENVNAME}/lib/python3.10/site-packages/musical/data/COSMIC-MuSiCal_v3p2_SBS_WGS.csv'

The same error persists both in Jupyter and invoking Python in the console, so shouldn't be a kernel issue.

Inside the .../site-packages/musica/data/ directory there was only a __pycache__ directory and one py script, no further files.

Can you please advise on how to fix this issue? Possibly there was a problem with my installation? Though I used pip inside my conda environment and there were no issues reported. All dependencies were previously installed via conda.

Improve parallelization of DenovoSig

The current parallelization of DenovoSig using multiprocessing works fine. Note that it is important to set sbatch -n to be ncpu in DenovoSig, instead of sbatch -c.

When sbatch -n is set to be ncpu in DenovoSig (with -c set to 1), I checked that the running time of each job performed by each worker is almost the same as that of a serial job. So we can almost get ncpu times speedup.

When sbatch -c is set to be ncpu in DenovoSig (with -n set to 1), then the running time of each job performed by each worker is about ~ncpu times slower than that of a serial job. As a result, we don't get any speedup.

So it is important to use sbatch -n instead of sbatch -c. That is contradictory to what I understood about sbatch before though. We need to understand more about this behavior.

Note that the results stated above are the same for NMF and mvNMF.

There could be potential improvements of our parallelization scheme. It is possible that some time is spent on pickling objects. See https://thelaziestprogrammer.com/python/a-multiprocessing-pool-pickle. Although, from the result above, there does not seem to be much overhead in our current code.

Cythonize core algorithms

Write cython version of the NMF and mvNMF codes. That should further reduce running time.

Output of beta_divergence when beta = 1 can be negative sometimes

When the reconstruction is precise (e.g., in simulations), the output of beta_divergence(beta=1) can be negative sometimes (very small values, e.g., -1e-14), due to numerical errors. It does not really matter, but is worth pointing out.

Update mannwhitneyu such that it does not throw an error when the two arrays are identical and contains a single value

This rarely happens. But in some cases, say we have x = [0, 0, 0], and y = [0, 0, 0], stats.mannwhitneyu(x, y) will throw an error. Let's write a wrapper to take care of that. This is similar to what we've already done in differential_tail_test().

Add a comment on n_replicates

n_replicates needs to be either 1, which means no replicates will be done, or >= 10. We need to specify this somewhere to let the users know.

Of course for other n_replicates values, the code will still run and generally there won't be an issue. Setting n_replicates >= 10 just makes more sense with the default thresholds in _select_n_components(). See caf9254

Refitting example

Hello,

I am trying to use the refitting example, however, I cannot find "musical.refit" in the package. Should I use "musical.assign" or "musical.assign_grid"?

Thank you!

Check silhouette score definition for corner cases

Check what the definition should be when there is 1 sample in the cluster, or just 1 cluster. Modify the code in _gather_results() accordingly.

Cohort Stratification Error

I am trying to stratify my cohort and get an error during the following step:
k, clusters, Xs, model = musical.preprocessing.stratify_samples(X.values, H.values)

This is the error I receive:
MuSiCal/MuSiCal-main/musical/cluster.py:181: RuntimeWarning: divide by zero encountered in log
self.Wk_log = np.log(self.Wk)
MuSiCal/MuSiCal-main/musical/cluster.py:195: RuntimeWarning: divide by zero encountered in log
self.Wk_log_ref_all = np.log(self.Wk_ref_all)
python37_musical/lib/python3.11/site-packages/numpy/core/_methods.py:173: RuntimeWarning: invalid value encountered in subtract x = asanyarray(arr - arrmean)
MuSiCal/MuSiCal-main/musical/cluster.py:206: RuntimeWarning: invalid value encountered in subtract
self.gap_statistic_log = self.Wk_log_ref - self.Wk_log

Here is the beginning of my H.values:
array([[ 13.02370177, 22.36663251, 4.58321832, 2.30586172,
6.80032626, 7.65084841, 20.57174326, 13.90007932,
29.7988945 , 0.87010952, 11.40108359, 37.65595118,
3.78449284, 9.85849575, 8.67073113, 4.58633285],

And the beginning of X.values:
array([[8, 3, 3, ..., 0, 0, 1],

This is WXS data by the way. Thanks for your help!

De novo signature discovery - very slow

Hi!

Thank you for developing this tool! I am currently using it for signature refitting and decomposition, but I'd like to also take advantage of the signature discovery module. However, I've been having some issues with runtime. When trying to obtain signatures from a matrix with ~6000 samples, a 24h job only gets to 1 signature extracted which is very slow compared to other tools and I'm wondering if I'm doing something wrong when assigning resources to the job.

These are the slurm options used (from the documentation I understand that to ncpu value should match the -n, not the -c)

#SBATCH -J musical
#SBATCH -n 20
#SBATCH -c 1
#SBATCH -t 24:00:00
#SBATCH --mem 20G

conda activate python37_musical
python3 Musical.py

And this is the MuSiCal script:

model = musical.DenovoSig(X, 
                          min_n_components=1, # Minimum number of signatures to test
                          max_n_components=10, # Maximum number of signatures to test
                          init='random', # Initialization method
                          method='mvnmf', # mvnmf or nmf
                          n_replicates=20, # Number of mvnmf/nmf replicates to run per n_components
                          ncpu=20, # Number of CPUs to use
                          max_iter=100000, # Maximum number of iterations for each mvnmf/nmf run
                          bootstrap=True, # Whether or not to bootstrap X for each run
                          tol=1e-8, # Tolerance for claiming convergence of mvnmf/nmf
                          verbose=1, # Verbosity of output
                          normalize_X=False # Whether or not to L1 normalize each sample in X before mvnmf/nmf
                         )
model.fit()

And the musical logs:

Extracting signatures for n_components = 1..................
Selected lambda_tilde = 2. This lambda_tilde will be used for all subsequent mvNMF runs.
/home/mmunteanu/.conda/envs/python37_musical/lib/python3.7/site-packages/musical/mvnmf.py:509: UserWarning: No p-value is smaller than or equal to 0.05. The largest lambda_tilde is selected. Enlarge the search grid of lambda_tilde.
  UserWarning)
n_components = 1, replicate 14 finished.
n_components = 1, replicate 10 finished.
n_components = 1, replicate 9 finished.
n_components = 1, replicate 18 finished.
n_components = 1, replicate 2 finished.
n_components = 1, replicate 1 finished.
n_components = 1, replicate 19 finished.
n_components = 1, replicate 7 finished.
n_components = 1, replicate 12 finished.
n_components = 1, replicate 15 finished.
n_components = 1, replicate 8 finished.
n_components = 1, replicate 6 finished.
n_components = 1, replicate 13 finished.
n_components = 1, replicate 11 finished.
n_components = 1, replicate 17 finished.
n_components = 1, replicate 3 finished.
n_components = 1, replicate 16 finished.
n_components = 1, replicate 4 finished.
n_components = 1, replicate 5 finished.
Time elapsed: 2.79e+04 seconds.
Extracting signatures for n_components = 2..................
Selected lambda_tilde = 0.1. This lambda_tilde will be used for all subsequent mvNMF runs.

Many thanks,
Maia

Output a warning when the selected lambda_tilde is at the edge of the grid

When the selected lambda_tilde in MVNMF is at the edge of the grid, we should give a warning for enlarging the grid.

Normalize W at the end of NMF

Currently the output W of NMF.fit() is not normalized. It's better to do a normalization at the end such that W is normalized. I think the output W of MVNMF.fit() is always normalized. But we need to double check.

Inconsistency of using KL divergence with NNLS

We perform NNLS frequently in the code, e.g., at the end of NMF and mvNMF, and after _gather_results(). But we still use KL divergence (e.g., beta divergence with beta = 1) for selection of lambda_tilde in mvNMF, selection of n_components, and filtering of bad solutions in _gather_results(). That is a bit inconsistent.

We can consider switching to beta divergence with beta = 2 once we have done NNLS.

In fact, it is never consistent because we do NMF/mvNMF with KL divergence and NNLS for inferring exposures. So no matter which norm we choose in the downstream selections, there will be inconsistencies.

I came into this issue when I was looking at samplewise errors. Sometimes by eye the reconstructed spectrum is very similar to the sample spectrum, but the KL divergence for that sample is relatively large. I think this is because in NNLS the frobenius norm is being optimized. If we do a KL-divergence version of NNLS, that problem should disappear.

So, maybe instead of doing NNLS, we can do a KL-divergence version of NNLS.

Keep track of sample names

Keep track of sample names during calculations.

Make sure everything works for n_components = 1

See title.

Challenges with Model Validation

Hello! Thank you for sharing MuSiCal. I have been able to extract de novo signatures which was really exciting, but I am now having trouble with the part of the example full pipeline where Parameter optimization with in silico validation is meant to take place. I would love your help so that I can match to the right number of signatures. Thank you for your time.

Improve matching and refitting for samples with large errors

After matching and refitting, we can look at the reconstruction errors per sample. And for those samples with large reconstruction errors, we can look at their error spectrum. That spectrum informs which signatures are not included in the refitting and should have been. We can go on adding those signatures. This mechanism basically allows more signatures to be considered in the refitting process (e.g., signatures that are not discovered in this tumor type). (Suggestion from Doga)

Update the default of load_catalog()

Update to COSMIC v3.2.

parklab / musical Goto Github PK

musical's People

Contributors

Stargazers

Watchers

Forkers

musical's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs