GithubHelp home page GithubHelp logo

alexandrovlab / sigprofilerextractor Goto Github PK

View Code? Open in Web Editor NEW
146.0 23.0 51.0 170.24 MB

SigProfilerExtractor allows de novo extraction of mutational signatures from data generated in a matrix format. The tool identifies the number of operative mutational signatures, their activities in each sample, and the probability for each signature to cause a specific mutation type in a cancer sample. The tool makes use of SigProfilerMatrixGenerator and SigProfilerPlotting.

License: BSD 2-Clause "Simplified" License

Python 100.00%
bioinformatics somatic-variants cancer-genomics mutational-signatures mutation-analysis

sigprofilerextractor's Introduction

Docs License Build Status

SigProfilerExtractor

SigProfilerExtractor allows de novo extraction of mutational signatures from data generated in a matrix format. The tool identifies the number of operative mutational signatures, their activities in each sample, and the probability for each signature to cause a specific mutation type in a cancer sample. The tool makes use of SigProfilerMatrixGenerator and SigProfilerPlotting. Detailed documentation can be found at: https://osf.io/t6j7u/wiki/home/

Table of contents

Installation

To install the current version of this Github repo, git clone this repo or download the zip file. Unzip the contents of SigProfilerExtractor-master.zip or the zip file of a corresponding branch.

In the command line, please run the following:

$ cd SigProfilerExtractor-master
$ pip install .

For most recent stable pypi version of this tool, In the command line, please run the following:

$ pip install SigProfilerExtractor

Install your desired reference genome from the command line/terminal as follows (available reference genomes are: GRCh37, GRCh38, mm9, and mm10):

$ python
from SigProfilerMatrixGenerator import install as genInstall
genInstall.install('GRCh37')

This will install the human 37 assembly as a reference genome. You may install as many genomes as you wish.

Next, open a python interpreter and import the SigProfilerExtractor module. Please see the examples of the functions.

Functions

The list of available functions are:

  • importdata
  • sigProfilerExtractor
  • estimate_solution
  • decompose

And an additional script:

  • plotActivity.py

importdata

Imports the path of example data.

importdata(datatype="matrix")

importdata Example

from SigProfilerExtractor import sigpro as sig
path_to_example_table = sig.importdata("matrix")
data = path_to_example_table 
# This "data" variable can be used as a parameter of the "project" argument of the sigProfilerExtractor function.

# To get help on the parameters and outputs of the "importdata" function, please use the following:
help(sig.importdata)

sigProfilerExtractor

Extracts mutational signatures from an array of samples.

sigProfilerExtractor(input_type, out_put, input_data, reference_genome="GRCh37", opportunity_genome = "GRCh37", context_type = "default", exome = False, 
                         minimum_signatures=1, maximum_signatures=10, nmf_replicates=100, resample = True, batch_size=1, cpu=-1, gpu=False, 
                         nmf_init="random", precision= "single", matrix_normalization= "gmm", seeds= "random", 
                         min_nmf_iterations= 10000, max_nmf_iterations=1000000, nmf_test_conv= 10000, nmf_tolerance= 1e-15, get_all_signature_matrices= False)
Category Parameter Variable Type Parameter Description
Input Data
input_type String The type of input:
output String The name of the output folder. The output folder will be generated in the current working directory.
input_data String
Path to input folder for input_type:
  • vcf
  • bedpe
Path to file for input_type:
  • matrix
  • seg:TYPE
reference_genome String The name of the reference genome. The default reference genome is "GRCh37". This parameter is applicable only if the input_type is "vcf".
opportunity_genome String The build or version of the reference genome for the reference signatures. The default opportunity genome is GRCh37. If the input_type is "vcf", the opportunity_genome automatically matches the input reference genome value. Only the genomes available in COSMIC are supported (GRCh37, GRCh38, mm9, mm10 and rn6). If a different opportunity genome is selected, the default genome GRCh37 will be used.
context_type String A string of mutaion context name/names separated by comma (","). The items in the list defines the mutational contexts to be considered to extract the signatures. The default value is "96,DINUC,ID", where "96" is the SBS96 context, "DINUC" is the DINUCLEOTIDE context and ID is INDEL context.
exome Boolean Defines if the exomes will be extracted. The default value is "False".
NMF Replicates
minimum_signatures Positive Integer The minimum number of signatures to be extracted. The default value is 1.
maximum_signatures Positive Integer The maximum number of signatures to be extracted. The default value is 25.
nmf_replicates Positive Integer The number of iteration to be performed to extract each number signature. The default value is 100.
resample Boolean Default is True. If True, add poisson noise to samples by resampling.
seeds String It can be used to get reproducible resamples for the NMF replicates. A path of a tab separated .txt file containing the replicated id and preset seeds in a two columns dataframe can be passed through this parameter. The Seeds.txt file in the results folder from a previous analysis can be used for the seeds parameter in a new analysis. The Default value for this parameter is "random". When "random", the seeds for resampling will be random for different analysis.
NMF Engines
matrix_normalization String Method of normalizing the genome matrix before it is analyzed by NMF. Default is value is "gmm". Other options are, "log2", "custom" or "none".
nmf_init String The initialization algorithm for W and H matrix of NMF. Options are 'random', 'nndsvd', 'nndsvda', 'nndsvdar' and 'nndsvd_min'. Default is 'random'.
precision String Values should be single or double. Default is single.
min_nmf_iterations Integer Value defines the minimum number of iterations to be completed before NMF converges. Default is 10000.
max_nmf_iterations Integer Value defines the maximum number of iterations to be completed before NMF converges. Default is 1000000.
nmf_test_conv Integer Value defines the number number of iterations to done between checking next convergence. Default is 10000.
nmf_tolerance Float Value defines the tolerance to achieve to converge. Default is 1e-15.
Execution
cpu Integer The number of processors to be used to extract the signatures. The default value is -1 which will use all available processors.
gpu Boolean Defines if the GPU resource will used if available. Default is False. If True, the GPU resources will be used in the computation. Note: All available CPU processors are used by default, which may cause a memory error. This error can be resolved by reducing the number of CPU processes through the cpu parameter.
batch_size Integer Will be effective only if the GPU is used. Defines the number of NMF replicates to be performed by each CPU during the parallel processing. Default is 1.
Solution Estimation Thresholds
stability Float Default is 0.8. The cutoff thresh-hold of the average stability. Solutions with average stabilities below this thresh-hold will not be considered.
min_stability Float Default is 0.2. The cutoff thresh-hold of the minimum stability. Solutions with minimum stabilities below this thresh-hold will not be considered.
combined_stability Float Default is 1.0. The cutoff thresh-hold of the combined stability (sum of average and minimum stability). Solutions with combined stabilities below this thresh-hold will not be considered.
allow_stability_drop Boolean Default is False. Defines if solutions with a drop in stability with respect to the highest stable number of signatures will be considered.
Decomposition
cosmic_version Float Takes a positive float among 1, 2, 3, 3.1, 3.2, 3.3, and 3.4. Default is 3.4. Defines the version of the COSMIC reference signatures.
make_decomposition_plots Boolean Defualt is True. If True, Denovo to Cosmic sigantures decompostion plots will be created as a part the results.
collapse_to_SBS96 Boolean Defualt is True. If True, SBS288 and SBS1536 Denovo signatures will be mapped to SBS96 reference signatures. If False, those will be mapped to reference signatures of the same context.
Others
get_all_signature_matrices Boolean If True, the Ws and Hs from all the NMF iterations are generated in the output.
export_probabilities Boolean Defualt is True. If False, then doesn't create the probability matrix.

sigProfilerExtractor Example

VCF Files as Input

from SigProfilerExtractor import sigpro as sig
def main_function():
    # to get input from vcf files
    path_to_example_folder_containing_vcf_files = sig.importdata("vcf")
    # you can put the path to your folder containing the vcf samples
    data = path_to_example_folder_containing_vcf_files
    sig.sigProfilerExtractor("vcf", "example_output", data, minimum_signatures=1, maximum_signatures=3)
if __name__=="__main__":
   main_function()
# Wait until the excecution is finished. The process may a couple of hours based on the size of the data.
# Check the current working directory for the "example_output" folder.

Matrix File as Input

from SigProfilerExtractor import sigpro as sig
def main_function():    
   # to get input from table format (mutation catalog matrix)
   path_to_example_table = sig.importdata("matrix")
   data = path_to_example_table # you can put the path to your tab delimited file containing the mutational catalog matrix/table
   sig.sigProfilerExtractor("matrix", "example_output", data, opportunity_genome="GRCh38", minimum_signatures=1, maximum_signatures=3)
if __name__=="__main__":
   main_function()

sigProfilerExtractor Output

To learn about the output, please visit https://osf.io/t6j7u/wiki/home/

Estimation of the Optimum Solution

Estimate the optimum solution (rank) among different number of solutions (ranks).

estimate_solution(base_csvfile="All_solutions_stat.csv", 
          All_solution="All_Solutions", 
          genomes="Samples.txt", 
          output="results", 
          title="Selection_Plot",
          stability=0.8, 
          min_stability=0.2, 
          combined_stability=1.0,
          allow_stability_drop=False,
          exome=False)
Parameter Variable Type Parameter Description
base_csvfile String Default is "All_solutions_stat.csv". Path to a csv file that contains the statistics of all solutions.
All_solution String Default is "All_Solutions". Path to a folder that contains the results of all solutions.
genomes String Default is Samples.txt. Path to a tab delimilted file that contains the mutation counts for all genomes given to different mutation types.
output String Default is "results". Path to the output folder.
title String Default is "Selection_Plot". This sets the title of the selection_plot.pdf
stability Float Default is 0.8. The cutoff thresh-hold of the average stability. Solutions with average stabilities below this thresh-hold will not be considered.
min_stability Float Default is 0.2. The cutoff thresh-hold of the minimum stability. Solutions with minimum stabilities below this thresh-hold will not be considered.
combined_stability Float Default is 1.0. The cutoff thresh-hold of the combined stability (sum of average and minimum stability). Solutions with combined stabilities below this thresh-hold will not be considered.
allow_stability_drop Boolean Default is False. Defines if solutions with a drop in stability with respect to the highest stable number of signatures will be considered.
exome Boolean Default is "False". Defines if exomes samples are used.

Estimation of the Optimum Solution Example

from SigProfilerExtractor import estimate_best_solution as ebs
ebs.estimate_solution(base_csvfile="All_solutions_stat.csv", 
          All_solution="All_Solutions", 
          genomes="Samples.txt", 
          output="results", 
          title="Selection_Plot",
          stability=0.8, 
          min_stability=0.2, 
          combined_stability=1.0,
          allow_stability_drop=False,
          exome=False)

Estimation of the Optimum Solution Output

The files below will be generated in the output folder:

File Name Description
All_solutions_stat.csv A csv file that contains the statistics of all solutions.
selection_plot.pdf A plot that depict the Stability and Mean Sample Cosine Distance for different solutions.

Decompose

For decomposition of denovo signatures please use SigProfilerAssignment

Activity Stacked Bar Plot

Generates a stacked bar plot showing activities in individuals

plotActivity(activity_file, output_file = "Activity_in_samples.pdf", bin_size = 50, log = False)
Parameter Variable Type Parameter Description
activity_file String The standard output activity file showing the number of, or percentage of mutations attributed to each sample. The row names should be samples while the column names should be signatures.
output_file String The path and full name of the output pdf file, including ".pdf"
bin_size Integer Number of samples plotted per page, recommended: 50

Activity Stacked Bar Plot Example

$ python plotActivity.py 50 sig_attribution_sample.txt test_out.pdf

Video Tutorials

Take a look at our video tutorials for step-by-step instructions on how to install and run SigProfilerExtractor on Amazon Web Services.

Tutorial #1: Installing SigProfilerExtractor on Amazon Web Services

Video Tutorial #3

Tutorial #2: Running the Quick Start Example Program

Video Tutorial #3

Tutorial #3: Reviewing the output from SigProfilerExtractor

Video Tutorial #3

GPU support

If CUDA out of memory exceptions occur, it will be necessary to reduce the number of CPU processes used (the cpu parameter).

For more information, help, and examples, please visit: https://osf.io/t6j7u/wiki/home/

Citation

Islam SMA, Díaz-Gay M, Wu Y, Barnes M, Vangara R, Bergstrom EN, He Y, Vella M, Wang J, Teague JW, Clapham P, Moody S, Senkin S, Li YR, Riva L, Zhang T, Gruber AJ, Steele CD, Otlu B, Khandekar A, Abbasi A, Humphreys L, Syulyukina N, Brady SW, Alexandrov BS, Pillay N, Zhang J, Adams DJ, Martincorena I, Wedge DC, Landi MT, Brennan P, Stratton MR, Rozen SG, and Alexandrov LB (2022) Uncovering novel mutational signatures by de novo extraction with SigProfilerExtractor. Cell Genomics. doi: 10.1016/j.xgen.2022.100179.

Copyright

This software and its documentation are copyright 2018 as a part of the sigProfiler project. The SigProfilerExtractor framework is free software and is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

Contact Information

Please address any queries or bug reports to Mark Barnes at [email protected]

sigprofilerextractor's People

Contributors

david-a-parry avatar heyudou avatar lalexandrov1018 avatar marcos-diazg avatar mdbarnesucsd avatar mishugeb avatar rvangara avatar vellamike avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sigprofilerextractor's Issues

cosine similarity cut-off in decomposition function

Hi Mishu,

Ludmil suggested to use decomposition function with a cosine similarity cut-off 0.9 to determine if a signature is novel and matching with one of the COSMIC signatures. However, I do not seem to find an argument to specify cosine similarity cut-off in decomposition function.

If I missed it, could you please point out which argument is the one I am looking for?

Thank you & stay safe!

python3: symbol lookup error: python3: undefined symbol: XML_SetHashSalt

Hi,

I run sigProfilerExtractor on my workstation running ubuntu 20.04 and keep getting the "python3: symbol lookup error: python3: undefined symbol: XML_SetHashSalt" error.

My python3 is 3.8.5, and I installed sigProfilerExtractor using command "python3 -m pip install sigProfilerExtractor". The package versions are:

SigProfilerExtractor 1.1.0
SigProfilerMatrixGenerator 1.1.23
sigProfilerPlotting 1.1.9

The following is an example:

from SigProfilerExtractor import sigpro as sig
sig.sigProfilerExtractor("matrix", "/MS_S1/SigProfilerExtractorMatrixSingleThread/", "/MS_S1/output/SBS/MS_S1.SBS96.all", reference_genome="GRCh38", minimum_signatures=1, maximum_signatures=10, nmf_replicates=100, cpu=1)
************** Reported Current Memory Use: 0.25 GB *****************

Extracting signature 1 for mutation type 96
The matrix normalizig cutoff is 10089

process 1 continues please wait...
execution time: 3 seconds
...
...

process 5 continues please wait...
execution time: 8 seconds

Time taken to collect 100 iterations for 5 signatures is 539.92 seconds
Optimization time is 1.876739740371704 seconds
The reconstruction error is 0.0133, average process stability is 0.15 and
the minimum process stability is -0.05 for 5 signatures

python3: symbol lookup error: python3: undefined symbol: XML_SetHashSalt
yc790@T7920:~$ /usr/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 6 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
^C

I also tried to use different number of cpu, and got the same error.

Any suggestion?

Thanks,

Ying
`

dendrogram missing

To whom it may concern,

I installed the latest SigProfilerExtractor using pip and I was able to successfully run the python script. I, however, don't find the dendrogram.pdf in the De_Novo_Solution directory and I was wondering what parameters have to be provided to generate the dendrogram.pdf.

Many thanks,
Sangjin

the output is not generated in "estimate_solution" function

Hi,
I'm using SigProfilerExtractor and I want to implement the "estimate_solution" function to find the optimum rank. The first parameter of this function named "base_csvfile" has a default string value "All_solution_stat.csv", which is a CSV file that should be generated in the output folder. But when running the example function (below) with all the default values, I get the "no such file or directory" error for this parameter.

Example:
>>>from SigProfilerExtractor import estimate_best_solution as ebs
>>>ebs.estimate_solution(base_csvfile="All_solutions_stat.csv",
All_solution="All_Solutions",
genomes="Samples.txt",
output="results",
title="Selection_Plot",
stability=0.8,
min_stability=0.2,
combined_stability=1.25)

What is wrong with this? does this file need to exist in the directory path? As an input?
As far as I got, this file is supposed to be generated during the function implementation, isn't it?

Two issues!

Hello,
Thank you for providing such a good tool! I now have two issues and need your help. Any input would be very much appreciated!

  1. The first issue is the error message:

sig.sigProfilerExtractor("vcf", "output_SNV_VCF", "SNV_vcf", reference_genome="mm10", opportunity_genome="mm10", minimum_signatures=1, maximum_signatures=10)

************** Reported Current Memory Use: 0.29 GB *****************

Starting matrix generation for SNVs and DINUCs...Completed! Elapsed time: 3.76 seconds.
Matrices generated for 16 samples with 0 errors. Total of 30452 SNVs, 0 DINUCs, and 0 INDELs were successfully analyzed.
Traceback (most recent call last):
File "", line 1, in
File "/home/xrao/.local/lib/python3.6/site-packages/SigProfilerExtractor/sigpro.py", line 647, in sigProfilerExtractor
excecution_parameters["context_type"]=",".join(mtypes)
UnboundLocalError: local variable 'mtypes' referenced before assignment


Then I tried this without success:
def sigProfilerExtractor():
global mtypes

  1. The following files are not contained in the output folders:
    Cluster_of_Samples.txt
    dendogram.pdf

I am using Python Version: 3.6.4 and Sigproextractor Version: 1.0.17

Many thanks in advance!
XR

Sigprofiler reference signatures: difference between genome and exome

Hello Developers,

I have been using SigProfiler to extract signatures from my WES data and decomposing each sample with COSMIC v3 signatures. At same time, I am also using MutationalPattern for fitting. Finally, I'd like to compare the results from these two tools.

I see both genomes and exomes in the signature reference folder: https://www.synapse.org/#!Synapse:syn12009743. Currently, I am using the exome signature for MutationaPattern, but which reference is actually being used by SigProfiler decomposition?

Kind regards,
Cedrick

SigProfilerExtractor with BED-File

Hi,

how do I use the SigProfilerExtractor with a BED-file for targeted sequencing?
I used SigProfilerMatrixGenerator with a BED-File for generating the matrix, but I am not able to provide the BED file to SigProfilerExtractor.

Many thanks!

Best,
Raphael

Using mutational catalogues as input

Hi

I am using mutational catalog like this


Substitution	1_Sample	2_Sample	3_Sample	4_Sample
A[C>A]A	409	387	432	438
A[C>A]C	204	191	233	214
A[C>A]G	31	26	30	34

I typed

>>> os.getcwd()
'/Users/fi1d18/Downloads/test'
>>> data=os.getcwd()
>>> sig.sigProfilerExtractor("table", "example_output", data)

************** Reported Current Memory Use: 0.09 GB *****************
>>> 

But the process never finishes
can you please tell me what I am doing wring here?
Thanks a lot
data.txt

Install issue

When I" pip install SigProfilerExtractor",and then always metion that "ERROR: No matching distribution found for torch>=1.1.0 (from SigProfilerExtractor)".I do not now how to solve that ,anyone can tell me?

How to extrat?

When I did this ,there some problems ,but I do not know why,anyone can help me ?

In [21]: a = sig.importdata("vcf")

In [22]: sig.sigProfilerExtractor("vcf", "example_output", "PD3851a", minimum_signatures=1, maximum_signatures=3)

************** Reported Current Memory Use: 0.14 GB *****************


IndexError Traceback (most recent call last)
in
----> 1 sig.sigProfilerExtractor("vcf", "example_output", "PD3851a", minimum_signatures=1, maximum_signatures=3)

C:\Anaconda3\lib\site-packages\SigProfilerExtractor\sigpro.py in sigProfilerExtractor(input_type, output, input_data, reference_genome, opportunity_genome, context_type, exome, minimum_signatures, maximum_signatures, nmf_replicates, resample, batch_size, cpu, gpu, nmf_init, precision, matrix_normalization, seeds, min_nmf_iterations, max_nmf_iterations, nmf_test_conv, nmf_tolerance, nnls_penalty, get_all_signature_matrices)
513
514 #project_name = project.split("/")[-1]
--> 515 data = datadump.SigProfilerMatrixGeneratorFunc(project_name, refgen, project, exome=exome, bed_file=None, chrom_based=False, plot=False, gs=False)
516
517

C:\Anaconda3\lib\site-packages\SigProfilerMatrixGenerator\scripts\SigProfilerMatrixGeneratorFunc.py in SigProfilerMatrixGeneratorFunc(project, genome, vcfFiles, exome, bed_file, chrom_based, plot, tsb_stat, seqInfo, cushion, gs)
340
341 # Creates a temporary folder for sorting and generating the matrices
--> 342 file_name = vcf_files[0].split(".")
343 file_extension = file_name[-1]
344 unique_folder = project + "_"+ str(uuid.uuid4())

IndexError: list index out of range

How to install reference genome

I am new in bioinformatics,and now I meet a probelm.When I install reference genome,here is the probelem.I do not know how to deal with,any one can tell me?Thanks very much!
In [3]: genInstall.install('GRCh37')
Beginning installation. This may take up to 40 minutes to complete.
tar: Error opening archive: Failed to open 'C:\Anaconda3\lib\site-packages\SigProfilerMatrixGenerator/references/chromosomes/tsb/GRCh37.tar.gz'
The ensembl ftp site is not currently responding.
An exception has occurred, use %tb to see the full traceback.
SystemExit

extractor error with gpu=True

Hi,

I was trying to extract signatures with gpu=True. However, I get the following error.

/usr/local/lib/python3.6/dist-packages/sigproextractor/sigpro.py in sigProfilerExtractor(input_type, out_put, input_data, refgen, genome_build, startProcess, endProcess, totalIterations, init, cpu, hierarchy, mtype, exome, par_h, penalty, resample, wall, gpu)
    569                                                 init = init,
    570                                                 normalization_cutoff=normalization_cutoff,
--> 571                                                 gpu=gpu,)
    572 
    573 

/usr/local/lib/python3.6/dist-packages/sigproextractor/subroutines.py in decipher_signatures(genomes, i, totalIterations, cpu, mut_context, resample, seeds, init, normalization_cutoff, gpu)
    982     ############################################################# The parallel processing takes place here #######################################################################
    983     ##############################################################################################################################################################################
--> 984     results = parallel_runs(genomes=genomes, totalProcesses=totalProcesses, iterations=totalIterations,  n_cpu=cpu, verbose = False, resample=resample, seeds = seeds, init=init, normalization_cutoff=normalization_cutoff, gpu=gpu)
    985 
    986     toc = time.time()

/usr/local/lib/python3.6/dist-packages/sigproextractor/subroutines.py in parallel_runs(genomes, totalProcesses, iterations, n_cpu, verbose, resample, seeds, init, normalization_cutoff, gpu)
    459     #print(seeds)
    460     pool_nmf=partial(pnmf, genomes=genomes, totalProcesses=totalProcesses, resample=resample, init=init, normalization_cutoff=normalization_cutoff, gpu=gpu)
--> 461     result_list = pool.map(pool_nmf, seeds)
    462     pool.close()
    463     pool.join()

/usr/lib/python3.6/multiprocessing/pool.py in map(self, func, iterable, chunksize)
    264         in a list that is returned.
    265         '''
--> 266         return self._map_async(func, iterable, mapstar, chunksize).get()
    267 
    268     def starmap(self, func, iterable, chunksize=None):

/usr/lib/python3.6/multiprocessing/pool.py in get(self, timeout)
    642             return self._value
    643         else:
--> 644             raise self._value
    645 
    646     def _set(self, i, obj):

TypeError: nnmf_gpu() got an unexpected keyword argument 'init'

Any help in solving the errors would be highly appreciated.

Thank you.

sigProfilerExtractor gpu solution can't work

Hello,

I follow example to execute sigProfilerExtractor with gpu = True. But program interrupted after matrix generation and output folder only job_metadata.txt and SBS96/All_solutions_stat.csv exist.

Below is command and result.
sig.sigProfilerExtractor("vcf",testgpu_output",data,startProcess=1,endProcess=3,totalIterations=1000,gpu = True).

************** Reported Current Memory Use: 0.15 GB *****************

Starting matrix generation for SNVs and DINUCs...Completed! Elapsed time: 7.81 seconds.
Matrices generated for 15 samples with 0 errors. Total of 93001 SNVs, 505 DINUCs, and 0 INDELs were successfully analyzed.
Extracting signature 1 for mutation type 96


What should i do to resolve this issue?

System information is below.

OS: Centos 7
PyTorch Version: 1.2
Python Version: 3.6.8
CUDA/cuDNN Version: 9.2/7.6
GPU Model: NVIDIA Tesla K80

Thanks,
Bruce

Can not have some of output from sigProfilerExtractor

Dear developer of SigProfilerExtractor

Thanks to the developer, it is really convenient for a mutational signature analysis with SigProfilerExtractor.

However, I could not get some of the output filers that should be contained in the output directory according to wiki (https://osf.io/t6j7u/wiki/4.%20Output%20-%20Suggested%20Solution/).

Instead of thes files( Cluster_of_Samples.txt, comparison_with_global_ID_signatures.csv, Decomposed_Solution_Activities.txt, Decomposed_Solution_Samples_stats.txt, Decomposed_Solution_Signatures.txt, decomposition_logfile.txt, dendogram.pdf, Mutation_Probabilities.txt, ignature_plot[MutatutionContext]_plots_Decomposed_Solution.pdf) are produced in Decomposed_Soultion, there are files (named De_Novo_map_to_COSMIC_SBS96.csv, SBS96_Decomposition_Plots.pdf) and directories (named Activities, Signature, Solution_Stats).

Here is the script code i used.

from SigProfilerMatrixGenerator import install as genInstall
from SigProfilerMatrixGenerator.scripts import SigProfilerMatrixGeneratorFunc as matGen
from SigProfilerExtractor import sigpro as sig

matrices = matGen.SigProfilerMatrixGeneratorFunc("[Output_File_Name]", "GRCh37", "[InputDirectory]", exome=False, bed_file=None, chrom_based=False, plot=True, tsb_stat=False, seqInfo=False)

sig.sigProfilerExtractor("text", "[OutputName]", "[InputDirectory]/output/SBS/[Output_File_Name].SBS96.all", reference_genome="GRCh37")

then I got

'''
process 14 continues please wait...
execution time: 14 seconds

process 14 continues please wait...
execution time: 13 seconds

process 14 continues please wait...
execution time: 24 seconds

Time taken to collect 100 iterations for 14 signatures is 85.88 seconds
Optimization time is 12.723468780517578 seconds
The reconstruction error is 0.0776, average process stability is 0.38 and
the minimum process stability is 0.28 for 14 signatures

Decompositon Plot made for SBS96A <----I think 'Deompositon' is typo

Your Job Is Successfully Completed! Thank You For Using SigProfiler Extractor.

''''

Can you help me to handle this issue?

Thank you.

Problem in getting the main point

Hello

After running your tool, in Suggested_Solution subfolder of SBS96 folder I see De_Novo_Solution and Decomposed_Solution folders

De_Novo_Solution says signatures A, B, C, D and E have been extracted from my data

Decomposed_Solution says this results in the photo

Screenshot 2020-04-28 at 06 16 54

Does this mean only signature C has been enriched on COSMIC signatures (COSMIC 1, 3 and 5) and your software has not found any enrichment for signatures B, D and E in reference signatures?

How I could know what these unannotated signatures are?

Please help me in understanding this

Understanding the output

Hello Developers,

I am trying to understand the output of this tool.
Only SBS96 were extracted from my data.
Within the All_solutions directory are the 8 sub folders (SBS98_[1-8]_signatures).

In SBS96_1, I see signature A.
In SBS96_2, I see signature A and B
...
In SBS_8, I see signatures A-H

What is the purpose of having 1-8 folders and the A-H signatures?
Furthermore, why is stability and the total number of mutations changes?
Which signature should I use for decompose function?

Thank you in advance.

decomposed activities not generated (version 1.0.17)

Hi,

I ran SigProfilerExtractor 1.0.17 on my data. After the process ended, I didn't see any activities table under COSMIC_ID83_Decomposed_Solution. I guess it should generate Activities and Signatures folder as in ID83_De-Novo_Solution.

Decomposition folder only has

  • Decomposition_Plots
  • De_Novo_map_to_COSMIC_ID83.csv
  • ID83_Decomposition_Plots.pdf
  • Solution_Stats

ID83_De-Novo_Solution folder has Activities, Signatures and Solution_stats folder.

I was wondering if there is anything I am missing here?

Thank you.

COSMIC Decomposed solutions is empty

Hello again,

I am trying to understand if I am getting a result as intended or if there is an issue with the decomposition step.

After what appears as a successful run of SigProfileExtractor (no errors thrown, and the .err file is empty), I checked the SBS Suggested_Solutions/COSMIC_SBS96_Decomposed_Solution repository to see if any of the De Novo signatures matched against the established COSMIC signatures. In the folder, I see the De_Novo_map_to_COSMIC_SBS96.csv file, but it is empty except for the table header:

De novo extracted, Global NMF Signatures, L1 Error %, L2 Error %, KL Divergence, Cosine Similarity, Correlation

Initially, I thought this must mean that NONE of the De Novo signatures appear to contain any of the COSMIC signatures. However, when I look at the Solution_Stats/Cosmic_SBS96_Decomposition_Log.txt file, I have:

############################ Signature Decomposition Details ################################

Context Type: 96
Genome Build: GRCh38

######################## Decomposing SBS96A ########################


!!!!!!!!!!!!!!!!!!!!!!!!! LAYER: 0 !!!!!!!!!!!!!!!!!!!!!!!!!
Best Signature Composition ['SBS1', 'SBS5', 'SBS12']
L2 Error % 0.27
Cosine Similarity 0.96


!!!!!!!!!!!!!!!!!!!!!!!!! LAYER: 1 !!!!!!!!!!!!!!!!!!!!!!!!!
Best Signature Composition ['SBS1', 'SBS5', 'SBS12']
L2 Error % 0.27
Cosine Similarity 0.96

#################### Final Composition #################################
['SBS1', 'SBS5', 'SBS12']
L2 Error % 0.27
Cosine Similarity 0.96

Which makes me think there is potentially an error in writing out the De_Novo_map_to_COSMIC_SBS96.csv file.

For reference, here is the JOBMETADATA.txt file for the run:

THIS FILE CONTAINS THE METADATA ABOUT SYSTEM AND RUNTIME


-------System Info-------
Operating System Name: Linux
Nodename: NODE1
Release: 3.10.0-957.10.1.el7.x86_64
Version: #1 SMP Mon Mar 18 15:06:45 UTC 2019

-------Python and Package Versions-------
Python Version: 3.7.3
Sigproextractor Version: 1.0.13
SigprofilerPlotting Version: 1.1.6
SigprofilerMatrixGenerator Version: 1.1.17
Pandas version: 1.0.4
Numpy version: 1.18.1
Scipy version: 1.4.1
Scikit-learn version: 0.23.1

--------------EXECUTION PARAMETERS--------------
INPUT DATA
	input_type: vcf
	output: PROJECT1
	input_data: VCFS
	reference_genome: GRCh38
	context_types: SBS96,DBS78,ID83
	exome: False
NMF REPLICATES
	minimum_signatures: 1
	maximum_signatures: 10
	NMF_replicates: 100
NMF ENGINE
	NMF_init: nndsvd_min
	precision: single
	matrix_normalization: gmm
	resample: True
	seeds: random
	min_NMF_iterations: 10,000
	max_NMF_iterations: 1,000,000
	NMF_test_conv: 10,000
	NMF_tolerance: 1e-15
CLUSTERING
	clustering_distance: cosine
EXECUTION
	cpu: 48; Maximum number of CPU is 48
	gpu: False
COSMIC MATCH
	opportunity_genome: GRCh38
	nnls_add_penalty: 0.05
	nnls_remove_penalty: 0.01
	initial_remove_penalty: 0.05
	refit_denovo_signatures: True

What does these subfolders means?

Hi

I have run your software on a couple of my samples (.VCF)

In output I see some sub folders like

SBS96_1_Signatures
SBS96_2_Signatures
SBS96_3_Signatures
SBS96_4_Signatures
SBS96_5_Signatures

Does this mean that 5 signatures have found in my data? Why only you divided 5 signatures in an accumulative way in 5 sub folders?
If I want to select the best signatures defining my samples (overall), which sub folder I should look?

Please help me to get these

I really could not figure out the meaning of output even by reading the wiki pages several times

Thanks for any help in advance

I am right in getting the point?

Hello

Please, you may look at this shot

Screenshot 2020-04-27 at 19 36 28

I am right that 5 as the optimum number of extracted signatures, is the best solution because of the lowest Frobenius ?

Thanks for any help

UnboundLocalError: local variable 'W' referenced before assignment

multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/anaconda3/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/home/anaconda3/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/home/anaconda3/lib/python3.7/site-packages/SigProfilerExtractor/subroutines.py", line 601, in pnmf
    W, H, kl = nmf_fn(bootstrapGenomes,totalProcesses, init=init, excecution_parameters=excecution_parameters)  #uses custom function nnmf
  File "/home/anaconda3/lib/python3.7/site-packages/SigProfilerExtractor/subroutines.py", line 415, in nnmf_cpu
    net = nmf_cpu.NMF(genomes,rank=nfactors, min_iterations=min_iterations, max_iterations=max_iterations, tolerance=tolerance,test_conv=test_conv, init_method=init,seed=None)
  File "/home/anaconda3/lib/python3.7/site-packages/SigProfilerExtractor/nmf_cpu.py", line 78, in __init__
    self._W, self._H = self._initialise_wh(init_method)
  File "/home/anaconda3/lib/python3.7/site-packages/SigProfilerExtractor/nmf_cpu.py", line 127, in _initialise_wh
    W = torch.from_numpy(W).type(self._tensor_type)
UnboundLocalError: local variable 'W' referenced before assignment
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/anaconda3/lib/python3.7/site-packages/SigProfilerExtractor/sigpro.py", line 746, in sigProfilerExtractor
    i = i)
  File "/home/anaconda3/lib/python3.7/site-packages/SigProfilerExtractor/subroutines.py", line 809, in decipher_signatures
    results = parallel_runs(excecution_parameters, genomes=genomes, totalProcesses=totalProcesses, verbose = False)
  File "/home/anaconda3/lib/python3.7/site-packages/SigProfilerExtractor/subroutines.py", line 731, in parallel_runs
    result_list = pool.map(pool_nmf, batch_seed_pair) 
  File "/home/anaconda3/lib/python3.7/multiprocessing/pool.py", line 268, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/home/anaconda3/lib/python3.7/multiprocessing/pool.py", line 657, in get
    raise self._value
UnboundLocalError: local variable 'W' referenced before assignment

Using SigProfiler with an heterogeneous dataset

I am currently working on a signatures extraction for a neuroblastoma cancers dataset. The specificity of our dataset is its heterogeneity. Our samples (WGS) come from both untreated and treated patients. Therefore, the profiles I am working with are very different and lead to some theoretical issues.

1. The total number of mutations vary between the samples

Because I am working on untreated and treated samples, the mutational burden varies a lot between the samples. The result is that SigProfiler primarily minimizes the error of the samples with a high number of mutations. When I plot error_by_sample=f(total_number_of_mutations_by_sample), I get a strong correlation suggesting that the mutational burden biases the results. It is not a surprise looking both on the cost function and the update rules.

Potential solution and question: Since it is a work on mutational profiles, I have normalized all the samples in order to remove the mutational burden bias (giving the same number of mutations for each sample). Do you think it is a good idea ? Am I missing a biological aspect by doing this ?

2. The dataset is slightly unbalanced

The dataset I am working on is slightly unbalanced. The result is that it seems that the algorithm is focusing hierarchically on the parts of the dataset that are more represented. In particular, when looking on the signatures obtained for different values of the total number of mutations, it seems that the distribution of the dataset biases the extractions. Therefore, it is possible that, for a given value of K, I overfit a part of the dataset while underfitting the the other part.

Question: Do you think that in such a case, I can trust the optimal value of K ?

3. Is there a parameter that allows to add sparsity during the extraction

Because the dataset is heterogeneous, there are some signatures that are present only in a part of the dataset (the signatures associated tp a type of chemotherapy are only present in the treated samples). The problem is that these signatures modify the signatures that are present in the entire dataset. I observe a clear influence of these signatures when I perform an extraction on the entire dataset and an extraction only on the untreated samples.

Potential solution and question: One solution could be to add sparsity during the extraction (on the exposure).
I have seen that you have created a parameter “penalty” to set up the degree of sparsity in the suggested solution,
If I have understood well your code, this penalty is only applied during the fitting part of the algorithm.
During the extraction, the cost functions minimized are an L2-norm or a KL-divergence without any sparsity penalty.
Question: Am I right when I say that there is no parameter to add sparsity during the extraction ? Is there a way with this implementation to fight against the bleeding of signatures ?

RESULTS FILES....

Hello.

Do you have a results file for me to compare my results
with yours? Because I ran your program, but I would like
to cross reference my results files with yours...

-CHRIS

sigProfilerExtractor end with error: "OSError: [Errno 24] Too many open files"

I was running sigProfilerExtractor on 92 WGS VCF files from cell lines.
It end with "OSError: [Errno 24] Too many open files" after first signatures is extracted. (STDOUT see below)
Any idea?

Here is the command I was running:

sig.sigProfilerExtractor("vcf", "sigProfilerExtractor", "vcf_test",reference_genome="GRCh37", minimum_signatures=1, maximum_signatures=10, nmf_replicates=100,cpu=8)

STDOUT:

************** Reported Current Memory Use: 0.2 GB *****************

Starting matrix generation for SNVs and DINUCs...Completed! Elapsed time: 7291.65 seconds.
Starting matrix generation for INDELs...Completed! Elapsed time: 2970.29 seconds.
Matrices generated for 92 samples with 429 errors. Total of 349047399 SNVs, 4041556 DINUCs, and 32987372 INDELs were successfully analyzed.
Extracting signature 1 for mutation type 96
The matrix normalizig cutoff is 4115174

process 1 continues please wait...
execution time: 4 seconds

process 1 continues please wait...
.
.
.
process 1 continues please wait...
execution time: 4 seconds

Time taken to collect 100 iterations for 1 signatures is 71.36 seconds
Traceback (most recent call last):
File "", line 1, in
File "/anaconda3/envs/SigProfiler/lib/python3.9/site-packages/SigProfilerExtractor/sigpro.py", line 779, in sigProfilerExtractor
processes = sub.decipher_signatures(excecution_parameters,
File "/anaconda3/envs/SigProfiler/lib/python3.9/site-packages/SigProfilerExtractor/subroutines.py", line 843, in decipher_signatures
processAvg, exposureAvg, processSTE, exposureSTE, avgSilhouetteCoefficients, clusterSilhouetteCoefficients = cluster_converge_outerloop(Wall, Hall, processes, dist=dist, gpu=gpu)
File "/anaconda3/envs/SigProfiler/lib/python3.9/site-packages/SigProfilerExtractor/subroutines.py", line 1104, in cluster_converge_outerloop
result_list = parallel_clustering(Wall, Hall, totalprocess, iterations=50, n_cpu=-1, dist=dist, gpu=gpu)
File "/anaconda3/envs/SigProfiler/lib/python3.9/site-packages/SigProfilerExtractor/subroutines.py", line 1088, in parallel_clustering
pool = multiprocessing.Pool()
File "/anaconda3/envs/SigProfiler/lib/python3.9/multiprocessing/context.py", line 119, in Pool
return Pool(processes, initializer, initargs, maxtasksperchild,
File "/anaconda3/envs/SigProfiler/lib/python3.9/multiprocessing/pool.py", line 212, in init
self._repopulate_pool()
File "/anaconda3/envs/SigProfiler/lib/python3.9/multiprocessing/pool.py", line 303, in _repopulate_pool
return self._repopulate_pool_static(self._ctx, self.Process,
File "/anaconda3/envs/SigProfiler/lib/python3.9/multiprocessing/pool.py", line 326, in _repopulate_pool_static
w.start()
File "/anaconda3/envs/SigProfiler/lib/python3.9/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/anaconda3/envs/SigProfiler/lib/python3.9/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File "/anaconda3/envs/SigProfiler/lib/python3.9/multiprocessing/popen_spawn_posix.py", line 32, in init
super().init(process_obj)
File "/anaconda3/envs/SigProfiler/lib/python3.9/multiprocessing/popen_fork.py", line 19, in init
self._launch(process_obj)
File "/anaconda3/envs/SigProfiler/lib/python3.9/multiprocessing/popen_spawn_posix.py", line 58, in _launch
self.pid = util.spawnv_passfds(spawn.get_executable(),
File "/anaconda3/envs/SigProfiler/lib/python3.9/multiprocessing/util.py", line 450, in spawnv_passfds
errpipe_read, errpipe_write = os.pipe()
OSError: [Errno 24] Too many open files

run estimate_solution without finishing denovo extraction for all ranks

Hi,

I started de novo extraction on my data for 35 signatures. This is taking a very long time than expected. However, I am sure there will not be 35 signatures in my data (from previous evidences, there might be 10 signatures). So far, the extraction has completed for 17 ranks. Now I would like to estimate the best solution among these 17 ranks.

When I ran the estimate_solution function, it is giving the following error

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/thatikon/anaconda3/lib/python3.7/site-packages/SigProfilerExtractor/estimate_best_solution.py", line 50, in estimate_solution
    signatures[i]=signatures[i].rstrip("*")
AttributeError: 'int' object has no attribute 'rstrip'
>>>

There is some information missing from All_solutions_stat.csv file (I compared this with another extraction which is completed).

I was wondering if there is a way here for me to estimate the best solution without actually finishing the extraction for all provided ranks?

Thank you.

ZeroDivisionError with simulated data

Hello,
I just attended a short workshop given by Steve Rozen Lab and collaborators and we used a simulated dataset.
We analyzed results from SigProfilerExtractor generated some time ago but i don't know which version was used to generate the results. My problem is that when I try to extract profiles using the last version of the library with the same input I'm getting the following error:

SigProfilerExtractor.__version__
'1.0.13.9'

sig.sigProfilerExtractor(input_type = "table", 
                         output = "sig.pro.output_test", 
                         input_data = "../3.types.ground.truth.spectra.1000.size.csv", 
                         minimum_signatures=  9,  
                         maximum_signatures= 17,
                         cpu = 4)
lib/python3.8/site-packages/SigProfilerExtractor/sigpro.py in sigProfilerExtractor(input_type, output, input_data, reference_genome, opportunity_genome, context_type, exome, minimum_signatures, maximum_signatures, nmf_replicates, resample, batch_size, cpu, gpu, nmf_init, precision, matrix_normalization, seeds, min_nmf_iterations, max_nmf_iterations, nmf_test_conv, nmf_tolerance, nnls_add_penalty, nnls_remove_penalty, initial_remove_penalty, refit_denovo_signatures, clustering_distance, export_probabilities, make_decomposition_plots, get_all_signature_matrices)
    688         # get the cutoff for normatization to handle the hypermutators
    689 
--> 690         normalization_cutoff = sub.get_normalization_cutoff(genomes, manual_cutoff=100*genomes.shape[0])
    691         #print("Normalization Cutoff is :", normalization_cutoff)
    692         excecution_parameters["normalization_cutoff"]= normalization_cutoff

~/miniconda3/envs/torch/lib/python3.8/site-packages/SigProfilerExtractor/subroutines.py in get_normalization_cutoff(data, manual_cutoff)
    251 
    252 
--> 253     mean = np.mean(col_sums)
    254     std = np.std(col_sums)
    255     cutoff = (mean + 2*(std)).astype(int)

<__array_function__ internals> in mean(*args, **kwargs)

~/.local/lib/python3.8/site-packages/numpy/core/fromnumeric.py in mean(a, axis, dtype, out, keepdims)
   3370             return mean(axis=axis, dtype=dtype, out=out, **kwargs)
   3371 
-> 3372     return _methods._mean(a, axis=axis, dtype=dtype,
   3373                           out=out, **kwargs)
   3374 

~/.local/lib/python3.8/site-packages/numpy/core/_methods.py in _mean(a, axis, dtype, out, keepdims)
    170             ret = ret.dtype.type(ret / rcount)
    171     else:
--> 172         ret = ret / rcount
    173 
    174     return ret

ZeroDivisionError: division by zero

I tried with the example data and it worked as expected (with GPU activated too).

Here it's the input data, It's a csv file, I changed the extension because github doesn't allow that extension. It was created using SynSigGen but I don't have the code used to do it.

3.types.ground.truth.spectra.1000.size.txt

Decomposed SBS96 signatures include Signature H

I ran SigProfilerExtractor as I wanted the COSMIC signatures for each of my samples. I'm looking at the decomposed signatures, and I'm seeing a couple SBS signatures, but I'm also seeing Signature H. What is signature H and why is it not incorporated into the other SBS signatures? I'm only seeing about 15 COSMIC signatures in Decomposed_Solution_Activities_SBS96.txt. Thanks in advance!

Executed correctly, but no signatures generated

Hello,

I am attempting to run sigProfilerExtractor on a set of input VCFs, and I believe everything is working as expected (no errors thrown that halt the process). However, upon inspection of the output directory, only the SBS96 folder appears, and the contained All_solutions_stat.csv is empty.

Here are my input commands and the returned output:

>>> from SigProfilerExtractor import sigpro as sig
>>> data = "/Users/Adrian/Desktop/TestSigProNewVer_vcfs"   # VCF files directory, 144 samples
>>> sig.sigProfilerExtractor("vcf", "TestSigProNewVer_out", data, minimum_signatures=1, maximum_signatures=10)

************** Reported Current Memory Use: 0.16 GB *****************

Starting matrix generation for SNVs and DINUCs...Completed! Elapsed time: 22.62 seconds.
Starting matrix generation for INDELs...Completed! Elapsed time: 6.39 seconds.
Matrices generated for 144 samples with 712919 errors. Total of 194855 SNVs, 14411 DINUCs, and 28350 INDELs were successfully analyzed.
Extracting signature 1 for mutation type 96
The matrix normalizig cutoff is 9600

Here is the JOB_METADATA.txt info:

THIS FILE CONTAINS THE METADATA ABOUT SYSTEM AND RUNTIME


-------System Info-------
Operating System Name: Darwin
Nodename: Adrians-MacBook-Pro-4.local
Release: 17.7.0
Version: Darwin Kernel Version 17.7.0: Sun Dec  1 19:19:56 PST 2019; root:xnu-4570.71.63~1/RELEASE_X86_64

-------Python and Package Versions-------
Python Version: 3.6.1
Sigproextractor Version: 1.0.12
SigprofilerPlotting Version: 1.1.6
SigprofilerMatrixGenerator Version: 1.1.16
Pandas version: 1.0.5
Numpy version: 1.19.0
Scipy version: 1.5.0
Scikit-learn version: 0.23.1

--------------EXECUTION PARAMETERS--------------
INPUT DATA
	input_type: vcf
	output: TestSigProNewVer_out
	input_data: /Users/Adrian/Desktop/TestSigProNewVer_vcfs
	reference_genome: GRCh37
	context_types: SBS96,DBS78,ID83
	exome: False
NMF REPLICATES
	minimum_signatures: 1
	maximum_signatures: 10
	NMF_replicates: 100
NMF ENGINE
	NMF_init: nndsvd_min
	precision: single
	matrix_normalization: gmm
	resample: True
	seeds: random
	min_NMF_iterations: 10,000
	max_NMF_iterations: 1,000,000
	NMF_test_conv: 10,000
	NMF_tolerance: 1e-15
CLUSTERING
	clustering_distance: cosine
EXECUTION
	cpu: 8; Maximum number of CPU is 8
	gpu: False
COSMIC MATCH
	opportunity_genome: GRCh37
	nnls_add_penalty: 0.05
	nnls_remove_penalty: 0.01
	initial_remove_penalty: 0.05
	refit_denovo_signatures: True

-------Analysis Progress-------
[2020-06-29 15:31:33] Analysis started:

##################################

[2020-06-29 15:32:11] Analysis started for SBS96. Matrix size [96 rows x 143 columns]

[2020-06-29 15:32:11] Normalization GMM with cutoff value set at 9600

Am I to interpret this as suggesting that no signatures were found?

Thanks,
Adrian

Extract all SBS Cosmic signatures

Hi,

I have an issue regarding the extraction of all SBS Cosmic signatures from TCGA mutect data.
I have correctly installed all python tools of the SigProfiler suite and correctly generate matrix of mutation substitutions using the Matrix Generator.

When I tried to extract all SBS COSMIC signatures using the code below, the tool extracted only 9 SBS in the decomposed solution (I also tried to increase the number of signatures to be extracted with endProcess=100, but unfortunately I obtained again 9 SBS):
from sigproextractor import sigpro as sig
data = "/path_to_folder/output/SBS/LUAD_mutect.SBS96.all.txt"
sig.sigProfilerExtractor("table", "example_output_new", data, "GRCh38", "GRCh38", startProcess=1, endProcess=50, totalIterations = 100)

Is there any step/parameter that I missed to extract all SBS?

Thank you,
Valentina

spawn issues

Hi,
I am using python 3.7 and run into these issues:
Traceback (most recent call last):
File "", line 1, in
File "anaconda3/lib/python3.7/multiprocessing/spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "anaconda3/lib/python3.7/multiprocessing/spawn.py", line 114, in _main
prepare(preparation_data)
File "anaconda3/lib/python3.7/multiprocessing/spawn.py", line 225, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "anaconda3/lib/python3.7/multiprocessing/spawn.py", line 277, in _fixup_main_from_path
run_name="mp_main")
File "anaconda3/lib/python3.7/runpy.py", line 263, in run_path
pkg_name=pkg_name, script_name=fname)
File "anaconda3/lib/python3.7/runpy.py", line 96, in _run_module_code
mod_name, mod_spec, pkg_name, script_name)
File "anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
sig.sigProfilerExtractor("vcf", "output_may2", data, minimum_signatures=1, maximum_signatures=10)
File "anaconda3/lib/python3.7/site-packages/SigProfilerExtractor/sigpro.py", line 521, in sigProfilerExtractor
data = datadump.SigProfilerMatrixGeneratorFunc(project_name, refgen, project, exome=exome, bed_file=None, chrom_based=False, plot=False, gs=False)
File "anaconda3/lib/python3.7/site-packages/SigProfilerMatrixGenerator/scripts/SigProfilerMatrixGeneratorFunc.py", line 303, in SigProfilerMatrixGeneratorFunc
os.remove(log_file)
FileNotFoundError: [Errno 2] No such file or directory: 'INDEL/logs/SigProfilerMatrixGenerator_INDEL_GRCh372020-05-29.out'

Am I missing something here? Thanks for you input

Originally posted by @grigri2020 in #27 (comment)

Failing to run SigProfilerExtractor

Hello SigProfilerExtractor developers,

I used SigProfilerExtractor for de novo extraction of mutational signatures on whole genome sequencing .vcf files. I run your python package and I got the error messages attached below which I think are related to the fonts of the output plots. Could you please advise me on how to fix this?

Many thanks,
Katerina

line 602, in open_for_read_by_name
return open(name,mode)
FileNotFoundError: [Errno 2] No such file or directory: 'Arial Bold.ttf'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/hps/anaconda3/envs/SigProfilerExtractor/lib/python3.7/site-packages/reportlab/lib/utils.py", line 661, in open_for_read
return getBytesIO(datareader(name) if name[:5].lower()=='data:' else urlopen(name).read())
File "/hps/anaconda3/envs/SigProfilerExtractor/lib/python3.7/urllib/request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "/hps/anaconda3/envs/SigProfilerExtractor/lib/python3.7/urllib/request.py", line 510, in open
req = Request(fullurl, data)
File "/hps/anaconda3/envs/SigProfilerExtractor/lib/python3.7/urllib/request.py", line 328, in init
self.full_url = url
File "/hps/anaconda3/envs/SigProfilerExtractor/lib/python3.7/urllib/request.py", line 354, in full_url
self._parse()
File "/hps/anaconda3/envs/SigProfilerExtractor/lib/python3.7/urllib/request.py", line 383, in _parse
raise ValueError("unknown url type: %r" % self.full_url)
ValueError: unknown url type: 'Arial Bold.ttf'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/hps/anaconda3/envs/SigProfilerExtractor/lib/python3.7/site-packages/reportlab/pdfbase/ttfonts.py", line 155, in TTFOpenFile
f = open_for_read(fn,'rb')
File "/hps/anaconda3/envs/SigProfilerExtractor/lib/python3.7/site-packages/reportlab/lib/utils.py", line 663, in open_for_read
raise IOError('Cannot open resource "%s"' % name)
OSError: Cannot open resource "Arial Bold.ttf"
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "run_subs_signatures.py", line 4, in
from SigProfilerExtractor import sigpro as sig
File "/hps/anaconda3/envs/SigProfilerExtractor/lib/python3.7/site-packages/SigProfilerExtractor/sigpro.py", line 33, in
from SigProfilerExtractor import subroutines as sub
File "/hps/anaconda3/envs/SigProfilerExtractor/lib/python3.7/site-packages/SigProfilerExtractor/subroutines.py", line 22, in
from SigProfilerExtractor import PlotDecomposition as sp
File "/hps/anaconda3/envs/SigProfilerExtractor/lib/python3.7/site-packages/SigProfilerExtractor/PlotDecomposition.py", line 25, in
from SigProfilerExtractor import PlotDecomposition_SBS96 as spd_96
File "/hps/anaconda3/envs/SigProfilerExtractor/lib/python3.7/site-packages/SigProfilerExtractor/PlotDecomposition_SBS96.py", line 40, in
pdfmetrics.registerFont(TTFont('Arial-Bold', 'Arial Bold.ttf'))
File "/hps/anaconda3/envs/SigProfilerExtractor/lib/python3.7/site-packages/reportlab/pdfbase/ttfonts.py", line 1196, in init
self.face = TTFontFace(filename, validate=validate, subfontIndex=subfontIndex)
File "/hps/anaconda3/envs/SigProfilerExtractor/lib/python3.7/site-packages/reportlab/pdfbase/ttfonts.py", line 1090, in init
TTFontFile.init(self, filename, validate=validate, subfontIndex=subfontIndex)
File "/hps/anaconda3/envs/SigProfilerExtractor/lib/python3.7/site-packages/reportlab/pdfbase/ttfonts.py", line 457, in init
TTFontParser.init(self, file, validate=validate,subfontIndex=subfontIndex)
File "/hps/anaconda3/envs/SigProfilerExtractor/lib/python3.7/site-packages/reportlab/pdfbase/ttfonts.py", line 179, in init
self.readFile(file)
File "/hps/anaconda3/envs/SigProfilerExtractor/lib/python3.7/site-packages/reportlab/pdfbase/ttfonts.py", line 255, in readFile
self.filename, f = TTFOpenFile(f)
File "/hps/anaconda3/envs/SigProfilerExtractor/lib/python3.7/site-packages/reportlab/pdfbase/ttfonts.py", line 165, in TTFOpenFile
raise TTFError('Can't open file "%s"' % fn)
reportlab.pdfbase.ttfonts.TTFError: Can't open file "Arial Bold.ttf"

sigproSS python package

I want to ask if this 'sigproSS'package could only make SBS decompositions?Can it apply to DBS and Indel decompositions?
Please help me ,thanks a lot!

Problems with large samples

Hi,
I have 3500 genomes to analysis, so the matrix I made has 3500 columns. And I did not pass the parameter "maximum_signatures" to sigProfilerExtractor. I find the default value of maximum_signtures is 10, but it has generated a subfolder name's SBS96_11_Signatures. When will it stop? Thanks

Multiprocessing issue

Dear Developers,

I am experiencing an issue with running SigProfilerExtractor. The program crashes with the following error:

RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

Please find attached the complete log file:
log.txt

Strangely, if I run it from IDE (Visual studio), it runs smoothly, however, if I run it via command line script, it crashes with the above error.

I would appreciate any advice on how to resolve the issue.

Best regards,
Jurica

sigProfilerExtractor TypeError: unsupported operand type(s) for /: 'str' and 'str'

When running the sigProfilerExtractor, I get a TypeError and no output is produced. I pasting the first 10 lines of the input file. Do you know of anything I can do to fix this error?

Thanks,
Teja

juno:SBS yellapav$ head /home/yellapav/local/downloads/sigplofiler/cave_fm6_chap_alex.txt MM PD26405a CHAPMAN GRCh37 SNP 1 2301677 2301677 A T SOMATIC MM PD26405a CHAPMAN GRCh37 SNP 1 3622561 3622561 C T SOMATIC MM PD26405a CHAPMAN GRCh37 SNP 1 4531849 4531849 C A SOMATIC MM PD26405a CHAPMAN GRCh37 SNP 1 4538255 4538255 G A SOMATIC MM PD26405a CHAPMAN GRCh37 SNP 1 4539928 4539928 G C SOMATIC MM PD26405a CHAPMAN GRCh37 SNP 1 4980298 4980298 T A SOMATIC MM PD26405a CHAPMAN GRCh37 SNP 1 5113557 5113557 G T SOMATIC MM PD26405a CHAPMAN GRCh37 SNP 1 5493418 5493418 T G SOMATIC MM PD26405a CHAPMAN GRCh37 SNP 1 6416065 6416065 C A SOMATIC MM PD26405a CHAPMAN GRCh37 SNP 1 6551142 6551142 C T SOMATIC

sig.sigProfilerExtractor("text", "/home/yellapav/local/downloads/sigplofiler/output/SBS/out","/home/yellapav/local/downloads/sigplofiler/cave_fm6_chap_alex.txt", "chap")

************** Reported Current Memory Use: 0.08 GB *****************

Extracting signature 1 for mutation type 501411

multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/work/isabl/opt/python/.virtualenvs/users/yellapav/default_python3/lib/python3.6/site-packages/pandas/core/ops.py", line 1497, in na_op
result = expressions.evaluate(op, str_rep, x, y, **eval_kwargs)
File "/work/isabl/opt/python/.virtualenvs/users/yellapav/default_python3/lib/python3.6/site-packages/pandas/core/computation/expressions.py", line 205, in evaluate
return _evaluate(op, op_str, a, b, **eval_kwargs)
File "/work/isabl/opt/python/.virtualenvs/users/yellapav/default_python3/lib/python3.6/site-packages/pandas/core/computation/expressions.py", line 65, in _evaluate_standard
return op(a, b)
TypeError: unsupported operand type(s) for /: 'str' and 'str'

meaning of opportunity genome

Hi there,

thanks for sharing the software.

One question;
for the command "sigProfilerExtractor" its is stated that the opportunity_genome argument "automatically matches the input reference genome value".

I launched the program using the command
sig.sigProfilerExtractor("vcf", output_dir, input_dir,reference_genome="GRCh38",cpu=6)

however I noticed that in the JOB_METADATA file it states that
"COSMIC MATCH
opportunity_genome: GRCh37"

Is this an error? Should I rerun with the additional option 'opportunity_genome="GRCh38"'

thanks

jamie

Nothing happen when running the program

Hi

I have installed your software

I have put one of my own .vcf files in the path and typed

path_to_example_vcf = sig.importdata("vcf")

data = path_to_example_vcf

sig.sigProfilerExtractor("vcf", "example_output", data)

After sometime everything seems right but in output I am not seeing things like what I see by your example data like suggested solution

This is few lines of my terminal

>>> sig.sigProfilerExtractor("vcf", "output", data)

************** Reported Current Memory Use: 0.1 GB *****************

Starting matrix generation for SNVs and DINUCs...Completed! Elapsed time: 5.71 seconds.
Starting matrix generation for INDELs...Completed! Elapsed time: 4.63 seconds.
Matrices generated for 1 samples with 0 errors. Total of 28098 SNVs, 86 DINUCs, and 2515 INDELs were successfully analyzed.
Normalization Cutoff is : 28098
Extracting signature 1 for mutation type 96
/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sklearn/decomposition/_nmf.py:163: RuntimeWarning: invalid value encountered in sqrt
  return np.sqrt(2 * res)
process 1 continues please wait... 
execution time: 0 seconds 

process 1 continues please wait... 
execution time: 0 seconds 

process 1 continues please wait... 
execution time: 0 seconds 

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sklearn/decomposition/_nmf.py:163: RuntimeWarning: invalid value encountered in sqrt
  return np.sqrt(2 * res)
/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sklearn/decomposition/_nmf.py:163: RuntimeWarning: invalid value encountered in sqrt
  return np.sqrt(2 * res)
/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sklearn/decomposition/_nmf.py:163: RuntimeWarning: invalid value encountered in sqrt
  return np.sqrt(2 * res)
/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sklearn/decomposition/_nmf.py:1075: ConvergenceWarning: Maximum number of iteration 1000000 reached. Increase it to improve convergence.
  warnings.warn("Maximum number of iteration %d reached. Increase it to"
process 1 continues please wait... 
execution time: 265 seconds 

process 1 continues please wait... 
execution time: 0 seconds 

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sklearn/decomposition/_nmf.py:1075: ConvergenceWarning: Maximum number of iteration 1000000 reached. Increase it to improve convergence.
  warnings.warn("Maximum number of iteration %d reached. Increase it to"
process 1 continues please wait... 
execution time: 271 seconds 

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sklearn/decomposition/_nmf.py:1075: ConvergenceWarning: Maximum number of iteration 1000000 reached. Increase it to improve convergence.
  warnings.warn("Maximum number of iteration %d reached. Increase it to"
process 1 continues please wait... 
execution time: 271 seconds 

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sklearn/decomposition/_nmf.py:1075: ConvergenceWarning: Maximum number of iteration 1000000 reached. Increase it to improve convergence.
  warnings.warn("Maximum number of iteration %d reached. Increase it to"
process 1 continues please wait... 
execution time: 272 seconds 

Time taken to collect 8 iterations for 1 signatures is 275.87 seconds
Optimization time is 1.366070032119751 seconds
The reconstruction error is 0.012, average process stability is 1.0 and 
the minimum process stability is 1.0 for 1 signatures


Extracting signature 2 for mutation type 96
>>> 

I need your help please

Thanks

H01_334.txt

No signature extraction when "exome = True"

Hi,

I am trying to extract de-novo signatures from my exome data and I get no output, and no error!JOB_MATADATA.txt is generated, but no solutions,or signature/activities/plots etc.

Works perfectly for my genome data!

Command:
data = "PATH_TO_MY_VCF_FILES_FOLDER"
sig.sigProfilerExtractor("vcf", "test2", data, refgen="GRCh37", genome_build = "GRCh37", startProcess=1, endProcess=10, totalIterations=8, cpu=-1, hierarchy = False, mtype = ["default"],exome = True)

Thanks,
Monica.

subroutines.py line 1410 -- shutil.rmtree OSError: [Errno 16] Device or resource busy

....
Decompositon Plot made for SBS96B

Traceback (most recent call last):

File "/hpf/largeprojects/adam/mehdi/sp_versions/sigproex_v1.0.17/SigProfilerExtractor/sigpro.py", line 1022, in sigProfilerExtractor
final_signatures = sub.signature_decomposition(processAvg, m, layer_directory2, genome_build=genome_build, add_penalty=add_penalty, remove_penalty=remove_penalty, mutation_context=mutation_context, make_decomposition_plots=make_decomposition_plots, originalProcessAvg=originalProcessAvg)
File "/hpf/largeprojects/adam/mehdi/sp_versions/sigproex_v1.0.17/SigProfilerExtractor/subroutines.py", line 1410, in signature_decomposition
shutil.rmtree(directory+"/Decomposition_Plots")
File "/home/mlayeghi/.conda/envs/SigProfiler/lib/python3.6/shutil.py", line 480, in rmtree
_rmtree_safe_fd(fd, path, onerror)
File "/home/mlayeghi/.conda/envs/SigProfiler/lib/python3.6/shutil.py", line 438, in _rmtree_safe_fd
onerror(os.unlink, fullname, sys.exc_info())
File "/home/mlayeghi/.conda/envs/SigProfiler/lib/python3.6/shutil.py", line 436, in _rmtree_safe_fd
os.unlink(name, dir_fd=topfd)
OSError: [Errno 16] Device or resource busy: '.nfs00000003906e18580003a6aa'
EXIT STATUS 0

Getting error from decomp function

Hello

I have tried decomp function on my data hopefully to get some from my de novo output file but I am getting error

decomp.decompose(signatures, activities, samples, output, genome_build="GRCh37", verbose=False)
Traceback (most recent call last):
File "", line 1, in
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/SigProfilerExtractor/decomposition.py", line 57, in decompose
exposureAvg = pd.read_csv(activities, sep = "\t", index_col = 0)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/io/parsers.py", line 676, in parser_f
return _read(filepath_or_buffer, kwds)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/io/parsers.py", line 448, in _read
parser = TextFileReader(fp_or_buf, **kwds)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/io/parsers.py", line 880, in init
self._make_engine(self.engine)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/io/parsers.py", line 1114, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/io/parsers.py", line 1891, in init
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 374, in pandas._libs.parsers.TextReader.cinit
File "pandas/_libs/parsers.pyx", line 673, in pandas._libs.parsers.TextReader._setup_parser_source
FileNotFoundError: [Errno 2] File De_Novo_Solution_Activities.txt does not exist: 'De_Novo_Solution_Activities.txt'
from SigProfilerExtractor import sigpro as sig
decomp.decompose(signatures, activities, samples, output, genome_build="GRCh37", verbose=False)
Traceback (most recent call last):
File "", line 1, in
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/SigProfilerExtractor/decomposition.py", line 57, in decompose
exposureAvg = pd.read_csv(activities, sep = "\t", index_col = 0)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/io/parsers.py", line 676, in parser_f
return _read(filepath_or_buffer, kwds)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/io/parsers.py", line 448, in _read
parser = TextFileReader(fp_or_buf, **kwds)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/io/parsers.py", line 880, in init
self._make_engine(self.engine)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/io/parsers.py", line 1114, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/io/parsers.py", line 1891, in init
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 374, in pandas._libs.parsers.TextReader.cinit
File "pandas/_libs/parsers.pyx", line 673, in pandas._libs.parsers.TextReader._setup_parser_source
FileNotFoundError: [Errno 2] File De_Novo_Solution_Activities.txt does not exist: 'De_Novo_Solution_Activities.txt'

Missing activity table (samples) when using decompose

Hello developers,

I used decompose function to align the De Novo signatures to the COSMIC signatures.
First I extracted the De Novo signatures:

from sigproextractor import sigpro as sig
data = "Project-339/01_sigProfilerExtractor/vcfs"
sig.sigProfilerExtractor(input_type="vcf", out_put="output", input_data=data, refgen="GRCh38",genome_build="GRCh38")

I only extracted SBS96 and got 1-8 signatures (why 8 signatures?) after running the tool.
Then I wanted to run the decomposition function to get the COSMIC signatures. However, I was missing the activity table where the rows are mutation types and samples as columns. It's not in the output folder after de novo extraction.

Instead, I used the matrix generator.

from sigproextractor import sigpro as sig
data = "Project-339/01_sigProfilerExtractor/vcfs"
sig.sigProfilerExtractor(input_type="vcf", out_put="output", input_data=data, refgen="GRCh38",genome_build="GRCh38")

Then I used the matrixes.SBS96.exome as input for the decomposition function:

from sigproextractor import decomposition as decomp

signatures = "output/SBS96/All_solutions/SBS96_8_Signatures/SBS96_S8_Signatures.txt"
activities = "output/SBS96/All_solutions/SBS96_8_Signatures/SBS96_S8_Activities.txt"
samples = "vcfs/output/SBS/matrixes.SBS96.exome"
output = "decomposed"
decomp.decompose(signatures, activities, samples,output=output,mutation_type="96",genome_build='GRCh38')

Is this the right way?

[Bug] SigProfilerExtractor : ProfileGenerator works but then stop at the extractor step

Hello,

I'm a new user of the library.
I tried to extract the Signature using the extractor module, and It worked pretty well on my computer.
But when I used it on a cluster I have a bug.

The extraction step is working well :

**************** Reported Current Memory Use: 0.16 GB *****************

Starting matrix generation for SNVs and DINUCs...Completed! Elapsed time: 5.09 seconds.
Starting matrix generation for INDELs...Completed! Elapsed time: 4.46 seconds.
Matrices generated for 1 samples with 1618 errors. Total of 28370 SNVs, 541 DINUCs, and 2012 INDELs were successfully analyzed.**

but then, it stop, it won't go on the extracting step.

here is the command used :
sig.sigProfilerExtractor("vcf", dossier_output, dossier_input,"GRCh38",1,2)

Do you see where I went wrong ?

Best,

Mario

problem with decomposition result

Hello,
I've followed the example to run the following codes:

>>> path_to_example_folder_containing_vcf_files = sig.importdata("vcf")
>>> data = path_to_example_folder_containing_vcf_files # you can put the path to your folder containing the vcf samples
>>> sig.sigProfilerExtractor("vcf", "example_output", data, startProcess=1, endProcess=3)

However, it showed up that there's no decomposition result at the end.

"WARNING!!! We apolozize we don't have a global signature database for the mutational context you provided. We have a database only for SB S96, DINUC and INDELS.
Therefore no result for signature Decomposition is generated."

What could I do to solve the warning and get the decomposition result?

Thank you,
Charlene

Pentanucleotide context mutational signature reference matrix

Hello,

I am writing to inquire whether it is possible to know where I could potentially find pentanucleotide context mutational signature matrix describing the probabilities of each mutation in pentanucleotide context for all the mutational signatures.

Employing the single base substitution classification of 1536 mutation types, which uses the pentanucleotide sequence context two bases 5’ and two bases 3’ to each mutated base, yielded a set of signatures largely consistent with that based on substitutions in trinucleotide context alone. Notably, however, the pentanucleotide context enabled the extraction of two forms of both SBS2 and SBS13, one with mainly a pyrimidine (C or T) and the other with a purine (A or G) at the -2 base (the second base 5’ to the mutated cytosine). These may represent the activities of the cytidine deaminases APOBEC3A and APOBEC3B, respectively44. described in the paper The Repertoire of Mutational Signatures in Human Cancer

I wasn't able to find the matrix on the website https://www.synapse.org/#!Synapse:syn11726601/files or on the paper https://www.nature.com/articles/s41586-020-1943-3#Sec18

If the matrix is confidential, I was wondering if it could be shared through email at [email protected].

Many thanks for the help!

Regards,
Sangjin

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.