statisticalbiotechnology / diffacto Goto Github PK
View Code? Open in Web Editor NEWLicense: Other
License: Other
I would like to create a bioconda recipe for diffacto using the latest code.
Could you make a release for that?
Thank you.
First of all, thank you for developing such great software!
I've noticed that results slightly change from run to run. For example, on my workflow applied to iPRG2015 data, in a half of diffacto runs I get only 6 true positives protein identifications passed the p-value threshold with Bonferroni correction. On the other half of runs, I get 6 true positives + 1 false positive protein. I do not change input file and parameters for the diffacto, just only start the script again. The MC simulation is turned off. Is it possible (and is it correct?) to put somewhere in the code (or as optional parameter) fixed random seed or something like this?
I can share the input files and parameters but it seems to be reproducible at any dataset.
P.S. Should the proteins with negative S/N ratio be excluded from the results? On the standard with known protein concentrations it seems that true positives always have a positive S/N ratio and the false positives have a negative S/N.
Regards,
Mark
When I click on Anaconda, I get an error. When I click on pyteomics, I also get an error. They need updating to valid URLs.
I've been looking at the GMM normalization method in diffacto, and think there might be a bug. The relevant code is:
''' two-component Gaussian mixture model '''
from sklearn.mixture import GMM
gmm = GMM(2)
norm_scale = []
for sp in samples:
v = df[sp].values
v = v[np.logical_not(np.isnan(v))]
v = v[np.logical_not(np.isinf(v))]
try:
gmm.fit(np.matrix(v.values).T) #no property .values, so bug?
vmean = gmm.means_[np.argmin(gmm.covars_)]
norm_scale.append(vmean)
except:
norm_scale.append(np.nanmean(v))
norm_scale = np.array(norm_scale)
Notice that v is set to a numpy array, but then in the try block "v.values" is used. In my debugging so far, v has never had a values property, resulting in an exception, so the code calls np.nanmean instead. Is this a bug?
Would it make sense to scale the quantities in the output to match the original TIC?
I am using vitrualenv and pip install. The packages listed installed fine, but it's unclear if that also installs diffacto or not.
Installing collected packages: threadpoolctl, joblib, decorator, scikit-learn, pyteomics, networkx
Successfully installed decorator-4.4.2 joblib-1.0.1 networkx-2.5.1 pyteomics-4.4.2 scikit-learn-0.24.1 threadpoolctl-2.1.0
(Python3) /verona/nobackup$ run_diffacto.py
-bash: run_diffacto.py: command not found
Meanwhile pip install cutadapt
, another Python software, works straight away.
(Python3) /verona/nobackup$ cutadapt
This is cutadapt 3.4 with Python 3.7.3
Run "cutadapt --help" to see command-line options
Is there something important missing from the installation instructions for pip install for diffacto?
Neither the journal article nor this website gives a formulaic definition of Diffacto's imputation process. When I set -impute_threshold 1.1
to effectively turn off imputation, somehow it makes a large difference to the result matrix. On default, I get a 100% complete matrix, but with imputation turned off, I get an average of 17% missing values per protein, I wouldn't think that there's much range between 99% and 100%, but it seems there is a lot. A formula would help users understand it.
What aggregation does specifying -samples
cause to be done? I can't find a formula specifying it explicitly in the journal article.
Without knowing which peptides are above or below the threshold as described in Issue #15 it is hard to transform the ratio output into linear scale. Could the software get an options so the user could set it and have the output restored to linear scale?
Not an issue, just a thank you and notice to the developers.
https://github.com/bioconda/bioconda-recipes/tree/master/recipes/diffacto
https://toolshed.g2.bx.psu.edu/view/galaxyp/diffacto/3cc7ce0822a1
https://github.com/galaxyproteomics/tools-galaxyp/tree/master/tools/diffacto
The software uses a threshold of 0.5 and peptides below that value are discarded for not covarying enough with the other peptides of a particular protein. How well does such an approach work for stably-expressed proteins (sometimes called housekeeping proteins) that are expected to be fairly constantly abundant across all samples, with the main contribution to variation in measurements being Poisson sampling variability or technical batch effects. For example GAPDH or G3P_HUMAN in my data set has 58 transitions, of which only 25 are used for quantitation. That means 33, or more than half, are discarded for being incoherent. Is Diffacto biased against housekeeping proteins?
Hi all,
I've noticed that peptide mapping to protein sequences (_map_seq function) is doing in simple way without taking into account any cleavage rules.
I've made a theoretical calculations for such mapping on swissprot database and found that there are ~10% matches of peptides (with length >= 6) which are belong to protein sequences but do not belong to the list of tryptic peptides of these proteins. What means that there can be some "false" peptides used for quantitation.
Also, such simple mapping increase calculation time exponentially with increasing the number of peptides in analysis.
So, I'm not sure, but it seems that diffacto efficiency and performance can be increased by adding cleavage rules in the _map_seq . Of course, all of these affect only the cases when user does not have "protein ID(s)" column in input file.
I have Python code for enzyme mapping in my own projects, so it will be easy to implement here if you think that it will be useful.
Regards,
Mark
The journal article refers to conditions a lot. For example, if there are I conditions, J experiments and K peptides then Residual Sum of Squares has IJK -I - m degrees of freedom. But, looking at the Usage section of the ReadMe file, there is no input parameter to define each sample's condition (my guess is values like "Healthy" and "Cancer"). Where has conditions gone or am I misunderstanding?
[email protected] suggests that we should make it possible to export peptide weights, or at least which peptides that go above and under the weight threshold.
I see thousands of warnings about dividing by zero. Should I, as an end-user. be concerned and do anything differently?
diffacto.py:171: RuntimeWarning: divide by zero encountered in double_scalars
diffacto.py:208: RuntimeWarning: invalid value encountered in true_divide
It also happens for the HBY20 Mix example data set. Sometimes, ss_resid
is 0. Please document this phenomenon for users to know more about it. I also frequently see -inf
in the S/N column. Should the columns of the results table also be documented?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.