statisticalbiotechnology / diffacto Goto Github PK

View Code? Open in Web Editor NEW

13.0 13.0 8.0 8.87 MB

License: Other

Python 98.80% Shell 1.20%

diffacto's People

Contributors

Stargazers

Watchers

Forkers

bioproximity bretttully markmipt felixshiyong cmri-procan caetera dagrahame levitsky

diffacto's Issues

Release 1.0.5

I would like to create a bioconda recipe for diffacto using the latest code.
Could you make a release for that?
Thank you.

Run-to-run reproducibility issue

First of all, thank you for developing such great software!

I've noticed that results slightly change from run to run. For example, on my workflow applied to iPRG2015 data, in a half of diffacto runs I get only 6 true positives protein identifications passed the p-value threshold with Bonferroni correction. On the other half of runs, I get 6 true positives + 1 false positive protein. I do not change input file and parameters for the diffacto, just only start the script again. The MC simulation is turned off. Is it possible (and is it correct?) to put somewhere in the code (or as optional parameter) fixed random seed or something like this?

I can share the input files and parameters but it seems to be reproducible at any dataset.

P.S. Should the proteins with negative S/N ratio be excluded from the results? On the standard with known protein concentrations it seems that true positives always have a positive S/N ratio and the false positives have a negative S/N.

Regards,
Mark

Between-sample Variation Appears Increased After Diffacto

I drew boxplots of 140 samples (two replicates per patient) and then after running Diffacto (70 patients). The boxes become quite different from sample to sample after the aggregation to proteins is done. Some of the boxes don't overlap other boxes at all. Is it concerning?

Broken Links in Installation Instructions

When I click on Anaconda, I get an error. When I click on pyteomics, I also get an error. They need updating to valid URLs.

Possible bug in zero_center_normalize with method = "GMM"?

I've been looking at the GMM normalization method in diffacto, and think there might be a bug. The relevant code is:

        ''' two-component Gaussian mixture model '''
        from sklearn.mixture import GMM
        gmm = GMM(2)
        norm_scale = []
        for sp in samples:
            v = df[sp].values
            v = v[np.logical_not(np.isnan(v))]
            v = v[np.logical_not(np.isinf(v))]
            try:
                gmm.fit(np.matrix(v.values).T) #no property .values, so bug?
                vmean = gmm.means_[np.argmin(gmm.covars_)]
                norm_scale.append(vmean)
            except:
                norm_scale.append(np.nanmean(v))
        norm_scale = np.array(norm_scale)

Notice that v is set to a numpy array, but then in the try block "v.values" is used. In my debugging so far, v has never had a values property, resulting in an exception, so the code calls np.nanmean instead. Is this a bug?

feature suggestion: rescale intensities to preserve TIC

Would it make sense to scale the quantities in the output to match the original TIC?

Installation Instructions Missing Something

I am using vitrualenv and pip install. The packages listed installed fine, but it's unclear if that also installs diffacto or not.

Installing collected packages: threadpoolctl, joblib, decorator, scikit-learn, pyteomics, networkx
Successfully installed decorator-4.4.2 joblib-1.0.1 networkx-2.5.1 pyteomics-4.4.2 scikit-learn-0.24.1 threadpoolctl-2.1.0
(Python3) /verona/nobackup$ run_diffacto.py
-bash: run_diffacto.py: command not found

Meanwhile pip install cutadapt, another Python software, works straight away.

(Python3) /verona/nobackup$ cutadapt
This is cutadapt 3.4 with Python 3.7.3
Run "cutadapt --help" to see command-line options

Is there something important missing from the installation instructions for pip install for diffacto?

Imputation Algorithm Not Documented

Neither the journal article nor this website gives a formulaic definition of Diffacto's imputation process. When I set -impute_threshold 1.1 to effectively turn off imputation, somehow it makes a large difference to the result matrix. On default, I get a 100% complete matrix, but with imputation turned off, I get an average of 17% missing values per protein, I wouldn't think that there's much range between 99% and 100%, but it seems there is a lot. A formula would help users understand it.

Document -samples In More Detail

What aggregation does specifying -samples cause to be done? I can't find a formula specifying it explicitly in the journal article.

Output File Linear Scale Feature Request

Without knowing which peptides are above or below the threshold as described in Issue #15 it is hard to transform the ratio output into linear scale. Could the software get an options so the user could set it and have the output restored to linear scale?

diffacto is now in bioconda and Galaxy

Not an issue, just a thank you and notice to the developers.

https://github.com/bioconda/bioconda-recipes/tree/master/recipes/diffacto
https://toolshed.g2.bx.psu.edu/view/galaxyp/diffacto/3cc7ce0822a1
https://github.com/galaxyproteomics/tools-galaxyp/tree/master/tools/diffacto

Housekeeping Proteins and Incoherent Peptides

The software uses a threshold of 0.5 and peptides below that value are discarded for not covarying enough with the other peptides of a particular protein. How well does such an approach work for stably-expressed proteins (sometimes called housekeeping proteins) that are expected to be fairly constantly abundant across all samples, with the main contribution to variation in measurements being Poisson sampling variability or technical batch effects. For example GAPDH or G3P_HUMAN in my data set has 58 transitions, of which only 25 are used for quantitation. That means 33, or more than half, are discarded for being incoherent. Is Diffacto biased against housekeeping proteins?

Enzyme cleavage rules for peptide-to-protein mapping

Hi all,

I've noticed that peptide mapping to protein sequences (_map_seq function) is doing in simple way without taking into account any cleavage rules.
I've made a theoretical calculations for such mapping on swissprot database and found that there are ~10% matches of peptides (with length >= 6) which are belong to protein sequences but do not belong to the list of tryptic peptides of these proteins. What means that there can be some "false" peptides used for quantitation.

Also, such simple mapping increase calculation time exponentially with increasing the number of peptides in analysis.

So, I'm not sure, but it seems that diffacto efficiency and performance can be increased by adding cleavage rules in the _map_seq . Of course, all of these affect only the cases when user does not have "protein ID(s)" column in input file.

I have Python code for enzyme mapping in my own projects, so it will be easy to implement here if you think that it will be useful.

Regards,
Mark

No User Specification of Samples' Conditions

The journal article refers to conditions a lot. For example, if there are I conditions, J experiments and K peptides then Residual Sum of Squares has IJK -I - m degrees of freedom. But, looking at the Usage section of the ReadMe file, there is no input parameter to define each sample's condition (my guess is values like "Healthy" and "Cancer"). Where has conditions gone or am I misunderstanding?

Output of peptide weights

[email protected] suggests that we should make it possible to export peptide weights, or at least which peptides that go above and under the weight threshold.

Numerous Divide By Zero Warnings

I see thousands of warnings about dividing by zero. Should I, as an end-user. be concerned and do anything differently?

diffacto.py:171: RuntimeWarning: divide by zero encountered in double_scalars
diffacto.py:208: RuntimeWarning: invalid value encountered in true_divide

It also happens for the HBY20 Mix example data set. Sometimes, ss_resid is 0. Please document this phenomenon for users to know more about it. I also frequently see -inf in the S/N column. Should the columns of the results table also be documented?

statisticalbiotechnology / diffacto Goto Github PK

diffacto's People

Contributors

Stargazers

Watchers

Forkers

diffacto's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs