GithubHelp home page GithubHelp logo

diffacto's People

Contributors

caetera avatar levitsky avatar markmipt avatar percolator avatar userbz avatar vnaum avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

diffacto's Issues

Release 1.0.5

I would like to create a bioconda recipe for diffacto using the latest code.
Could you make a release for that?
Thank you.

Run-to-run reproducibility issue

First of all, thank you for developing such great software!

I've noticed that results slightly change from run to run. For example, on my workflow applied to iPRG2015 data, in a half of diffacto runs I get only 6 true positives protein identifications passed the p-value threshold with Bonferroni correction. On the other half of runs, I get 6 true positives + 1 false positive protein. I do not change input file and parameters for the diffacto, just only start the script again. The MC simulation is turned off. Is it possible (and is it correct?) to put somewhere in the code (or as optional parameter) fixed random seed or something like this?

I can share the input files and parameters but it seems to be reproducible at any dataset.

P.S. Should the proteins with negative S/N ratio be excluded from the results? On the standard with known protein concentrations it seems that true positives always have a positive S/N ratio and the false positives have a negative S/N.

Regards,
Mark

Between-sample Variation Appears Increased After Diffacto

I drew boxplots of 140 samples (two replicates per patient) and then after running Diffacto (70 patients). The boxes become quite different from sample to sample after the aggregation to proteins is done. Some of the boxes don't overlap other boxes at all. Is it concerning?

DifferentAfterDiffacto

Possible bug in zero_center_normalize with method = "GMM"?

I've been looking at the GMM normalization method in diffacto, and think there might be a bug. The relevant code is:

        ''' two-component Gaussian mixture model '''
        from sklearn.mixture import GMM
        gmm = GMM(2)
        norm_scale = []
        for sp in samples:
            v = df[sp].values
            v = v[np.logical_not(np.isnan(v))]
            v = v[np.logical_not(np.isinf(v))]
            try:
                gmm.fit(np.matrix(v.values).T) #no property .values, so bug?
                vmean = gmm.means_[np.argmin(gmm.covars_)]
                norm_scale.append(vmean)
            except:
                norm_scale.append(np.nanmean(v))
        norm_scale = np.array(norm_scale)

Notice that v is set to a numpy array, but then in the try block "v.values" is used. In my debugging so far, v has never had a values property, resulting in an exception, so the code calls np.nanmean instead. Is this a bug?

Installation Instructions Missing Something

I am using vitrualenv and pip install. The packages listed installed fine, but it's unclear if that also installs diffacto or not.

Installing collected packages: threadpoolctl, joblib, decorator, scikit-learn, pyteomics, networkx
Successfully installed decorator-4.4.2 joblib-1.0.1 networkx-2.5.1 pyteomics-4.4.2 scikit-learn-0.24.1 threadpoolctl-2.1.0
(Python3) /verona/nobackup$ run_diffacto.py
-bash: run_diffacto.py: command not found

Meanwhile pip install cutadapt, another Python software, works straight away.

(Python3) /verona/nobackup$ cutadapt
This is cutadapt 3.4 with Python 3.7.3
Run "cutadapt --help" to see command-line options

Is there something important missing from the installation instructions for pip install for diffacto?

Imputation Algorithm Not Documented

Neither the journal article nor this website gives a formulaic definition of Diffacto's imputation process. When I set -impute_threshold 1.1 to effectively turn off imputation, somehow it makes a large difference to the result matrix. On default, I get a 100% complete matrix, but with imputation turned off, I get an average of 17% missing values per protein, I wouldn't think that there's much range between 99% and 100%, but it seems there is a lot. A formula would help users understand it.

Document -samples In More Detail

What aggregation does specifying -samples cause to be done? I can't find a formula specifying it explicitly in the journal article.

Output File Linear Scale Feature Request

Without knowing which peptides are above or below the threshold as described in Issue #15 it is hard to transform the ratio output into linear scale. Could the software get an options so the user could set it and have the output restored to linear scale?

Housekeeping Proteins and Incoherent Peptides

The software uses a threshold of 0.5 and peptides below that value are discarded for not covarying enough with the other peptides of a particular protein. How well does such an approach work for stably-expressed proteins (sometimes called housekeeping proteins) that are expected to be fairly constantly abundant across all samples, with the main contribution to variation in measurements being Poisson sampling variability or technical batch effects. For example GAPDH or G3P_HUMAN in my data set has 58 transitions, of which only 25 are used for quantitation. That means 33, or more than half, are discarded for being incoherent. Is Diffacto biased against housekeeping proteins?

Enzyme cleavage rules for peptide-to-protein mapping

Hi all,

I've noticed that peptide mapping to protein sequences (_map_seq function) is doing in simple way without taking into account any cleavage rules.
I've made a theoretical calculations for such mapping on swissprot database and found that there are ~10% matches of peptides (with length >= 6) which are belong to protein sequences but do not belong to the list of tryptic peptides of these proteins. What means that there can be some "false" peptides used for quantitation.

Also, such simple mapping increase calculation time exponentially with increasing the number of peptides in analysis.

So, I'm not sure, but it seems that diffacto efficiency and performance can be increased by adding cleavage rules in the _map_seq . Of course, all of these affect only the cases when user does not have "protein ID(s)" column in input file.

I have Python code for enzyme mapping in my own projects, so it will be easy to implement here if you think that it will be useful.

Regards,
Mark

No User Specification of Samples' Conditions

The journal article refers to conditions a lot. For example, if there are I conditions, J experiments and K peptides then Residual Sum of Squares has IJK -I - m degrees of freedom. But, looking at the Usage section of the ReadMe file, there is no input parameter to define each sample's condition (my guess is values like "Healthy" and "Cancer"). Where has conditions gone or am I misunderstanding?

Numerous Divide By Zero Warnings

I see thousands of warnings about dividing by zero. Should I, as an end-user. be concerned and do anything differently?

diffacto.py:171: RuntimeWarning: divide by zero encountered in double_scalars
diffacto.py:208: RuntimeWarning: invalid value encountered in true_divide

It also happens for the HBY20 Mix example data set. Sometimes, ss_resid is 0. Please document this phenomenon for users to know more about it. I also frequently see -inf in the S/N column. Should the columns of the results table also be documented?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.