GithubHelp home page GithubHelp logo

jpquast / protti Goto Github PK

View Code? Open in Web Editor NEW
56.0 4.0 5.0 38.54 MB

Picotti lab data analysis package.

Home Page: https://jpquast.github.io/protti/

License: Other

R 100.00%
r proteomics mass-spectrometry data-analysis systems-biology lip-ms omics protein

protti's Introduction

Hi there πŸ‘‹πŸ‘¨πŸ»β€πŸ’»

  • I am currently doing my PhD in the Picotti Lab at ETH in Zurich Switzerland ⛰️πŸ₯ΌπŸ₯½πŸ”¬
  • I am a biochemist who works on protein metal interactions using mass spectrometry πŸ§ͺβš™οΈ
  • I am interested in unstudied organisms 🧬🦠πŸͺ²πŸŒ± and automation solutions using robots πŸ€–

Check out our R package protti developed for easy data analysis of bottom-up proteomics and LiP-MS data πŸ§ͺπŸ”¬πŸ’»

CRAN status Metacran downloads DOI:10.1093/bioadv/vbab041

Also check out my R package ggplate which can create simple plots of biological culture plates as well as microplates. πŸ§«πŸ§¬πŸ“Š

CRAN status Metacran downloads

You can also find me on Twitter and LinkedIn 🐦

JP's GitHub stats Top Langs

protti's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

protti's Issues

GO enrichment, enrichment type

Add argument to calculate_go_enrichment() that allows the user the specify the enrichment type that should be displayed.
Options could be: "all", "enriched", "deenriched".

confusions about the qc_sample_correlation function

I have tried many times using "qc_sample_correlation" function to calculate correlation level but an error occurs always.
I have comapred my data with the example data and found nothing abnormal in the format. I don't know why for this problem. Could you please help me check this issue?
I have uploaded top rows of my file here.
图片

myfile.csv

NOTICE: plan(multiprocess) of future is deprecated

Hi.

This is a friendly reminder that plan(multiprocess) of the future package is deprecated since future 1.20.0 (2020-11-03). It will eventually become defunct and removed. The background for this can be found in HenrikBengtsson/future#420.

Your protti package relies on multiprocess, cf. https://github.com/jpquast/protti/search?q=multiprocess.

Please migrate your code to the platform-independent plan(multisession) or the Linux/macOS-specific plan(multicore). If you want to emulate what multiprocess does, you can do something like:

  if (parallelly::supportsMulticore()) {
    oplan <- plan(multicore)
  } else {
    oplan <- plan(multisession)
  }
  on.exit(plan(oplan))

BTW, if you don't already do so, please make sure to undo any plan() you set in your code, as illustrated by the above example. This is needed to guarantee that calling your code won't override settings that the user has set previously. You can read about this in https://future.futureverse.org/reference/plan.html#for-package-developers.

Thank you,

Henrik
(maintainer of the future package)

calculate_diff_abundance output can't add intensity_log2 column

Hi,
the calculate_diff_abundance() function can't retain the intensity_log2=normalised_intensity_log2 column in output, even i add normalised_intensity_log2 to retain_columns=c(genes, sampleID, normalised_intensity_log2), is there a way to do this ?

Thanks!

Partial output of fetch_uniprot.R

It seems that "fetch_uniprot.R" provides a partial output even if some of the batches fail to download data from uniprot (due to high traffic: "HTTP ERROR 429"). If using markdowns, the console warnings are easily overlooked. I would suggest that the function should only provide an output if all batches are successfully downloaded.

assign_missingness() variables are hard coded

The variables completeness_MAR and completeness_MNAR defined in lines 93 and 94 are not used in the function. Instead, the values are hard coded in the function in lines 200-205.

qc_peptide_type

Function is super slow when using method = "intensity" .
Reminder to work on this.

Query regarding the drc_4p_plot

First, thank you for this package. I have been using it to look at proteomics in some experiments, in particular using the drc_4p_plot function after running the fit_drc_4p function ( replicate_completeness = 0.7, condition_completeness = 0.5, correlation_cutoff = 0.7).

When I plot the data, I noticed that the confidence interval does not run over all the samples.

Exploring this further by looking at the plot_curve and plot_points output, I noticed that only 13 points were output in plot_points output, but the graph output shows 17 points (attched plot). I had assumed that the plot_points represented the plotted points, while the plot_curve plotted the confidence intervals, so i'm confused where the extra points came from? I am wondering is this a bug or are some of the plotted points predictions/extrapolations?

I am running version 0.6.0.
dce5a23e-5105-4529-8ede-d5109e62b822

`qc_sample_correlation()` always returns a plot

The qc_sample_correlation() function always returns a plot even if the result is saved to a variable. This should be possible to fix by including the silent = TRUE argument in the pheatmap() function.

fetch_uniprot

If ID's are outdated they appear as "Merged into xy" in the proteins.names column.
And example is: P01892 | Merged into P04439.
This can be fixed by using the information from P04439 for P01892.

qc_pca typo

qc_PCA(
data,
sample = r_file_name,
grouping = eg_precursor_id,
intensity = normalised_intensity_log2,
condition = r_condition
)

qc_sample_correlation typo

qc_sample_correlation(
data,
sample = r_file_name
grouping = eg_precursor_id,
intensity = intensity_log2,
condition = r_condition
)

comma missing

The 'is_significant' variable missing reference

Hello!

I am continually very pleased with the output and readability of this vignette but I noticed something confusing : the bottom of the single dose treatment (this corresponds to data that I am working on) contains a section called 'Additional helpful functions'. The third paragraph references one to use 'is_significant' and then it is indeed used in the corresponding diff_abundance_significant code but is not created in this workflow. The 'is_significant' column is only found / created in the dose response data analysis workflow. However, the 'is_significant' variable calls for a 'passed_filter' variable also not present in the single dose workflow. In fact, I can't seem to find the 'passed_filter' variable at all.

Kind regards,
Bryan

qc_sequence_coverage()

qc_sequence_coverage() should display not the overall median as a dotted line but rather the sample wise median.

The code should look something like the following:

ggplot2::geom_vline(data = input_data, mapping = aes(xintercept = median_coverage), linewidth = 1, col = "red", linetype = "dashed")

Isobaric labels support

Hi. Firstly, I find your project very interesting. My compliments to the developers for keeping the source extremely organize, and for following the tidyverse "philosophy".
Here's my question. Can I process data labelled with TMT or iTRAQ? If so, could you give me some pointers?

Thanks, and well done!

peptide_profile_plot() should plot all samples

peptide_profile_plot() should plot all samples even if the sample does not contain any precursor.

This plot has all samples:
image

This one is missing all but two samples:
image

Thanks 🦦

Anna

go_enrichment does not work

When using go_enrichment(data, uniprot_id, is_significant = significant, ontology_type = "CC", organism_id = "9606", algorithm = "elim", statistic = "fisher")
error:
Fehler in get(paste("GO", whichOnto, "Term", sep = "")) : Objekt 'GOCCTerm' nicht gefunden

similar error for other GO terms;

typo in roxygen header (arguments, ontology type - biological process (BP))
comma missing in example (after is_significant = significant)

Proteome coverage plot - label switch

It looks like the figure labels are switched!

Here is the dataframe that qc_proteome_coverage() outputs where sample 1A has ~20% proteins detected
image

And here is the figure where there are 20% proteins not detected
image

Dose response benchmarking data

Hi there,

Thanks a lot for developing the protti package and for putting together the R workflow in this document https://jpquast.github.io/protti/articles/data_analysis_dose_response_workflow.html. !

I'm interested in the dose-response work and I am running your package (and therefore the drc package and estimates) on different operating systems in production on our web platform.

One thing that I noticed is that because of the optimisation with optim the model estimates with drc can be slightly different and it's hard for me to benchmark results so I'm looking for a ground truth dataset where I can compare results and make sure I'm getting always what I need to. Are you able to share a csv with the results that you show in the all hits table (top rows below) so that I can compare the actual EC50 values in your table with what I obtain without the approximation. I noticed that the estimate of the EC50 could vary quite a bit between OSs.

rank score eg_precursor_id pg_protein_accessions anova_adj_pval correlation ec_50
1 0.919 VFDVELLKLE.2 P62942 3.98e-13 0.967 3.6e+06
2 0.914 RGQTC[Carbamidomethyl (C)]VVHYTGMLEDGK.3 P62942 4.73e-14 0.947 3.0e+05
3 0.888 GWEEGVAQMSVGQR.2 P62942 1.08e-13 0.947 4.7e+05

Also, do you have other benchmarking data of which you know dose-response curves and effect and that users could use for benchmarking?

Thanks a lot for this!

Anna

qc_proteome_coverage: labels linked to wrong colors

the labels in the plot for the proteome coverage (qc_proteome_coverage) are linked to the wrong color in the plot legend.

image

image

labels = c("proteins_detected" = "detected", "proteins_undetected" = "not detected") fixes the issue

documentation calculate_protein_abundance()

Dear fellow colleagues,
I found a mistake in the documentation of the function calculate_protein_abundance():

If for_plot = FALSE, protein abundances are returned, if for_plot = TRUE also precursor intensities are returned in a data frame. The later output is ideal for plotting with qc_protein_abundance and can be filtered to only include protein abundances.

that should be:

If for_plot = FALSE, protein abundances are returned, if for_plot = TRUE also precursor intensities are returned in a data frame. The later output is ideal for plotting with peptide_profile_plot() and can be filtered to only include protein abundances.

Thank you for your great work πŸ₯‡

Anna

qc_ids typo

qc_ids(
data,
sample = r_file_name
grouping = eg_precursor_id,
condition = r_condition,
title = "Number of peptide IDs per sample"
)

comma missing

qc_log2_intensity_distribution

qc_log2_distribution(
data,
sample = r_file_name,
grouping = pep_stripped_sequence,
log2_intensity = normalised_intensity_log2)

wrong function call

plot labels order

Dear fellow colleagues ,

I would love to have the labels in the qc plots ordered by x-axis sample order instead of alphabetical.
Would it be possible for you to fulfill my wish??

Thank you for your great work.
Best,
Anna

Improvement to `assign_peptide_type()`

Make assign_peptide_type() recognise if the initial methionine of a protein is missing in all peptides and assign "fully-tryptic" instead of "semi-tryptic".

'Timeout is reached' in fetch_uniprot

Hi,

In my dataset there are more than 4000 proteins ID going to retrieve from Uniprot. When I use fetch_uniprot() function, it comes

Please note that some column names have changed due to UniProt updating its API! This might cause errors in your code. You can fix it by replacing the old column names with new ones.
The following IDs have not been retrieved correctly.                                                                                   
# A tibble: 10 Γ— 2
   id                error                                                                                                     
   <chr>             <chr>                                                                                                     
 1 IDs: 1 to 200     Timeout was reached: [rest.uniprot.org] Operation timed out after 30001 milliseconds with 0 bytes received

Could you give me some advice about this issue please?
Thanks in advance!

Best regards,
Shel

Facetting plots in qc_sequence_coverage()

Hey,

I found a potentially minor problem when using the "sample" parameter in qc_sequence_coverage(). It will properly facet the plot but always indicate the same median coverage in each subplot.

Example:

library(tidyverse)
library(protti)

minex <- data.frame(IDs = rep(c("A", "B"), 6), 
                    cov = c(1:5, 25, 1, 21:25),
                    condition = rep(c("x", "y"), each = 6))

qc_sequence_coverage(data = minex,
                     protein_identifier = IDs,
                     coverage = cov,
                     sample = condition) # indicated median at 13

minex %>% group_by(condition) %>% summarise(median = median(cov)) #median(s) should be this

Best :)

Fetch uniprot functions need to be updated

UniProt recently updated their website and changed the way it is accessed programmatically. In order to ensure that everything works as it used to, we need to update our fetch_uniprot and fetch_uniprot_proteome functions. As a quick fix we have updated these functions to retrieve information from the UniProt legacy website. These changes are not implemented on CRAN yet.

impute() throws error, even with the example given with create_synthetic_data()

Hi, I really like your package! As the title says, I've encountered an issue - the example I'm referring to is the one given in:
https://rdrr.io/cran/protti/man/impute.html

I have my own data that I wanted to apply impute() on, it threw back an error so I tried to test if the function works with the example you provide. The error is the same in both cases, minding column name differences:
image
(this is the one recieved when trying to imitate the example with synthetic data)

Is there a bug or am I using the function in a wrong way? The code I ran was:

`
set.seed(123) # Makes example reproducible

data <- create_synthetic_data(
n_proteins = 10,
frac_change = 0.5,
n_replicates = 4,
n_conditions = 2,
method = "effect_random",
additional_metadata = FALSE
)

head(data, n = 24)

data_missing <- assign_missingness(
data,
sample = sample,
condition = condition,
grouping = peptide,
intensity = peptide_intensity_missing,
ref_condition = "all",
retain_columns = c(protein, peptide_intensity)
)

head(data_missing, n = 24)

data_imputed <- impute(
data_missing,
sample = sample,
grouping = peptide,
intensity_log2 = peptide_intensity_missing,
condition = condition,
comparison = comparison,
missingness = missingness,
method = "ludovic",
retain_columns = c(protein, peptide_intensity)
)

head(data_imputed, n = 24)
`

Thanks for your help and keep up the good work:)

Question - adding condition to bottom-up data results in NA column

Hello,

Your vignette is rather easy to follow. I have a binary comparison set of PD data with 4 replicates of treatment and 3 replicates of a DMSO control exported as a CSV. In the input preparation workflow I can follow each step except the last one :

pd_prot_long_annotated <- pd_prot_long %>%
left_join(y = annotation, by = "file_name")

A glimpse of the pd_prot_long_annotated shows the condition column as 'NA'. This seems to be unintended.
Screen Shot 2021-07-20 at 8 52 20 PM

fetch_uniprot typo

fetch_uniprot(c("P36578", "O43324", "Q00796")) commas are missing in the example

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.