jpquast / protti Goto Github PK

View Code? Open in Web Editor NEW

56.0 4.0 5.0 38.54 MB

Picotti lab data analysis package.

Home Page: https://jpquast.github.io/protti/

License: Other

R 100.00%

r proteomics mass-spectrometry data-analysis systems-biology lip-ms omics protein

protti's Introduction

Hi there 👋👨🏻‍💻

I am currently doing my PhD in the Picotti Lab at ETH in Zurich Switzerland ⛰️🥼🥽🔬
I am a biochemist who works on protein metal interactions using mass spectrometry 🧪⚙️
I am interested in unstudied organisms 🧬🦠🪲🌱 and automation solutions using robots 🤖

Check out our R package protti developed for easy data analysis of bottom-up proteomics and LiP-MS data 🧪🔬💻

Also check out my R package ggplate which can create simple plots of biological culture plates as well as microplates. 🧫🧬📊

You can also find me on Twitter and LinkedIn 🐦

protti's People

Stargazers

Watchers

Forkers

algom resulelgin fehraaron vivreb justinmogl

protti's Issues

GO enrichment, enrichment type

Add argument to calculate_go_enrichment() that allows the user the specify the enrichment type that should be displayed.
Options could be: "all", "enriched", "deenriched".

confusions about the qc_sample_correlation function

I have tried many times using "qc_sample_correlation" function to calculate correlation level but an error occurs always.
I have comapred my data with the example data and found nothing abnormal in the format. I don't know why for this problem. Could you please help me check this issue?
I have uploaded top rows of my file here.

myfile.csv

NOTICE: plan(multiprocess) of future is deprecated

Hi.

This is a friendly reminder that plan(multiprocess) of the future package is deprecated since future 1.20.0 (2020-11-03). It will eventually become defunct and removed. The background for this can be found in HenrikBengtsson/future#420.

Your protti package relies on multiprocess, cf. https://github.com/jpquast/protti/search?q=multiprocess.

Please migrate your code to the platform-independent plan(multisession) or the Linux/macOS-specific plan(multicore). If you want to emulate what multiprocess does, you can do something like:

  if (parallelly::supportsMulticore()) {
    oplan <- plan(multicore)
  } else {
    oplan <- plan(multisession)
  }
  on.exit(plan(oplan))

BTW, if you don't already do so, please make sure to undo any plan() you set in your code, as illustrated by the above example. This is needed to guarantee that calling your code won't override settings that the user has set previously. You can read about this in https://future.futureverse.org/reference/plan.html#for-package-developers.

Thank you,

Henrik
(maintainer of the future package)

calculate_diff_abundance output can't add intensity_log2 column

Hi,
the calculate_diff_abundance() function can't retain the intensity_log2=normalised_intensity_log2 column in output, even i add normalised_intensity_log2 to retain_columns=c(genes, sampleID, normalised_intensity_log2), is there a way to do this ?

Thanks!

Partial output of fetch_uniprot.R

It seems that "fetch_uniprot.R" provides a partial output even if some of the batches fail to download data from uniprot (due to high traffic: "HTTP ERROR 429"). If using markdowns, the console warnings are easily overlooked. I would suggest that the function should only provide an output if all batches are successfully downloaded.

assign_missingness() variables are hard coded

The variables completeness_MAR and completeness_MNAR defined in lines 93 and 94 are not used in the function. Instead, the values are hard coded in the function in lines 200-205.

qc_peptide_type

Function is super slow when using method = "intensity" .
Reminder to work on this.

parallel_fit_drc_4p does not initiate workers correctly

parallel_fit_drc_4p is not initiating workers correctly with future::plan(multiprocess). It throws an error.

Query regarding the drc_4p_plot

First, thank you for this package. I have been using it to look at proteomics in some experiments, in particular using the drc_4p_plot function after running the fit_drc_4p function ( replicate_completeness = 0.7, condition_completeness = 0.5, correlation_cutoff = 0.7).

When I plot the data, I noticed that the confidence interval does not run over all the samples.

Exploring this further by looking at the plot_curve and plot_points output, I noticed that only 13 points were output in plot_points output, but the graph output shows 17 points (attched plot). I had assumed that the plot_points represented the plotted points, while the plot_curve plotted the confidence intervals, so i'm confused where the extra points came from? I am wondering is this a bug or are some of the plotted points predictions/extrapolations?

I am running version 0.6.0.

`qc_sample_correlation()` always returns a plot

The qc_sample_correlation() function always returns a plot even if the result is saved to a variable. This should be possible to fix by including the silent = TRUE argument in the pheatmap() function.

fetch_uniprot

If ID's are outdated they appear as "Merged into xy" in the proteins.names column.
And example is: P01892 | Merged into P04439.
This can be fixed by using the information from P04439 for P01892.

qc_pca typo

qc_PCA(
data,
sample = r_file_name,
grouping = eg_precursor_id,
intensity = normalised_intensity_log2,
condition = r_condition
)

qc_sample_correlation typo

qc_sample_correlation(
data,
sample = r_file_name
grouping = eg_precursor_id,
intensity = intensity_log2,
condition = r_condition
)

comma missing

fetch_kegg.R doesnt join dataframe correctly

the fetch_kegg function is buggy - and returns NA in the pathway column - causing our github actions to fail

peptide_profile_plot() missing plots

peptide_profile_plot(): if the content of the grouping column contains spaces no plot is generated.

Thanks 🦭

Anna

The 'is_significant' variable missing reference

Hello!

I am continually very pleased with the output and readability of this vignette but I noticed something confusing : the bottom of the single dose treatment (this corresponds to data that I am working on) contains a section called 'Additional helpful functions'. The third paragraph references one to use 'is_significant' and then it is indeed used in the corresponding diff_abundance_significant code but is not created in this workflow. The 'is_significant' column is only found / created in the dose response data analysis workflow. However, the 'is_significant' variable calls for a 'passed_filter' variable also not present in the single dose workflow. In fact, I can't seem to find the 'passed_filter' variable at all.

Kind regards,
Bryan

qc_sequence_coverage()

qc_sequence_coverage() should display not the overall median as a dotted line but rather the sample wise median.

The code should look something like the following:

ggplot2::geom_vline(data = input_data, mapping = aes(xintercept = median_coverage), linewidth = 1, col = "red", linetype = "dashed")

Isobaric labels support

Hi. Firstly, I find your project very interesting. My compliments to the developers for keeping the source extremely organize, and for following the tidyverse "philosophy".
Here's my question. Can I process data labelled with TMT or iTRAQ? If so, could you give me some pointers?

Thanks, and well done!

Add grouping argument to `calculate_sequence_coverage()`

One could add an additional argument to calculate_sequence_coverage() that allows the user to group by a specific column in order to calculate subsets of sequence coverages for the same protein.

median_normalisation contains grouping but does not need it

median_normalisation function contains the argument grouping which is not required by the function.

peptide_profile_plot() should plot all samples

peptide_profile_plot() should plot all samples even if the sample does not contain any precursor.

This plot has all samples:

This one is missing all but two samples:

Thanks 🦦

Anna

NA in gene name column when using the impute function

When using the impute function, imputed rows have missing values in some columns like gene names.

assign_missingness NA if one condition no results and other condition 3/4 samples

If a protein was not found in one condition and in the other condition it was found in 3 out of 4 replicates the missingness assigned will be NA. This will not be imputed. However, for certain methods it would be nice to have the option to assign a missingness-tag that can be imputed.

Styler package for GitHub Actions

volcano_plot() function fails when there is no significant peptide after p-value adjustment

Hi guys,
I noticed that if there is no significant peptide after p-value adjustment, the volcano_plot() function fails, BUT ONLY WHEN THE interactive = ARGUMENT IS GIVEN.
Could you please give it a look?
Thankssss :)

go_enrichment does not work

When using go_enrichment(data, uniprot_id, is_significant = significant, ontology_type = "CC", organism_id = "9606", algorithm = "elim", statistic = "fisher")
error:
Fehler in get(paste("GO", whichOnto, "Term", sep = "")) : Objekt 'GOCCTerm' nicht gefunden

similar error for other GO terms;

typo in roxygen header (arguments, ontology type - biological process (BP))
comma missing in example (after is_significant = significant)

Proteome coverage plot - label switch

It looks like the figure labels are switched!

Here is the dataframe that qc_proteome_coverage() outputs where sample 1A has ~20% proteins detected

And here is the figure where there are 20% proteins not detected

Dose response benchmarking data

Hi there,

Thanks a lot for developing the protti package and for putting together the R workflow in this document https://jpquast.github.io/protti/articles/data_analysis_dose_response_workflow.html. !

I'm interested in the dose-response work and I am running your package (and therefore the drc package and estimates) on different operating systems in production on our web platform.

One thing that I noticed is that because of the optimisation with optim the model estimates with drc can be slightly different and it's hard for me to benchmark results so I'm looking for a ground truth dataset where I can compare results and make sure I'm getting always what I need to. Are you able to share a csv with the results that you show in the all hits table (top rows below) so that I can compare the actual EC50 values in your table with what I obtain without the approximation. I noticed that the estimate of the EC50 could vary quite a bit between OSs.

rank	score	eg_precursor_id	pg_protein_accessions	anova_adj_pval	correlation	ec_50
1	0.919	VFDVELLKLE.2	P62942	3.98e-13	0.967	3.6e+06
2	0.914	RGQTC[Carbamidomethyl (C)]VVHYTGMLEDGK.3	P62942	4.73e-14	0.947	3.0e+05
3	0.888	GWEEGVAQMSVGQR.2	P62942	1.08e-13	0.947	4.7e+05

Also, do you have other benchmarking data of which you know dose-response curves and effect and that users could use for benchmarking?

Thanks a lot for this!

Anna

qc_proteome_coverage: labels linked to wrong colors

the labels in the plot for the proteome coverage (qc_proteome_coverage) are linked to the wrong color in the plot legend.

labels = c("proteins_detected" = "detected", "proteins_undetected" = "not detected") fixes the issue

documentation calculate_protein_abundance()

Dear fellow colleagues,
I found a mistake in the documentation of the function calculate_protein_abundance():

If for_plot = FALSE, protein abundances are returned, if for_plot = TRUE also precursor intensities are returned in a data frame. The later output is ideal for plotting with qc_protein_abundance and can be filtered to only include protein abundances.

that should be:

If for_plot = FALSE, protein abundances are returned, if for_plot = TRUE also precursor intensities are returned in a data frame. The later output is ideal for plotting with peptide_profile_plot() and can be filtered to only include protein abundances.

Thank you for your great work 🥇

Anna

qc_ids typo

qc_ids(
data,
sample = r_file_name
grouping = eg_precursor_id,
condition = r_condition,
title = "Number of peptide IDs per sample"
)

comma missing

qc_log2_intensity_distribution

qc_log2_distribution(
data,
sample = r_file_name,
grouping = pep_stripped_sequence,
log2_intensity = normalised_intensity_log2)

wrong function call

Add ungroup to normalise function

As the title says. Makes the function safer to use.

plot_drc_4p

Add option to customize x-axis label

plot labels order

Dear fellow colleagues ,

I would love to have the labels in the qc plots ordered by x-axis sample order instead of alphabetical.
Would it be possible for you to fulfill my wish??

Thank you for your great work.
Best,
Anna

calculate_protein_abundance() does not retain peptide column

the function calculate_protein_abundance() does not retain peptide column. If it is added to retain_columns , wrong precursor are matched to the peptide.

Thankssss 🐼
Anna

Improvement to `assign_peptide_type()`

Make assign_peptide_type() recognise if the initial methionine of a protein is missing in all peptides and assign "fully-tryptic" instead of "semi-tryptic".

'Timeout is reached' in fetch_uniprot

Hi,

In my dataset there are more than 4000 proteins ID going to retrieve from Uniprot. When I use fetch_uniprot() function, it comes

Please note that some column names have changed due to UniProt updating its API! This might cause errors in your code. You can fix it by replacing the old column names with new ones.
The following IDs have not been retrieved correctly.                                                                                   
# A tibble: 10 × 2
   id                error                                                                                                     
   <chr>             <chr>                                                                                                     
 1 IDs: 1 to 200     Timeout was reached: [rest.uniprot.org] Operation timed out after 30001 milliseconds with 0 bytes received

Could you give me some advice about this issue please?
Thanks in advance!

Best regards,
Shel

Facetting plots in qc_sequence_coverage()

Hey,

I found a potentially minor problem when using the "sample" parameter in qc_sequence_coverage(). It will properly facet the plot but always indicate the same median coverage in each subplot.

Example:

library(tidyverse)
library(protti)

minex <- data.frame(IDs = rep(c("A", "B"), 6), 
                    cov = c(1:5, 25, 1, 21:25),
                    condition = rep(c("x", "y"), each = 6))

qc_sequence_coverage(data = minex,
                     protein_identifier = IDs,
                     coverage = cov,
                     sample = condition) # indicated median at 13

minex %>% group_by(condition) %>% summarise(median = median(cov)) #median(s) should be this

Best :)

Fetch uniprot functions need to be updated

UniProt recently updated their website and changed the way it is accessed programmatically. In order to ensure that everything works as it used to, we need to update our fetch_uniprot and fetch_uniprot_proteome functions. As a quick fix we have updated these functions to retrieve information from the UniProt legacy website. These changes are not implemented on CRAN yet.

impute() throws error, even with the example given with create_synthetic_data()

Hi, I really like your package! As the title says, I've encountered an issue - the example I'm referring to is the one given in:
https://rdrr.io/cran/protti/man/impute.html

I have my own data that I wanted to apply impute() on, it threw back an error so I tried to test if the function works with the example you provide. The error is the same in both cases, minding column name differences:

(this is the one recieved when trying to imitate the example with synthetic data)

Is there a bug or am I using the function in a wrong way? The code I ran was:

`
set.seed(123) # Makes example reproducible

data <- create_synthetic_data(
n_proteins = 10,
frac_change = 0.5,
n_replicates = 4,
n_conditions = 2,
method = "effect_random",
additional_metadata = FALSE
)

head(data, n = 24)

data_missing <- assign_missingness(
data,
sample = sample,
condition = condition,
grouping = peptide,
intensity = peptide_intensity_missing,
ref_condition = "all",
retain_columns = c(protein, peptide_intensity)
)

head(data_missing, n = 24)

data_imputed <- impute(
data_missing,
sample = sample,
grouping = peptide,
intensity_log2 = peptide_intensity_missing,
condition = condition,
comparison = comparison,
missingness = missingness,
method = "ludovic",
retain_columns = c(protein, peptide_intensity)
)

head(data_imputed, n = 24)
`

Thanks for your help and keep up the good work:)

Question - adding condition to bottom-up data results in NA column

Hello,

Your vignette is rather easy to follow. I have a binary comparison set of PD data with 4 replicates of treatment and 3 replicates of a DMSO control exported as a CSV. In the input preparation workflow I can follow each step except the last one :

pd_prot_long_annotated <- pd_prot_long %>%
left_join(y = annotation, by = "file_name")

A glimpse of the pd_prot_long_annotated shows the condition column as 'NA'. This seems to be unintended.

fetch_uniprot typo

fetch_uniprot(c("P36578", "O43324", "Q00796")) commas are missing in the example

jpquast / protti Goto Github PK

protti's Introduction

Hi there 👋👨🏻‍💻

protti's People

Stargazers

Watchers

Forkers

protti's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs