jpquast / protti Goto Github PK
View Code? Open in Web Editor NEWPicotti lab data analysis package.
Home Page: https://jpquast.github.io/protti/
License: Other
Picotti lab data analysis package.
Home Page: https://jpquast.github.io/protti/
License: Other
One could add an additional argument to calculate_sequence_coverage()
that allows the user to group by a specific column in order to calculate subsets of sequence coverages for the same protein.
the function calculate_protein_abundance()
does not retain peptide column. If it is added to retain_columns
, wrong precursor are matched to the peptide.
Thankssss ๐ผ
Anna
fetch_uniprot(c("P36578", "O43324", "Q00796"))
commas are missing in the example
Hi,
In my dataset there are more than 4000 proteins ID going to retrieve from Uniprot. When I use fetch_uniprot() function, it comes
Please note that some column names have changed due to UniProt updating its API! This might cause errors in your code. You can fix it by replacing the old column names with new ones.
The following IDs have not been retrieved correctly.
# A tibble: 10 ร 2
id error
<chr> <chr>
1 IDs: 1 to 200 Timeout was reached: [rest.uniprot.org] Operation timed out after 30001 milliseconds with 0 bytes received
Could you give me some advice about this issue please?
Thanks in advance!
Best regards,
Shel
qc_ids(
data,
sample = r_file_name
grouping = eg_precursor_id,
condition = r_condition,
title = "Number of peptide IDs per sample"
)
comma missing
Hey,
I found a potentially minor problem when using the "sample" parameter in qc_sequence_coverage()
. It will properly facet the plot but always indicate the same median coverage in each subplot.
Example:
library(tidyverse)
library(protti)
minex <- data.frame(IDs = rep(c("A", "B"), 6),
cov = c(1:5, 25, 1, 21:25),
condition = rep(c("x", "y"), each = 6))
qc_sequence_coverage(data = minex,
protein_identifier = IDs,
coverage = cov,
sample = condition) # indicated median at 13
minex %>% group_by(condition) %>% summarise(median = median(cov)) #median(s) should be this
Best :)
UniProt recently updated their website and changed the way it is accessed programmatically. In order to ensure that everything works as it used to, we need to update our fetch_uniprot
and fetch_uniprot_proteome
functions. As a quick fix we have updated these functions to retrieve information from the UniProt legacy website. These changes are not implemented on CRAN yet.
It seems that "fetch_uniprot.R" provides a partial output even if some of the batches fail to download data from uniprot (due to high traffic: "HTTP ERROR 429"). If using markdowns, the console warnings are easily overlooked. I would suggest that the function should only provide an output if all batches are successfully downloaded.
qc_PCA(
data,
sample = r_file_name,
grouping = eg_precursor_id,
intensity = normalised_intensity_log2,
condition = r_condition
)
parallel_fit_drc_4p
is not initiating workers correctly with future::plan(multiprocess)
. It throws an error.
Dear colleagues,
it would be nice if you could update the STRINGdb default version from 11.5 to 12.0.
Thanks a lot,
Anna
Hello!
I am continually very pleased with the output and readability of this vignette but I noticed something confusing : the bottom of the single dose treatment (this corresponds to data that I am working on) contains a section called 'Additional helpful functions'. The third paragraph references one to use 'is_significant' and then it is indeed used in the corresponding diff_abundance_significant code but is not created in this workflow. The 'is_significant' column is only found / created in the dose response data analysis workflow. However, the 'is_significant' variable calls for a 'passed_filter' variable also not present in the single dose workflow. In fact, I can't seem to find the 'passed_filter' variable at all.
Kind regards,
Bryan
I have tried many times using "qc_sample_correlation" function to calculate correlation level but an error occurs always.
I have comapred my data with the example data and found nothing abnormal in the format. I don't know why for this problem. Could you please help me check this issue?
I have uploaded top rows of my file here.
Hi there,
Thanks a lot for developing the protti
package and for putting together the R workflow in this document https://jpquast.github.io/protti/articles/data_analysis_dose_response_workflow.html. !
I'm interested in the dose-response work and I am running your package (and therefore the drc
package and estimates) on different operating systems in production on our web platform.
One thing that I noticed is that because of the optimisation with optim
the model estimates with drc
can be slightly different and it's hard for me to benchmark results so I'm looking for a ground truth dataset where I can compare results and make sure I'm getting always what I need to. Are you able to share a csv with the results that you show in the all hits
table (top rows below) so that I can compare the actual EC50 values in your table with what I obtain without the approximation. I noticed that the estimate of the EC50 could vary quite a bit between OSs.
rank | score | eg_precursor_id | pg_protein_accessions | anova_adj_pval | correlation | ec_50 |
---|---|---|---|---|---|---|
1 | 0.919 | VFDVELLKLE.2 | P62942 | 3.98e-13 | 0.967 | 3.6e+06 |
2 | 0.914 | RGQTC[Carbamidomethyl (C)]VVHYTGMLEDGK.3 | P62942 | 4.73e-14 | 0.947 | 3.0e+05 |
3 | 0.888 | GWEEGVAQMSVGQR.2 | P62942 | 1.08e-13 | 0.947 | 4.7e+05 |
Also, do you have other benchmarking data of which you know dose-response curves and effect and that users could use for benchmarking?
Thanks a lot for this!
Anna
Hi.
This is a friendly reminder that plan(multiprocess)
of the future package is deprecated since future 1.20.0 (2020-11-03). It will eventually become defunct and removed. The background for this can be found in HenrikBengtsson/future#420.
Your protti package relies on multiprocess
, cf. https://github.com/jpquast/protti/search?q=multiprocess.
Please migrate your code to the platform-independent plan(multisession)
or the Linux/macOS-specific plan(multicore)
. If you want to emulate what multiprocess
does, you can do something like:
if (parallelly::supportsMulticore()) {
oplan <- plan(multicore)
} else {
oplan <- plan(multisession)
}
on.exit(plan(oplan))
BTW, if you don't already do so, please make sure to undo any plan()
you set in your code, as illustrated by the above example. This is needed to guarantee that calling your code won't override settings that the user has set previously. You can read about this in https://future.futureverse.org/reference/plan.html#for-package-developers.
Thank you,
Henrik
(maintainer of the future package)
The qc_sample_correlation()
function always returns a plot even if the result is saved to a variable. This should be possible to fix by including the silent = TRUE
argument in the pheatmap()
function.
As the title says. Makes the function safer to use.
fetch_uniprot()
finds a valid UniProt accession in for example: "CON_ENSEMBL:ENSBTAP00000037665", which is wrong.
Hi. Firstly, I find your project very interesting. My compliments to the developers for keeping the source extremely organize, and for following the tidyverse "philosophy".
Here's my question. Can I process data labelled with TMT or iTRAQ? If so, could you give me some pointers?
Thanks, and well done!
Add option to customize x-axis label
Hi,
the calculate_diff_abundance() function can't retain the intensity_log2=normalised_intensity_log2 column in output, even i add normalised_intensity_log2 to retain_columns=c(genes, sampleID, normalised_intensity_log2), is there a way to do this ?
Thanks!
qc_sample_correlation(
data,
sample = r_file_name
grouping = eg_precursor_id,
intensity = intensity_log2,
condition = r_condition
)
comma missing
Function is super slow when using method = "intensity"
.
Reminder to work on this.
Add argument to calculate_go_enrichment()
that allows the user the specify the enrichment type that should be displayed.
Options could be: "all", "enriched", "deenriched".
peptide_profile_plot()
: if the content of the grouping column contains spaces no plot is generated.
Thanks ๐ฆญ
Anna
When using go_enrichment(data, uniprot_id, is_significant = significant, ontology_type = "CC", organism_id = "9606", algorithm = "elim", statistic = "fisher"
)
error:
Fehler in get(paste("GO", whichOnto, "Term", sep = "")) : Objekt 'GOCCTerm' nicht gefunden
similar error for other GO terms;
typo in roxygen header (arguments, ontology type - biological process (BP))
comma missing in example (after is_significant = significant
)
Hi, I really like your package! As the title says, I've encountered an issue - the example I'm referring to is the one given in:
https://rdrr.io/cran/protti/man/impute.html
I have my own data that I wanted to apply impute() on, it threw back an error so I tried to test if the function works with the example you provide. The error is the same in both cases, minding column name differences:
(this is the one recieved when trying to imitate the example with synthetic data)
Is there a bug or am I using the function in a wrong way? The code I ran was:
`
set.seed(123) # Makes example reproducible
data <- create_synthetic_data(
n_proteins = 10,
frac_change = 0.5,
n_replicates = 4,
n_conditions = 2,
method = "effect_random",
additional_metadata = FALSE
)
head(data, n = 24)
data_missing <- assign_missingness(
data,
sample = sample,
condition = condition,
grouping = peptide,
intensity = peptide_intensity_missing,
ref_condition = "all",
retain_columns = c(protein, peptide_intensity)
)
head(data_missing, n = 24)
data_imputed <- impute(
data_missing,
sample = sample,
grouping = peptide,
intensity_log2 = peptide_intensity_missing,
condition = condition,
comparison = comparison,
missingness = missingness,
method = "ludovic",
retain_columns = c(protein, peptide_intensity)
)
head(data_imputed, n = 24)
`
Thanks for your help and keep up the good work:)
The variables completeness_MAR and completeness_MNAR defined in lines 93 and 94 are not used in the function. Instead, the values are hard coded in the function in lines 200-205.
Dear fellow colleagues,
I found a mistake in the documentation of the function calculate_protein_abundance()
:
If for_plot = FALSE, protein abundances are returned, if for_plot = TRUE also precursor intensities are returned in a data frame. The later output is ideal for plotting with qc_protein_abundance and can be filtered to only include protein abundances.
that should be:
If
for_plot = FALSE
, protein abundances are returned, iffor_plot = TRUE
also precursor intensities are returned in a data frame. The later output is ideal for plotting withpeptide_profile_plot()
and can be filtered to only include protein abundances.
Thank you for your great work ๐ฅ
Anna
Hello,
Your vignette is rather easy to follow. I have a binary comparison set of PD data with 4 replicates of treatment and 3 replicates of a DMSO control exported as a CSV. In the input preparation workflow I can follow each step except the last one :
pd_prot_long_annotated <- pd_prot_long %>%
left_join(y = annotation, by = "file_name")
A glimpse of the pd_prot_long_annotated shows the condition column as 'NA'. This seems to be unintended.
Make assign_peptide_type()
recognise if the initial methionine of a protein is missing in all peptides and assign "fully-tryptic" instead of "semi-tryptic".
First, thank you for this package. I have been using it to look at proteomics in some experiments, in particular using the drc_4p_plot function after running the fit_drc_4p function ( replicate_completeness = 0.7, condition_completeness = 0.5, correlation_cutoff = 0.7).
When I plot the data, I noticed that the confidence interval does not run over all the samples.
Exploring this further by looking at the plot_curve and plot_points output, I noticed that only 13 points were output in plot_points output, but the graph output shows 17 points (attched plot). I had assumed that the plot_points represented the plotted points, while the plot_curve plotted the confidence intervals, so i'm confused where the extra points came from? I am wondering is this a bug or are some of the plotted points predictions/extrapolations?
If a protein was not found in one condition and in the other condition it was found in 3 out of 4 replicates the missingness assigned will be NA. This will not be imputed. However, for certain methods it would be nice to have the option to assign a missingness-tag that can be imputed.
qc_log2_distribution(
data,
sample = r_file_name,
grouping = pep_stripped_sequence,
log2_intensity = normalised_intensity_log2)
wrong function call
Dear fellow colleagues ,
I would love to have the labels in the qc plots ordered by x-axis sample order instead of alphabetical.
Would it be possible for you to fulfill my wish??
Thank you for your great work.
Best,
Anna
qc_sequence_coverage()
should display not the overall median as a dotted line but rather the sample wise median.
The code should look something like the following:
ggplot2::geom_vline(data = input_data, mapping = aes(xintercept = median_coverage), linewidth = 1, col = "red", linetype = "dashed")
Hi guys,
I noticed that if there is no significant peptide after p-value adjustment, the volcano_plot()
function fails, BUT ONLY WHEN THE interactive =
ARGUMENT IS GIVEN.
Could you please give it a look?
Thankssss :)
median_normalisation
function contains the argument grouping
which is not required by the function.
If ID's are outdated they appear as "Merged into xy" in the proteins.names column.
And example is: P01892 | Merged into P04439.
This can be fixed by using the information from P04439 for P01892.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.