kechrislab / msprep Goto Github PK

A processing pipeline for the summarization, normalization and diagnostics of mass spectrometry–based metabolomics data.

R 100.00%

msprep's People

Contributors

Stargazers

Watchers

Forkers

tuh8888 danielmedic zahrapahlavanyali

msprep's Issues

add self as author to DESCRIPTION

Easiest task ever. :)

Clean up repo

Remove old code files and check locations and gen scripts of datasets used in package and testing.

Remove hard-coding of replicate count (curr = 3)

Should allow for none or some (any) user-defined number of replicates.

Metabolomics and mass spec resources for Bioconductor incorporation

I checked Biocondcutor and there were only 31 packages with the "metabolomics" search term.
I removed some that were not obviously related. Attached is a list, but some of those may not be as relevant either - I don't have time now to look at each of them.

But if it helps, we are looking at data from LC/MS - which is related to GC/MS or other mass spectrometry (MS) based methods - but not related to NMR methods.

Packages I know better are 'xcms', which is widely used. The authors of "metabomxr" visited last year and we're in touch with them, they could be a good resource.

Other relevant packages from the Emory group that we work with:
http://web1.sph.emory.edu/apLCMS/
https://sourceforge.net/projects/xmsanalyzer/

Package Name	Authors	Description
biosigner	Philippe Rinaudo Etienne Thevenot	Signature discovery from omics data
CAMERA	Steffen Neumann	Collection of annotation related methods for mass spectrometry data
cosmiq	David Fischer Christian Panse	cosmiq - COmbining Single Masses Into Quantities
IPO	Thomas Riebenbauer	Automated Optimization of XCMS Data Processing parameters
MAIT	Francesc Fernandez-Albert	Statistical Analysis of Metabolomic Data
Metab	Raphael Aggio	Metab: An R Package for a High-Throughput Analysis of Metabolomics Data Generated by GC-MS.
metabomxtr	Michael Nodzenski	A package to run mixture models for truncated metabolomics data with normal or lognormal distributions
metaMS	Ron Wehrens	MS-based metabolomics annotation pipeline
MetCirc	Thomas Naake	Navigating mass spectral similarity in high-resolution MS/MS metabolomics data
MWASTools	Andrea Rodriguez-Martinez Rafael Ayala	MWASTools: an integrated pipeline to perform metabolome-wide association studies
mzR	Bernd Fischer Steffen Neumann Laurent Gatto Qiang Kou	parser for netCDF mzXML mzData and mzML and mzIdentML files (mass spectrometry data)
OmicsMarkeR	Charles E. Determan Jr.	Classification and Feature Selection for 'Omics' Datasets
PAPi	Raphael Aggio	Predict metabolic pathway activity based on metabolomics data
pathview	Weijun Luo	a tool set for pathway based data integration and visualization
Rdisop	Steffen Neumann	Decomposition of Isotopic Patterns
RMassBank	RMassBank at Eawag	Workflow to process tandem MS files and build MassBank records
ropls	Etienne A. Thevenot	PCA PLS(-DA) and OPLS(-DA) for multivariate analysis and feature selection of omics data
SIMAT	Mo R. Nezami Ranjbar	GC-SIM-MS data processing and alaysis tool
statTarget	Hemi Luan	Statistical Analysis of Metabolite Profile
xcms	Steffen Neumann	LC/MS and GC/MS Data Analysis
yamss	Leslie Myint	Tools for high-throughput metabolomics

Consider changing groupingvars to sample_info and check handling of continuous variables.

change name to 'sample_info'
ensure that it takes continuous (e.g. blood pressure could be a phenotype)

Deprecate groupingvars arguments and functions if so.

Fix print method

Doesn't seem to be working across the entire pipeline.

What if a user does not have replicates, spike-ins or batches?

All of these should have reasonable default values so that, for example, an experiment without batches treats all samples as belonging to a single batch. Default value should be NULL and have dummy variables filled in internally. Per @mmulvahill 's suggestion, this needs to be done as a separate branch and the downstream implications thoroughly tested before merging.

Improve user interface/reading in data - streamline/simplify

Definitely remove the file read -- should just use read.csv, read_csv, etc.

Meeting 1/9/18

Modulerize functions

Want to be able to quickly provide interface for incorporating new methods with simple wrappers. Definitely for imputation methods, but probably other functions too.

MSPrep

When running msFilter using summarized experiment, column names are changed from just sample_id to one that has all of the colData variables that are pasted together with separation "_". How do i preserve the sample_ID of the original colnames without having to delete all colData

Clarify RT variable type

Some values in the quant/spike dataset have colons: 3.2672226:1 and 3.2817035:2. Is this a ratio that needs to be calculated before converting to numeric? Currently this column is a factor and is inconsistently implemented. Resolve in wide_matrix_to_data function and ms_prepare.

Imputation questions

@kechrisk

One imputation method is the half-minimum approach, where we impute 0's with half the minimum value for that compound. Do we actually want to use the minimum across all patients & replicates, or should we do this using the minimum value within each patient?
In the old code, when BPCA imputes a number < 0, we replace that negative value with the half-minimum imputation -- resulting in a combined BPCA/half-min imputation method. Is this what we want to do? Or should these negative values be considered 'true missing'? Or, should these be two separate options? (1. BPCA, assumed 0 if neg. and 2. BPCA, assumed below threshold)

For reference, from the manuscript:

Missing Data: There are three primary modes of missing data in metabolomics datasets and each mode has different implications for subsequent analysis; therefore, different imputation routines and statistical methods are required and three are offered in the MSPrep package. The three modes are truly not present, present below the detectable limit of the instrument and absent owing to error in pre-processing algorithms. The MSPrep package implements three methods of managing missing data: (i) No imputation assumes the mode of missing is true zeros and therefore assigns the missing values as zeros. This dataset could be useful for PCA analysis, cluster analysis and methods that account for clustering at zero. Unless a stringent filter is applied, normalization routines may have poor performance, as most have assumptions about underlying distributions that are not valid with zero clustered data. (ii) The second option assumes missing compounds were below the detectable limit and imputes a value of one half of the minimum observed value for that compound (Xia et al., 2009). (iii) The final method is a call to the Bayesian PCA (BPCA) imputation algorithm (Oba et al., 2003) from the PCAMethods R package (Stacklies et al., 2007) and assumes that the compound is present but failed to be accurately detected. This algorithm estimates the missing value by a linear combination of principal axis vectors, where the parameters of the model are identified by a Bayesian estimation method and is not sensitive to the quantity of missing data.

Add documentation to functions, datasets, and add vignette

Track already-completed processes & valid combinations in MSPrep object

Add a dataframe object that has column for pipeline step (prepare, filter, impute, etc.) and a locigal column for conducted y/n.
Then create a function that defines acceptable paths through pipeline, giving error when invalid path

Add new modular methods

e.g., imputation steps

Consolidate different versions of code

There are 3 or 4 versions -- get down to one set.

Review this -- programming w/ dplyr

https://edwinth.github.io/blog/dplyr-recipes/

Also, consider using .data$varname in dplyr funs to avoid env scoping issues.

BPCA parameters in ms_impute

Currently ms_impute uses the default parameters for Bayesian PCA as implemented in impute_bpca. All parameters passed through impute_bpca to pca need to be available to the user in the arguments to ms_impute.

Ordering should be...

filter --> impute --> normalize???? Forgot to grab sticky note from Katerina's office...

ms_normalize parameter documentation

The n_comp and n_control parameters for ms_normalize are not well documented. The first is only for CRMN, while the second is passed to several functions. I will track down what both of these do (especially n_control) and document thoroughly, including arguments passed to functions in other packages.

choice of making metabolites or samples neighbors

This should be a matter of having a parameter (e.g. neighbor_type=c("metabolite", "sample")) in impute_knn. c5c0464 implies that this has been implemented but I don't see a user-controllable option. Need to check with @tuh8888 to find out current status.

Incorporate Dominik's notes into existing issues

MSPrep R package

Summary from meeting w/ Domink:

Start with read_data
Sean's version (in sean branch, _sj()) has more code than the one in
develop/master
Separate out tasks is the first key thing to do

Notes from meeting w/ Dominik:

Dominik uses:
- read_data_sj()
  - fn doesn't allow more than 3 technical replicates -- should allow Inf
  - fn summarizes, but this should be a separate fn -- probabaly summary()
  - option for whether data is already log-xfrmed or not -
  - option to load/convert data existing in workign environment
  - ID variable ("lcms...") is hardcoded, shoudln't be
batch correction --
- hasn't used
- should this be done before summaries, imputation, normalization
cvmax -
missing data typically represented by 0 or 1 (log(1)=0)

Notes

Grouping
- change name to 'sample_info'
- ensure that it takes continuous (e.g. blood pressure could be a phenotype)
Check if sva and ruv have a 'covariate' option -- should be a user option whether sample_info should be passed to it.
Test users: Harrison, Sean Jacobson
Ask Sean for the Emory dataset (cc: Katerina)

Diagnostic and visualization functions

diagnosticgen()
graphimputations()

So far not adding these features, but consider doing so later.

Data transformed yes/no

Add option for data transformation -- should it be whether data is already transformed (if so do we need to know what that transformation is?) or should it be what transformation to apply to the data (transform = c(NULL, 'log', 'sqrt', etc.))?

kechrislab / msprep Goto Github PK

msprep's People

Contributors

Stargazers

Watchers

Forkers

msprep's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs