kechrislab / msprep Goto Github PK
View Code? Open in Web Editor NEWA processing pipeline for the summarization, normalization and diagnostics of mass spectrometry–based metabolomics data.
A processing pipeline for the summarization, normalization and diagnostics of mass spectrometry–based metabolomics data.
Easiest task ever. :)
Remove old code files and check locations and gen scripts of datasets used in package and testing.
Should allow for none or some (any) user-defined number of replicates.
I checked Biocondcutor and there were only 31 packages with the "metabolomics" search term.
I removed some that were not obviously related. Attached is a list, but some of those may not be as relevant either - I don't have time now to look at each of them.But if it helps, we are looking at data from LC/MS - which is related to GC/MS or other mass spectrometry (MS) based methods - but not related to NMR methods.
Packages I know better are 'xcms', which is widely used. The authors of "metabomxr" visited last year and we're in touch with them, they could be a good resource.
Other relevant packages from the Emory group that we work with:
http://web1.sph.emory.edu/apLCMS/
https://sourceforge.net/projects/xmsanalyzer/
Package Name | ** Authors** | ** Description** |
---|---|---|
biosigner | Philippe Rinaudo Etienne Thevenot | Signature discovery from omics data |
CAMERA | Steffen Neumann | Collection of annotation related methods for mass spectrometry data |
cosmiq | David Fischer Christian Panse | cosmiq - COmbining Single Masses Into Quantities |
IPO | Thomas Riebenbauer | Automated Optimization of XCMS Data Processing parameters |
MAIT | Francesc Fernandez-Albert | Statistical Analysis of Metabolomic Data |
Metab | Raphael Aggio | Metab: An R Package for a High-Throughput Analysis of Metabolomics Data Generated by GC-MS. |
metabomxtr | Michael Nodzenski | A package to run mixture models for truncated metabolomics data with normal or lognormal distributions |
metaMS | Ron Wehrens | MS-based metabolomics annotation pipeline |
MetCirc | Thomas Naake | Navigating mass spectral similarity in high-resolution MS/MS metabolomics data |
MWASTools | Andrea Rodriguez-Martinez Rafael Ayala | MWASTools: an integrated pipeline to perform metabolome-wide association studies |
mzR | Bernd Fischer Steffen Neumann Laurent Gatto Qiang Kou | parser for netCDF mzXML mzData and mzML and mzIdentML files (mass spectrometry data) |
OmicsMarkeR | Charles E. Determan Jr. | Classification and Feature Selection for 'Omics' Datasets |
PAPi | Raphael Aggio | Predict metabolic pathway activity based on metabolomics data |
pathview | Weijun Luo | a tool set for pathway based data integration and visualization |
Rdisop | Steffen Neumann | Decomposition of Isotopic Patterns |
RMassBank | RMassBank at Eawag | Workflow to process tandem MS files and build MassBank records |
ropls | Etienne A. Thevenot | PCA PLS(-DA) and OPLS(-DA) for multivariate analysis and feature selection of omics data |
SIMAT | Mo R. Nezami Ranjbar | GC-SIM-MS data processing and alaysis tool |
statTarget | Hemi Luan | Statistical Analysis of Metabolite Profile |
xcms | Steffen Neumann | LC/MS and GC/MS Data Analysis |
yamss | Leslie Myint | Tools for high-throughput metabolomics |
Deprecate groupingvars arguments and functions if so.
Doesn't seem to be working across the entire pipeline.
All of these should have reasonable default values so that, for example, an experiment without batches treats all samples as belonging to a single batch. Default value should be NULL and have dummy variables filled in internally. Per @mmulvahill 's suggestion, this needs to be done as a separate branch and the downstream implications thoroughly tested before merging.
Definitely remove the file read -- should just use read.csv, read_csv, etc.
Want to be able to quickly provide interface for incorporating new methods with simple wrappers. Definitely for imputation methods, but probably other functions too.
When running msFilter using summarized experiment, column names are changed from just sample_id to one that has all of the colData variables that are pasted together with separation "_". How do i preserve the sample_ID of the original colnames without having to delete all colData
Some values in the quant/spike dataset have colons: 3.2672226:1
and 3.2817035:2
. Is this a ratio that needs to be calculated before converting to numeric? Currently this column is a factor and is inconsistently implemented. Resolve in wide_matrix_to_data
function and ms_prepare.
For reference, from the manuscript:
Missing Data: There are three primary modes of missing data in metabolomics datasets and each mode has different implications for subsequent analysis; therefore, different imputation routines and statistical methods are required and three are offered in the MSPrep package. The three modes are truly not present, present below the detectable limit of the instrument and absent owing to error in pre-processing algorithms. The MSPrep package implements three methods of managing missing data: (i) No imputation assumes the mode of missing is true zeros and therefore assigns the missing values as zeros. This dataset could be useful for PCA analysis, cluster analysis and methods that account for clustering at zero. Unless a stringent filter is applied, normalization routines may have poor performance, as most have assumptions about underlying distributions that are not valid with zero clustered data. (ii) The second option assumes missing compounds were below the detectable limit and imputes a value of one half of the minimum observed value for that compound (Xia et al., 2009). (iii) The final method is a call to the Bayesian PCA (BPCA) imputation algorithm (Oba et al., 2003) from the PCAMethods R package (Stacklies et al., 2007) and assumes that the compound is present but failed to be accurately detected. This algorithm estimates the missing value by a linear combination of principal axis vectors, where the parameters of the model are identified by a Bayesian estimation method and is not sensitive to the quantity of missing data.
e.g., imputation steps
There are 3 or 4 versions -- get down to one set.
https://edwinth.github.io/blog/dplyr-recipes/
Also, consider using .data$varname in dplyr funs to avoid env scoping issues.
Currently ms_impute
uses the default parameters for Bayesian PCA as implemented in impute_bpca
. All parameters passed through impute_bpca
to pca
need to be available to the user in the arguments to ms_impute
.
filter --> impute --> normalize???? Forgot to grab sticky note from Katerina's office...
The n_comp
and n_control
parameters for ms_normalize
are not well documented. The first is only for CRMN, while the second is passed to several functions. I will track down what both of these do (especially n_control
) and document thoroughly, including arguments passed to functions in other packages.
MSPrep R package
Summary from meeting w/ Domink:
Notes from meeting w/ Dominik:
sample_info
should be passed to it.So far not adding these features, but consider doing so later.
Add option for data transformation -- should it be whether data is already transformed (if so do we need to know what that transformation is?) or should it be what transformation to apply to the data (transform = c(NULL, 'log', 'sqrt', etc.))?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.