GithubHelp home page GithubHelp logo

kechrislab / msprep Goto Github PK

View Code? Open in Web Editor NEW
10.0 10.0 3.0 5.3 MB

A processing pipeline for the summarization, normalization and diagnostics of mass spectrometry–based metabolomics data.

R 100.00%

msprep's People

Contributors

kechrisk avatar max-mcgrath avatar mmulvahill avatar nturaga avatar tuh8888 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

msprep's Issues

Clean up repo

Remove old code files and check locations and gen scripts of datasets used in package and testing.

Metabolomics and mass spec resources for Bioconductor incorporation

I checked Biocondcutor and there were only 31 packages with the "metabolomics" search term.
I removed some that were not obviously related. Attached is a list, but some of those may not be as relevant either - I don't have time now to look at each of them.

But if it helps, we are looking at data from LC/MS - which is related to GC/MS or other mass spectrometry (MS) based methods - but not related to NMR methods.

Packages I know better are 'xcms', which is widely used. The authors of "metabomxr" visited last year and we're in touch with them, they could be a good resource.

Other relevant packages from the Emory group that we work with:
http://web1.sph.emory.edu/apLCMS/
https://sourceforge.net/projects/xmsanalyzer/

Package Name ** Authors** ** Description**
biosigner Philippe Rinaudo Etienne Thevenot Signature discovery from omics data
CAMERA Steffen Neumann Collection of annotation related methods for mass spectrometry data
cosmiq David Fischer Christian Panse cosmiq - COmbining Single Masses Into Quantities
IPO Thomas Riebenbauer Automated Optimization of XCMS Data Processing parameters
MAIT Francesc Fernandez-Albert Statistical Analysis of Metabolomic Data
Metab Raphael Aggio Metab: An R Package for a High-Throughput Analysis of Metabolomics Data Generated by GC-MS.
metabomxtr Michael Nodzenski A package to run mixture models for truncated metabolomics data with normal or lognormal distributions
metaMS Ron Wehrens MS-based metabolomics annotation pipeline
MetCirc Thomas Naake Navigating mass spectral similarity in high-resolution MS/MS metabolomics data
MWASTools Andrea Rodriguez-Martinez Rafael Ayala MWASTools: an integrated pipeline to perform metabolome-wide association studies
mzR Bernd Fischer Steffen Neumann Laurent Gatto Qiang Kou parser for netCDF mzXML mzData and mzML and mzIdentML files (mass spectrometry data)
OmicsMarkeR Charles E. Determan Jr. Classification and Feature Selection for 'Omics' Datasets
PAPi Raphael Aggio Predict metabolic pathway activity based on metabolomics data
pathview Weijun Luo a tool set for pathway based data integration and visualization
Rdisop Steffen Neumann Decomposition of Isotopic Patterns
RMassBank RMassBank at Eawag Workflow to process tandem MS files and build MassBank records
ropls Etienne A. Thevenot PCA PLS(-DA) and OPLS(-DA) for multivariate analysis and feature selection of omics data
SIMAT Mo R. Nezami Ranjbar GC-SIM-MS data processing and alaysis tool
statTarget Hemi Luan Statistical Analysis of Metabolite Profile
xcms Steffen Neumann LC/MS and GC/MS Data Analysis
yamss Leslie Myint Tools for high-throughput metabolomics

What if a user does not have replicates, spike-ins or batches?

All of these should have reasonable default values so that, for example, an experiment without batches treats all samples as belonging to a single batch. Default value should be NULL and have dummy variables filled in internally. Per @mmulvahill 's suggestion, this needs to be done as a separate branch and the downstream implications thoroughly tested before merging.

Modulerize functions

Want to be able to quickly provide interface for incorporating new methods with simple wrappers. Definitely for imputation methods, but probably other functions too.

MSPrep

When running msFilter using summarized experiment, column names are changed from just sample_id to one that has all of the colData variables that are pasted together with separation "_". How do i preserve the sample_ID of the original colnames without having to delete all colData

Clarify RT variable type

Some values in the quant/spike dataset have colons: 3.2672226:1 and 3.2817035:2. Is this a ratio that needs to be calculated before converting to numeric? Currently this column is a factor and is inconsistently implemented. Resolve in wide_matrix_to_data function and ms_prepare.

Imputation questions

@kechrisk

  1. One imputation method is the half-minimum approach, where we impute 0's with half the minimum value for that compound. Do we actually want to use the minimum across all patients & replicates, or should we do this using the minimum value within each patient?
  2. In the old code, when BPCA imputes a number < 0, we replace that negative value with the half-minimum imputation -- resulting in a combined BPCA/half-min imputation method. Is this what we want to do? Or should these negative values be considered 'true missing'? Or, should these be two separate options? (1. BPCA, assumed 0 if neg. and 2. BPCA, assumed below threshold)

For reference, from the manuscript:

Missing Data: There are three primary modes of missing data in metabolomics datasets and each mode has different implications for subsequent analysis; therefore, different imputation routines and statistical methods are required and three are offered in the MSPrep package. The three modes are truly not present, present below the detectable limit of the instrument and absent owing to error in pre-processing algorithms. The MSPrep package implements three methods of managing missing data: (i) No imputation assumes the mode of missing is true zeros and therefore assigns the missing values as zeros. This dataset could be useful for PCA analysis, cluster analysis and methods that account for clustering at zero. Unless a stringent filter is applied, normalization routines may have poor performance, as most have assumptions about underlying distributions that are not valid with zero clustered data. (ii) The second option assumes missing compounds were below the detectable limit and imputes a value of one half of the minimum observed value for that compound (Xia et al., 2009). (iii) The final method is a call to the Bayesian PCA (BPCA) imputation algorithm (Oba et al., 2003) from the PCAMethods R package (Stacklies et al., 2007) and assumes that the compound is present but failed to be accurately detected. This algorithm estimates the missing value by a linear combination of principal axis vectors, where the parameters of the model are identified by a Bayesian estimation method and is not sensitive to the quantity of missing data.

BPCA parameters in ms_impute

Currently ms_impute uses the default parameters for Bayesian PCA as implemented in impute_bpca. All parameters passed through impute_bpca to pca need to be available to the user in the arguments to ms_impute.

Ordering should be...

filter --> impute --> normalize???? Forgot to grab sticky note from Katerina's office...

ms_normalize parameter documentation

The n_comp and n_control parameters for ms_normalize are not well documented. The first is only for CRMN, while the second is passed to several functions. I will track down what both of these do (especially n_control) and document thoroughly, including arguments passed to functions in other packages.

Incorporate Dominik's notes into existing issues

MSPrep R package

Summary from meeting w/ Domink:

  • Start with read_data
  • Sean's version (in sean branch, _sj()) has more code than the one in
    develop/master
  • Separate out tasks is the first key thing to do

Notes from meeting w/ Dominik:

  • Dominik uses:
    • read_data_sj()
      • fn doesn't allow more than 3 technical replicates -- should allow Inf
      • fn summarizes, but this should be a separate fn -- probabaly summary()
      • option for whether data is already log-xfrmed or not -
      • option to load/convert data existing in workign environment
      • ID variable ("lcms...") is hardcoded, shoudln't be
  • batch correction --
    • hasn't used
    • should this be done before summaries, imputation, normalization
  • cvmax -
  • missing data typically represented by 0 or 1 (log(1)=0)

Notes

  • Grouping
    • change name to 'sample_info'
    • ensure that it takes continuous (e.g. blood pressure could be a phenotype)
  • Check if sva and ruv have a 'covariate' option -- should be a user option whether sample_info should be passed to it.
  • Test users: Harrison, Sean Jacobson
  • Ask Sean for the Emory dataset (cc: Katerina)

Data transformed yes/no

Add option for data transformation -- should it be whether data is already transformed (if so do we need to know what that transformation is?) or should it be what transformation to apply to the data (transform = c(NULL, 'log', 'sqrt', etc.))?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.