computational-metabolomics / structtoolbox Goto Github PK

View Code? Open in Web Editor NEW

8.0 6.0 4.0 25.25 MB

R/Bioconductor package - STRUCT (STatistics in R Using Class Templates) Toolbox

Home Page: https://computational-metabolomics.github.io/structToolbox/

License: GNU General Public License v3.0

R 100.00%

metabolomics statistics multivariate-analysis univariate lc-ms dims bioconductor-package r-package machine-learning

structtoolbox's Introduction

structToolbox

An extensive set of data (pre-)processing and analysis methods and tools for metabolomics and other omics, with a strong emphasis on statistics and machine learning.

This toolbox allows the user to build extensive and standardised workflows for data analysis. The methods and tools have been implemented using class-based templates provided by the struct (Statistics in R Using Class-based Templates) package. The toolbox includes pre-processing methods (e.g. signal drift and batch correction, normalisation, missing value imputation and scaling), univariate (e.g. ttest, various forms of ANOVA, Kruskal–Wallis test and more) and multivariate statistical methods (e.g. PCA and PLS, including cross-validation and permutation testing) as well as machine learning methods (e.g. Support Vector Machines). The STATistics Ontology (STATO) has been integrated and implemented to provide standardised definitions for the different methods, inputs and outputs.

Installation

To install this package:

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("structToolbox")

To install the development version:

if (!require("remotes", quietly = TRUE))
    install.packages("remotes")

remotes::install_github("computational-metabolomics/structToolbox")

structtoolbox's People

Contributors

Stargazers

Watchers

Forkers

doctorbim tomnl kelseychetnik mapp-metabolomics-unit

structtoolbox's Issues

RSD filter incorrectly states it removes feature below the threshold

The filter actually removes features above the threshold

the alpha parameter for fold change isnt used

"significance" is applied based on the threshold and alpha is never used. maybe it could be used to specify the confidence interval instead?

allow fixed value for glog_transform

Currently using model_train or model_apply tries to find an optimal value for lambda. In some cases a predefined value might be needed.

Error occurs when plotting PLS feature importance

First of all, thanks for your excellent work. I have learned a lot of skills here.

I can reproduce all results with demo datasets. However, When I repeat it with my own data, some error occured as following:

test.zip

library(structToolbox)
load("test.Rdata")

mD <- DatasetExperiment(name = "mD", 
                        data = mconc, 
                        sample_meta = mmeta,
                        variable_meta = data.frame(variable = colnames(mconc),row.names =  colnames(mconc) ) ,
                        description = "The test"
                       )
mD$sample_meta$FN <- as.factor(mD$sample_meta$FN)


P = PLSDA(number_components = 2, factor_name= "FN" )

# apply the model
P = model_apply(P,mD)

C = pls_scores_plot(components=c(1,2),factor_name = "FN" )
chart_plot(C,P)



# prepare chart
C = pls_vip_plot(ycol = 'HE')
g1 = chart_plot(C,P)

the Erroe message like this:

Error in `$<-.data.frame`(`*tmp*`, "feature", value = c("pos.M68T423", :
replacement has 1000 rows, data has 0
6.
stop(sprintf(ngettext(N, "replacement has %d row, data has %d",
"replacement has %d rows, data has %d"), N, nrows), domain = NA)
5.
`$<-.data.frame`(`*tmp*`, "feature", value = c("pos.M68T423",
"pos.M69T424", "pos.M70T363", "pos.M70T442", "pos.M70T535", "pos.M70T258",
"pos.M71T363", "pos.M72T349", "pos.M73T351", "pos.M74T449", "pos.M74T415",
"pos.M74T57", "pos.M74T129", "pos.M74T403", "pos.M74T602", "pos.M75T415", ...
4.
`$<-`(`*tmp*`, "feature", value = c("pos.M68T423", "pos.M69T424",
"pos.M70T363", "pos.M70T442", "pos.M70T535", "pos.M70T258", "pos.M71T363",
"pos.M72T349", "pos.M73T351", "pos.M74T449", "pos.M74T415", "pos.M74T57",
"pos.M74T129", "pos.M74T403", "pos.M74T602", "pos.M75T415", "pos.M76T415", ...
3.
.local(obj, dobj, ...)
2.
chart_plot(C, P)
1.
chart_plot(C, P)

my R evironment ：

R version 4.3.2 (2023-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C                  LC_TIME=en_US.UTF-8           LC_COLLATE=en_US.UTF-8        LC_MONETARY=en_US.UTF-8       LC_MESSAGES=en_US.UTF-8      
 [7] LC_PAPER=en_US.UTF-8          LC_NAME=en_US.UTF-8           LC_ADDRESS=en_US.UTF-8        LC_TELEPHONE=en_US.UTF-8      LC_MEASUREMENT=en_US.UTF-8    LC_IDENTIFICATION=en_US.UTF-8

time zone: Asia/Shanghai
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] remotes_2.4.2.1      xlsx_0.6.5           pmp_1.14.0           BiocFileCache_2.10.1 dbplyr_2.4.0         cowplot_1.1.1        structToolbox_1.14.0 struct_1.14.0        lubridate_1.9.3     
[10] forcats_1.0.0        stringr_1.5.1        dplyr_1.1.4          purrr_1.0.2          readr_2.1.4          tidyr_1.3.0          tibble_3.2.1         ggplot2_3.4.4        tidyverse_2.0.0     
[19] omu_1.1.1           

loaded via a namespace (and not attached):
  [1] rstudioapi_0.15.0           magrittr_2.0.3              farver_2.1.1                zlibbioc_1.48.0             vctrs_0.6.4                 memoise_2.0.1              
  [7] RCurl_1.98-1.13             rstatix_0.7.2               S4Arrays_1.2.0              itertools_0.1-3             missForest_1.5              curl_5.1.0                 
 [13] broom_1.0.5                 SparseArray_1.2.2           pROC_1.18.5                 caret_6.0-94                FSA_0.9.5                   parallelly_1.36.0          
 [19] desc_1.4.2                  plyr_1.8.9                  impute_1.76.0               cachem_1.0.8                lifecycle_1.0.4             iterators_1.0.14           
 [25] pkgconfig_2.0.3             Matrix_1.6-3                R6_2.5.1                    fastmap_1.1.1               GenomeInfoDbData_1.2.11     MatrixGenerics_1.14.0      
 [31] future_1.33.0               digest_0.6.33               pcaMethods_1.94.0           colorspace_2.1-0            ps_1.7.5                    S4Vectors_0.40.1           
 [37] rprojroot_2.0.4             GenomicRanges_1.54.1        RSQLite_2.3.3               filelock_1.0.2              labeling_0.4.3              randomForest_4.7-1.1       
 [43] fansi_1.0.5                 timechange_0.2.0            httr_1.4.7                  abind_1.4-5                 compiler_4.3.2              rngtools_1.5.2             
 [49] bit64_4.0.5                 withr_2.5.2                 backports_1.4.1             carData_3.0-5               DBI_1.1.3                   pkgbuild_1.4.2             
 [55] MASS_7.3-60                 lava_1.7.3                  DelayedArray_0.28.0         ModelMetrics_1.2.2.2        tools_4.3.2                 future.apply_1.11.0        
 [61] nnet_7.3-19                 glue_1.6.2                  callr_3.7.3                 nlme_3.1-163                grid_4.3.2                  reshape2_1.4.4             
 [67] generics_0.1.3              recipes_1.0.8               gtable_0.3.4                tzdb_0.4.0                  class_7.3-22                data.table_1.14.8          
 [73] hms_1.1.3                   sp_2.1-1                    car_3.1-2                   utf8_1.2.4                  XVector_0.42.0              BiocGenerics_0.48.1        
 [79] foreach_1.5.2               pillar_1.9.0                rJava_1.0-6                 splines_4.3.2               lattice_0.22-5              survival_3.5-7             
 [85] bit_4.0.5                   tidyselect_1.2.0            knitr_1.45                  gridExtra_2.3               IRanges_2.36.0              SummarizedExperiment_1.32.0
 [91] stats4_4.3.2                xfun_0.41                   pls_2.8-3                   Biobase_2.62.0              hardhat_1.3.0               timeDate_4022.108          
 [97] matrixStats_1.1.0           stringi_1.8.1               xlsxjars_0.6.1              codetools_0.2-19            BiocManager_1.30.22         cli_3.6.1                  
[103] ontologyIndex_2.11          rpart_4.1.21                processx_3.8.2              munsell_0.5.0               Rcpp_1.0.11                 GenomeInfoDb_1.38.1        
[109] globals_0.16.2              parallel_4.3.2              ggfortify_0.4.16            gower_1.0.1                 blob_1.2.4                  prettyunits_1.2.0          
[115] doRNG_1.8.6                 bitops_1.0-7                listenv_0.9.0               ggthemes_4.2.4              viridisLite_0.4.2           ipred_0.9-14               
[121] scales_1.2.1                prodlim_2023.08.28          crayon_1.5.2                rlang_1.1.2

HCA dendrogram plotting group colours in wrong positions

filter_na_count documentation is confusing

For filter_na_count you say that the threshold = the maximum number of NA allowed per level of factor_name I think it is phrased weirdly because I think the value you put there is a minimum number of samples the group should be represented by. But the phrasing indicates how many "NA" you allow, I find it confusing.

filter_by_name doesn't work with model_train

model_apply works but model_train and model_predict do not

HSD parameter "unbalanced" never gets used

Changing it has no effect because it isn't passed to the HSD function from the agricolae package.

pqn factor_name has no description

The default empty entity object is used for factor_name and not updated with name description etc.

pca_scores_plot: incorrect grey data ellipse when using two factors

The grey ellipse usually encompasses all points regardless of factor level. When using two factors the grey ellipse is plotted for each shape level instead of for all data points.

make d-statistic/hotelling T2 a model object so that it can be used for filtering

This is currently only available as a chart object for PCA object, so its not possible to use for filtering

add "mean" method for calculating fold change

currently only "median" and "geometric" are supported

t-test fails when the factor_name column is characters instead of a factor

Ideally it would automatically convert to a factor, like ANOVA does when using a formula.

model.apply fails for a model sequence when seq_in is not equal to 'data'

For model sequences seq_in must equal 'data' or model.apply fails. Example error when seq_in = 'names' for filter_by_name as the first step in a sequence:

Error in (function (classes, fdef, mtable)  :
  unable to find an inherited method for function ‘model_apply’ for signature ‘"model_seq", "filter_by_name"’

provide documentation for struct object outputs

All input parameters are documented, but outputs are not

Possibilities to plot 3d PCA ?

Hi !
First of all many thanks for this great piece of software !

I wondered how I could plot a 3d PCA using the structToolbox ?

I have a

PCA(
       number_components = 10)

However if I try to acess > 2 components such as here :

C = pca_scores_plot(components = c(1,2,3), factor_name = 'brainregion')

I get the following error :

Error in validObject(obj) : 
  invalid class “entity” object: Components to plot: number of values must be less than "max_length"

How could I plot such object ?

Many thanks

report reasons for NA in ANOVA / ttest

e.g. ANOVA returns NA for a feature if <3 samples in a group. Instead of / as well NA provide the reason for easier debugging / reporting.

log transform always uses base 10

L = log_transform(base=2)
L$base

10

report number of features used to compute the median for PQN

PQN currently only includes features with zero NA when computing the median. The number of features this applies to should be reported, as sometimes a very small number of features might be used and then result in a poor estimate of the coefficient.

add selectivity ratio as an output for PLS objects

They are already computed by the pls package but they are not included in the outputs of the structToolbox PLS model object.

mean of medians not working correctly

the difference applied is always 0.

the threshold for fold change using "median" isnt applied correctly

it seems to use the log(threshold), which is correct for geometric but not for median

ask user to install missing packages when object is created

currently an error is thrown when trying to train/apply a model. if a required package is missing. Would be better if the user was alerted earlier e.g. when the object is created, and asked if they'd like to install it.

fold change uses control group in the numerator

In fold_change objects the control_group parameter can be provided. However, the calculation uses this group in the numerator instead of the denominator (as shown by the column names) so that the calculated fold changes are not 'relative to the control' group as might be expected.

M = tSNE()
predicted(M)

result:
"tnse"

this causes tSNE when part of a sequence to fail because

predicted(M)

Error in slot(obj, name) : 
  no slot of name "tsne" for this object of class "tSNE"

To work around this the predicted slot can be set when creating the object:

M = tSNE(predicted = 'Y')

corr_coef fails when using with only one factor

minimal example:


# example data
D=iris_DatasetExperiment()

# add continuous factor for correlation
D$sample_meta$example1=rnorm(nrow(D))
D$sample_meta$example2=rnorm(nrow(D))

M = corr_coef(factor_names = 'example1')
M = model_apply(M,D)

Fails with

Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE,  : 
  row names supplied are of the wrong length

However, it runs as expected with factor_names = c("example1", "example2")

HSD doesnt handle missing groups well

In a one way ANOVA design, if a feature has missing values for an entire group this is not picked up by HSD and then the p-values are recycled over the groups.

add PLS scores plot for regression models

currently only PLSDA models are supported