GithubHelp home page GithubHelp logo

gtsummary-smds's Introduction

Adding SMDs to gtsummary tables with 3 or more groups

Zheer Kejlberg Al-Mashhadi 2023-11-04

Functions to add SMDs with explanations and examples of use

In this document, the necessary functions are defined and explained. First, wrapper functions are defined for the core functionality (e.g., core_smd_function and applied_smd_function). Finally, four separate “caller” functions are created, each of which call the core functions in distinct ways for distinct use cases.



Defining the functions


1) Load dependencies

library(gtsummary) # For creating a baseline characteristics table
library(tidyverse) # For data wrangling and misc.
library(smd) # for calculating the SMDs
library(purrr) # for vectorised functions

2) Create the core functionality, via core_smd_function(), for taking the data and outputting the SMD results

core_smd_function <- function(data, is_weighted, location, ref_group, ci, decimals) {
  # MAKE A TABLE OF EVERY POSSIBLE COMBO OF TWO DIFFERENT GROUPS
  groups <- factor(unique(data$by))
  pairs <- expand.grid(groups, groups) %>%
    arrange(as.integer(.data$Var1), as.integer(.data$Var2)) %>% 
    filter(Var1 != Var2) %>%
    filter(!duplicated(
      paste0(pmax(as.character(Var1), as.character(Var2)), 
             pmin(as.character(Var1), as.character(Var2)))))
  if (ref_group) { # IF ref_group, KEEP ONLY PAIRS CONTAINING THE REF GROUP
    pairs <- pairs %>% filter(Var1 == first(levels(groups)))
  }
  
  # CREATE COLUMN NAMES FOR EACH CALCULATED SMD
  create_colname <- function(pair) {
    filtered_data <- data %>% filter(by %in% pair) %>% mutate(by = factor(by))
    paste0("SMD: ", levels(filtered_data$by)[1], " vs. ", levels(filtered_data$by)[2])
  }
  comparisons <- apply(pairs, 1, create_colname)
  if (location == "level") { comparisons <- paste0(comparisons, " ") }
  
  # CREATE SUBSETS OF DATA
  subsetting <- function(pair, data) {
    as.data.frame(data) %>%
      filter(by %in% pair) %>%
      mutate(by = factor(by)) %>%
      droplevels()
  }
  data_subsets <- apply(X = pairs, MARGIN = 1, FUN = subsetting, data = data)
  
  # CALCULATE SMD BETWEEN GROUPS WITHIN EACH DATA SUBSET
  calc_SMD <- function(data_subset, is_weighted, ci, decimals) {
    res <- smd::smd(data_subset$variable, data_subset$by, std.error = T)
    if (is_weighted) {
      res <- smd::smd(data_subset$variable, data_subset$by, data_subset$weight_var, std.error = T) 
    }
    res_smd <- res[[2]] %>% round(decimals) %>% format(nsmall = decimals)
    ci_lower <- (res[[2]] - 1.96 * res[[3]]) %>% round(decimals) %>% format(nsmall = decimals)
    ci_upper <- (res[[2]] + 1.96 * res[[3]]) %>% round(decimals) %>% format(nsmall = decimals)
    
    if (ci == TRUE) {
      output <- paste(res_smd, " [", ci_lower, ";", ci_upper, "]", sep = "")
      return(output)
    } else {
      return(res_smd)
    }
  }
  calc_SMD <- purrr::possibly(.f = calc_SMD, otherwise = NA_character_)
  
  smd_estimates <- purrr::map_chr(data_subsets, ~ calc_SMD(., is_weighted, ci, decimals))
  
  # OUTPUT THE RESULTS
  tibble(comp = comparisons, smd = smd_estimates) %>%
    spread(comp, smd) %>%
    relocate(any_of(comparisons))
}

3) Create a function clean_smd_data() to prepare the input data for use by the core_smd_function()

clean_smd_data <- function(data, variable, by, tbl) {
  tbl_type <- first(class(tbl))
  if (tbl_type != "tbl_svysummary" & tbl_type != "tbl_summary") {
    stop("Inappropriate input to smd function")
  }
  is_weighted <- tbl_type == "tbl_svysummary"
  
  if (is_weighted) {
    data <- data$variables %>% mutate(weight_var = 1 / data$allprob[[1]])
  } else {
    data <- data %>% mutate(weight_var = 1)
  }
  
  data <- dplyr::select(data, all_of(c(variable, by, "weight_var"))) %>%
    rlang::set_names(c("variable", "by", "weight_var")) %>%
    dplyr::filter(complete.cases(.))
  if (is.character(data$variable)) {
    data <- data %>% mutate(variable = factor(variable))
  }
  if (is.factor(data$variable)) {
    levels <- levels(data$variable)
  } else {
    levels <- NULL
  }
  return(list(data, levels, is_weighted))
}

4) Create the add_SMD() function to be called by users.

This function first cleans the data and then applies the core_smd_function() but allows two distinct things to vary according to user preference:

  1. The location argument (“label”, “level”, or “both”):
  • Specifying “label”, you get one SMD per variable. For categorical variables, a Mahalanobis distance is calculated between groups.
  • Specifying “level”, you get an SMD for each level of all categorical variables. This option thus does not produce SMDs for continuous/numeric variables.
  • Specifying “both”, you combine the output of the level and the label options.
  1. The ref_group argument (TRUE or FALSE):
  • FALSE: There is no reference group, and SMDs will be calculated between every possible pair of groups (i.e., groups being defined by the “by” argument in tbl_summary() or tbl_svysummary()).
  • TRUE: The first group (the first level of the variable given in the “by” argument - which is also the leftmost group in the table) will be set as a reference group.
  1. The ci argument (logical) specifies whether to print confidence intervals for the SMDs.
  2. The decimals argument (integer) specifies the number of significant digits to print for SMDs (and CIs).
add_SMD <- function(tbl, location = "label", ref_group = FALSE, ci = FALSE, decimals = 2) {
  fun <- function(data, variable, by, tbl, ...) {
    clean_data <- clean_smd_data(data, variable, by, tbl)
    data <- clean_data[[1]]
    levels <- clean_data[[2]]
    is_weighted <- clean_data[[3]]
    
    if (location == "label") {
      output <- core_smd_function(data, is_weighted, 
                                  location = location, ref_group = ref_group, 
                                  ci = ci, decimals = decimals)
    } else { # location == "level"
      execute_by_level <- function(data, level, is_weighted) {
        data <- data %>% mutate(variable = variable == level)
        core_smd_function(data, is_weighted, 
                          location = location, ref_group = ref_group, 
                          ci = ci, decimals = decimals)
      }
      output <- map_dfr(levels, .f = ~ execute_by_level(data, .x, is_weighted))
    }
    return(output)
  }
  
  if (location == "both") {
    location <- "label"
    tbl <- tbl %>% add_stat(fns = everything() ~ fun, location = ~ "label")
    location <- "level"
    tbl <- tbl %>% add_stat(fns = everything() ~ fun, location = ~ "level")
    
    duplicates <- stringr::str_subset(tbl$table_styling$header$column, "^SMD(\r\n|\r|\n|.)* $")
    duplicates <- stringr::str_remove(duplicates, " $")
    
    for (i in 1:length(duplicates)) {
      # Temporarily change column names for use by gtsummary
      column_names <- colnames(tbl$table_body)
      indices <- which(column_names == duplicates[i] | column_names == paste0(duplicates[i], " "))
      column_names[indices] <- stringr::str_replace_all(column_names[indices], "[: .]", "_")
      colnames(tbl$table_body) <- column_names
      
      # Adjust the digits of the SMDs and turn into character (while hiding NAs)
      format_smd <- function(column) {
        #column <- round(column, 3)
        #column <- format(column, nsmall = 3)
        column[is.na(column)] <- ""
        return(column)
      }
      tbl$table_body[[column_names[indices][1]]] <- format_smd(tbl$table_body[[column_names[indices][1]]])
      tbl$table_body[[column_names[indices][2]]] <- format_smd(tbl$table_body[[column_names[indices][2]]])
      
      # Finally merge and reinstate the original column title
      merge_pattern <- paste0("{",column_names[indices][1],"}{",column_names[indices][2],"}")
      tbl <- tbl %>%
        modify_column_merge(pattern = merge_pattern) %>%
        modify_header(column_names[indices][1] ~ duplicates[i])
      
    }
    
  } else {
    tbl <- tbl %>% add_stat(fns = everything() ~ fun, location = ~ location)
  }
  return(tbl)
  
}



Using the functions with gtsummary.


For unweighted data (with a tbl_summary() object), see the following examples:

For one SMD per variable, the location argument does not need to be specified (as it defaults to “label”).

trial %>% 
  tbl_summary(by = grade, include = c(age, stage)) %>%
  add_SMD()
Characteristic I, N = 681 II, N = 681 III, N = 641 SMD: I vs. II SMD: I vs. III SMD: II vs. III
Age 47 (37, 56) 49 (37, 57) 47 (38, 58) -0.10 -0.13 -0.04
    Unknown 2 6 3


T Stage


0.29 0.19 0.31
    T1 17 (25%) 23 (34%) 13 (20%)


    T2 18 (26%) 17 (25%) 19 (30%)


    T3 18 (26%) 11 (16%) 14 (22%)


    T4 15 (22%) 17 (25%) 18 (28%)


1 Median (IQR); n (%)

For one SMD per level of every categorical variable, you must specify location = “level”

trial %>% 
  tbl_summary(by = grade, include = c(age, stage)) %>%
  add_SMD(location = "level")
Characteristic I, N = 681 II, N = 681 III, N = 641 SMD: I vs. II SMD: I vs. III SMD: II vs. III
Age 47 (37, 56) 49 (37, 57) 47 (38, 58)


    Unknown 2 6 3


T Stage





    T1 17 (25%) 23 (34%) 13 (20%) -0.19 0.11 0.31
    T2 18 (26%) 17 (25%) 19 (30%) 0.03 -0.07 -0.11
    T3 18 (26%) 11 (16%) 14 (22%) 0.25 0.11 -0.15
    T4 15 (22%) 17 (25%) 18 (28%) -0.07 -0.14 -0.07
1 Median (IQR); n (%)
*Notice, this only gives SMDs on levels of categorical variables.

There’s also the option to set location = “both” to get both kinds of SMDs simultaneously.

trial %>%
  tbl_summary(by = grade, include = c(age, stage)) %>%
  add_SMD(location = "both")
Characteristic I, N = 681 II, N = 681 III, N = 641 SMD: I vs. II SMD: I vs. III SMD: II vs. III
Age 47 (37, 56) 49 (37, 57) 47 (38, 58) -0.10 -0.13 -0.04
    Unknown 2 6 3
T Stage


0.29 0.19 0.31
    T1 17 (25%) 23 (34%) 13 (20%) -0.19 0.11 0.31
    T2 18 (26%) 17 (25%) 19 (30%) 0.03 -0.07 -0.11
    T3 18 (26%) 11 (16%) 14 (22%) 0.25 0.11 -0.15
    T4 15 (22%) 17 (25%) 18 (28%) -0.07 -0.14 -0.07
1 Median (IQR); n (%)

To get confidence intervals, add ci = TRUE. With the decimals argument, you can adjust the number of significant digits displayed.

trial %>% 
  tbl_summary(by = grade, include = c(age, stage)) %>%
  add_SMD(location = "level", ci = TRUE, decimals = 3)
Characteristic I, N = 681 II, N = 681 III, N = 641 SMD: I vs. II SMD: I vs. III SMD: II vs. III
Age 47 (37, 56) 49 (37, 57) 47 (38, 58)


    Unknown 2 6 3


T Stage





    T1 17 (25%) 23 (34%) 13 (20%) -0.195 [-0.531;0.142] 0.112 [-0.229;0.454] 0.308 [-0.036;0.651]
    T2 18 (26%) 17 (25%) 19 (30%) 0.034 [-0.303;0.370] -0.072 [-0.413;0.270] -0.105 [-0.447;0.236]
    T3 18 (26%) 11 (16%) 14 (22%) 0.253 [-0.084;0.591] 0.107 [-0.234;0.449] -0.146 [-0.487;0.196]
    T4 15 (22%) 17 (25%) 18 (28%) -0.069 [-0.406;0.267] -0.140 [-0.482;0.202] -0.071 [-0.412;0.271]
1 Median (IQR); n (%)


For weighted data, use tbl_svysummary():

In this examply we use weights from WeightIt package. The survey package delivers the necessary svydesign object.

library(WeightIt) # To calculate weights
library(survey) # To create a surveydesign object (a "weighted" dataset)

Application of the add_SMD() function is identical to the non-weighted case, but it is applied to a tbl_svysummary object instead of a tbl_summary object.

trial %>% mutate(
  w = weightit(grade ~ age + stage + trt, data = ., focal="I")$weights) %>% # create ATT weights
  survey::svydesign(~1, data = ., weights = ~w) %>% # create the svydesign object
  tbl_svysummary(by = grade, include = c(age, stage)) %>%
  add_SMD(ref_group = TRUE)
Characteristic I, N = 681 II, N = 681 III, N = 681 SMD: I vs. II SMD: I vs. III
Age 47 (37, 56) 46 (34, 56) 45 (38, 54) -0.01 -0.01
    Unknown 2 2 2

T Stage


0.08 0.02
    T1 17 (25%) 19 (27%) 17 (25%)

    T2 18 (26%) 19 (27%) 17 (26%)

    T3 18 (26%) 17 (26%) 18 (27%)

    T4 15 (22%) 13 (19%) 15 (23%)

1 Median (IQR); n (%)
*Notice, comparisons are only made here between group I and all other groups due to the use of ref_group = TRUE

gtsummary-smds's People

Contributors

zheer-kejlberg avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.