rbpatt2019 / chooser Goto Github PK

An R framework for choosing clustering parameters in scRNA-seq analysis pipelines

License: GNU General Public License v3.0

R 100.00%

chooser's Introduction

Ryan B Patterson-Cross is currently a computational biologist in the Bio2Core, Wellcome-MRC Institute of Metabolic Science, University of Cambridge, where he uses his passion for transparent and reproducible data science to provide scientific computing support to a broad range of scientists. He also teaches supervisions in the University's Computer Science and Technology department. He moved to Camridge after working as a research fellow in Ariel Levine's group at the National Institutes of Health, USA. There, he developed a framework in R for quantitative selection of clustering hyperparameters and collaborated on a ML pipeline to identify cell types in a harmonised, sc-RNAseq atlas.

Ryan holds an MSc(Res) in Medical Science from the University of Oxford, where he studied mitochondrial turnover in Parkinson's disease. During these studies, he took several courses in data science through the FAES at the National Institutes of Health in Bethesda. He also holds a BSc in Neuroscience and a BSc in Molecular/Cellular Biology with a minor in Spanish from Johns Hopkins University.

Outside of work, Ryan is an avid crafter with a penchant for cross stitch. He enjoys cooking and playing ukulele, tinkers in Vim, and develops projects ranging from Dash interfaces to Snakemake pipelines!

chooser's People

Contributors

Stargazers

Watchers

Forkers

menonlab richard0531 safwat190 nichd-bspc jmbowers312 crsky1023 0x644be25

chooser's Issues

Should PCA be recaculated for each sub-sample when assessing clustering robustness?

Hi,

More of a theoretical question than a code issue (so apologies if this isn't the appropriate place for it).

Looking at the function used for subsampling in chooseR, it appears that when calculating clusters on subsets of cells that the PCA reduction calculated from the total set of cells is used. Therefore, some information from cells not in the sub cluster will be influencing the overall clustering (as cells outside of the subset will still determine distances in PCA space).

I'm not sure if it would make a large difference to inter-parameter comparisons but my intuition is that it would indicate clustering is more stable between subsamples than it would be if PCAs were recalculated on each subset?

Hope that makes sense, and that I haven't misunderstood the pipeline!

Cheers,
Matt

Can not load shared object from chooseR's source

Hello,

Thanks for the creation of such a package, it works quite good!

I tested your tool on few scRNA-seq datasets and everything was great! Hovewer, when I took exactly the same code for clustering by chooseR except the fact that I replaced usual arguments from script on command-line arguments (which you have to run through terminal) by argparse package I ran into a problem related to the loading shared object from renv package (see log file below). Do you have any ideas how could I solve this issue?

Thank you in advance!

Error in dyn.load(file, DLLpath = DLLpath, ...) : unable to load shared object '/nfs/home/rasmirnov/.cache/R/renv/cache/v5/R-4.0/x86_64-pc-linux-gnu/magrittr/2.0.2/cdc87ecd81934679d1557633d8e1fe51/magrittr/libs/magrittr.so: /nfs/home/rasmirnov/.cache/R/renv/cache/v5/R-4.0/x86_64-pc-linux-gnu/magrittr/2.0.2/cdc87ecd81934679d1557633d8e1fe51/magrittr/libs/magrittr.so: undefined symbol: R_removeVarFromFrame Calls: source ... getNamespace -> loadNamespace -> library.dynam -> dyn.load Execution halted

Error: More dimensions specified in dims than have been computed

Hello!

I am trying to implement chooseR on my scRNA dataset containing about 10k cells. I pre-processed the cells using Seuratv3, performed normalisation, scaling, PCA and umap (using default parameters).

After this, I saved an .rds file and accordingly changed the readRDS function in file1 of your package, and set npcs to 32 (my PCA has 50 dimensions to it, as evident from the elbowplot). When I try to run the subsequent for loop which iterates over my custom defined resolution vector, I get an error:

I tried various things such as not performing umap (should not affect final results thought), changing default paramters, etc but nothing works, How should I resolve this issue?

R crash with large datasets

Hello,

Thank you for this package that seems very useful.

Nevertheless, I have an integrated datasets with 150 000 cells, and my R crash when calculating the co-clustering frequencies. I have a computer with a RAM memory of 128 GB. My integrated dataset takes up 15.4GB. Do you have any idea how could I overcome this issue.

Thank you in advance

Error in CheckDots(...) : object 'integrated' not found

Hi, I would like to use chooseR for a single cell dataset, but when trying to use it on a Seurat object (SCTransform and PCA performed) I get the following error: "Error in CheckDots(...) : object 'integrated' not found". Is this an issue due to the updated version of Seurat? Is there a way to fix this issue?
Thanks for any input you can provide in advance

re recommended resolution and number of clusters

Hi,

I've been using chooseR to determine optimal resolution for Seurat clustering. I'm running into an issue where the chooseR recommended resolution does not generate the same number of clusters in Seurat as chooseR. For example, if chooseR recommends a resolution 1.0, the co-clustering map and the silhouette_point_plot shows clusters 0-21 (22 clusters). Then when I enter resolution of 1.0 into Seurat, I get 24 clusters. Any thoughts on this would be greatly appreciated. I've pasted below the code I used.


# Run chooseR before FindClusters


# ChooseR
library(renv)
# Before starting chooseR, must read in all of the functions:
#read in ChooseR functions####

library(Seurat)
library(ggplot2)

`%>%` <- magrittr::`%>%`

# find_clusters
find_clusters <- function(
  obj,
  reduction = "pca",
  npcs = 100,
  assay = "integrated",
  features = NULL,
  resolution = 0.8,
  verbose = FALSE) {
  obj <- Seurat::FindNeighbors(
    obj,
    reduction = reduction,
    dims = 1:npcs,
    assay = integtrated,
    features = features,
    verbose = verbose,
    graph.name = paste(reduction, assay, sep = ".")
  )
  obj <- Seurat::FindClusters(
    obj,
    resolution = resolution,
    graph.name = paste(reduction, assay, sep = "."),
    verbose = verbose
  )
  return(obj)
}

# Generate n sub-samples
n_samples <- function(
  n,
  input,
  size = 0.8,
  replace = FALSE,
  simplify = FALSE) {
  splits <- replicate(
    n,
    sample(
      input,
      as.integer(length(input) * size),
      replace = replace
    ),
    simplify = simplify
  )
}

# multiple_clusters
multiple_cluster <- function(
  obj,
  n = 100,
  size = 0.8,
  npcs = 100,
  res = 1.2,
  reduction = "pca",
  assay = "SCT") {
  
  # Initialise tibble for data
  clusters <- dplyr::as_tibble(Seurat::Cells(obj))
  clusters <- dplyr::rename(clusters, "cell" = value)
  
  # Get samples
  samples <- n_samples(n, Seurat::Cells(obj), size = size)
  
  # Repeated clusters
  j <- 1
  for (idx in samples) {
    message(paste0("\tClustering ", j, "..."))
    small_obj <- obj[, idx]
    small_obj <- find_clusters(
      small_obj,
      reduction = reduction,
      npcs = npcs,
      resolution = res,
      assay = assay
    )
    clusters <- dplyr::left_join(
      clusters,
      dplyr::as_tibble(Seurat::Idents(small_obj), rownames = "cell"),
      by = "cell"
    )
    j <- j + 1
  }
  return(clusters)
}

# Find matches for a given clustering resolution
find_matches <- function(col, df) {
  mtchs <- outer(df[[col]], df[[col]], "==")
  # Records drops as imaginary, mtchs as 1, not mtchs as 0
  mtchs[is.na(mtchs)] <- 1i
  return(mtchs)
}

# Score the number of matches
percent_match <- function(x, n = 100) {
  return(Re(x) / (n - Im(x)))
}

# Compute group average frequencies
group_scores <- function(tbl, clusters) {
  colnames(tbl) <- clusters
  data <- tbl %>%
    tibble::add_column("cell_1" = clusters) %>%
    tidyr::pivot_longer(-cell_1, names_to = "cell_2", values_to = "percent") %>%
    dplyr::group_by(cell_1, cell_2) %>%
    dplyr::summarise("avg_percent" = mean(percent)) %>%
    dplyr::ungroup()
  return(data)
}

# Compute group average silhouette scores
group_sil <- function(sil, res) {
  sil <- tibble::as_tibble(sil[, ]) %>%
    dplyr::group_by(cluster) %>%
    dplyr::summarise("avg_sil" = mean(sil_width)) %>%
    tibble::add_column("res" = res)
  return(sil)
}

# Compute confidence intervals on the median
boot_median <- function(x, interval = 0.95, R = 25000, type = "bca") {
  # Define median to take data and indices for use with boot::
  med <- function(data, indices) {
    resample <- data[indices]
    return(median(resample))
  }
  
  # Calculate intervals
  boot_data <- boot::boot(data = x, statistic = med, R = R)
  boot_ci <- boot::boot.ci(boot_data, conf = interval, type = type)
  
  # Extract desired statistics
  ci <- list(
    low_med = boot_ci$bca[4],
    med = boot_ci$t0,
    high_med = boot_ci$bca[5]
  )
  return(ci)
}


### Begin ChooseR Workflow
npcs <- 20
resolutions <- c(0.4,0.6,0.8, 1, 1.6,2,4,6,8)
assay <- "SCT"
reduction <- "pca"
results_path <- paste0("chooseR-results/")

obj <- combined.sct # this is an integrated seurat object

# Run pipeline
for (res in resolutions) {
  message(paste0("Clustering ", res, "..."))
  message("\tFinding ground truth...")
  
  # "Truths" will be stored at glue::glue("{reduction}.{assay}_res.{res}")
  obj <- find_clusters(
    obj,
    reduction = reduction,
    assay = assay,
    npcs = npcs,   ###change made from original github repository
    resolution = res
  )
  clusters <- obj[[glue::glue("{reduction}.{assay}_res.{res}")]]
  
  # Now perform iterative, sub-sampled clusters
  results <- multiple_cluster(
    obj,
    n = 100,
    size = 0.8,
    npcs = npcs,
    res = res,
    reduction = reduction,
    assay = assay
  )
  
  # Now calculate the co-clustering frequencies
  message(paste0("Tallying ", res, "..."))
  # This is the more time efficient vectorisation
  # However, it exhausts vector memory for (nearly) all datasets
  # matches <- purrr::map(columns, find_matches, df = results)
  # matches <- purrr::reduce(matches, `+`)
  columns <- colnames(dplyr::select(results, -cell))
  mtchs <- matrix(0, nrow = dim(results)[1], ncol = dim(results)[1])
  i <- 1 # Counter
  for (col in columns) {
    message(paste0("\tRound ", i, "..."))
    mtchs <- Reduce("+", list(
      mtchs,
      find_matches(col, df = results)
    ))
    i <- i + 1
  }
  
  message(paste0("Scoring ", res, "..."))
  mtchs <- dplyr::mutate_all(
    dplyr::as_tibble(mtchs),
    function(x) dplyr::if_else(Re(x) > 0, percent_match(x), 0)
  )
  
  # Now calculate silhouette scores
  message(paste0("Silhouette ", res, "..."))
  sil <- cluster::silhouette(
    x = as.numeric(as.character(unlist(clusters))),
    dmatrix = (1 - as.matrix(mtchs))
  )
  saveRDS(sil, paste0(results_path, "silhouette_", res, ".rds"))
  
  # Finally, calculate grouped metrics
  message(paste0("Grouping ", res, "..."))
  grp <- group_scores(mtchs, unlist(clusters))
  saveRDS(grp, paste0(results_path, "frequency_grouped_", res, ".rds"))
  sil <- group_sil(sil, res)
  saveRDS(sil, paste0(results_path, "silhouette_grouped_", res, ".rds"))
}

saveRDS(obj, paste0(results_path, "clustered_data_March_8.rds"))

# Create silhouette plot
# Read in scores and calculate CIs
scores <- purrr::map(
  paste0(results_path, "silhouette_grouped_", resolutions, ".rds"),
  readRDS
)
scores <- dplyr::bind_rows(scores) %>%
  dplyr::group_by(res) %>%
  dplyr::mutate("n_clusters" = dplyr::n()) %>%
  dplyr::ungroup()
meds <- scores %>%
  dplyr::group_by(res) %>%
  dplyr::summarise(
    "boot" = list(boot_median(avg_sil)),
    "n_clusters" = mean(n_clusters)
  ) %>%
  tidyr::unnest_wider(boot)

writexl::write_xlsx(meds, paste0(results_path, "median_ci.xlsx"))

# Find thresholds
threshold <- max(meds$low_med)
choice <- as.character(
  meds %>%
    dplyr::filter(med >= threshold) %>%
    dplyr::arrange(n_clusters) %>%
    tail(n = 1) %>%
    dplyr::pull(res)
)

#  And plot!
ggplot(meds, aes(factor(res), med)) +
  geom_crossbar(
    aes(ymin = low_med, ymax = high_med),
    fill = "grey",
    size = 0.25
  ) +
  geom_hline(aes(yintercept = threshold), colour = "blue") +
  geom_vline(aes(xintercept = choice), colour = "red") +
  geom_jitter(
    data = scores,
    aes(factor(res), avg_sil),
    size = 0.35,
    width = 0.15
  ) +
  scale_x_discrete("Resolution") +
  scale_y_continuous(
    "Silhouette Score",
    expand = c(0, 0),
    limits = c(-1, 1),
    breaks = seq(-1, 1, 0.25),
    oob = scales::squish
  ) +
  cowplot::theme_minimal_hgrid() +
  theme(
    axis.title = element_text(size = 8),
    axis.text = element_text(size = 7),
    axis.line.x = element_line(colour = "black"),
    axis.line.y = element_line(colour = "black"),
    axis.ticks = element_line(colour = "black"),
  )

ggsave(
  filename = paste0(results_path, "silhouette_distribution_plot.png"),
  dpi = 300,
  height = 3.5,
  width = 3.5,
  units = "in"
) 

# Finally, a dot plot of silhouette scores to help identify less robust clusters
# The initial pipe is to order the clusters by silhouette score
scores %>%
  dplyr::filter(res == choice) %>%
  dplyr::arrange(dplyr::desc(avg_sil)) %>%
  dplyr::mutate_at("cluster", ordered, levels = .$cluster) %>%
  ggplot(aes(factor(cluster), avg_sil)) +
  geom_point() +
  scale_x_discrete("Cluster") +
  scale_y_continuous(
    "Silhouette Score",
    expand = c(0, 0),
    limits = c(-1, 1),
    breaks = seq(-1, 1, 0.25),
    oob = scales::squish
  ) +
  cowplot::theme_minimal_grid() +
  theme(
    axis.title = element_text(size = 8),
    axis.text = element_text(size = 7),
    axis.line.x = element_line(colour = "black"),
    axis.line.y = element_line(colour = "black"),
    axis.ticks = element_line(colour = "black"),
  )

ggsave(
  filename = paste0(results_path, "silhouette_point_plot_", choice, ".png"),
  dpi = 300,
  height = 3.5,
  width = 3.5,
  units = "in"
)

#PART 2

# Define common variables
# choice is the res elected by the pipeline in examples/1_seurat_pipeline.R
# Be sure to change your path as necessary!
reduction <- "pca"
assay <- "SCT"
choice <- 1
results_path <- "chooseR-results/"

# Load in the object containing the clustered results
obj <- readRDS(paste0(results_path, "clustered_data.rds"))

# First is a cluster average co-clustering heatmap
# Read the data
grp <- readRDS(paste0(results_path, "frequency_grouped_", choice, ".rds"))

# I have hashed out the block of code below because this keeps distorting the generated image. I actually like the full 
# square for visualization purposes anyway.
# As the data is symmetrical, we do not need the upper triangle
# grp <- grp %>%
#   pivot_wider(names_from = "cell_2", values_from = "avg_percent") %>%
#   select(str_sort(colnames(.), numeric = T)) %>%
#   column_to_rownames("cell_1")
# grp[lower.tri(grp)] <- NA
# grp <- grp %>%
#   as_tibble(rownames = "cell_1") %>%
#   pivot_longer(-cell_1, names_to = "cell_2", values_to = "avg_percent") %>%
#   mutate_at("cell_2", ordered, levels = unique(.$cell_1)) %>%
#   mutate_at("cell_1", ordered, levels = unique(.$cell_1))

# And plot!
plot <- ggplot(grp, aes(factor(cell_1), cell_2, fill = avg_percent)) +
  geom_tile() +
  scale_x_discrete("Cluster", expand = c(0, 0)) +
  scale_y_discrete(
    "Cluster",
    limits = rev(levels(grp$cell_2)),
    expand = c(0, 0)
  ) +
  scale_fill_distiller(
    " ",
    limits = c(0, 1),
    breaks = c(0, 0.5, 1),
    palette = "RdYlBu",
    na.value = "white"
  ) +
  coord_fixed() +
  theme(
    axis.ticks = element_line(colour = "black"),
    axis.text = element_text(size = 6),
    axis.title = element_text(size = 8),
    legend.text = element_text(size = 7),
    legend.position = c(0.9, 0.9)
  ) +
  guides(fill = guide_colorbar(barheight = 3, barwidth = 1))

plot + NoLegend()

ggsave(
  plot = plot,
  filename = paste0(results_path, "coclustering_heatmap_", choice, ".png"),
  dpi = 300,
  height = 3.5,
  width = 3.5,
  units = "in"
)

# # Let's add the silhouette scores to the Seurat object!
choice <- 1.0
sil_scores <- readRDS(paste0(results_path, "silhouette_", choice, ".rds"))
sil_scores <- as.data.frame(sil_scores[, 3], row.names = Seurat::Cells(combined.sct))
colnames(sil_scores) <- c("sil_score")
combined.sct <- AddMetaData(combined.sct, metadata = sil_scores)

# Seurat Clusters

# Clustering
combined.sct <- FindNeighbors(combined.sct, reduction = "pca", dims = 1:20)
combined.sct <- FindClusters(combined.sct, resolution = 1.0) # recommended chooseR resolution based on resolution silhouette score plot
DimPlot(combined.sct, reduction = 'umap')
DimPlot(combined.sct, reduction = 'umap', split.by = "condition")

# visualize the siluotte score
FeaturePlot(
  combined.sct,
  "sil_score",
  reduction = "umap",
  pt.size = 1,
  min.cutoff = -1,
  max.cutoff = 1
) +
  scale_colour_distiller(
    palette = "RdYlBu",
    labels = c(-1, 0, 1),
    breaks = c(-1, 0, 1),
    limits = c(-1, 1)
  )

Cannot find "umap" in Seurat object

Hi,
I'm currently testing your method on my scRNA-Seq datasets with Seurat, I'm using the 2 scripts you included in the examples directory, I just have one issue (see the name of this thread) when running the "2_seurat_further_visualisations.R" script since at some point you put the following code (around row 88 of the script):
plot <- DimPlot( obj, reduction = "umap", group.by = glue::glue("{reduction}.{assay}_res.{choice}"), pt.size = 0.5, )
but in the first script (1_seurat_pipeline.R), you never computed the UMAP dimensionality reduction (running the RunUMAP function). I checked both the "pipeline.R" and "helper_functions.R" scripts and it appears you never included it, is there a correct part where you intended it to be for the whole method to work correctly?
Thanks in advance for your help :)

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs

Jooble