stemangiola / tidysinglecellexperiment Goto Github PK

Brings SingleCellExperiment objects to the tidyverse

Home Page: https://stemangiola.github.io/tidySingleCellExperiment/index.html

R 97.30% TeX 2.70%

ggplot2 singlecellexperiment tidyverse tibble bioconductor dplyr tidyr single-cell-rna-seq plotly single-cell-sequencing

tidysinglecellexperiment's Introduction

tidySingleCellExperiment - part of tidytranscriptomics

Brings SingleCellExperiment to the tidyverse!

Website: tidySingleCellExperiment

Please also have a look at

tidySummarizedExperiment for tidy manipulation of SummarizedExperiment objects)
tidyseurat for tidy manipulation of Seurat objects
tidybulk for tidy bulk RNA-seq data analysis
tidygate for adding custom gate information to your tibble
tidyHeatmap for heatmaps produced with tidy principles

Introduction

tidySingleCellExperiment provides a bridge between Bioconductor single-cell packages [@amezquita2019orchestrating] and the tidyverse [@wickham2019welcome]. It enables viewing the Bioconductor SingleCellExperiment object as a tidyverse tibble, and provides SingleCellExperiment-compatible dplyr, tidyr, ggplot and plotly functions. This allows users to get the best of both Bioconductor and tidyverse worlds.

Functions/utilities available

SingleCellExperiment-compatible Functions	Description
`all`	After all `tidySingleCellExperiment` is a SingleCellExperiment object, just better

tidyverse Packages	Description
`dplyr`	All `dplyr` tibble functions (e.g. `select`)
`tidyr`	All `tidyr` tibble functions (e.g. `pivot_longer`)
`ggplot2`	`ggplot` (`ggplot`)
`plotly`	`plot_ly` (`plot_ly`)

Utilities	Description
`as_tibble`	Convert cell-wise information to a `tbl_df`
`join_features`	Add feature-wise information, returns a `tbl_df`
`aggregate_cells`	Aggregate cell gene-transcription abundance as pseudobulk tissue

Installation

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")

BiocManager::install("tidySingleCellExperiment")

Load libraries used in this vignette.

# Bioconductor single-cell packages
library(scater)
library(scran)
library(SingleR)
library(SingleCellSignalR)

# Tidyverse-compatible packages
library(purrr)
library(magrittr)
library(tidyHeatmap)

# Both
library(tidySingleCellExperiment)

Data representation of `tidySingleCellExperiment`

This is a SingleCellExperiment object but it is evaluated as a tibble. So it is compatible both with SingleCellExperiment and tidyverse.

data(pbmc_small, package="tidySingleCellExperiment")

It looks like a tibble

pbmc_small

## # A SingleCellExperiment-tibble abstraction: 80 × 17
## # �[90mFeatures=230 | Cells=80 | Assays=counts, logcounts�[0m
##    .cell orig.ident nCount_RNA nFeature_RNA RNA_snn_res.0.8 letter.idents groups
##    <chr> <fct>           <dbl>        <int> <fct>           <fct>         <chr> 
##  1 ATGC… SeuratPro…         70           47 0               A             g2    
##  2 CATG… SeuratPro…         85           52 0               A             g1    
##  3 GAAC… SeuratPro…         87           50 1               B             g2    
##  4 TGAC… SeuratPro…        127           56 0               A             g2    
##  5 AGTC… SeuratPro…        173           53 0               A             g2    
##  6 TCTG… SeuratPro…         70           48 0               A             g1    
##  7 TGGT… SeuratPro…         64           36 0               A             g1    
##  8 GCAG… SeuratPro…         72           45 0               A             g1    
##  9 GATA… SeuratPro…         52           36 0               A             g1    
## 10 AATG… SeuratPro…        100           41 0               A             g1    
## # ℹ 70 more rows
## # ℹ 10 more variables: RNA_snn_res.1 <fct>, file <chr>, ident <fct>,
## #   PC_1 <dbl>, PC_2 <dbl>, PC_3 <dbl>, PC_4 <dbl>, PC_5 <dbl>, tSNE_1 <dbl>,
## #   tSNE_2 <dbl>

But it is a SingleCellExperiment object after all

assay(pbmc_small, "counts")[1:5, 1:5]

## 5 x 5 sparse Matrix of class "dgCMatrix"
##         ATGCCAGAACGACT CATGGCCTGTGCAT GAACCTGATGAACC TGACTGGATTCTCA
## MS4A1                .              .              .              .
## CD79B                1              .              .              .
## CD79A                .              .              .              .
## HLA-DRA              .              1              .              .
## TCL1A                .              .              .              .
##         AGTCAGACTGCACA
## MS4A1                .
## CD79B                .
## CD79A                .
## HLA-DRA              1
## TCL1A                .

The SingleCellExperiment object’s tibble visualisation can be turned off, or back on at any time.

# Turn off the tibble visualisation
options("restore_SingleCellExperiment_show" = TRUE)
pbmc_small

## class: SingleCellExperiment 
## dim: 230 80 
## metadata(0):
## assays(2): counts logcounts
## rownames(230): MS4A1 CD79B ... SPON2 S100B
## rowData names(5): vst.mean vst.variance vst.variance.expected
##   vst.variance.standardized vst.variable
## colnames(80): ATGCCAGAACGACT CATGGCCTGTGCAT ... GGAACACTTCAGAC
##   CTTGATTGATCTTC
## colData names(9): orig.ident nCount_RNA ... file ident

# Turn on the tibble visualisation
options("restore_SingleCellExperiment_show" = FALSE)

Annotation polishing

We may have a column that contains the directory each run was taken from, such as the “file” column in pbmc_small.

pbmc_small$file[1:5]

## [1] "../data/sample2/outs/filtered_feature_bc_matrix/"
## [2] "../data/sample1/outs/filtered_feature_bc_matrix/"
## [3] "../data/sample2/outs/filtered_feature_bc_matrix/"
## [4] "../data/sample2/outs/filtered_feature_bc_matrix/"
## [5] "../data/sample2/outs/filtered_feature_bc_matrix/"

We may want to extract the run/sample name out of it into a separate column. Tidyverse extract can be used to convert a character column into multiple columns using regular expression groups.

# Create sample column
pbmc_small_polished <-
    pbmc_small |>
    extract(file, "sample", "../data/([a-z0-9]+)/outs.+", remove=FALSE)

# Reorder to have sample column up front
pbmc_small_polished |>
    select(sample, everything())

## # A SingleCellExperiment-tibble abstraction: 80 × 18
## # �[90mFeatures=230 | Cells=80 | Assays=counts, logcounts�[0m
##    .cell sample orig.ident nCount_RNA nFeature_RNA RNA_snn_res.0.8 letter.idents
##    <chr> <chr>  <fct>           <dbl>        <int> <fct>           <fct>        
##  1 ATGC… sampl… SeuratPro…         70           47 0               A            
##  2 CATG… sampl… SeuratPro…         85           52 0               A            
##  3 GAAC… sampl… SeuratPro…         87           50 1               B            
##  4 TGAC… sampl… SeuratPro…        127           56 0               A            
##  5 AGTC… sampl… SeuratPro…        173           53 0               A            
##  6 TCTG… sampl… SeuratPro…         70           48 0               A            
##  7 TGGT… sampl… SeuratPro…         64           36 0               A            
##  8 GCAG… sampl… SeuratPro…         72           45 0               A            
##  9 GATA… sampl… SeuratPro…         52           36 0               A            
## 10 AATG… sampl… SeuratPro…        100           41 0               A            
## # ℹ 70 more rows
## # ℹ 11 more variables: groups <chr>, RNA_snn_res.1 <fct>, file <chr>,
## #   ident <fct>, PC_1 <dbl>, PC_2 <dbl>, PC_3 <dbl>, PC_4 <dbl>, PC_5 <dbl>,
## #   tSNE_1 <dbl>, tSNE_2 <dbl>

Preliminary plots

Set colours and theme for plots.

# Use colourblind-friendly colours
friendly_cols <- dittoSeq::dittoColors()

# Set theme
custom_theme <-
    list(
        scale_fill_manual(values=friendly_cols),
        scale_color_manual(values=friendly_cols),
        theme_bw() +
            theme(
                panel.border=element_blank(),
                axis.line=element_line(),
                panel.grid.major=element_line(size=0.2),
                panel.grid.minor=element_line(size=0.1),
                text=element_text(size=12),
                legend.position="bottom",
                aspect.ratio=1,
                strip.background=element_blank(),
                axis.title.x=element_text(margin=margin(t=10, r=10, b=10, l=10)),
                axis.title.y=element_text(margin=margin(t=10, r=10, b=10, l=10))
            )
    )

## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

We can treat pbmc_small_polished as a tibble for plotting.

Here we plot number of features per cell.

pbmc_small_polished |>
    ggplot(aes(nFeature_RNA, fill=groups)) +
    geom_histogram() +
    custom_theme

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Here we plot total features per cell.

pbmc_small_polished |>
    ggplot(aes(groups, nCount_RNA, fill=groups)) +
    geom_boxplot(outlier.shape=NA) +
    geom_jitter(width=0.1) +
    custom_theme

Here we plot abundance of two features for each group.

pbmc_small_polished |>
    join_features(features=c("HLA-DRA", "LYZ")) |>
    ggplot(aes(groups, .abundance_counts + 1, fill=groups)) +
    geom_boxplot(outlier.shape=NA) +
    geom_jitter(aes(size=nCount_RNA), alpha=0.5, width=0.2) +
    scale_y_log10() +
    custom_theme

## tidySingleCellExperiment says: join_features produces duplicate cell names to accomadate the long data format. For this reason, a data frame is returned for independent data analysis. Assay feature abundance is appended as .abundance_counts and .abundance_logcounts.

Preprocess the dataset

We can also treat pbmc_small_polished as a SingleCellExperiment object and proceed with data processing with Bioconductor packages, such as scran [@lun2016pooling] and scater [@mccarthy2017scater].

# Identify variable genes with scran
variable_genes <-
    pbmc_small_polished |>
    modelGeneVar() |>
    getTopHVGs(prop=0.1)

# Perform PCA with scater
pbmc_small_pca <-
    pbmc_small_polished |>
    runPCA(subset_row=variable_genes)

## Warning in check_numbers(k = k, nu = nu, nv = nv, limit = min(dim(x)) - : more
## singular values/vectors requested than available

## Warning in (function (A, nv = 5, nu = nv, maxit = 1000, work = nv + 7, reorth =
## TRUE, : You're computing too large a percentage of total singular values, use a
## standard svd instead.

## Warning in (function (A, nv = 5, nu = nv, maxit = 1000, work = nv + 7, reorth =
## TRUE, : did not converge--results might be invalid!; try increasing work or
## maxit

pbmc_small_pca

## # A SingleCellExperiment-tibble abstraction: 80 × 18
## # �[90mFeatures=230 | Cells=80 | Assays=counts, logcounts�[0m
##    .cell orig.ident nCount_RNA nFeature_RNA RNA_snn_res.0.8 letter.idents groups
##    <chr> <fct>           <dbl>        <int> <fct>           <fct>         <chr> 
##  1 ATGC… SeuratPro…         70           47 0               A             g2    
##  2 CATG… SeuratPro…         85           52 0               A             g1    
##  3 GAAC… SeuratPro…         87           50 1               B             g2    
##  4 TGAC… SeuratPro…        127           56 0               A             g2    
##  5 AGTC… SeuratPro…        173           53 0               A             g2    
##  6 TCTG… SeuratPro…         70           48 0               A             g1    
##  7 TGGT… SeuratPro…         64           36 0               A             g1    
##  8 GCAG… SeuratPro…         72           45 0               A             g1    
##  9 GATA… SeuratPro…         52           36 0               A             g1    
## 10 AATG… SeuratPro…        100           41 0               A             g1    
## # ℹ 70 more rows
## # ℹ 11 more variables: RNA_snn_res.1 <fct>, file <chr>, sample <chr>,
## #   ident <fct>, PC1 <dbl>, PC2 <dbl>, PC3 <dbl>, PC4 <dbl>, PC5 <dbl>,
## #   tSNE_1 <dbl>, tSNE_2 <dbl>

If a tidyverse-compatible package is not included in the tidySingleCellExperiment collection, we can use as_tibble to permanently convert tidySingleCellExperiment into a tibble.

# Create pairs plot with GGally
pbmc_small_pca |>
    as_tibble() |>
    select(contains("PC"), everything()) |>
    GGally::ggpairs(columns=1:5, ggplot2::aes(colour=groups)) +
    custom_theme

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

Identify clusters

We can proceed with cluster identification with scran.

pbmc_small_cluster <- pbmc_small_pca

# Assign clusters to the 'colLabels' of the SingleCellExperiment object
colLabels(pbmc_small_cluster) <-
    pbmc_small_pca |>
    buildSNNGraph(use.dimred="PCA") |>
    igraph::cluster_walktrap() %$%
    membership |>
    as.factor()

## Warning in (function (to_check, X, clust_centers, clust_info, dtype, nn, :
## detected tied distances to neighbors, see ?'BiocNeighbors-ties'

# Reorder columns
pbmc_small_cluster |> select(label, everything())

## # A SingleCellExperiment-tibble abstraction: 80 × 19
## # �[90mFeatures=230 | Cells=80 | Assays=counts, logcounts�[0m
##    .cell  label orig.ident nCount_RNA nFeature_RNA RNA_snn_res.0.8 letter.idents
##    <chr>  <fct> <fct>           <dbl>        <int> <fct>           <fct>        
##  1 ATGCC… 2     SeuratPro…         70           47 0               A            
##  2 CATGG… 2     SeuratPro…         85           52 0               A            
##  3 GAACC… 2     SeuratPro…         87           50 1               B            
##  4 TGACT… 1     SeuratPro…        127           56 0               A            
##  5 AGTCA… 2     SeuratPro…        173           53 0               A            
##  6 TCTGA… 2     SeuratPro…         70           48 0               A            
##  7 TGGTA… 1     SeuratPro…         64           36 0               A            
##  8 GCAGC… 2     SeuratPro…         72           45 0               A            
##  9 GATAT… 2     SeuratPro…         52           36 0               A            
## 10 AATGT… 2     SeuratPro…        100           41 0               A            
## # ℹ 70 more rows
## # ℹ 12 more variables: groups <chr>, RNA_snn_res.1 <fct>, file <chr>,
## #   sample <chr>, ident <fct>, PC1 <dbl>, PC2 <dbl>, PC3 <dbl>, PC4 <dbl>,
## #   PC5 <dbl>, tSNE_1 <dbl>, tSNE_2 <dbl>

And interrogate the output as if it was a regular tibble.

# Count number of cells for each cluster per group
pbmc_small_cluster |>
    count(groups, label)

## tidySingleCellExperiment says: A data frame is returned for independent data analysis.

## # A tibble: 8 × 3
##   groups label     n
##   <chr>  <fct> <int>
## 1 g1     1        12
## 2 g1     2        14
## 3 g1     3        14
## 4 g1     4         4
## 5 g2     1        10
## 6 g2     2        11
## 7 g2     3        10
## 8 g2     4         5

We can identify and visualise cluster markers combining SingleCellExperiment, tidyverse functions and tidyHeatmap [@mangiola2020tidyheatmap]

# Identify top 10 markers per cluster
marker_genes <-
    pbmc_small_cluster |>
    findMarkers(groups=pbmc_small_cluster$label) |>
    as.list() |>
    map(~ .x |>
        head(10) |>
        rownames()) |>
    unlist()

# Plot heatmap
pbmc_small_cluster |>
    join_features(features=marker_genes) |>
    group_by(label) |>
    heatmap(.feature, .cell, .abundance_counts, .scale="column")

## tidySingleCellExperiment says: join_features produces duplicate cell names to accomadate the long data format. For this reason, a data frame is returned for independent data analysis. Assay feature abundance is appended as .abundance_counts and .abundance_logcounts.

## tidyHeatmap says: (once per session) from release 1.7.0 the scaling is set to "none" by default. Please use scale = "row", "column" or "both" to apply scaling

## Warning: The `.scale` argument of `heatmap()` is deprecated as of tidyHeatmap 1.7.0.
## ℹ Please use scale (without dot prefix) instead: heatmap(scale = ...)
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Reduce dimensions

We can calculate the first 3 UMAP dimensions using the SingleCellExperiment framework and scater.

pbmc_small_UMAP <-
    pbmc_small_cluster |>
    runUMAP(ncomponents=3)

And we can plot the result in 3D using plotly.

pbmc_small_UMAP |>
    plot_ly(
        x=~`UMAP1`,
        y=~`UMAP2`,
        z=~`UMAP3`,
        color=~label,
        colors=friendly_cols[1:4]
    )

plotly screenshot

Cell type prediction

We can infer cell type identities using SingleR [@aran2019reference] and manipulate the output using tidyverse.

# Get cell type reference data
blueprint <- celldex::BlueprintEncodeData()

# Infer cell identities
cell_type_df <-

    assays(pbmc_small_UMAP)$logcounts |>
    Matrix::Matrix(sparse = TRUE) |>
    SingleR::SingleR(
        ref = blueprint,
        labels = blueprint$label.main,
        method = "single"
    ) |>
    as.data.frame() |>
    as_tibble(rownames="cell") |>
    select(cell, first.labels)

# Join UMAP and cell type info
data(cell_type_df)
pbmc_small_cell_type <-
    pbmc_small_UMAP |>
    left_join(cell_type_df, by="cell")

## Warning in is_sample_feature_deprecated_used(x, .cols):
## tidySingleCellExperiment says: from version 1.3.1, the special columns
## including cell id (colnames(se)) has changed to ".cell". This dataset is
## returned with the old-style vocabulary (cell), however, we suggest to update
## your workflow to reflect the new vocabulary (.cell).

# Reorder columns
pbmc_small_cell_type |>
    select(cell, first.labels, everything())

## Warning in is_sample_feature_deprecated_used(.data, .cols):
## tidySingleCellExperiment says: from version 1.3.1, the special columns
## including cell id (colnames(se)) has changed to ".cell". This dataset is
## returned with the old-style vocabulary (cell), however, we suggest to update
## your workflow to reflect the new vocabulary (.cell).

## # A SingleCellExperiment-tibble abstraction: 80 × 23
## # �[90mFeatures=230 | Cells=80 | Assays=counts, logcounts�[0m
##    cell          first.labels orig.ident nCount_RNA nFeature_RNA RNA_snn_res.0.8
##    <chr>         <chr>        <fct>           <dbl>        <int> <fct>          
##  1 ATGCCAGAACGA… CD4+ T-cells SeuratPro…         70           47 0              
##  2 CATGGCCTGTGC… CD8+ T-cells SeuratPro…         85           52 0              
##  3 GAACCTGATGAA… CD8+ T-cells SeuratPro…         87           50 1              
##  4 TGACTGGATTCT… CD4+ T-cells SeuratPro…        127           56 0              
##  5 AGTCAGACTGCA… CD4+ T-cells SeuratPro…        173           53 0              
##  6 TCTGATACACGT… CD4+ T-cells SeuratPro…         70           48 0              
##  7 TGGTATCTAAAC… CD4+ T-cells SeuratPro…         64           36 0              
##  8 GCAGCTCTGTTT… CD4+ T-cells SeuratPro…         72           45 0              
##  9 GATATAACACGC… CD4+ T-cells SeuratPro…         52           36 0              
## 10 AATGTTGACAGT… CD4+ T-cells SeuratPro…        100           41 0              
## # ℹ 70 more rows
## # ℹ 17 more variables: letter.idents <fct>, groups <chr>, RNA_snn_res.1 <fct>,
## #   file <chr>, sample <chr>, ident <fct>, label <fct>, PC1 <dbl>, PC2 <dbl>,
## #   PC3 <dbl>, PC4 <dbl>, PC5 <dbl>, tSNE_1 <dbl>, tSNE_2 <dbl>, UMAP1 <dbl>,
## #   UMAP2 <dbl>, UMAP3 <dbl>

We can easily summarise the results. For example, we can see how cell type classification overlaps with cluster classification.

# Count number of cells for each cell type per cluster
pbmc_small_cell_type |>
    count(label, first.labels)

## tidySingleCellExperiment says: A data frame is returned for independent data analysis.

## # A tibble: 11 × 3
##    label first.labels     n
##    <fct> <chr>        <int>
##  1 1     CD4+ T-cells     2
##  2 1     CD8+ T-cells     8
##  3 1     NK cells        12
##  4 2     B-cells         10
##  5 2     CD4+ T-cells     6
##  6 2     CD8+ T-cells     2
##  7 2     Macrophages      1
##  8 2     Monocytes        6
##  9 3     Macrophages      1
## 10 3     Monocytes       23
## 11 4     Erythrocytes     9

We can easily reshape the data for building information-rich faceted plots.

pbmc_small_cell_type |>

    # Reshape and add classifier column
    pivot_longer(
        cols=c(label, first.labels),
        names_to="classifier", values_to="label"
    ) |>

    # UMAP plots for cell type and cluster
    ggplot(aes(UMAP1, UMAP2, color=label)) +
    geom_point() +
    facet_wrap(~classifier) +
    custom_theme

## tidySingleCellExperiment says: A data frame is returned for independent data analysis.

We can easily plot gene correlation per cell category, adding multi-layer annotations.

pbmc_small_cell_type |>

    # Add some mitochondrial abundance values
    mutate(mitochondrial=rnorm(dplyr::n())) |>

    # Plot correlation
    join_features(features=c("CST3", "LYZ"), shape="wide") |>
    ggplot(aes(CST3 + 1, LYZ + 1, color=groups, size=mitochondrial)) +
    geom_point() +
    facet_wrap(~first.labels, scales="free") +
    scale_x_log10() +
    scale_y_log10() +
    custom_theme

## Warning in is_sample_feature_deprecated_used(x, .cols):
## tidySingleCellExperiment says: from version 1.3.1, the special columns
## including cell id (colnames(se)) has changed to ".cell". This dataset is
## returned with the old-style vocabulary (cell), however, we suggest to update
## your workflow to reflect the new vocabulary (.cell).

Nested analyses

A powerful tool we can use with tidySingleCellExperiment is tidyverse nest. We can easily perform independent analyses on subsets of the dataset. First we classify cell types into lymphoid and myeloid, and then nest based on the new classification.

pbmc_small_nested <-
    pbmc_small_cell_type |>
    filter(first.labels != "Erythrocytes") |>
    mutate(cell_class=dplyr::if_else(`first.labels` %in% c("Macrophages", "Monocytes"), "myeloid", "lymphoid")) |>
    nest(data=-cell_class)

## Warning: There were 2 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `data = map(...)`.
## Caused by warning in `is_sample_feature_deprecated_used()`:
## ! tidySingleCellExperiment says: from version 1.3.1, the special columns including cell id (colnames(se)) has changed to ".cell". This dataset is returned with the old-style vocabulary (cell), however, we suggest to update your workflow to reflect the new vocabulary (.cell).
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.

pbmc_small_nested

## # A tibble: 2 × 2
##   cell_class data           
##   <chr>      <list>         
## 1 lymphoid   <SnglCllE[,40]>
## 2 myeloid    <SnglCllE[,31]>

Now we can independently for the lymphoid and myeloid subsets (i) find variable features, (ii) reduce dimensions, and (iii) cluster using both tidyverse and SingleCellExperiment seamlessly.

pbmc_small_nested_reanalysed <-
    pbmc_small_nested |>
    mutate(data=map(
        data, ~ {
            .x <- runPCA(.x, subset_row=variable_genes)

            variable_genes <-
                .x |>
                modelGeneVar() |>
                getTopHVGs(prop=0.3)

            colLabels(.x) <-
                .x |>
                buildSNNGraph(use.dimred="PCA") |>
                igraph::cluster_walktrap() %$%
                membership |>
                as.factor()

            .x |> runUMAP(ncomponents=3)
        }
    ))

pbmc_small_nested_reanalysed

## # A tibble: 2 × 2
##   cell_class data           
##   <chr>      <list>         
## 1 lymphoid   <SnglCllE[,40]>
## 2 myeloid    <SnglCllE[,31]>

We can then unnest and plot the new classification.

pbmc_small_nested_reanalysed |>

    # Convert to tibble otherwise SingleCellExperiment drops reduced dimensions when unifying data sets.
    mutate(data=map(data, ~ .x |> as_tibble())) |>
    unnest(data) |>

    # Define unique clusters
    unite("cluster", c(cell_class, label), remove=FALSE) |>

    # Plotting
    ggplot(aes(UMAP1, UMAP2, color=cluster)) +
    geom_point() +
    facet_wrap(~cell_class) +
    custom_theme

We can perform a large number of functional analyses on data subsets. For example, we can identify intra-sample cell-cell interactions using SingleCellSignalR [@cabello2020singlecellsignalr], and then compare whether interactions are stronger or weaker across conditions. The code below demonstrates how this analysis could be performed. It won’t work with this small example dataset as we have just two samples (one for each condition). But some example output is shown below and you can imagine how you can use tidyverse on the output to perform t-tests and visualisation.

pbmc_small_nested_interactions <-
    pbmc_small_nested_reanalysed |>

    # Unnest based on cell category
    unnest(data) |>

    # Create unambiguous clusters
    mutate(integrated_clusters=first.labels |> as.factor() |> as.integer()) |>

    # Nest based on sample
    nest(data=-sample) |>
    mutate(interactions=map(data, ~ {

        # Produce variables. Yuck!
        cluster <- colData(.x)$integrated_clusters
        data <- data.frame(assays(.x) |> as.list() |> extract2(1) |> as.matrix())

        # Ligand/Receptor analysis using SingleCellSignalR
        data |>
            cell_signaling(genes=rownames(data), cluster=cluster) |>
            inter_network(data=data, signal=_, genes=rownames(data), cluster=cluster) %$%
            `individual-networks` |>
            map_dfr(~ bind_rows(as_tibble(.x)))
    }))

pbmc_small_nested_interactions |>
    select(-data) |>
    unnest(interactions)

If the dataset was not so small, and interactions could be identified, you would see something like below.

data(pbmc_small_nested_interactions)
pbmc_small_nested_interactions

## # A tibble: 100 × 9
##    sample  ligand          receptor ligand.name receptor.name origin destination
##    <chr>   <chr>           <chr>    <chr>       <chr>         <chr>  <chr>      
##  1 sample1 cluster 1.PTMA  cluster… PTMA        VIPR1         clust… cluster 2  
##  2 sample1 cluster 1.B2M   cluster… B2M         KLRD1         clust… cluster 2  
##  3 sample1 cluster 1.IL16  cluster… IL16        CD4           clust… cluster 2  
##  4 sample1 cluster 1.HLA-B cluster… HLA-B       KLRD1         clust… cluster 2  
##  5 sample1 cluster 1.CALM1 cluster… CALM1       VIPR1         clust… cluster 2  
##  6 sample1 cluster 1.HLA-E cluster… HLA-E       KLRD1         clust… cluster 2  
##  7 sample1 cluster 1.GNAS  cluster… GNAS        VIPR1         clust… cluster 2  
##  8 sample1 cluster 1.B2M   cluster… B2M         HFE           clust… cluster 2  
##  9 sample1 cluster 1.PTMA  cluster… PTMA        VIPR1         clust… cluster 3  
## 10 sample1 cluster 1.CALM1 cluster… CALM1       VIPR1         clust… cluster 3  
## # ℹ 90 more rows
## # ℹ 2 more variables: interaction.type <chr>, LRscore <dbl>

Aggregating cells

Sometimes, it is necessary to aggregate the gene-transcript abundance from a group of cells into a single value. For example, when comparing groups of cells across different samples with fixed-effect models.

In tidySingleCellExperiment, cell aggregation can be achieved using the aggregate_cells function.

pbmc_small |>
  aggregate_cells(groups, assays = "counts")

## class: SummarizedExperiment 
## dim: 230 2 
## metadata(0):
## assays(1): counts
## rownames(230): ACAP1 ACRBP ... ZNF330 ZNF76
## rowData names(0):
## colnames(2): g1 g2
## colData names(4): .aggregated_cells groups orig.ident file

tidysinglecellexperiment's People

Contributors

Stargazers

Watchers

Forkers

kur1sutaru helenalc biomiha noriakis lambdamoses william-hutchison qclayssen

tidysinglecellexperiment's Issues

join_features doesn't allow choosing which assay when multiple present

Noticed that it doesn't appear possible to choose which assay to join_features from when multiple assays are present, see workshop comment here tidytranscriptomics-workshops/bioc2022_tidytranscriptomics#14 (comment)

join_features help does have

... | Parameters to pass to join wide, i.e. assay name to extract feature abundance from and gene prefix, for shape="wide"

But these 2 below (logcounts and counts) give the same result. Or am I giving the assay in the wrong way.

sce_obj |>
    
  join_features(
    features = c("CD3D", "TRDC", "TRGC1", "TRGC2", "CD8A", "CD8B"), shape = "wide",  "logcounts"
  )

sce_obj |>
    
  join_features(
    features = c("CD3D", "TRDC", "TRGC1", "TRGC2", "CD8A", "CD8B"), shape = "wide", "counts"
  )

feature request: join_features shape="long" should carry the rowData as well, similarly to `tidySummarizedExperiment`

Error from plotting `sce` object: '.abundance_counts' not found

Hi there,

I have an sce object obtained after doing CATALYST clustering of the flow cytometry data, and I want to use tidySingleCellExperiement package to manipulate the object.

I was trying to plot some channels/markers as below, but unsure which data the function .abundance_counts is drawing from, given the error below.

Would you mind giving me some pointer?

Thank you.

> F37_sce_backboneClustering
# A SingleCellExperiment-tibble abstraction: 4,955,453 × 17
# Features=270 | Assays=exprs
   .cell                 sample_id    cell_count_prePeacoQC cell_count_postPeacoQC manualInspect_strang…¹
   <chr>                 <fct>        <fct>                 <fct>                  <fct>                 
 1 CATALYST28meta16_12_1 F37_10D_CD4… 50819                 50785                  Expected              
 2 CATALYST28meta16_3_2  F37_10D_CD4… 50819                 50785                  Expected              
 3 CATALYST28meta16_3_3  F37_10D_CD4… 50819                 50785                  Expected              
 4 CATALYST28meta16_2_4  F37_10D_CD4… 50819                 50785                  Expected              
 5 CATALYST28meta16_3_5  F37_10D_CD4… 50819                 50785                  Expected              
 6 CATALYST28meta16_2_6  F37_10D_CD4… 50819                 50785                  Expected              
 7 CATALYST28meta16_8_7  F37_10D_CD4… 50819                 50785                  Expected              
 8 CATALYST28meta16_3_8  F37_10D_CD4… 50819                 50785                  Expected              
 9 CATALYST28meta16_3_9  F37_10D_CD4… 50819                 50785                  Expected              
10 CATALYST28meta16_3_10 F37_10D_CD4… 50819                 50785                  Expected              
# ℹ 4,955,443 more rows
# ℹ abbreviated name: ¹manualInspect_strangeFCS
# ℹ 12 more variables: preFilter_liveCD3CD8_count <fct>, flowJo_prePeacoQC_gMFI_PE <fct>,
#   flowCore_postPeacoQC_medMFI_PE <fct>, cluster_id <fct>, CATALYST28meta16 <fct>, nColID <fct>,
#   CATALYST28meta16_nColID <chr>, CATALYST28meta16_uniqueLetterID <fct>, UMAP1 <dbl>, UMAP2 <dbl>,
#   TSNE1 <dbl>, TSNE2 <dbl>
# ℹ Use `print(n = ...)` to see more rows

> F37_sce_backboneClustering |>
+     join_features(features=c("CD4", "CD8")) |>
+     ggplot(aes(CATALYST28meta16, .abundance_counts + 1, fill=CATALYST28meta16)) +
+     geom_boxplot(outlier.shape=NA)  +
+     scale_y_log10() +
+     custom_theme

tidySingleCellExperiment says: This operation lead to duplicated cell names. A data frame is returned for independent data analysis.
Error in `geom_boxplot()`:
! Problem while computing aesthetics.
ℹ Error occurred in the 1st layer.
Caused by error in `FUN()`:
! object '.abundance_counts' not found
Backtrace:
  1. base (local) `<fn>`(x)
  2. ggplot2:::print.ggplot(x)
  4. ggplot2:::ggplot_build.ggplot(x)
  5. ggplot2:::by_layer(...)
 12. ggplot2 (local) f(l = layers[[i]], d = data[[i]])
 13. l$compute_aesthetics(d, plot)
 14. ggplot2 (local) compute_aesthetics(..., self = self)
 15. base::lapply(aesthetics, eval_tidy, data = data, env = env)
 16. rlang (local) FUN(X[[i]], ...)

`tidySCE` Speedup aggregate_cells

This has taken inspiration and motivation from

https://twitter.com/lcolladotor/status/1687475222687936512

Bug when using `joinFeatures` + `SingleCellExperiment` with 1 cell

There is some weird behaviour when using a SingleCellExperiment with one cell. Firstly, note that we have two .cell columns in the second example, and both of them become part of the full coldata:

This example works fine:

> tidySingleCellExperiment::pbmc_small[, 1:2] |> tidySingleCellExperiment::join_features("HLA-DRA")
# A SingleCellExperiment-tibble abstraction: 2 × 20
# Features=230 | Cells=2 | Assays=counts, logcounts
  .cell   .feat…¹ .abun…² .abun…³ orig.…⁴ nCoun…⁵ nFeat…⁶ RNA_s…⁷ lette…⁸ groups
  <chr>   <chr>     <dbl>   <dbl> <fct>     <dbl>   <int> <fct>   <fct>   <chr>
1 ATGCCA… HLA-DRA       0    0    Seurat…      70      47 0       A       g2
2 CATGGC… HLA-DRA       1    4.78 Seurat…      85      52 0       A       g1
# … with 10 more variables: RNA_snn_res.1 <fct>, file <chr>, ident <fct>,
#   PC_1 <dbl>, PC_2 <dbl>, PC_3 <dbl>, PC_4 <dbl>, PC_5 <dbl>, tSNE_1 <dbl>,
#   tSNE_2 <dbl>, and abbreviated variable names ¹.feature, ².abundance_counts,
#   ³.abundance_logcounts, ⁴orig.ident, ⁵nCount_RNA, ⁶nFeature_RNA,
#   ⁷RNA_snn_res.0.8, ⁸letter.idents
# ℹ Use `colnames()` to see all variable names

But as soon as we use one cell:

> tidySingleCellExperiment::pbmc_small[, 1] |> tidySingleCellExperiment::join_features("HLA-DRA")
New names:
• `.cell` -> `.cell...1`
• `.cell` -> `.cell...2`
tidySingleCellExperiment says: Key columns are missing. A data frame is returned for independent data analysis.
# A tibble: 1 × 35
  .feature .abundance_…¹ .abun…² .cell…³ .cell…⁴ orig.…⁵ nCoun…⁶ nFeat…⁷ RNA_s…⁸
  <chr>            <dbl>   <dbl> <chr>   <chr>   <fct>     <dbl>   <int> <fct>
1 HLA-DRA              0       0 .cell   ATGCCA… Seurat…      70      47 0
# … with 26 more variables: letter.idents <fct>, groups <chr>,
#   RNA_snn_res.1 <fct>, file <chr>, ident <fct>, PC_1 <dbl>, PC_2 <dbl>,
#   PC_3 <dbl>, PC_4 <dbl>, PC_5 <dbl>, PC_6 <dbl>, PC_7 <dbl>, PC_8 <dbl>,
#   PC_9 <dbl>, PC_10 <dbl>, PC_11 <dbl>, PC_12 <dbl>, PC_13 <dbl>,
#   PC_14 <dbl>, PC_15 <dbl>, PC_16 <dbl>, PC_17 <dbl>, PC_18 <dbl>,
#   PC_19 <dbl>, tSNE_1 <dbl>, tSNE_2 <dbl>, and abbreviated variable names
#   ¹.abundance_counts, ².abundance_logcounts, ³.cell...1, ⁴.cell...2, …
# ℹ Use `colnames()` to see all variable names
> tidySingleCellExperiment::pbmc_small[, 1] |> tidySingleCellExperiment::join_features("HLA-DRA") |> colnames()
New names:
• `.cell` -> `.cell...1`
• `.cell` -> `.cell...2`
tidySingleCellExperiment says: Key columns are missing. A data frame is returned for independent data analysis.
 [1] ".feature"             ".abundance_counts"    ".abundance_logcounts"
 [4] ".cell...1"            ".cell...2"            "orig.ident"
 [7] "nCount_RNA"           "nFeature_RNA"         "RNA_snn_res.0.8"
[10] "letter.idents"        "groups"               "RNA_snn_res.1"
[13] "file"                 "ident"                "PC_1"
[16] "PC_2"                 "PC_3"                 "PC_4"
[19] "PC_5"                 "PC_6"                 "PC_7"
[22] "PC_8"                 "PC_9"                 "PC_10"
[25] "PC_11"                "PC_12"                "PC_13"
[28] "PC_14"                "PC_15"                "PC_16"
[31] "PC_17"                "PC_18"                "PC_19"
[34] "tSNE_1"               "tSNE_2"

Speed up `nest` and `unnest` for massive datasets in `tidySingleCellExperiment`

There is an example in the tidyseurat repository of how we can speed out nesting for very simple use cases using the split and merge functionality of Surat

options("restore_SingleCellExperiment_show" = FALSE)` doesn't work?

It looks like options("restore_SingleCellExperiment_show" = FALSE) doesn't work to revert back to tidy view?

See the workshop here https://tidytranscriptomics-workshops.github.io/bioc2022_tidytranscriptomics/articles/tidytranscriptomics_case_study.html#introduction-to-tidysinglecellexperiment

It worked fine with tidySummarizedExperiment here https://saskiafreytag.github.io/biocommons-r-intro/60-next-steps/index.html

Inconsistency between assay data used and what is advised in SingleR?

Hi there,

Q1: In the tutorial, logcounts were used, but in SingleR documentation, it was strongly advised against using of any transformed data and prefer raw counts. Would you mind clarifying if there is a reason you guys use logcounts here?

Q2: If my sce object is flow data, I will simply set Matrix::Matrix(sparse = F)?

Thank you.

# Get cell type reference data
blueprint <- celldex::BlueprintEncodeData()

# Infer cell identities
cell_type_df <-

    assays(pbmc_small_UMAP)$logcounts %>%
    Matrix::Matrix(sparse = TRUE) %>%
    SingleR::SingleR(
        ref = blueprint,
        labels = blueprint$label.main,
        method = "single"
    ) %>%
    as.data.frame() %>%
    as_tibble(rownames="cell") %>%
    select(cell, first.labels)

From SingleR (https://bioconductor.org/books/release/SingleRBook/classic-mode.html):
"For the test data, the assay data need not be log-transformed or even (scale) normalized. This is because SingleR() computes Spearman correlations within each cell, which is unaffected by monotonic transformations like cell-specific scaling or log-transformation. It is perfectly satisfactory to provide the raw counts for the test dataset to SingleR(), which is the reason for setting assay.type.test=1 in our previous SingleR() call for the Grun dataset."

Installation error with latest (1.7.2) in R 4.0.1: ERROR: unable to collate and parse R files for package ‘tidySingleCellExperiment’

devtools::install_github("stemangiola/tidySingleCellExperiment")
Downloading GitHub repo stemangiola/tidySingleCellExperiment@HEAD
✔ checking for file ‘/tmp/RtmpKpe7AU/remotes4c012bf2b/stemangiola-tidySingleCellExperiment-27335c1/DESCRIPTION’ (502ms)
─ preparing ‘tidySingleCellExperiment’:
✔ checking DESCRIPTION meta-information ...
─ installing the package to process help pages
-----------------------------------
─ installing source package ‘tidySingleCellExperiment’ ...
** using staged installation
** R
Error in parse(outFile) :
/tmp/Rtmp2rFGO3/Rbuild4c363d10ffec/tidySingleCellExperiment/R/print_method.R:39:27: unexpected '>'
38:
39: number_of_features = x |>
^
ERROR: unable to collate and parse R files for package ‘tidySingleCellExperiment’
─ removing ‘/tmp/Rtmp2rFGO3/Rinst4c364825a588/tidySingleCellExperiment’
-----------------------------------
ERROR: package installation failed
Error: Failed to install 'tidySingleCellExperiment' from GitHub:
! System command 'R' failed

add join_by function from dplyr

dplyr added the new join_by() with left_join(join_by = ...)

try to not re-declare generics

comply to Bioconductor review

DESCRIPTION

I think the package should be re-named for maximum clarity as TidySingleCellExperiment -- the appreviation SCE could mean anything. The capitalization follows Bioconductor convention.

We are well in-line as I don't like acronyms either. As note, sce is the common name for SingleCellExperiment object e.g. in this official source, is really commonly accepted by the community.

In order to respect both Bioconductor tradition and tidyverse standards (afterall this is a equal-rights marriage) the name tidySingleCellExperiment can be a good solution. (I understand the S4 point of view, but this package is dedicated to end-users, so the elegance in design and vocabulary has them as our priority)

Having said that, at the BioC-Asia2020 we will gather the opinion of the community with a pool at our workshop.

Title: field opening quote not closed; is it necessary
Description: Not sure what an 'invisible layer' is as a programming concept?
Imports: not all packages listed here are actually used directly? Remove packages that provide functionality that is not directly referenced.

TESTS

consider organising test files so that they have the same structure as the R/ files -- each tests/testthat/test-* is testing code defined in the corresponding R/* file. This organization makes it easier to navigate the code.
hmm, I find the coding style here very difficult to read, and wonder why it does not follow tidyverse guidelines? https://style.tidyverse.org/pipes.html#whitespace .
Likewise with tests, it would seem that nesting pipes within expect_() is antithetic to readability -- instead, create a temporary variable from a piped expresssion, and use the variable in expect_() statement?

I opted for full tydyverse style without the need to create temporary variables

test-dplyr.R:16 It is never appropriate to use direct slot access; use accessors instead.

R (comments are on specfic lines of code, but apply throughout)

data.R I was surprised that these data sets were not documented here, where they are defined? The documentation in man/pbmc_small.Rd, for instance, is inadequate -- provide a description of what the object represents, how it was derived, how it is used in the package.
dplyr_methods.R:58 Redefining the generic is not appropriate. Instead, define S3 methods that implement functionality for the specific classes you wish to provide methods for, and export the methods. Since you expect the user to use the dplyr generics, place dplyr in the Depends: field of the DESCRIPTION file.

I worked on it, and I agree. Respecting this standard is on out to-do list for tidySCE, tidySE and tidybulk. However I was not able to make things work under this standard this branch (either the export from roxigen disappears, or the method.class does not get exported). I ask to give us a release cycle to fix this. This although is a better approach does not affect negatively when using any package.

dplyr_methods.R:157 I find the mix of 'base' and tidy functionality confusing; why use cbind() rather than the tidy equivalent? In the next line,

I used cbind to cbind two SingleCellExperiment objects

dplyr_methods.R:189 Never use direct slot access @, instead use accessors colData(). It's strange to see as.data.frame() in this line, rather than as_tibble(), and likewise creating the second argument as a DataFrame()???

tts[[1]] is a SingleCellExperiment; colData(tts[[1]]) is a DataFrame that cannot be converted to tibble directly; DataFrame is used to update colData from SCE

dplyr_methods.R:224 This formatting

message("tidySCE says: A data frame is returned for
independent data analysis.")
results in quasi-arbitrary formatting for output, where the white space introduced for programmatic convenience is echoed literally to the user

tidySCE says: A data frame is returned for
independent data analysis.
Rely on message()'s internal use of paste0() or, if precise formatting is important, use strwrap()

message(
"tidySCE says: A data frame is returned for ",
"independent data analysis."
)
or (presummable in a helper function for re-use across your code)

txt <- paste(
"tidySCE says: A data frame is returned for ",
"independent data analysis."
)
message(paste(strwrap(txt, exdent = 4), collapse = "\n"))

dplyr_methods.R:622 long lines of piped code and nested conditional tests are unreadable. Reformat, e.g.,

tst <- intersect(
cols %>%
names(),
get_special_columns(.data) %>%
c(get_needed_columns())) %>%
length() %>%
gt(0)
)
if (tst) {

(why do you write c(get_needed_columns())) instead of get_needed_columns()?)

this is a concatenation "a" %>% c("b")

In the stop() message immediately after, reconsider use of stop(sprintf()) and instead rely on stop()'s internal use of paste0()

columns <-
get_special_columns(.data) %>%
c(get_needed_columns()) %>%
paste(collapse=", ")
stop(
"tidySCE says: you are trying to rename a column that is view only ",
columns, " ",
"(it is not present in the colData). If you want to mutate a view-only",
"column, make a copy and mutate that one.",
)

For such long messages adopting a strwrap() discipline as outlined above would seem to be highly desireable.

For the moment we have adopted the stop paste0 function suggested above. In the future this will be an area of improvement.

dplyr_methods.R:775 this code is repeated several times. Write a function instead to remove code duplication.

the code is similar but not identical

dplyr_methods.R:1146 it should not be necessary to preface functions with their package names; I don't think this helps readability of the code.

I understan the concern, however since I am overwriting some tidyverse function this helps some inconsistencies. We plan to improve the code clarity as the pakage matures

methods.R:3 it's strange to see a class definition at the top of a file called 'methods.R'; a more common organization would give the file the name of the class -- tidySCE.R. The Bioconductor / S4 convention (this is an S4 class, after all...) would name this class TidySCE, and perhaps further recognizing the imporance of being clear to the user and that the user will seldom have to type out the word in it's entirety (e.g., because of autocompletion in RStudio) it would be better to name the class explicitly TidySingleCellExperiment.

We agree, we wait to create this file once the package name has been settled (first point in the review list).

methods.R:106 highly nested calls like this are just too complicated for the user to parse. Adopt a more procedural programming style with less nesting.

MAN

not sure what the figures/ directory is doing here?

It's used for the figures in the README, as here: https://r-pkgs.org/whole-game.html

try to override show function without redefining a new class

add aggregate_cells

Hello @william-hutchison , this feature request noted by @michael love could be a nice little project.

tidytranscriptomics-workshops/LoveMangiola2022_tidytranscriptomics#2

request from Michael is here: tidytranscriptomics-workshops/LoveMangiola2022_tidytranscriptomics#2 (comment)

as_tibble() and slice_sample() cannot handle a column named $cell

If you're dying to reproduce the whole thing, see https://github.com/VanAndelInstitute/ExpDesign2021/blob/main/vignettes/project2.md for gory details.

The bug:

R> as_tibble(tidybarnyard) %>% filter(method == "inDrops") %>% slice_sample()
Error: Can't transform a data frame with duplicate names.
Run `rlang::last_error()` to see where the error occurred.

Currently the fix is to rename that column. But that's a crappy fix!

R> names(colData(tidybarnyard))[4] <- "barcode"
R> as_tibble(tidybarnyard) %>% filter(method == "inDrops") %>% slice_sample()
# A tibble: 1 × 5
  cell                                        name    experiment method  barcode
  <chr>                                       <chr>   <chr>      <chr>   <chr>  
1 Mixture2.inDrops.ACCACAGA-AAGAGCGT-CTTGGTGT Mixtur… Mixture2   inDrops ACCACA…

Session info:

R> sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 21.04

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    graphics  grDevices datasets  stats     utils     methods  
[8] base     

other attached packages:
 [1] forcats_0.5.1                  readr_2.1.0                   
 [3] tidyverse_1.3.1                stringr_1.4.0                 
 [5] RcppML_0.5.2                   tidySingleCellExperiment_1.4.0
 [7] SingleCellExperiment_1.16.0    SummarizedExperiment_1.24.0   
 [9] Biobase_2.54.0                 GenomicRanges_1.46.0          
[11] GenomeInfoDb_1.30.0            IRanges_2.28.0                
[13] S4Vectors_0.32.2               BiocGenerics_0.40.0           
[15] MatrixGenerics_1.6.0           matrixStats_0.61.0            
[17] nvimcom_0.9-122                colorout_1.2-2                
[19] skeletor_1.0.4                 gtools_3.9.2                  
[21] useful_1.2.6                   knitr_1.36                    
[23] yardstick_0.0.8                workflowsets_0.1.0            
[25] workflows_0.2.4                tune_0.1.6                    
[27] tidyr_1.1.4                    tibble_3.1.6                  
[29] rsample_0.1.1                  recipes_0.1.17                
[31] purrr_0.3.4                    parsnip_0.1.7.9001            
[33] modeldata_0.1.1                infer_1.0.0                   
[35] ggplot2_3.3.5                  dplyr_1.0.7                   
[37] dials_0.0.10                   scales_1.1.1                  
[39] broom_0.7.10                   tidymodels_0.1.4              
[41] BiocManager_1.30.16           

loaded via a namespace (and not attached):
  [1] readxl_1.3.1           backports_1.3.0        RcppEigen_0.3.3.9.1   
  [4] plyr_1.8.6             lazyeval_0.2.2         splines_4.1.1         
  [7] listenv_0.8.0          usethis_2.1.3          digest_0.6.28         
 [10] foreach_1.5.1          htmltools_0.5.2        fansi_0.5.0           
 [13] magrittr_2.0.1         memoise_2.0.0.9000     tzdb_0.2.0            
 [16] remotes_2.4.1          globals_0.14.0         modelr_0.1.8          
 [19] gower_0.2.2            hardhat_0.1.6          prettyunits_1.1.1     
 [22] colorspace_2.0-2       rvest_1.0.2            haven_2.4.3           
 [25] xfun_0.28              tcltk_4.1.1            callr_3.7.0           
 [28] crayon_1.4.2           RCurl_1.98-1.5         jsonlite_1.7.2        
 [31] roxygen2_7.1.2         survival_3.2-13        iterators_1.0.13      
 [34] glue_1.5.0             gtable_0.3.0           ipred_0.9-12          
 [37] zlibbioc_1.40.0        XVector_0.34.0         DelayedArray_0.20.0   
 [40] pkgbuild_1.2.0         future.apply_1.8.1     DBI_1.1.1             
 [43] Rcpp_1.0.7             viridisLite_0.4.0      GPfit_1.0-8           
 [46] lava_1.6.10            prodlim_2019.11.13     htmlwidgets_1.5.4     
 [49] httr_1.4.2             ellipsis_0.3.2         pkgconfig_2.0.3       
 [52] nnet_7.3-16            dbplyr_2.1.1           utf8_1.2.2            
 [55] tidyselect_1.1.1       rlang_0.4.12           DiceDesign_1.9        
 [58] cellranger_1.1.0       munsell_0.5.0          tools_4.1.1           
 [61] cachem_1.0.6           cli_3.1.0              generics_0.1.1        
 [64] devtools_2.4.2         fastmap_1.1.0          processx_3.5.2        
 [67] fs_1.5.0               future_1.23.0          xml2_1.3.2            
 [70] compiler_4.1.1         rstudioapi_0.13        plotly_4.10.0         
 [73] testthat_3.1.0         reprex_2.0.1           lhs_1.1.3             
 [76] stringi_1.7.5          ps_1.6.0               desc_1.4.0            
 [79] lattice_0.20-45        Matrix_1.3-4           vctrs_0.3.8           
 [82] pillar_1.6.4           lifecycle_1.0.1        furrr_0.2.3           
 [85] data.table_1.14.2      bitops_1.0-7           R6_2.5.1              
 [88] KernSmooth_2.23-20     parallelly_1.28.1      sessioninfo_1.2.1     
 [91] codetools_0.2-18       MASS_7.3-54            assertthat_0.2.1      
 [94] pkgload_1.2.3          rprojroot_2.0.2        withr_2.4.2           
 [97] GenomeInfoDbData_1.2.7 hms_1.1.1              parallel_4.1.1        
[100] grid_4.1.1             rpart_4.1-15           timeDate_3043.102     
[103] class_7.3-19           pROC_1.18.0            lubridate_1.8.0

Implement `group_by` and `with_groups` for `tidySingleCellExperiment` and `tidyseurat`

the issue for tidyseurat is here

stemangiola/tidyseurat#65

`tidySCE`: `slice` masks tidyverse::slice

@HelenaLC in one occasion loading first tidyverse and tidySingleCellExperiment second, I get

The following objects are masked from ‘package:dplyr’:

    collapse, desc, slice

Specifically slice seems to be the only method that overrides the generics for some reason.

When/if you have time, could you please have a quick look? Feel free to reject the invitation if you don't have throughput.

Thanks!

Adding aggregate_cells in README and Vignette

Hello @william-hutchison , when you have time could you please add aggregate_cells in the function list of README and Vignette, and add an example in the README and Vignette?

Repo website link is broken

The repo's website is a broken link:

Points to https://stemangiola.github.io/tidySCE/articles/introduction.html instead of https://stemangiola.github.io/tidySingleCellExperiment/index.html

update github action

Hello @mblue9 , when you have time (no rush at all), do you think you could replicate the edits you have done for tidybulk?

I think tidyseurat tidyBioconductor have the same problem of R 4.1.

Thanks a lot!

`tidySCE` Functionality to see and access the altExp slot

Hi all,

I am really liking your package, it has made many operations infinitely easier. We tend to prefer the tidySingleCellExperiment to Seurat, however, the one thing we have noticed is that there is no functionality to access the altExp slot, where our CITE-seq data are stored. In the standard SingleCellExperiment print method the altExpNames are listed at the bottom, whereas this is completely hidden in the tibble abstraction. Could this possibly be added?

Many thanks in advance.

adapt aggregate_features() to tidyseurat which is based on ttservice

@william-hutchison aggregate_cells() in tidyseurat relies on the generic method in ttservice, could you please adapt this method to tidyseurat, so they can be used together if both tidyseurat and tidySCE are loaded at the same time?

stemangiola/tidyseurat#61

It is just few lines of code to be modified in the methods

subsample sce object based on factor in colData

I would like to sub-sample a singleCellExperiment object based on a factorial in colData.

I have a singleCellExperiment object:

> sce
# A SingleCellExperiment-tibble abstraction: 13,268,769 × 6
# Features=42 | Assays=exprs

with some colData:

> colData(sce)
DataFrame with 13268769 rows and 5 columns
         sample_id condition patient_id    label1 cluster_id
          <factor>  <factor>   <factor> <numeric>   <factor>
1            D929I       Ref      D929I        36        302
2            D929I       Ref      D929I        29        285
3            D929I       Ref      D929I        50        103
4            D929I       Ref      D929I        36        302
5            D929I       Ref      D929I        51        181
...            ...       ...        ...       ...        ...
13268765     D232I       Ref      D232I        51        201
13268766     D232I       Ref      D232I        28        304
13268767     D232I       Ref      D232I        50        5  
13268768     D232I       Ref      D232I        51        184
13268769     D232I       Ref      D232I        18        364

I would like to subsample based on the cluster_id column such that I have max X (500) events of each cluster.

I can get the selection of cells using the following code:

> sce %>% group_by(cluster_id) %>% slice_sample(n=500) %>% ungroup()
tidySingleCellExperiment says: A data frame is returned for independent data analysis.
# A tibble: 200,000 × 6
   .cell    sample_id condition patient_id label1 cluster_id
   <chr>    <fct>     <fct>     <fct>       <dbl> <fct>     
 1 4002318  D0749I    Ref       D0749I         60 1         
 2 10259368 D590I     Ref       D590I          60 1         
 3 12615676 D232I     Ref       D232I          25 1         
 4 6765422  D694I     Ref       D694I          25 1         
 5 9415336  D0553I    Ref       D0553I         60 1         
 6 7245671  D694I     Ref       D694I          25 1         
 7 7177144  D694I     Ref       D694I          42 1         
 8 7002069  D694I     Ref       D694I          49 1         
 9 8732040  D615I     Ref       D615I          60 1         
10 3989255  D0749I    Ref       D0749I         60 1         
# … with 199,990 more rows
# ℹ Use `print(n = ...)` to see more rows

But I don't know how I would use this to filter the original singleCellExperiment object.

Could you please give me a pointer?

Thanks

Consider adding summarize.SingleCellExperiment as an explicit alias of summarise.SingleCellExperiment

This is a very small bauble, but see the following issue:

suppressMessages(library(tidySingleCellExperiment))

pbmc_small |>
  summarise(nCount_RNA = mean(nCount_RNA))
#> tidySingleCellExperiment says: A data frame is returned for independent data analysis.
#> # A tibble: 1 × 1
#>   nCount_RNA
#>        <dbl>
#> 1       245.

pbmc_small |> 
  summarize(nCount_RNA = mean(nCount_RNA))
#> Error in summarize(pbmc_small, nCount_RNA = mean(nCount_RNA)): could not find function "summarize"

^{Created on 2023-06-06 with reprex v2.0.2}

What I mean to show here is that, when dplyr is not attached, you can run into some namespace issues with summarise vs. summarize method deployment for SingleCellExperiment's

^{Created on 2023-06-06 with reprex v2.0.2}

README has an incredibly long assay(pbmc_small_tidy)

Hello @mblue9, sorry I merged without double-checking. The README has an incredibly long assay(pbmc_small_tidy), have a look :)

Clarification of getting back/disabling the viewing of sce as tibble?

Hi there,

Thanks for developing a very useful package.

Would you mind clarifying how we can disabling ("getting back") the sce class of the sce object?

After I installed the package, immediately my sce object becomes a tibble object without me setting or running any function to specify that, which was a bit unexpected.

`tidySCE` deal with missing assay names

Inspired by

@csoneson at

stemangiola/tidySummarizedExperiment#78

Beta testing - do you know somebody?

Hello @mblue9 ,

I was thinking, do you happen to know anybody working with bioconductor (or Seurat) for single cell, that would like to start using any of these packages?

Clarifications about `.abundance_counts`

Hi @stemangiola - My apologies for re-asking these questions following up from #92. Would you mind giving me some pointers?

Q1. Would you mind confirming that the duplicated cell identifiers are created as a byproduct of the plotting function and not that the input data has duplicated cell identifiers, correct?

Q2. What exactly does the .abundance_ function is calculating when there is one feature and when there are multiple features? Can we control how the expression grouped for plotting, e.g., median, ...?

Q3. Is it possible to modify the code below so that we can split a cell cluster by some threshold of expression of a marker or set of markers and plot these cell splits next to each other in the bar plot?

For example, in the plot below, CATALYST28meta16 is my cluster IDs on the x-axis. I want to have (instead of 1 bar of all the cells per cluster) 2 bars for each cluster, and one bar is of a group of cells that have "low" CD3 expression and the other bar is of a group of cells that have high CD3 expression?

speed up as_tibble in case I have nested columns in the metadata e.g. TCR

An example file is /stornext/Bioinf/data/bioinf-data/Papenfuss_lab/projects/mangiola.s/PostDoc/covid19pbmc/data/all_batches/raw_counts/C120_COVID_PBMC_batch6/data/SCEs/C120_COVID_PBMC_batch6.demultiplexed.SCE.rds

Method for S4 DataFrame for left_join

This way we can use colData(sce) in left join into another SCE object without having to convert to S3 tibble.

stemangiola / tidysinglecellexperiment Goto Github PK

tidysinglecellexperiment's Introduction

tidySingleCellExperiment - part of tidytranscriptomics

Introduction

Functions/utilities available

Installation

Data representation of tidySingleCellExperiment

Annotation polishing

Preliminary plots

Preprocess the dataset

Identify clusters

Reduce dimensions

Cell type prediction

Nested analyses

Aggregating cells

tidysinglecellexperiment's People

Contributors

Stargazers

Watchers

Forkers

tidysinglecellexperiment's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs

Data representation of `tidySingleCellExperiment`