nanostring-biostats / fastreseg Goto Github PK

View Code? Open in Web Editor NEW

1.0 3.0 0.0 67.85 MB

FastReseg for detection and correction of cell segmentation error based on spatial profile of transcripts.

Home Page: https://nanostring-biostats.github.io/FastReseg/

License: Other

R 23.92% HTML 76.08%

fastreseg's Introduction

FastReseg

An R package for detection and correction of cell segmentation error based on spatial profile of transcripts

Dev notes

FastReseg package processes spatial transcriptome data through 5 different modules:

Preprocess on whole dataset
1. runPreprocess() to get the baseline and cutoffs for transcript score and spatial local density from the whole dataset.
Parallel processing on individual FOV: core wrapper fastReseg_perFOV_full_process() contains all the 3 modules in this step.
1. runSegErrorEvaluation() to detect cells with cell segmentation error based on spatial dependency of transcript score in a given transcript data.frame.
2. runTranscriptErrorDetection() to identify transcript groups with poor fit to current cell segments.
3. runSegRefinement() to re-segment the low-fit transcript groups given their neighborhood.
Combine outcomes from multiple FOVs into one
1. Pipeline wrappers would combine resegmentation outputs from individual FOVs into one.

For convenience, two pipeline wrapper functions are included for streamline processing of multi-FOV dataset to different exit points.

fastReseg_flag_all_errors(): performs preprocess and then evaluates and flags segmentation error in all input files, optional to return the gene expression matrix where all putative contaminating transcripts are trimmed from current cell segments.
fastReseg_full_pipeline(): performs preprocess, detect and correct cell segmentation errors by trimming, splitting and merging given the local neighborhood of poor-fit transcript groups, can process multiple input files in parallel.

System requirements

R (>= 3.5.0)
UNIX, Mac or Windows
see DESCRIPTION for full dependencies

Demo

See the "vignettes" folder.

tutorial.Rmd and tutorial.html for example usages of streamline pipeline wrappers and modular functions for individual task.
a__flagErrorOnly_on_SMIobject.R for flagging segmentation errors without correcting, interfacing FastReseg with SMI TAP pipeline (Giotto).
b__fastReseg_on_SMIobject.R for runing entire resegmentation workfkow on a given dataset, example dataset, interfacing FastReseg with SMI TAP pipeline (Giotto).

Workflow:

Installation

Install the development version from GitHub

if(!requireNamespace("GiottoClass", quietly=TRUE))
  remotes::install_github("drieslab/GiottoClass", upgrade="never", 
  ref = "6d9d385beebcc57b78d34ffbe30be1ef0a184681")
  
devtools::install_github("Nanostring-Biostats/FastReseg", 
                         build_vignettes = TRUE, ref = "main")

fastreseg's People

Contributors

Stargazers

Watchers

fastreseg's Issues

faster replacement for leiden clustering

Here's an idea for a much faster alternative to leiden clustering:

How can we tell if a group should be merged with a neighbor or not? I think the key pattern is that groups that don't deserve to be merged with their neighbors make very non-convex shapes when merged. See examples below:

A possible metric for this behavior is the following:

Calculate both a convex hull and a concave hull for: a. the group alone, b. the neighbor alone, c. the group and neighbor together.
Take the areas within those hulls. Call them convex_group, concave_group, convex_neighbor, concave_neighbor, convex_both, concave_both.
Measure the area of whitespace within the convex hulls:

whitespace_both = area(convex_both) - area(concave_both)
whitespace_separate = area(convex_group) - area(concave_group) + area(convex_neighbor) - area(concave_neighbor)

Take the increase in whitespace that results from merging the group with the neighbor: whitespace_both - whitespace_separate.
Possibly, scale this statistic by the area of the group. I.e. (whitespace_both - whitespace_separate) / area(concave_group).

This would measure the area pointed to in the below:

Convex hulls can be calculated with spatstat. Concave hulls with https://search.r-project.org/CRAN/refmans/concaveman/html/concaveman.html

Workflows using only parts of the FastReseg pipeline

I can imagine 3 popular FastReseg workflows:

Just score cells for doublet potential using the LRT, and stop there.
Proceed to calling transcript groups and scoring them as good/bad fits within the cell.
Proceed to reassigning transcript groups. (Complete resegmentation workflow)

So: the wrapper function should make it easy for people to indicate how far they want to progress through steps 1,2,3.

wrapper functions

I think it would help this get used if we had a wrapper function that made it simple to run. The main goals I have in mind are:

Run resegmentation separately in each FOV to speed things up (running too many FOVs at once made R slow to a crawl)
Make it easy to run the complete algorithm without a lot of preliminaries.

Here's one way to get there:

Function for compiling all the necessary data from a single FOV (input: slide ID and FOV ID)
Function for running the whole resegmentation procedure from the above data
Function to gather data and run segmentation for all the slides/FOVs in a study.

prepping FastReseg for handoff to software

We want to get this package to a point where we can hand it over to the software team to integrate into our pipelines. For that we'll need:

Easy-to-understand, highly modular code
Thorough documentation
Confidence that it improves / doesn't hurt downstream analysis

1: architecture / modular code:

Move more code from the big wrapper into its subsidiary functions. See here for an example of a wrapper function that contains almost nothing but calls to subsidiary functions. Data checks should be moved within subsidiary functions when possible.
The ideal modular function performs a single task, and has relatively few inputs / outputs. A function can call as many subsidiary functions as needed.
I recommend starting by diagraming out the flow of data through the pipeline. See the readme here for an example of such a diagram. This is useful both to help you organize your functions and to help readers understand the code.

2: documentation

Easy-to-follow vignettes will be key. A vignette should usually only have a couple dozen R commands; all the extra/supporting stuff would ideally go into functions.
A thorough README is essential. It should include installation instructions and an explanation of the main things the package does.

3: getting confidence it works:

You've done a lot here already. I think the best thing now is to put the package in people's hands and let them try it out. Note they'll only try it if it's well-documented and easy to use, so move onto this after 1 & 2 are done.

segment of missing genes in reference profiles

Genes not present in reference profiles are currently invisible for the resegmentation algorithm. 2 potential ways to put those missing genes back to the cell segments:

Assignment transcripts for those missing genes to their nearest neighbor cells based on distance cutoff. downside: computation intensive.
Set the tLLR score for those missing genes as maximum (0) for all cell types such that they won't affect the linear regression or SVM analysis much. downside: too many counts of missing genes might be misleading for SVM and affect the transcript number and tLLR cutoff for resegmentation.

how do we put fastreseg results back into the pipeline?

Eventually we'll need to get our results back into the pipeline. What's needed for this?

Investigate fastLM to speed up module

https://search.r-project.org/CRAN/refmans/RcppArmadillo/html/fastLm.html

Remove dependencies on Giotto objects

Background

The TAP and the commercial DA will both use Seurat objects rather than Giotto objects. But exactly the format of that Seurat object remains to be determined.

Takeaways:

FastReseg shouldn't ever rely on a Giotto object
It's not yet time to incorporate a Seurat object into FastReseg.
My sense if that FastReseg only uses the giotto object for things like cell type and maybe cell type profiles. We can input this info to FastReseg as vectors and matrices.

Entry points to FastReseg

Using this issue to summarize the ways people will use fastreseg, now and in the future:

Current use:

Assume data is stored in the TAP file structure, load it up using prepSMI_for_fastReseg(), then run the FastReseg wrapper. This approach will only be used for ~1 year.
Assemble the necessary data by hand, then run the FastReseg wrapper (I expect this will almost never happen)

Future uses:

We are beginning to sketch out conventions for our data. This will take the form of a hdf5 file, with a well-specified format. Seurat objects can then be created that will point to that file.

So eventually, we'll need a function to prep data in that format for FastReseg. This will be the primary path, and it will replace the workflow from the TAP file structure.

Takeaways:

Don't invest too much in the function based on the TAP file structure.

suggested workflow and code organization

The below schematic shows code the code might be modularized. This is mostly how you have it organized already. I'm mostly trying to:

Make it easy to update when we get a new file structure
Keep the major function as simple as possible
Make everything as modular as possible

Blue boxes are data; orange boxes are functions.

speeding up slow steps

Capturing thoughts on how we might speed up the slow steps:

Instead of a delaunay network, would it be faster to calculate a radius-based network? The latter seems like it might be computationally simpler, and I see no reason why it wouldn't serve our purposes just as well.

Also, the grouptranscripts_delanuay function is very long and complicated. I bet using dbscan would simplify it dramatically.

To break a list of flagged transcripts into groups, the dbscan algorithm might be best. It's blazingly fast, and it would produce very similar (identical?) groupings to building a delaunay network with a maximal distance requirement.
Using leiden clustering for resegmentation: apart from using a radius-based network, I don't see a way to speed this up without using an entirely different clustering algorithm.

functions should take plain arguments, not configs

We probably don't need the config objects like we used in the TAP pipeline; it'd be simpler if our functions just took simple arguments.

For advanced settings, we could use a "config" or "control" argument used in many functions. For an example, see mclust::Mclust or umap::umap.

Some questions about FastReseg for CosMX data

Hi Nanostring team,

I came upon this repo while looking through your available modules. I am currently working with CosMx data and have opted to using Baysor at the moment for resegmentation. I was wondering if you guys have tried to test this tool as a benchmark against other more widely known tools.

I guess one outstanding issue with the CosMx data at the moment is that there is no way to integrate all the different IF staining as additional information to inform on segmentation (i.e. CD68 markers for macrophages). Do you think that would be something that could be integrated eventually?

Thank you!

prepSMI_for_fastReseg failure on slides with low number of FOVs

Sorry I dont have permission to branch in this repo so I have to submit issues instead.

FastReseg/R/preprocessing.R

Lines 222 to 224 in 10e6337

 if ( length(exprs_files_list[[idx]])==0 ){ 

 stop("No target call files are found.") 

 }

I think this line needs have the "[[idx]]" removed. Which would become:
if ( length(exprs_files_list)==0 ){ stop("No target call files are found.") }

From the name of the object I assume there was a transition from a list approach that held all slide FOVs in a single object to instead use a forloop approach but didnt remove the [[idx]] part from this line. On most datasets when the number of FOVs is ~20 and the number of slides is <6 it doesn't have an issue because the number of FOVs is is always larger than the number of slides. But in the MAYO dataset we have some slides with only 2 FOVs. That slide was the 9th slide in the dataset and with only 2 FOV the call exprs_files_list[[idx]] tried to find the 9th element in a vector of length 2, which errors out. I made the above switch and it worked fine.

create mask or polygons of cells for FastReseg outputs

The mask for cell label and polygons for cell boundaries is useful for data visualization. After resegementation, one gets a data.frame with spatial coordinates of each transcript and their cell_ID assignment. Need a feature to create the cell mask and polygons image out of the transcript data.frame.

likelihood calcualtion tLLRv2

@lidanwu @davidpross @patrickjdanaher

Wondering about a few details in FastReseg.
It seems like the goal is to get a n_cells x n_clusters matrix of cell-level loglikelihood
tLLRv2_cellMatrix <- counts %*% tLLRv2_geneMatrix

But it looks like scoreGenesInRef function seems to calculate the log of the mean profile, instead of a loglikelihood, in producing tLLRv2_geneMatrix.

libsize <- colSums(ref_profiles, na.rm = TRUE) norm_exprs <- Matrix::t(Matrix::t(ref_profiles)/ libsize) _loglik <- log(norm_exprs)_ return(loglik)

I think with the above formula, discrepancies with the original nb_clust derived celltypes and tllRv2_maxCelltype could potentially occur for a couple of reasons

likelihood is not additive by count. i.e.,
loglik(count=2, mu) is not equal to 2*loglik(count=1, mu)
counts is not normalized by library size like tLLRv2_geneMatrix is.
background not modeled.

If you assume the neg binomial model, maybe this would work below
At the least I wonder whether there should be anything but minimum discrepancy between the original cell_type labels from nb_clust, and the celltype from tLLRv2_maxCellType, if the only major difference is background.

(similar to nbclust (R/nbclust.R))
I think Dave R. may have an optimized implementation.
tLLRv2_cellMatrix_v2 <- apply(refProfiles, 2, function(ref){ libsize <- Matrix::rowSums(counts) yhat <- matrix(libsizes,ncol=1) %*% matrix(ref,nrow=1) ## yhat is a n_cells x n_genes matrix of expected values lls <- stats::dnbinom(x = as.matrix(counts) ,size = size ,mu = yhat ,log = TRUE) return(rowSums(lls)) })

	if ( length(exprs_files_list[[idx]])==0 ){
	stop("No target call files are found.")
	}

nanostring-biostats / fastreseg Goto Github PK

fastreseg's Introduction

FastReseg

Dev notes

System requirements

Demo

Workflow:

Installation

Install the development version from GitHub

fastreseg's People

Contributors

Stargazers

Watchers

fastreseg's Issues

1: architecture / modular code:

2: documentation

3: getting confidence it works:

Background

Takeaways:

Current use:

Future uses:

Takeaways:

Recommend Projects

Recommend Topics

Recommend Org

Jobs