genomicsnx / ikap Goto Github PK

IKAP - Identifying K mAjor cell Population groups in single-cell RNA-seq analysis

License: MIT License

R 100.00%

ikap's Introduction

IKAP – Identifying K mAjor cell Population groups in single-cell RNA-seq analysis

Article:
IKAP - Identifying K mAjor cell Population groups in single-cell RNA-seq analysis
Yun-Ching Chen, Abhilash Suresh, Chingiz Underbayev, Clare Sun, Komudi Singh, Fayaz Seifuddin, Adrian Wiestner, Mehdi Pirooznia. https://academic.oup.com/gigascience/article/8/10/giz121/5579995

* Note: for Seurat3 please see Seurat3_code folder

Installation

Please install the following R libraries before installing IKAP:
Seurat, dplyr, reshape2, PRROC, WriteXLS, rpart, stringr, and rpart.plot

IKAP installation:

First, you need to install the devtools package. You can do this from CRAN. Invoke R and then type
```
install.packages("devtools")
```
Load the devtools package.
```
library(devtools)
```

Install IKAP

devtools::install_github("NHLBI-BCB/IKAP")

The main function, IKAP, takes a Seurat object with the normalized expression matrix and other parameters set by default values if not specified. IKAP explores sets of cell groups (clustering) by varying resolution (r) and the number of top principal components (nPC) for Seurat SNN clustering and picks a few candidate sets among all explored sets with one marked as the best that likely produces distinguishing marker genes.

Note: IKAP will, by default, regress out the percentage of mitochondrial gene counts and total UMI counts and scale the expression matrix using Seurat ScaleData function. These two values should be save in Seurat metadata with column names 'percent.mito' and 'nUMI' respectively. If you want to regress out different confounding variables or use different column names, please save these variables in Seurat metadata and set 'confounders' (an IKAP parameter) as their column names in the Seurat metadata data frame.

IKAP Workflow

Usage:

Seurat_obj <- IKAP(Seurat_obj, out.dir = "./IKAP")

Returned data and output files (saved in the output directory, default = ./IKAP/):

Seurat object: IKAP returns a Seurat object with all explored sets in the metadata data frame.

PC_K.pdf:

The heatmap shows the statistics for every combination of r and nPC explored. Candidate sets are marked as 'X' with the best marked as 'B'. The corresponding cell membership can be found in the metadata of the returned Seurat object with column name 'PC?K?'. For example, if 'B' (the best set) is marked at nPC = 20 and k = 8, the corresponding cell membership is stored in column 'PC20K8' in the metadata.

data.xls and markers.all.rds:

It saves the statistics (plotted in PC_K.pdf) for determining candidate sets in the first sheet. The other sheets display the (upregulated) marker genes for candidate sets. The R object, markers.all.rds, contains a data frame of marker genes for every candidate set.

*.png:

Heatmaps show expression of top 10 (ranked by expression fold change) marker genes from each cell group for candidate sets. They are plotted using Seurat DoHeatmap function.

DT_plot.pdf, DT_summary.rds, and DT.rds:

Decision tree output files. A decision tree is built using marker genes for every cell group in every candidate set using R package rpart. All decision trees are plotted in DT_plot.pdf. Classification errors are summarized in the R object DT_summary.rds. DT.rds is the output object from rpart.

*tSNE.pdf:

tSNE plots for candidate sets.

Functions in the R script:

IKAP: The main function runs the following steps:
- (1) regress out confounding variables and scale data using Seurat::ScaleData;
- (2) find variable genes for principal component analysis (PCA) using Seurat::FindVariableGenes;
- (3) perform PCA using Seurat::RunPCA;
- (4) estimate k.max;
- (5) explore ranges of k and nPC and compute gap statistics;
  - GapStatistic, ObservedLogW, and ExpectedLogW:
    Compute gap statistics given a data matrix (used for computing data point Euclidean distances) and K sets of clusters with k = 1 … K. GapStatistic calls ObservedLogW and ExpectedLogW to compute sum of within-group distances for observed data and random data respectively.
  - BottomUpMerge and NearestCluster (5): Generate sets of cell groups by exploring ranges of k and nPC. BottomUpMerge finds k.max groups using Seurat::FindClusters and gradually merges two nearest clusters measured by NearestCluster.
- (6) select candidate sets;
  - SelectCandidate:
    Select candidate sets based on gap statistics.
- (7) compute marker genes using Seurat::FindAllMarkers;
  - ComputeMarkers:
    Compute marker genes for all cell groups in all candidate sets using Seurat::FindAllMarkers. In addition, compute Area Under the ROC curve (AUROC) for each marker genes using the R package PRROC. Plot marker gene heatmap(s) using Seurat::DoHeatmap.
- (8) build decision trees;
  - DecisionTree:
    Build decision trees for all cell groups in all candidate sets using the R package rpart and compute the classification error for each candidate set.
- (9) plot tSNE plots and PC_K.pdf
  - PlotSummary:
    Mark the best set based on classification error and plot PC_K.pdf

License

MIT license: https://opensource.org/licenses/MIT

Contact

If you have any question, please contact: [email protected]

ikap's People

Contributors

Stargazers

Watchers

Forkers

jahnavibhaskaran harimchun bertolabmusc mr-september alexander-sol nhlbi-bcb wangmingcheng

ikap's Issues

IKAP Error: could not find function "FindVariableGenes"

Hi @NHLBI-BCB
I am trying to use IKAP for a merged Seurat's object but getting a weired error about unable to find the function "FindVariableGenes". I also checked for ?FindVariableGenes() which is working perfectly fine.

Error in FindVariableGenes(sobj, mean.function = ExpMean, dispersion.function = LogVMR,  : 
  could not find function "FindVariableGenes"
In addition: Warning message:
In IKAP(inhouse_wt_norm_log_var_feat.integrated, out.dir = "./IKAP") :
  nUMIpercent.mitonot in Seurat metadata: skipped for regression.

Is there anyone who has encountered this issue before and can guide how to resolve this issue?

Thanks

extract the Best npc&K and add it to metadata

Dear IKAP team,
Thanks for the incredible package. I applied the IKAP on my ScRNA seq data and I got the results. but I don't know how I extract the Best one and add it to my metadata or add it as a Cluster column in my Seurat object.
any solution or recommendation would be appreciated.

Assay used in FindAllMarkers

Hi,
If the default assay is either integrated or SCT, should assay="RNA" be set in the "FindAllMarkers" function here so that it is not applied to the SCT or integrated assay?

Thank you,
J

Error: Problem with `filter()` input `..1`. x object 'avg_logFC' not found

This is using IKAP_Seurat3.R, on a PBMC 5K dataset.

SCT or SCT-integrated data is used. Skip data scaling.
SCT or integrated data is used. Skip data variable feature finding.
Running PCA ... 
PC_ 1 
Positive:  LYZ, CST3, FCN1, S100A9, CTSS, S100A8, FTL, MNDA, PSAP, TYROBP 
	   AIF1, VCAN, HLA-DRA, NEAT1, HLA-DRB1, FCER1G, CD14, S100A12, LST1, FTH1 
	   SERPINA1, FGL2, TYMP, KLF4, IL1B, S100A6, GRN, CD68, MS4A6A, CD74 
Negative:  LTB, IL32, IL7R, RPS27, TRAC, TRBC1, CCL5, TRBC2, RPS12, GZMA 
	   NKG7, CD3D, CD3E, CD69, MALAT1, KLRB1, CST7, CD247, RPS29, CTSW 
	   CD3G, CD7, RPL13, RPS3, RPS27A, RPL10, RPL3, EEF1A1, LDHB, RPS18 
PC_ 2 
Positive:  CD74, HLA-DRA, CD79A, HLA-DQA1, MS4A1, HLA-DRB1, IGHM, BANK1, CD79B, HLA-DPA1 
	   IGKC, HLA-DPB1, LTB, HLA-DQB1, LINC00926, TNFRSF13C, RALGPS2, IGHD, VPREB3, CD37 
	   SPIB, CD22, RPS12, EEF1A1, RPS27, TCF4, FAM129C, RPL13, IGHA1, BLK 
Negative:  NKG7, GZMA, CST7, CCL5, GNLY, PRF1, KLRD1, FGFBP2, GZMH, CTSW 
	   GZMB, FCGR3A, KLRF1, TRDC, SPON2, CD247, HOPX, ADGRG1, CCL4, CLIC3 
	   KLRB1, GZMM, TTC38, MATK, HCST, KLRC2, TYROBP, TBX21, IL2RB, EFHD2 
PC_ 3 
Positive:  CD74, HLA-DRA, HLA-DRB1, HLA-DPB1, HLA-DQA1, HLA-DPA1, CD79A, MS4A1, IGHM, CD79B 
	   BANK1, IGKC, HLA-DQB1, NKG7, GZMB, GNLY, LINC00926, FCGR3A, FGFBP2, PRF1 
	   KLRD1, CST7, TNFRSF13C, IGHD, GZMH, RALGPS2, GZMA, CD37, SPIB, VPREB3 
Negative:  IL7R, TRAC, TPT1, LDHB, RPS12, CD3E, TCF7, TRBC1, S100A8, TRBC2 
	   LEF1, RPL13, EEF1A1, MAL, IL32, TRABD2A, S100A9, S100A12, CD3G, VCAN 
	   CD3D, AQP3, NOSIP, RPL32, RCAN3, RPS14, LRRN3, RPL34, CCR7, BCL11B 
PC_ 4 
Positive:  S100A12, S100A8, VCAN, S100A9, CD14, MNDA, NCF1, CD79A, CYP1B1, MS4A1 
	   IGHM, CSF3R, VNN2, RBP7, APLP2, PLBD1, RGS2, PADI4, CTSD, BANK1 
	   CD36, MEGF9, CES1, IGKC, BST1, S100A6, TALDO1, QPCT, CDA, LINC00926 
Negative:  HLA-DPA1, HLA-DPB1, FCGR3A, CSF1R, RHOC, COTL1, IFITM3, TCF7L2, CDKN1C, LST1 
	   AIF1, HLA-DRB1, C1QA, HLA-DQB1, MS4A7, SMIM25, FCER1G, CLEC10A, HES4, ABI3 
	   CTSL, CD74, CST3, MAFB, SIGLEC10, CPVL, WARS, LRRC25, HMOX1, CAMK1 
PC_ 5 
Positive:  FCER1A, IL1B, CLEC10A, PLD4, ATF3, CD74, GSN, GAS6, CCDC88A, HLA-DMA 
	   CST3, RNASE6, DNASE1L3, PPP1R14B, ID2, ITM2C, IL3RA, RGS1, C12orf75, RUNX2 
	   MS4A6A, CCDC50, HLA-DRB1, APP, HLA-DQB1, IER3, SERPINF1, UGCG, EGR1, HLA-DPB1 
Negative:  FCGR3A, SMIM25, FTL, CD79B, MS4A7, CD79A, CTSS, CDKN1C, MS4A1, SERPINA1 
	   IFITM3, TCF7L2, AIF1, LST1, CFD, HES4, BANK1, SIGLEC10, LINC00926, CEBPB 
	   MTSS1, CTSL, C5AR1, HMOX1, CD37, LILRB2, POU2F2, TNFRSF13C, LRRC25, LYN 
Determine k.max.
k.max = 19 
Perform clustering for every nPC:
Iteration for nPC = 10 , r = 1.0, 1.2, 1.4
Iteration for nPC = 11 , r = 1.0, 1.2, 1.4
Iteration for nPC = 12 , r = 1.0, 1.2, 1.4
Iteration for nPC = 13 , r = 1.0, 1.2
Iteration for nPC = 14 , r = 1.0, 1.2, 1.4
Iteration for nPC = 15 , r = 1.0, 1.2, 1.4
Iteration for nPC = 16 , r = 1.0, 1.2, 1.4
Iteration for nPC = 17 , r = 1.0, 1.2, 1.4
Iteration for nPC = 18 , r = 1.0, 1.2, 1.4
Iteration for nPC = 19 , r = 1.0, 1.2, 1.4, 1.6
Iteration for nPC = 20 , r = 1.0, 1.2, 1.4, 1.6, 1.8
Iteration for nPC = 21 , r = 1.0, 1.2, 1.4, 1.6, 1.8
Iteration for nPC = 22 , r = 1.0, 1.2, 1.4, 1.6, 1.8
Iteration for nPC = 23 , r = 1.0, 1.2, 1.4, 1.6, 1.8
Iteration for nPC = 24 , r = 1.0, 1.2, 1.4, 1.6
Iteration for nPC = 25 , r = 1.0, 1.2, 1.4, 1.6
Iteration for nPC = 26 , r = 1.0, 1.2, 1.4
Iteration for nPC = 27 , r = 1.0, 1.2, 1.4
Iteration for nPC = 28 , r = 1.0, 1.2, 1.4
Iteration for nPC = 29 , r = 1.0, 1.2, 1.4, 1.6
Iteration for nPC = 30 , r = 1.0, 1.2, 1.4
Compute marker gene lists ... 
Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
This message will be shown once per session
Saving 6.11 x 4.43 in image
Calculating cluster 1
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=14s  
Calculating cluster 3
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=03s  
Calculating cluster 2
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=14s  
Error: Problem with `filter()` input `..1`.
x object 'avg_logFC' not found
ℹ Input `..1` is `top_n_rank(10, avg_logFC)`.
ℹ The error occurred in group 1: cluster = "1".
Run `rlang::last_error()` to see where the error occurred.
In addition: Warning message:
In dir.create(out.dir, recursive = T) :
 Error: Problem with `filter()` input `..1`.
x object 'avg_logFC' not found
ℹ Input `..1` is `top_n_rank(10, avg_logFC)`.
ℹ The error occurred in group 1: cluster = "1".
Run `rlang::last_error()` to see where the error occurred. 
19.
stop(fallback) 
18.
signal_abort(cnd) 
17.
abort(c(cnd_bullet_header(), x = conditionMessage(e), i = cnd_bullet_input_info(), 
    i = cnd_bullet_cur_group_label()), class = "dplyr_error") 
16.
h(simpleError(msg, call)) 
15.
.handleSimpleError(function (e) 
{
    local_call_step(dots = dots, .index = env_filter$current_expression, 
        .fn = "filter") ... 
14.
~avg_logFC 
13.
desc(wt) 
12.
rank(x, ties.method = "min", na.last = "keep") 
11.
min_rank(desc(wt)) 
10.
top_n_rank(~10, ~avg_logFC) 
9.
mask$eval_all_filter(dots, env_filter) 
8.
withCallingHandlers(mask$eval_all_filter(dots, env_filter), error = function(e) {
    local_call_step(dots = dots, .index = env_filter$current_expression, 
        .fn = "filter")
    abort(c(cnd_bullet_header(), x = conditionMessage(e), i = cnd_bullet_input_info(),  ... 
7.
filter_rows(.data, ...) 
6.
filter.data.frame(x, top_n_rank({
    {
        n
    } ... 
5.
filter(x, top_n_rank({
    {
        n
    } ... 
4.
top_n(., 10, avg_logFC) 
3.
sobj.markers %>% group_by(cluster) %>% top_n(10, avg_logFC) at IKAP_Seurat3.R#165
2.
ComputeMarkers(sobj, gap.gain, candidates, out.dir) at IKAP_Seurat3.R#342
1.
IKAP(counts_seurat, out.dir = "./IKAP")

Adjust pc.range in IKAP for SCTransform()-processed Seurat objects?

Dear @NHLBI-BCB and @xizhihui

Thank you for developing the package!

I have processed my data using SCTransform() (which replaces NormalizeData(), ScaleData() and FindVariableFeatures()) and have performed RunPCA, RunUMAP, FindNeighbors() and FindClusters().

SCTransform() allows using 30 PCs, should I adjust pc.range = 30 or just use the default value of 20 and let the IKAP() decides? I am confused because even if the IKAP Package info says pc.range = 20 as default, I still see the Iteration running for nPC = more than 20.

Also, would you mind advising me how to obtain the resolution of a recommended (PC,k) combination at the end?

Thank you very much for your help!

Allow running IKAP on other reductions, such as Harmony

I wonder if it is possible to adapt IKAP to run on other reductions beside PCA (like harmony for integrated data)

Running IKAP on fully processed Seurat project

Hello guys,
your tool is such a great tool. I would like to ask one question: I have applied IKAP on a Seurat object that was fully processed and filtered. what I mean by filtered is that I removed empty drops and removed doublets. Is it OK to apply IKAP on this final Seurat object? I did try to run it and it worked well, but I want to make sure I am using the tool in the correct way.

Thank you very much indeed!

scanpy support

Dear,
Are you planning to support scanpy (another single cell data analysis tool similar to seurat)?

Parallelisation

Hi, thanks for developping IKAP, which seems to be a useful software. I am running it on a large Seurat object and am wondering if there are ways to make it quicker (e.g. parallelisation)? Any hint on how to do this would be really appreciated!

Cannot install IKAP

Hi, I am struggling to install IKAP. I get an error saying lazy loading failed

`Installing package into ‘/Users/thoa0003/Library/R/4.0/library’
(as ‘lib’ is unspecified)

installing source package ‘IKAP’ ...
** using staged installation
** R
** byte-compile and prepare package for lazy loading
sh: line 1: 5292 Killed: 9 R_TESTS= '/Library/Frameworks/R.framework/Resources/bin/R' --no-save --no-restore --no-echo 2>&1 < '/var/folders/69/67vbnhlj22zfy4dyt7pq6rjrs4rzdf/T//RtmptHT1Fq/file14a5161c5baa'
ERROR: lazy loading failed for package ‘IKAP’
removing ‘/Users/thoa0003/Library/R/4.0/library/IKAP’
Error: Failed to install 'IKAP' from GitHub:
(converted from warning) installation of package ‘/var/folders/69/67vbnhlj22zfy4dyt7pq6rjrs4rzdf/T//RtmpX3ZxLn/file13bd28fc8600/IKAP_0.0.0.9000.tar.gz’ had non-zero exit status`

Thank you in advance for your help.

Understanding IKAP output

Hi,

First of all thanks for your useful tool!
I have launch your software on my single nuclei RNA-seq data and it seems to work fin instead of some warning due to visualization troubleshooting for some feature, but it's ok.
I was wondering how I can understand/interpret the figure below produce in the file DT_plot.pdf :

Can you explained how I can interpret this figure fro exemple please ?
Thanks in advance