zwj-tina / scibetr Goto Github PK

License: GNU General Public License v3.0

R 100.00%

scibetr's Introduction

scibetR

Pure R version of scibet, a portable and fast single cell type identifier. It takes longer than the original functions in scibet.

Installation Guide

Installing scibetR
To install scibetR, run:

if (!requireNamespace("devtools", quietly = TRUE)) install.packages("devtools")
devtools::install_github("zwj-tina/scibetR")

Tutorial

Library

library(ggplot2)
library(tidyverse)
library(scibetR)
library(viridis)
library(ggsci)

Load the data

For expression matrix (TPM), rows should be cells and columns should be genes but the last column should be "label" for each cell.

path_da <- "~/test.rds.gz"
expr <- readr::read_rds(path = path_da)

E(ntropy)-test for supervised gene selection

etest_gene <- SelectGene_R(expr, k = 50)
etest_gene

scibetR: Single Cell Identifier Based on Entropy Test

For reference set, rows should be cells, column should be genes and the last column should be "label" (TPM).
For query set, rows should be cells and column should be genes (TPM). example:

tibble(
  ID = 1:nrow(expr),
  label = expr$label
) %>%
  dplyr::sample_frac(0.7) %>%
  dplyr::pull(ID) -> ID

train_set <- expr[ID,]      #construct reference set
test_set <- expr[-ID,]      #construct query set

prd <- SciBet_R(train_set, test_set[,-ncol(test_set)])

False positive control

Due to the incomplete nature of reference scRNA-seq data collection, cell types excluded from the reference dataset may be falsely predicted to be a known cell type. By applying a null dataset as background, SciBet controls the potential false positives while maintaining high prediction accuracy for cells with types covered by the reference dataset (positive cells). For the purposes of this example, these three datasets are used to get started.

null <- readr::read_rds('~/null.rds.gz')
reference <- readr::read_rds('~/reference.rds.gz')
query <- readr::read_rds('~/query.rds.gz')

For query set, “negative cells” account for more than 60%.

ori_label <- query$label
table(ori_label)

The confidence score of each query cell is calculated with the function conf_score_R.

query <- query[,-ncol(query)]
c_score <- conf_score_R(ref = reference, query = query, null_expr = null, gene_num = 500)

Entropy calculation

Compute expression entropy. expr,The expression dataframe. Rows should be cells and columns should be genes. window, The window size for expression value discretization. low The lower limit for normalizing expression entropy

ent_res <- Entropy_R(expr,window=120,low=2000)

return

# A tibble: 11,516 x 5
   gene  mean.expr entropy    fit norm_ent
   <chr>     <dbl>   <dbl>  <dbl>    <dbl>
 1 A2M       63.2    1.26  0.191    0.248 
 2 AAAS      73.9    1.03  0.210    0.204 
 3 AACS       8.73   0.412 0.0419   0.0813
 4 AAED1     37.8    0.631 0.136    0.124 
 5 AAGAB     65.7    1.18  0.196    0.233 
 6 AAK1      13.8    0.683 0.0622   0.135 
 7 AAMDC     13.1    0.414 0.0596   0.0817
 8 AAMP     159.     1.63  0.316    0.322 
 9 AAR2      49.3    0.952 0.163    0.188 
10 AARS      39.6    0.922 0.141    0.182 
# … with 11,506 more rows

LoadModel

x, A SciBet model in the format of a matrix as the trained reference.To facilitate matrix multiplication in the process, its rows are genes and columns are labels.

y <- LoadModel_R(x,genes=NULL,labels=Null)

return a function as

Bet_R <- function(expr, result="list"){
}

correction

add %>% in line 312 in Marker_heatmap() function

scibetr's People

Contributors

Stargazers

Watchers

Forkers

healthvivo

scibetr's Issues

scibetR run error in windows: could not find function "np.exp2"

Hi Wenjie,
I met an issue when I run scibetR:

prd <- SciBet_R(train_set, test_set[,-ncol(test_set)])
Error in np.exp2(total - np.max(total)) :
could not find function "np.exp2"

Could you please help me figure out the reason?
Thanks a lot!

Error in out[1:k, ] : incorrect number of dimensions

Hi, I met an error when using the package.
expr = as.data.frame(t(counts(ref))) # the reference dataset, rows are cells and columns are genes
expr = expr %>%
mutate(group_ref) # add the label column
que = as.data.frame(t(counts(sce))) # the query dataset, , rows are cells and columns are genes
prd <- SciBet_R(expr, que, k=100)

Error in out[1:k, ] : incorrect number of dimensions

I don't know what's wrong with my code, could you please help me solve the issue?

关于LoadModel_R 函数的问题

根据scibet文档，model的输入格式为行为样本，列为基因名，最后一列为label的矩阵。
但是根据loadmodel_R的描述，似乎需要进行转置变化，得到一个rownames = samplename ,colnames=genename的矩阵。

function (x, genes = NULL, labels = NULL)
{
prob <- x
if (is.null(genes))
genes <- rownames(x)
if (is.null(labels))
labels <- colnames(x)
function(expr, result = "list") {
have_genes <- intersect(genes, colnames(expr))
expra <- log1p(as.matrix(expr[, have_genes]))/log(2)
switch(result, list = Gambler_R(expra, prob[have_genes,
], FALSE), table = {
out <- Gambler_R(expra, prob[have_genes, ], TRUE)
rownames(out) <- have_genes
return(out)
})
}

性能明显比作者所说的高很多啊

我跑了好几个数据集，不是很大的数据，准确率居然也有98往上，有人知道这是为什么吗，3000-4000的数据集，有四个这样的数据。统计的acc是，看起来很优秀的一个方法啊，我跑了BERT也是，是对这种小的数据集有天然的优势吗？

Add sparse matrix support

test_j <- SciBet_R(mat, iD_m)
Error in expr$label : $ operator not defined for this S4 class
I try and meet this problem and I think these code change as follows could solve these problems.

train and test are both sparse matrix

train_label <- factor(CellType, levels = unique(CellType))

SciBet_R <- function (train, train_label, test, k = 1000, result = "list")
{
Learn_R(train, train_label, NULL, k)(test, result)
}

Learn_R <- function (expr, train_label, geneset = NULL, k = 1000)
{
labels <- train_label
if (is.null(geneset)) {
geneset <- SelectGene_R(expr, k, train_label)
}
labell <- levels(labels)
expr_select <- expr[, geneset]
label_total <- matrix(0, length(geneset), 1)
for (i in labell) {
label_TPM <- expr_select[labels == i, ]
label_mean <- colSums(log2(label_TPM + 1))
a <- matrix(label_mean, , 1)
colnames(a) <- i
label_total <- cbind(label_total, a)
}
label_total <- label_total[, -1]
rownames(label_total) <- geneset
label_t <- t(label_total)
prob <- log2(label_t + 1) - log2(rowSums(label_t) + length(geneset))
prob <- t(prob)
genes <- rownames(prob)
function(test, result = "list") {
have_genes <- intersect(genes, colnames(test))
testa <- log1p(as.matrix(test[, have_genes]))/log(2)
switch(result, list = Gambler_R(testa, prob[have_genes,
], FALSE), table = {
out <- Gambler_R(testa, prob[have_genes, ], TRUE)
rownames(out) <- have_genes
return(out)
})
}
}

SelectGene_R <- function (expr, k = 1000, train_label)
{
labels <- train_label
labels_set <- levels(labels)
label_total <- matrix(0, ncol(expr) - 1, 1)
for (i in labels_set) {
label_TPM <- expr[train_label == i, ][, -ncol(expr)]
label_mean <- colMeans(label_TPM)
a <- matrix(label_mean, , 1)
colnames(a) <- i
label_total <- cbind(label_total, a)
}
label_total <- label_total[, -1]
log_E <- log2(rowMeans(label_total + 1))
E_log <- rowMeans(log2(label_total + 1))
t_scores <- log_E - E_log
out <- cbind(label_total, t_scores)
rownames(out) <- colnames(expr)[-ncol(expr)]
out <- out[order(-out[, "t_scores"]), ]
select <- out[1:k, ]
out <- rownames(select)
return(out)
}