ocbe-uio / discbio Goto Github PK

View Code? Open in Web Editor NEW

11.0 2.0 5.0 235.99 MB

A user-friendly R pipeline for biomarker discovery in single-cell transcriptomics

License: Other

R 1.49% Jupyter Notebook 98.51%

transcriptomics biomarker-discovery single-cell-analysis scrna-seq jupyter-notebook r-package

discbio's People

Contributors

Stargazers

Watchers

Forkers

minrk nikolaospapachristou dami82 takshan enformatik

discbio's Issues

The PPI function is writing to the user's disk

Summary

The following lines of PPI() always write and read a file from the user's local drive:

DIscBIO/R/PPI.R

Lines 42 to 45 in 5b2b23a

 repo_content <- content(repos) 

 #results <- read_tsv(repo_content) 

 write.table(repo_content, file = "data.csv", sep = ",") 

 results <- read.table(file = "data.csv", sep = ",")

DIscBIO version

1.1.0

Expected output

R functions are not supposed to write on the user's working directory unless explicitly told to.

In a big dataset that has about 1400 cells the "plotSilhouette" function is not generating any plot.

Describe the bug

A clear and concise description of what the bug is.

Steps to Reproduce

Try to create a minimally-reproducible example so other people can better understand the problem.

Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
Reproduce the following code:

Expected behavior

A clear and concise description of what you expected to happen.

Screenshots

If applicable, add screenshots to help explain your problem.

Software metainformation

Operating System and version: [e.g. Linux, Mac, Windows]
R version:
DIscBIO version or commit number:

Additional context

Add any other context about the problem here.

New vignette hangs with reduced dataset

Summary

Under DIscBIO version 0.0.0.9004, the new vignette hangs with the reduced dataset.

What works

The following code works fine, and is identical to the one currently present on the vignette (minus commented-out code for brevity):

## ----options, echo=FALSE------------------------------------------------------
library(knitr)
opts_chunk$set(fig.width=7, fig.height=7)

## -----------------------------------------------------------------------------
library(DIscBIO)
library(enrichR)

## -----------------------------------------------------------------------------
DataSet <- valuesG1msReduced
head(DataSet)

## -----------------------------------------------------------------------------
sc <- DISCBIO(DataSet)    

## -----------------------------------------------------------------------------
sc<-NoiseFiltering(sc,percentile=0.9, CV=0.2)

####  Normalizing the reads without any further gene filtering
sc<-Normalizedata(sc, mintotal=1000, minexpr=0, minnumber=0, maxexpr=Inf, downsample=FALSE, dsn=1, rseed=17000) 

####  Additional gene filtering step based on gene expression

sc<-FinalPreprocessing(sc,GeneFlitering="NoiseF",export = TRUE) ### The GeneFiltering can be either "NoiseF" or"ExpF"

## -----------------------------------------------------------------------------
#OnlyExpressionFiltering=TRUE           
OnlyExpressionFiltering=FALSE         

if (OnlyExpressionFiltering==TRUE){
    MIínExp<- mean(rowMeans(DataSet,na.rm=TRUE))
    MIínExp
    MinNumber<- round(length(DataSet[1,])/3)    # To be expressed in at least one third of the cells.
    MinNumber
    sc<-Normalizedata(sc, mintotal=1000, minexpr=MIínExp, minnumber=MinNumber, maxexpr=Inf, downsample=FALSE, dsn=1, rseed=17000) #### In this case this function is used to filter out genes and cells.
    sc<-FinalPreprocessing(sc,GeneFlitering="ExpF",export = TRUE)  
}

## -----------------------------------------------------------------------------
sc<- Clustexp(sc,cln=3,quiet=TRUE)    #### K-means clustering to get three clusters
plotGap(sc)        ### Plotting gap statisticssc<- Clustexp(sc, clustnr=20,bootnr=50,metric="pearson",do.gap=T,SE.method="Tibs2001SEmax",SE.factor=.25,B.gap=50,cln=K,rseed=17000)

## -----------------------------------------------------------------------------
sc<- comptSNE(sc,rseed=15555,quiet = TRUE)
cat("\t","     Cell-ID"," Cluster Number","\n")
sc@cpart

## -----------------------------------------------------------------------------
# Silhouette of k-means clusters
par(mar=c(6,2,4,2))
plotSilhouette(sc,K=3)       # K is the number of clusters

## -----------------------------------------------------------------------------
Jaccard(sc,Clustering="K-means", K=3, plot = TRUE)     # Jaccard of k-means clusters

## -----------------------------------------------------------------------------
############ Plotting K-means clusters
plottSNE(sc)
plotKmeansLabelstSNE(sc) # To plot the the ID of the cells in eacj cluster
plotSymbolstSNE(sc,types=sub("(\\_\\d+)$","", names(sc@ndata))) # To plot the the ID of the cells in each cluster

## -----------------------------------------------------------------------------
outlg<-round(length(sc@fdata[,1])/200)     # The cell will be considered as an outlier if it has a minimum of 0.5% of the number of filtered genes as outlier genes. 
Outliers<- FindOutliersKM(sc, K=3, outminc=5,outlg=outlg,probthr=.5*1e-3,thr=2**-(1:40),outdistquant=.75, plot = TRUE, quiet = FALSE)

RemovingOutliers=FALSE     
# RemovingOutliers=TRUE                    # Removing the defined outlier cells based on K-means Clustering

if(RemovingOutliers==TRUE){
    names(Outliers)=NULL
    Outliers
    DataSet=DataSet[-Outliers]
    dim(DataSet)
    colnames(DataSet)
    cat("Outlier cells were removed, now you need to start from the beginning")
}

## -----------------------------------------------------------------------------
sc<-KmeanOrder(sc,quiet = FALSE, export = TRUE)
plotOrderKMtsne(sc)

## -----------------------------------------------------------------------------
KMclustheatmap(sc,hmethod="single", plot = TRUE) 

## -----------------------------------------------------------------------------
g='ENSG00000000003'                   #### Plotting the expression of  MT-RNR2
plotExptSNE(sc,g)

## ----degKM--------------------------------------------------------------------
####### differential expression analysis between cluster 1 and cluster 3 of the Model-Based clustering using FDR of 0.05
cdiff <- DEGanalysis2clust(
  sc, Clustering="K-means", K=3, fdr=0.1, name="Name", export=TRUE, quiet=TRUE
)

## -----------------------------------------------------------------------------
#### To show the result table
head(cdiff[[1]])                  # The first component 
head(cdiff[[2]])                  # The second component

What doesn't work

The next line, however, hangs:

cdiff <- DEGanalysis(
  sc, Clustering="K-means", K=3, fdr=0.1, name="Name", export=TRUE,
  quiet=FALSE
)

The last output lines before the freeze are these:

Number of thresholds chosen (all possible thresholds) = 115
Getting all the cutoffs for the thresholds...
Getting number of false positives in the permutation...
'select()' returned 1:many mapping between keys and columns
Up-regulated genes in the Cl2 in Cl1 VS Cl2
Estimating sequencing depths...
Resampling to get new data matrices...

ClustDiffGenes is generating opposite tables

Expected behavior

The up should be down and the down should be up

Additional context

This is most likely due the ">" in line 122.

Reorganize internal functions

The packages' internal functions are scattered across several files on the R/ folder. For example, there's the internal-functions.R file which contains several functions, then there's samr-adapted.R containing internal, adapted functions from the samr package, then there are a few individually-isolated functions like reformatSiggenes.R and replaceDecimals.R. It would be great if there were more consistency on this.

One suggestion would be to keep the samr functions separated and aggregate all the rest into internal-functions.R.

Remove 1 dependency

Solving #38 involved adding one dependency (withr). This triggers a build note complaining about the number of dependencies, so it would be great to get back to 20 or less. Solving #26 might be the quickest way to get this one closed as well.

Adapt package to different types of organisms

DIscBIO was developed based on two datasets using human and mouse genes. It would be great if it could be adapted to work on other organisms.

Adapted details from @SystemsBiologist

What to change

Conquer is a collection of analysis-ready public scRNA-seq data sets. We would like to add it to our manuscript. It has about 40 datasets from three organisms: human, Zebrafish and mouse. When I wrote DIscBIO I was focusing on humans but now we want to make it applicable for any organism with a taxonomy ID. To do so we need to change in the DIscBIO-classes.R lines 157-159 from

DIscBIO/R/DIscBIO-classes.R

Lines 157 to 159 in 0c90899

 shortNames <- substr(rownames(tmpExpdataAll), 1, 4) 

 geneTypes <- factor( 

 c(ENSG = "ENSG", ERCC = "ERCM", ENSG = "ENSM")[shortNames]

shortNames <- substr(rownames(tmpExpdataAll), 1, 3)
        geneTypes <- factor(
            c(ENS = "ENS", ERC = "ERC")[shortNames]

I did not change the code because the dev is not working, I was worried to make the situation worst. Could you change the code after you bring back dev to work?

Expected behavior

The problem will be in 3 functions (DEGanalysis2clust, DEGanalysis and ClustDiffGenes)
The outcome of ClustDiffGenes() is not perfect but it is OK
The main problem is in DEGanalysis2clust and DEGanalysis. They are not working at all.

Testing code

library(MultiAssayExperiment)
GSE41265 <- readRDS("~/GSE41265.rds")
Dataset=assays(experiments(GSE41265)[["gene"]])[["count"]]
rownames(Dataset) <- as.list(sub("*\\..*", "", unlist(rownames(Dataset))))
sc<- DISCBIO(Dataset)
sc<- Clustexp(sc,cln=2,quiet=F,clustnr=6,rseed=17000)    
Cdiff<-DEGanalysis2clust(sc,Clustering="K-means",K=2,fdr=0.05,name="M",export = TRUE,quiet=F)  
Cdiff<-DEGanalysis(sc,Clustering="K-means",K=2,fdr=0.05,name="All",export = TRUE,quiet=F)   ####### differential expression analysis between all clusters
CdiffBinomial<-ClustDiffGenes(sc,K=2,export = T,fdr=.01,quiet=F)

At the moment if DEGanalysis and DEGanalysis2clust can work even without having the gene names as ClustDiffGenes that will be great.

Freezing on DEGanalysis

Describe the bug

There are some circumstances that cause the DEGanalysis function to hang.

To Reproduce

Download the pan_indrop_matrix_8000cells_18556genes dataset
From the download directory, run R
Run the following code in R (it may take a few minutes):

library(DIscBIO)
load("data/pan_indrop_matrix_8000cells_18556genes.rda")

# ==============================================================================
# Determining contants
# ==============================================================================
n_genes <- 500
K <- 3

# ==============================================================================
# Subsetting and formatting datasets
# ==============================================================================
sc_dataframe <- pan_indrop_matrix_8000cells_18556genes[, seq_len(n_genes)]
sc <- DISCBIO(sc_dataframe)

# ==============================================================================
# Performing operations
# ==============================================================================
MIinExp <- mean(rowMeans(sc_dataframe, na.rm=TRUE))
MinNumber <- round(length(sc_dataframe[1, ]) / 10)
sc <- Normalizedata(
    sc, mintotal=1000, minexpr=MIinExp, minnumber=MinNumber, maxexpr=Inf,
    downsample=FALSE, dsn=1, rseed=17000
)
sc <- FinalPreprocessing(sc, GeneFlitering="ExpF", export=FALSE, quiet=TRUE)
sc <- Clustexp(sc, cln=K, quiet=TRUE)
sc <- comptSNE(sc, rseed=15555, quiet=TRUE)

# This is the part that freezes
cdiff <- DEGanalysis(
    sc, Clustering="K-means", K=K, fdr=0.10, name="all_clusters",
    export=FALSE, quiet=FALSE, plot=FALSE, nresamp=5, nperms=10
)

Observe that the code freezes with the output of DEGanalysis being:

The dataset is ready for differential expression analysis[1] "Cl2" "Cl1" "Cl3"
Number of comparisons:  6 
Estimating sequencing depths...
Resampling to get new data matrices...
perm= 1
perm= 2
perm= 3
perm= 4
perm= 5
perm= 6
perm= 7
perm= 8
perm= 9
perm= 10
Number of thresholds chosen (all possible thresholds) = 1283
Getting all the cutoffs for the thresholds...
Getting number of false positives in the permutation...
'select()' returned 1:many mapping between keys and columns
Low-regulated genes in the Cl1 in Cl2 VS Cl1
'select()' returned 1:many mapping between keys and columns
Up-regulated genes in the Cl1 in Cl2 VS Cl1
Estimating sequencing depths...
Resampling to get new data matrices...

Ctrl+C doesn't quit the function, only killing R does the trick.

Expected behavior

For a simpler set of parameters, for example n_genes <- 100 and K <- 2, the code above ends with the following output:

Up-regulated genes in the Cl1 in Cl2 VS Cl1
  Comparisons Target cluster Gene number                                   File name Gene number                                   File name
1  Cl2 VS Cl1            Cl1         477  Up-regulated-all_clustersCl1inCl2VSCl1.csv         941 Low-regulated-all_clustersCl1inCl2VSCl1.csv
2  Cl2 VS Cl1            Cl2         477 Low-regulated-all_clustersCl2inCl2VSCl1.csv         941  Up-regulated-all_clustersCl2inCl2VSCl1.csv

Moreover, the structure of cdiff is:

List of 2
 $ : chr [1:1422, 1:2] "ENSG00000005022" "ENSG00000006327" "ENSG00000008394" "ENSG00000008517" ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr [1:2] "DEGsE" "DEGsS"
 $ :'data.frame':       2 obs. of  6 variables:
  ..$ Comparisons   : chr [1:2] "Cl2 VS Cl1" "Cl2 VS Cl1"
  ..$ Target cluster: chr [1:2] "Cl1" "Cl2"
  ..$ Gene number   : int [1:2] 477 477
  ..$ File name     : chr [1:2] "Up-regulated-all_clustersCl1inCl2VSCl1.csv" "Low-regulated-all_clustersCl2inCl2VSCl1.csv"
  ..$ Gene number   : int [1:2] 941 941
  ..$ File name     : chr [1:2] "Low-regulated-all_clustersCl1inCl2VSCl1.csv" "Up-regulated-all_clustersCl2inCl2VSCl1.csv"

Software metainformation

Operating System: Ubuntu 18.04
R version: 4.0.0
DIscBIO version or commit number: 0.99.6

Remove calls to KmeanOrder from notebook

The following Notebooks contain calls to the legacy function KmeanOrder():

As per a warning on that function, KmeanOrder() has been replaced by pseudoTimeOrdering(), so all calls to KmeanOrder() should be replaced ASAP.

Ideally, just changing the name of the function called should suffice; if bugs occur, fixes should be made to pseudoTimeOrdering().

This change would allow us to remove KmeanOrder, thus reducing the code footprint and check time of the package (something DIscBIO is in need).

Prepare for upcoming Seurat v5 release

I am opening this issue as a notification because DIscBIO is listed here as a package that relies (depends/imports/suggests) on Seurat. As you may know, we recently released Seurat v5 as a beta in March of this year, with new updates for spatial, multimodal, and massively scalable analysis. For more information on updates and improvements, check out our website https://satijalab.org/seurat/.
We are now preparing to release Seurat v5 to CRAN, and plan to submit it on October 23rd. While we have tried our best to keep things backward-compatible, it is possible that updates to Seurat and SeuratObject might break your existing functionality or tests. We wanted to reach out before the new version is on CRAN, so that there's time to report issues/incompatibilities and prepare you for any changes in your code base that might be necessary.

We apologize for any disruption or inconvenience, but hope that the improvements to Seurat v5 will benefit your users going forward.
To test the upcoming release, you can install Seurat from the seurat5 branch using the instructions available on this page: https://satijalab.org/seurat/articles/install.

Thank you!
Seurat v5 team

Fix linting

The linter workflow has found several refactoring improvements to suggest. Solving all the warnings raised should be enough to close this issue.

PPI error

Error

Error: `file` must be a string, raw vector or a connection.
Traceback:
1. PPI(data, FileName)
2. read_tsv(repo_content)
3. read_delimited(file, tokenizer, col_names = col_names, col_types = col_types, 
 .     locale = locale, skip = skip, skip_empty_rows = skip_empty_rows, 
 .     comment = comment, n_max = n_max, guess_max = guess_max, 
 .     progress = progress)
4. col_spec_standardise(data, skip = skip, skip_empty_rows = skip_empty_rows, 
 .     comment = comment, guess_max = guess_max, col_names = col_names, 
 .     col_types = col_types, tokenizer = tokenizer, locale = locale)
5. datasource(file, skip = skip, skip_empty_rows = skip_empty_rows, 
 .     comment = comment)
6. stop("`file` must be a string, raw vector or a connection.", 
 .     call. = FALSE)

Steps to reproduce

[13.39] Salim Ghannoum I am always running the noebook
[13.39] Salim Ghannoum CTCs-Binder-Part2.ipynb
[13.40] Salim Ghannoum I printed the repo_content

So it is not empty, I dont know why read_tsv() is not working

Standardize function names

The names of DIscBIO functions are not consistent: some use Pascal case (e.g. ClassVectoringDT, PCAplotSymbols), others use Dromedary case (e.g. clustheatmap, plotGap); even the spelling of similar functions can change (e.g. plotOrderTsne vs. plotLabelstSNE). This can cause confusion on a user. It would be great if all functions had a naming convention.

Binder failing to start

Waiting for build to start...
Picked Git content provider.
Cloning into '/tmp/repo2dockerbqeislym'...
HEAD is now at b9a1dc3 Merge branch 'release-1.2.0'
Error during build: .0.0 is not valid SemVer st

Possible solution provided on jupyterhub/repo2docker#1140 (comment).

Retry URLs on Networking and PPI functions

Summary

The Networking() and PPI() functions use httr::GET() to parse URLs. They only try the URL one time, which sometimes fail due to network errors (a common problem among users with less-than-perfect internet connections). It would be great if these calls were wrapped around loops that try GET some (e.g. 3) times until they get a status_code() of 200, which means a successful URL retrieval.

To-dos

Create MRE
Add loop to Networking()
Add loop to PPI()

MRE

The code below lacks downloading steps for the scripts and data from the Binder.

library("DIscBIO")
source("DIscBIO-CTCs-Binder-Part1.r")
source("DIscBIO-CTCs-Binder-Part2.r")

Networking(data, FileName)
PPI(data, FileName)

Failing TravisCI builds

Describe the bug

TravisCI builds for DIscBIO are failing

Steps to Reproduce

Access https://travis-ci.org/ocbe-uio/DIscBIO
Click on "restart build"

Expected behavior

A passing build.

Additional info

The last passing build was https://travis-ci.org/github/ocbe-uio/DIscBIO/builds/721646083. The only change between commit 0567683 and the next one (c7d9929) are a few lines on README.md. Maybe the new lines regarding BiocManager::install caused this?

P.S.: to check the difference between the two commits, run

git diff c7d99297a2dea98763d79248ffd0d675a3be64b5 0567683379186cf3b60a37f861932bb86696e04b

on a terminal at the DIscBIO working directory.

DIscBIO doesn't install

Describe the bug

Installation of DIscBIO from CRAN fails.

Steps to Reproduce

From an R interactive session, run:

install.packages("DIscBIO")

Expected output

Output that ends in:

* DONE (DIscBIO)

The downloaded source packages are in
        ‘/tmp/Rtmpgn8yf4/downloaded_packages’

The name of the temporary directory, Rtmpgn8yf4, will probably be different in your case.

Obtained output

The downloaded source packages are in
        ‘/tmp/Rtmpgn8yf4/downloaded_packages’
Warning messages:
1: In install.packages("DIscBIO") :
  installation of package ‘rJava’ had non-zero exit status
2: In install.packages("DIscBIO") :
  installation of package ‘RWekajars’ had non-zero exit status
3: In install.packages("DIscBIO") :
  installation of package ‘RWeka’ had non-zero exit status
4: In install.packages("DIscBIO") :
  installation of package ‘DIscBIO’ had non-zero exit status

Remove philentropy dependency

Package poorman is schedule for removal from CRAN, which affects the philentropy package and, by extension, DIscBIO. We use philentropy for calcualtion of the Jaccard distances, so the functionality can be rewritten.

Demo data is too large for examples

With 59 838 obsservations and 94 variables, the valuesG1ms dataset that comes with the package is too large for some function examples.

@SystemsBiologist, is it possible to add a second data example, with a subset of valuesG1ms? Which is the best way to subset the data and still have the dataset make sense? One idea is to just keep 33 columns (G1–G1.10, S1–S1.10, G2–G2.10) and, say, the first 1 000 rows of the dataset. Is this reasonable?

Replace call to boot() from Jaccard()

DIscBIO contains the boot package as a dependency just for the purpose of using the boot() function inside Jaccard() (see here). If this were to be replaced by an in-house solution, there would be one fewer dependency for DIscBIO (which is currently depending on 21 non-default packages; this generates a NOTE from devtools::check()).

Unable to view code of Binder notebook

Summary

Clicking on the "View as code" icon of some notebook files leads to a 404 error page.

Expected output

I don't know if I'm using the notebook properly, but let's take as an example in this page:

https://nbviewer.jupyter.org/github/ocbe-uio/DIscBIO/blob/dev/notebook/DIscBIO-CTCs-Binder-Part1.ipynb

I would like to copy the R code chunks, but manually selecting each chunk and Ctrl+C/Ctrl+V-ing my way through each one of them sounds like waaay much more work than it should. Normal Python notebooks can export just the code chunks for simple copying and pasting, so I assume Binder allows a reader to do that too. I also assume that can be done by clicking here:

Obtained output

When I click on the highlighted icon in the image above, this is where I end up:

So maybe the icon is pointing to the wrong place. Please fix or advise.

Binder links on README point to dev

The dev branch is unstable; Binder should be pointed to master by default.

Some functions implicitly require clustering

Some functions are very explicit about their dependency on other functions, for example running

comptSNE(DISCBIO(valuesG1ms))

Returns

Error in comptSNE(DISCBIO(valuesG1ms)) : run clustexp before comptsne

However, there are several other functions such as KmeanOrder which have the same requirement, but are not explicit about it. Running

KmeanOrder(DISCBIO(valuesG1ms), export = FALSE)

Returns

Error in [<-(*tmp*, cid, , value = colMeans(pcareduceres[names(clusterid[clusterid == : indeksen ligg utanfor grensene

(apologies for the error in Nynorsk, the point is that the first argument is out of bounds).

The function is clearly expecting some other clustering function such as clustexp to be run beforehand.

There are a few functions like this, and I can fix this by adding a similar validation algorithm to them as I go through the demo script, but I was wondering what this validation algo contains. Is it enough to check if length(object@cpart) == 0 or should I be checking other slots? Is clustexp the only clustering function that needs to be run before or are there alternatives?

Loading count matrix from 10x dataset in DIscBIO

Dear Author,

Thank you for the excellent tool. I have tested the tool on the test dataset of CTC and it works flawlessly. I then tried to use it on a 10x dataset from GSE136103. It has 10 healthy and 10 diseased samples.

I have processed these using Seurat as follows

data.10x = list()
data.10x[[1]] <- Read10X(data.dir = "/home/abhishek/UseCase/Liver/UseCaseLiver/Cirrhotic.Cellranger/Cirrhotic/GSM4041161/cellranger/")
data.10x[[2]] <- Read10X(data.dir = "/home/abhishek/UseCase/Liver/UseCaseLiver/Cirrhotic.Cellranger/Cirrhotic/GSM4041162/cellranger/")
data.10x[[3]] <- Read10X(data.dir = "/home/abhishek/UseCase/Liver/UseCaseLiver/Cirrhotic.Cellranger/Cirrhotic/GSM4041163/cellranger/")
data.10x[[4]] <- Read10X(data.dir = "/home/abhishek/UseCase/Liver/UseCaseLiver/Cirrhotic.Cellranger/Cirrhotic/GSM4041164/cellranger/")
data.10x[[5]] <- Read10X(data.dir = "/home/abhishek/UseCase/Liver/UseCaseLiver/Cirrhotic.Cellranger/Cirrhotic/GSM4041165/cellranger/")
data.10x[[6]] <- Read10X(data.dir = "/home/abhishek/UseCase/Liver/UseCaseLiver/Cirrhotic.Cellranger/Cirrhotic/GSM4041166/cellranger/")
data.10x[[7]] <- Read10X(data.dir = "/home/abhishek/UseCase/Liver/UseCaseLiver/Cirrhotic.Cellranger/Cirrhotic/GSM4041167/cellranger/")
data.10x[[8]] <- Read10X(data.dir = "/home/abhishek/UseCase/Liver/UseCaseLiver/Cirrhotic.Cellranger/Cirrhotic/GSM4041168/cellranger/")
data.10x[[9]] <- Read10X(data.dir = "/home/abhishek/UseCase/Liver/UseCaseLiver/Cirrhotic.Cellranger/Cirrhotic/GSM4041169/cellranger/")
data.10x[[10]] <- Read10X(data.dir = "/home/abhishek/UseCase/Liver/UseCaseLiver/Healthy.Cellranger/Healthy/GSM4041150/cellranger/")
data.10x[[11]] <- Read10X(data.dir = "/home/abhishek/UseCase/Liver/UseCaseLiver/Healthy.Cellranger/Healthy/GSM4041151/cellranger/")
data.10x[[12]] <- Read10X(data.dir = "/home/abhishek/UseCase/Liver/UseCaseLiver/Healthy.Cellranger/Healthy/GSM4041152/cellranger/")
data.10x[[13]] <- Read10X(data.dir = "/home/abhishek/UseCase/Liver/UseCaseLiver/Healthy.Cellranger/Healthy/GSM4041153/cellranger/")
data.10x[[14]] <- Read10X(data.dir = "/home/abhishek/UseCase/Liver/UseCaseLiver/Healthy.Cellranger/Healthy/GSM4041154/cellranger/")
data.10x[[15]] <- Read10X(data.dir = "/home/abhishek/UseCase/Liver/UseCaseLiver/Healthy.Cellranger/Healthy/GSM4041155/cellranger/")
data.10x[[16]] <- Read10X(data.dir = "/home/abhishek/UseCase/Liver/UseCaseLiver/Healthy.Cellranger/Healthy/GSM4041156/cellranger/")
data.10x[[17]] <- Read10X(data.dir = "/home/abhishek/UseCase/Liver/UseCaseLiver/Healthy.Cellranger/Healthy/GSM4041157/cellranger/")
data.10x[[18]] <- Read10X(data.dir = "/home/abhishek/UseCase/Liver/UseCaseLiver/Healthy.Cellranger/Healthy/GSM4041158/cellranger/")
data.10x[[19]] <- Read10X(data.dir = "/home/abhishek/UseCase/Liver/UseCaseLiver/Healthy.Cellranger/Healthy/GSM4041159/cellranger/")
data.10x[[20]] <- Read10X(data.dir = "/home/abhishek/UseCase/Liver/UseCaseLiver/Healthy.Cellranger/Healthy/GSM4041160/cellranger/")





### Create vector of sample names
samples = c("GSM4041150","GSM4041151","GSM4041152","GSM4041153","GSM4041154","GSM4041155","GSM4041156","GSM4041157","GSM4041158", "GSM4041159","GSM4041160",         "GSM4041161","GSM4041162","GSM4041163","GSM4041164","GSM4041165","GSM4041166","GSM4041167","GSM4041168","GSM4041169")


#Create Seurat Objects
scrna.list = list()
for (i in 1:length(data.10x)) {
  scrna.list[[i]] = CreateSeuratObject(counts = data.10x[[i]], min.cells=5, min.features=50, project=samples[i]);
  scrna.list[[i]][["DataSet"]] = samples[i];
}

rm(data.10x)

### Merge Seurat object into a single object
scrna <- merge(x=scrna.list[[1]], y=c(scrna.list[[2]],scrna.list[[3]],scrna.list[[4]],scrna.list[[5]],scrna.list[[6]],scrna.list[[7]],
                                      scrna.list[[8]],scrna.list[[9]],scrna.list[[10]],scrna.list[[11]],scrna.list[[12]],scrna.list[[13]],
                                      scrna.list[[14]],scrna.list[[15]],scrna.list[[16]],scrna.list[[17]],scrna.list[[18]],scrna.list[[19]],
                                      scrna.list[[20]]), add.cell.ids =  c("GSM4041161","GSM4041162",
                                      "GSM4041163","GSM4041164","GSM4041165","GSM4041166","GSM4041167","GSM4041168","GSM4041169",
                                      "GSM4041150","GSM4041151","GSM4041152","GSM4041153","GSM4041154","GSM4041155","GSM4041156",
                                      "GSM4041157","GSM4041158", "GSM4041159","GSM4041160"));

Then computed the count matrix per sample by taking average as below

avg.counts <- AverageExpression(object = scrna)
write.csv(avg.counts, file="Liver.csv")

Then I use DIscBIO pipeline as outlined for further analyses

library(DIscBIO)

Loading Dataset

FileName<-"liver"
DataSet <- read.csv(file = paste0(FileName,".csv"), sep = ",",header=T)
rownames(DataSet)<-DataSet[,1]
DataSet<-(DataSet[,-1])
cat(paste0("The ", FileName," contains:","\n","Genes: ",length(DataSet[,1]),"\n","cells: ",length(DataSet[1,]),"\n"))
sc<- DISCBIO(DataSet)

In the next step everything becomes infinity

S1<-summary(colSums(DataSet,na.rm=TRUE))            # It gives an idea about the number of reads across cells
print(S1)

I am not sure where I am going wrong and how can I fix this and in addition is DIscBIO capable of handling 10x data or it is only for FACS sorted and SMART-seq data?

Could you please guide me and give suggestion that could fix this problem.

Thank you

Unit tests failing

Unit tests, local and on GitHub Actions, are either failing (usualy due to a pathfinding problem) or taking forever to finish (because the datasets are large and the tests need adaptations, e.g. smaller bootstrapping runs).

Pass local tests
Pass test-coverage.yaml

As a matter of fact, I'm not sure the Binder tests should be part of the package, since they use datasets that are external to it (i.e., they are only used on the notebook). Alternatively, use smaller versions of them.

Recreate unit tests

The unit tests for the package have been simplified so the package could conform to CRAN policies. There is no need to completely remove them from the repository, as they only need to be excluded from the package itself. Having unit tests is very important in maintaining compatibility across versions.
Hence, I propose:

Re-adding the Binder notebook content as unit tests
Adding the unit tests to .Rbuildignore so they do not add overhead to the CRAN check

Networking outputs different network from hyperlink

[20:21, 10.10.2020] Salim Ghannoum: I have just noticed something strange in DIscBIO, the Networking() is giving a network that is not the same as the hyperlink
[20:22, 10.10.2020] Salim Ghannoum: You can see that in the binder for MLS or part 2 or part 4
[20:25, 10.10.2020] Salim Ghannoum: I went through the code but could not understand perfectly the part you added, could you please check it when you have time?

Aggregate K/MB selection into one function

Several funcitons detect whether k-means or model-based clustering was performed. The code to do so is individualized on each function, even though the procedure is almost identical and should be aggregated into one internal function.

Here are some examples of code chunks containing such redundancy:

DIscBIO/R/DIscBIO-generic-pseudoTimeOrdering.R

Lines 28 to 45 in 0c90899

 # ====================================================================== 

 # Validating 

 # ====================================================================== 

 ran_k <- length(object@kmeans$kpart) > 0 

 ran_m <- length(object@MBclusters) > 0 

 if (ran_k) { 

 Obj <- object@fdata 

 Names <- object@cpart 

 lpsmclust <- Exprmclust(Obj, K = 4, reduce = F, cluster = Names) 

 lpsorder <- TSCANorder(lpsmclust) 

 } else if (ran_m) { 

 Obj <- object@fdata 

 Names <- names(object@MBclusters$clusterid) 

 lpsmclust <- object@MBclusters 

 lpsorder <- TSCANorder(lpsmclust) 

 } else { 

 stop("run clustexp before this pseudoTimeOrdering") 

 }

DIscBIO/R/DIscBIO-generic-plotSilhouette.R

Lines 25 to 44 in 0c90899

 # ====================================================================== 

 # Validation 

 # ====================================================================== 

 ran_clustexp <- length(object@kmeans$kpart) > 0 

 ran_exprmclust <- length(object@MBclusters) > 0 

 if (ran_clustexp) { 

 kpart <- object@kmeans$kpart 

 DIS <- object@distances 

 } else if (ran_exprmclust) { 

 kpart <- object@MBclusters$clusterid 

 y <- clustfun(object@fdata, clustnr = 3, bootnr = 50, 

 metric = "pearson", do.gap = TRUE, SE.method = "Tibs2001SEmax", 

 SE.factor = .25, B.gap = 50, cln = 0, rseed = NULL, quiet = TRUE ) 

 DIS <- as.matrix(y$di) 

 } else { 

 stop("run clustexp or exprmclust before plotSilhouette") 

 } 

 if (length(unique(kpart)) < 2) { 

 stop("only a single cluster: no silhouette plot") 

 }

DIscBIO/R/DIscBIO-generic-plottSNE.R

Lines 15 to 28 in 0c90899

 # ====================================================================== 

 # Validating 

 # ====================================================================== 

 ran_k <- length(object@tsne) > 0 

 ran_m <- length(object@MBtsne) > 0 

 if (ran_k) { 

 part <- object@kmeans$kpart 

 x <- object@tsne 

 } else if (ran_m) { 

 part <- object@MBclusters$clusterid 

 x <- object@MBtsne 

 } else { 

 stop("run comptsne before plottSNE") 

 }

Also, the code makes it so that the functions will prefer k-means over MB, so if both are ran, than k-means is preferred. This should be documented. Ideally, the user should be made aware of this, though, and perhaps be prompted to choose or warned of the choice made by the function.

	repo_content <- content(repos)
	#results <- read_tsv(repo_content)
	write.table(repo_content, file = "data.csv", sep = ",")
	results <- read.table(file = "data.csv", sep = ",")

	shortNames <- substr(rownames(tmpExpdataAll), 1, 4)
	geneTypes <- factor(
	c(ENSG = "ENSG", ERCC = "ERCM", ENSG = "ENSM")[shortNames]

	# ======================================================================
	# Validating
	# ======================================================================
	ran_k <- length(object@kmeans$kpart) > 0
	ran_m <- length(object@MBclusters) > 0
	if (ran_k) {
	Obj <- object@fdata
	Names <- object@cpart
	lpsmclust <- Exprmclust(Obj, K = 4, reduce = F, cluster = Names)
	lpsorder <- TSCANorder(lpsmclust)
	} else if (ran_m) {
	Obj <- object@fdata
	Names <- names(object@MBclusters$clusterid)
	lpsmclust <- object@MBclusters
	lpsorder <- TSCANorder(lpsmclust)
	} else {
	stop("run clustexp before this pseudoTimeOrdering")
	}

	# ======================================================================
	# Validation
	# ======================================================================
	ran_clustexp <- length(object@kmeans$kpart) > 0
	ran_exprmclust <- length(object@MBclusters) > 0
	if (ran_clustexp) {
	kpart <- object@kmeans$kpart
	DIS <- object@distances
	} else if (ran_exprmclust) {
	kpart <- object@MBclusters$clusterid
	y <- clustfun(object@fdata, clustnr = 3, bootnr = 50,
	metric = "pearson", do.gap = TRUE, SE.method = "Tibs2001SEmax",
	SE.factor = .25, B.gap = 50, cln = 0, rseed = NULL, quiet = TRUE )
	DIS <- as.matrix(y$di)
	} else {
	stop("run clustexp or exprmclust before plotSilhouette")
	}
	if (length(unique(kpart)) < 2) {
	stop("only a single cluster: no silhouette plot")
	}

ocbe-uio / discbio Goto Github PK

discbio's People

Contributors

Stargazers

Watchers

Forkers

discbio's Issues

Summary

DIscBIO version

Expected output

Describe the bug

Steps to Reproduce

Expected behavior

Screenshots

Software metainformation

Additional context

Summary

What works

What doesn't work

Expected behavior

Additional context

Adapted details from @SystemsBiologist

What to change

Expected behavior

Testing code

Describe the bug

To Reproduce

Expected behavior

Software metainformation

Error

Steps to reproduce

Summary

To-dos

MRE

Describe the bug

Steps to Reproduce

Expected behavior

Additional info

Describe the bug

Steps to Reproduce

Expected output

Obtained output

Summary

Expected output

Obtained output

Loading Dataset

Recommend Projects

Recommend Topics

Recommend Org

Jobs