GithubHelp home page GithubHelp logo

yko-barseq's Introduction

YKO-barseq

Building singularity image

On a computer with sudo access, run:

# while in directory containing Singularity file
sudo singularity build bartender.sif Singularity

Running bartender extractor

singularity exec \
    -B /data/SBGE/cory/YKO-barseq:/home/wellerca/ \
    bartender/bartender.sif bartender_extractor_com \
    -f seq/MS3059950-600V3_19109498_S7_L001_R1_001.fastq \
    -o pre  \
    -p CGAGC[34]C -m 1

Output:

Running bartender extractor
bartender_extractor seq/MS3059950-600V3_19109498_S7_L001_R1_001.fastq pre 1 "(CGAG.|CGA.C|CG.GC|C.AGC|.GAGC)([ATCGN]{34})(C)" CGAGC C 3 1
Totally there are 1187764 reads in seq/MS3059950-600V3_19109498_S7_L001_R1_001.fastq file!
Totally there are 1118562 valid barcodes from seq/MS3059950-600V3_19109498_S7_L001_R1_001.fastq file
Totally there are 1118562 valid barcodes whose quality pass the quality condition
The estimated sequence error from the prefix and suffix parts is 0.0311966

Formatting barcodes

The extracted_barcode.txt file contains a 34-mer nucleotide sequence, but we only want the 20 nucleotide barcode sequence contained within.

python3 format_barcodes.py pre_barcode.txt > barcodes.txt

Running bartender cluster

singularity exec \
    -B /data/SBGE/cory/YKO-barseq:/home/wellerca/ \
    bartender/bartender.sif bartender_single_com  \
    -f barcodes.txt \
    -o barcode_clusters  \
    -d 2 \
    -s 5

output:

Running bartender
Loading barcodes from the file
It takes 00:00:01 to load the barcodes from barcodes.txt
Shortest barcode length: 20
Longest barcode length: 20
Start to group barcode with length 20
Using two sample unpooled test
Transforming the barcodes into seed clusters
Initial number of unique reads:  64431
The distance threshold is 2
Clustering iteration 1
Clustering iteration 2
Clustering iteration 3
Clustering iteration 4
Identified 18272 barcodes with length 20
The clustering process takes 00:00:01
Start to dump clusters to file with prefix barcode_clusters
Start to remove pcr effects
***(Overall error rate estimated from the clustering result)***
Total number of clusters after removing PCR effects: 18272
The estimated error rate is 0.00340786
The overall running time 00:00:05 seconds.

Take most abundant seq (consensus) per cluster and plot

library(data.table)
library(ggplot2)
library(ggrepel)

dat <- fread('barcode_clusters_barcode.csv')

consensus <- dat[, .SD[which.max(Frequency)], by=Cluster.ID]

setnames(consensus, "Unique.reads", "consensus")
consensus[, Frequency := NULL]
setkey(consensus, Cluster.ID)
setkey(dat, Cluster.ID)

dat.merge <- merge(dat, consensus)
consensus_counts <- dat.merge[, list("N" = sum(Frequency)), by=consensus][order(-N)]
consensus_counts[, abundance_rank := 1:.N]

fwrite(consensus_counts, file="consensus_counts.csv", quote=F, row.names=F, col.names=T, sep=",")


consensus_counts[N > 3000, text_label := consensus]


ggplot(consensus_counts[N>=2], aes(x=abundance_rank, y=N)) + geom_point() +
scale_y_continuous(trans='log10', 
                    breaks=c(1e0, 1e1, 1e2, 1e3, 1e4, 1e5, 1e6),
                    labels=c("1", "10", "100", "1000", "10000", "100000", "1000000")) +
labs(x="Barcodes ranked by abundance",
        y="Abundance",
        title="Barcodes with cluster counts >= 2") +
theme_few(12) +
geom_text_repel(aes(label=text_label))

ggplot(consensus_counts[N>=2 & N < 10000], aes(x=abundance_rank, y=N)) + geom_point() +
labs(x="Barcodes ranked by abundance",
        y="Abundance",
        title="Barcodes with cluster counts >= 2 and <= 100000") +
theme_few(12) +
geom_text_repel(aes(label=text_label))

yko-barseq's People

Contributors

cory-weller avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.