GithubHelp home page GithubHelp logo

pachterlab / scrna-seq-tcc-prep Goto Github PK

View Code? Open in Web Editor NEW
63.0 10.0 26.0 22.76 MB

Preprocessing of single-cell RNA-Seq (deprecated)

Home Page: http://kallistobus.tools

License: GNU General Public License v3.0

Jupyter Notebook 98.68% Python 1.32%

scrna-seq-tcc-prep's Introduction

The methods implemented in the code in this repository have been superseded by the kallisto|bustools workflow.

Single-cell RNA-Seq TCC prep

This repository contains scripts needed to generate transcript compatibility count (TCC) matrices from single-cell RNA-Seq data. Included is error-correction of barcodes, collapsing of UMIs and pseudoalignment of reads to a transcriptome to obtain transcript compatibility counts. The scripts utilize kallisto for pseudoalignment.

We currently support the 10X Chromium technology; support for more technologies is underway.

Instructions for processing 10X Chromium 3' digital expression data

Getting started

The getting started tutorial explains how to process the small example in the example_dataset directory. This is a good starting point to make sure that the necessary programs are correctly installed. Note that you will need kallisto (≥ 0.43.0), python (≥ 2.7.10), scipy (≥ 0.16.0), scikit-learn (= 0.16.1) and Juypter Notebook (≥ 4.0.6) installed (the Jupyter requirement is not strictly necessary but highly recommended). Please note that there appears to be a problem with tSNE in scikit-learn v0.17.1.

Workflow organization

The processing workflow consists of four steps:

  1. Preparation of a configuration file that contains the parameters needed for the processing.
  2. Identification of "true" cell barcodes according to read coverage followed by error correction when possible.
  3. Creation of read/UMI files for each cell.
  4. Pseudoalignment of reads associated with each cell using kallisto, deduplication according to UMIs, and generation of transcript compatibility counts (TCCs) for each cell.

Following the pre-processing, the transcript compatibility counts (TCC) matrix can be analyzed using a Jupyter Notebook.

Creation of the configuration file

Parameters needed to run the processing require specification of a config.json file. The following parameters need to be specified:

  • NUM_THREADS: the number of threads available for processing.
  • WINDOW: this parameter contains a lower and upper threshold for the expected number of cells in the experiment. It is used in the determination of the number o cells in the experiment from reads coverage data.
  • SOURCE_DIR: path to the source directory that contains the .py scripts
  • BASE_DIR: this must contain the path to the (demultiplexed) FASTQ files from the sequencing. Note that our workflow does not currently demultiplex reads and you may have to do so with 10X's software; we plan to provide a demultiplexing script in the future.
  • sample_idx: The sample index used for the run e.g., for SI-3A-A10 -> sample_idx: ["ACAGCAAC", "CGCAATTT", "GAGTTGCG", "TTTCGCGA"].
  • SAVE_DIR: path to a directory where intermediate files will be saved.
  • dmin: the minimum distance between barcodes needed for error correction to be performed.
  • BARCODE_LENGTH: length of the barcodes.
  • OUTPUT_DIR: directory in which to output results.
  • kallisto: path to the binary for kallisto, location of the kallisto index file for the appropriate transcriptome and path where to save the TCC matrix.

Barcode analysis and selection

The workflow operates on demultiplexed Chromium-prepared sequencing samples (the raw barcode and read files can be converted to FASTQ using the cellranger demux 10x software). Once the gzipped FASTQ files have been obtained, the first step in our workflow is to identify "true" barcodes, and to error correct barcodes that are close to true barcodes, yet associated with sufficiently low read coverage to be confidently identified as containing an error. The script get_cell_barcodes.py in the source directory performs the identification and error correction and is called with python get_cell_barcodes.py config.json.

While get_cell_barcodes.py can be run from the command line, we strongly encourage users to instead perform this step using the Jupyter Notebook 10xGet_cell_barcodes.ipynb in the notebooks directory. The interactive notebook produces summary statistics and figures that are useful for both quality control and for the setting of parameters for error correction.

Cell file generation

Once barcodes have been identified and (some) erroneous barcodes corrected, the next step is to generate individual read and UMI files for each cell for processing by kallisto. This can be performed with the command python error_correct_and_split.py config.json.

Pseudoalignment

The computation of transcript compatibility counts is performed using kallisto by running python compute_TCCs.py config.json followed by python prep_TCC_matrix.py config.json. The first script runs kallisto and the second step computes a pairwise distance matrix between cells that is essential for analysis. The result of running the two scripts is the generation of three files needed for analysis: TCC_matrix.dat, pwise_dist_L1.dat and nonzero_ec.dat.

Note that the entire workflow can be run using the master script 10xDetect_and_Prep.py although as explained above we recommend examining the barcode data using the Jupyter Notebook 10xGet_cell_barcodes.ipynb. After the barcode analysis and selection step, the rest of the workflow can be completed by running python 10xPrepData.py config.json.

Analysis

The TCC_matrix.dat file contains a matrix that specifies, for each cell, a list of transcript sets with associated counts. Those counts, called transcript compatibility counts, are explained in Ntranos et al. 2016. They are the starting point for downstream analysis of the data.

The analysis workflow for an experiment will depend on the specifics of the data and the questions associated with it. To help users get started, we have provided two examples based on datasets distributed by 10X: an experiment with both human and mouse cells and an analysis of peripheral blood mononuclear cells.

Contributions

This 10X Chromium 3' digital expression processing workflow was designed and implemented by Vasilis Ntranos with some input from Lior Pachter. Páll Melsted added the --umi option to kallisto which allows for deduplicating reads using associated unique molecular identifiers (UMIs).

scrna-seq-tcc-prep's People

Contributors

dylanbannon avatar lakigigar avatar vasilisnt avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scrna-seq-tcc-prep's Issues

Very low number of cells detected

Hi, I'm confused by this output of the first script:

get_cell_barcodes.py

...
NUMBER_OF_SEQUENCED_BARCODES = 35910128
Detecting Cells...
NUM_OF_DISTINCT_BARCODES = 5309260
CELL_WINDOW: [100, 5000]
Cell_barcodes_detected: 100
NUM_OF_READS_in_CELL_BARCODES = 3200555
Calculating d_min...
number of cell barcodes to error-correct: 86 ( dmin >= 5 )
Writing output...
....

It sounds fishy to me that it always detects whatever I specify as the lowest number of cells.
If I set the window [500,5000] cells, it will find 500 cells. If I set it to 100,5000, it will find 100 cells. Is this expected? If not, any ideas what I'm doing wrong?

thanks!
Max

kallisto transcriptome website is down

Hi,

I am following the getting started tutorial (http://pachterlab.github.io/kallisto/10xstarting.html) for single cell RNAseq but when I get to the point:

Download the “human-mouse” transcriptome from the kallisto transcriptome website.

I cannot access the link (http://pachterlab.github.io/kallisto/transcriptomes/), I get the following message:

Sorry this page does not exist =(

Could you direct me to the right website? alternatively, could you describe the "human-mouse" transcriptome file?

Thanks!!

Code outdated

Hi,

I am willing to try your code for a 10x scRNA-Seq analysis. However, in the latest V2 chemistry, they changed the design of barcodes, R1 and R2 reads. In simple, R1 is 26 bp long, including the cell barcodes and UMI, while R2 is 98 bp long, which is the length of the transcripts. The independent I7 index is saved in an I1.fastq file. So your jupyterhub notebook is almost useless without substantial changes.
Not sure whether you will be actively developing this code or not, but just in case you are, please check 10x for their latest development.

Best,
Ying

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.