GithubHelp home page GithubHelp logo

jfnavarro / st_pipeline Goto Github PK

View Code? Open in Web Editor NEW
59.0 6.0 47.0 147.48 MB

ST Pipeline contains the tools and scripts needed to process and analyze the raw files generated with the Spatial Transcriptomics method in FASTQ format.

License: Other

Python 97.12% Makefile 2.88%
pipeline spatial-transcriptomics visium scrna-seq

st_pipeline's Introduction

Spatial Transcriptomics Pipeline

License: MIT Python 3.6 PyPI version Build Status

The ST Pipeline contains the tools and scripts needed to process and analyze the raw files generated with Spatial Transcriptomics and Visium raw data in FASTQ format to generate datasets for down-stream analysis. The ST pipeline can also be used to process single cell RNA-seq data as long as a file with barcodes identifying each cell is provided (same template as the files in the folder "ids").

The ST Pipeline has been optimized for speed, robustness and it is very easy to use with many parameters to adjust all the settings. The ST Pipeline is fully parallel and has constant memory use. The ST Pipeline allows to skip any of the steps and to use the genome or the transcriptome as reference.

The following files/parameters are commonly required :

  • FASTQ files (Read 1 containing the spatial information and the UMI and read 2 containing the genomic sequence)
  • A genome index generated with STAR
  • An annotation file in GTF or GFF3 format (optional when using a transcriptome)
  • The file containing the barcodes and array coordinates (look at the folder "ids" to use it as a reference). Basically this file contains 3 columns (BARCODE, X and Y), so if you provide this file with barcodes identinfying cells (for example), the ST pipeline can be used for single cell data. This file is also optional if the data is not barcoded (for example RNA-Seq data).
  • A name for the dataset

The ST pipeline has multiple parameters mostly related to trimming, mapping and annotation but generally the default values are good enough. You can see a full description of the parameters typing "st_pipeline_run.py --help" after you have installed the ST pipeline.

The input FASTQ files can be given in gzip/bzip format as well.

Basically what the ST pipeline does (default mode) is :

  • Quality trimming (read 1 and read 2) :
    • Remove low quality bases
    • Sanity check (reads same length, reads order, etc..)
    • Check quality UMI
    • Remove artifacts (PolyT, PolyA, PolyG, PolyN and PolyC) of user defined length
    • Check for AT and GC content
    • Discard reads with a minimum number of bases of that failed any of the checks above
  • Contamimant filter e.x. rRNA genome (Optional)
  • Mapping with STAR (only read 2)
  • Demultiplexing with Taggd (only read 1)
  • Keep reads (read 2) that contain a valid barcode and are correctly mapped
  • Annotate the reads with htseq-count (slightly modified version)
  • Group annotated reads by barcode (spot position), gene and genomic location (with an offset) to get a read count
  • In the grouping/counting only unique molecules (UMIs) are kept.

You can see a graphical more detailed description of the workflow in the documents workflow.pdf and workflow_extended.pdf

The output is a matrix of counts (genes as columns, spots as rows). The ST pipeline will also output a log file with useful stats and information.

Installation

We recommend you install a virtual environment like Pyenv or Anaconda before you install the pipeline.

The ST Pipeline works with python 3.6 or bigger.

You can install the ST Pipeline using PyPy:

pip install stpipeline

Alternatively, you can build the ST Pipeline yourself:

First clone the repository

git clone <stpipeline repository> 

or download a tar/zip from the releases section and unzip it

unzip stpipeline_release.zip

Access the cloned ST Pipeline folder or the folder where the tar/zip file has been decompressed.

cd stpipeline

To install the pipeline type

python setup.py build
python setup.py install

To run a test type

python setup.py test
python -m unittest testrun.py

To see the different options type

st_pipeline_run.py --help

Requirements

The ST Pipeline requires STAR installed in the system (minimum version 2.5.4 if you use a ST Pipeline version >= 1.6.0): https://github.com/alexdobin/STAR

If you use anaconda you can install STAR with

conda install -c bioconda star

The ST Pipeline requires samtools installed in the system If you use anaconda you can install Samtools with

conda install -c bioconda samtools openssl=1.0

The ST Pipeline needs a computer with at least 32GB of RAM (depending on the size of the genome) and 8 cpu cores.

Dependencies

The ST Pipeline depends on some Python packages that will be automatically installed during the installation process. You can see them in the file dependencies.txt

Example

An example run would be

st_pipeline_run.py --expName test --ids ids_file.txt --ref-map path_to_index --log-file log_file.txt --output-folder /home/me/results --ref-annotation annotation_file.gtf file1.fastq file2.fastq 

Visium

To process Visium datasets it is recommended to use these options:

--demultiplexing-mismatches 1
--demultiplexing-kmer 4
--umi-allowed-mismatches 2
--umi-start-position 16
--umi-end-position 28

Emsembl ids

If you used an Ensembl annotation file and you would like change the ouput file so it contains gene ids/names instead of Ensembl ids. You can use this tool that comes with the ST Pipeline

convertEnsemblToNames.py --annotation path_to_annotation_file --output st_data_updated.tsv st_data.tsv

Merge demultiplexed FASTQ files

If you used different indexes to sequence and need to merge the files you can use the script merge_fastq.py

merge_fastq.py --run-path path_to_run_folder --out-path path_to_output --identifiers S1 S2 S3 S4

Where identifiers will be strings that identify each demultiplexed sample.

Filter out genes by gene type

If you want to remove from the dataset (matrix in TSV) genes corresponding to certain gene types (For instance to keep only protein_coding). You can do so with the script filter_gene_type_matrix.py

filter_gene_type_matrix.py --gene-types-keep protein-coding --annotation path_to_annotation_file stdata.tsv

You may include the parameter --ensembl-ids if your gene names are represented as gene ids instead.

Remove spots from dataset

If you want to remove spots from a dataset (matrix in TSV) for instance to keep only spots inside the tissue. You can do so with the script adjust_matrix_coordinates.py

adjust_matrix_coordinates.py --outfile new_stdata.tsv --coordinates-file coordinates.txt stdata.tsv

Where coordinates.txt will be a tab delimited file with 6 columns:

orig_x orig_y new_x new_y new_pixel_x new_pixel_y

Only spots whose coordinates in the file will be kept and then optionally you can update the coordinates in the matrix choosing for the new array or pixel coordinates.

Quality stats

The ST Pipeline generate useful statistical information in the LOG file but if you want to obtain more detail information about the quality of the data, you can run the following script:

st_qa.py stdata.tsv 

If you want to perform quality stats on multiple samples you can run:

multi_qa.py stdata1.tsv stadata2.tsv stdata3.tsv stdata4.tsv

Multi_qa.py generates violing plots, correlation plots/tables and more useful information and it allows to log the counts for the correlation.

Documentation

You can see a more detailed documentation in the folder "doc_out".

Example data

You can see a real dataset obtained from the public data from the following publication (http://science.sciencemag.org/content/353/6294/78) in the folder called "data".

License

The ST pipeline is open source under the MIT license which means that you can use it, change it and re-distribute but you must always refer to our license (see LICENSE and AUTHORS).

Reference

If you use the ST Pipeline, please refer its publication:

ST Pipeline: An automated pipeline for spatial mapping of unique transcripts Oxford BioInformatics 10.1093/bioinformatics/btx211

Contact

For questions, bugs, feedback, etc.. you can contact Jose Fernandez Navarro [email protected]

st_pipeline's People

Contributors

alexander-stuckey avatar elhb avatar jfnavarro avatar jsh58 avatar maaskola avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

st_pipeline's Issues

importing the ST pipeline output to seurat/STutility

Hello, I received visium data processed with ST pipeline and am having trouble creating a seurat or STutility object (R). Do you have any suggestions there? I couldnt find anything available so far since both tools focus on spaceranger output and they expect this as a standard input. Thank you for the help

spot coord problem

Hi Jose,

I’ve trying to extract the spot coordinates from Staffli object in order to integrate single-cell RNA-seq with Spatial data.
I gotta the x and y by this data.frame(GetStaffli(A1_ST)[[]][,c(1,2)]),
I run my own code; the plot looks like this:
image001

If I use ST.FeaturePlot, plot looks like this
image003

The image generated by ST.FeaturePlot is correct since it’s the same as CellRanger. But image on my codes is upside down. I want to figure out what I did is wrong, if I pick up the wrong coordinate value? Would you please advise? Thanks

Running st pipeline “name 'out_rw' is not defined”

Hi,

Thank you very much for your work.

I ran stpipeline with the following code:

st_pipeline_run.py
--output-folder $OUTPUT
--temp-folder $TMP
--umi-start-position 16
--umi-end-position 26
--ids /home/zhangqiang/spatial_trans/$ID
--ref-map $MAP
--ref-annotation $ANN
--expName $EXP
--htseq-no-ambiguous
--verbose
--threads 16
--log-file $OUTPUT/${EXP}_log.txt
--star-two-pass-mode
--no-clean-up
$FW $RV

I get this error message :

ST Pipeline, parameters loaded
ST Pipeline, logger created
ST Pipeline, sanity check passed. Starting the run.
Error running the pipeline
name 'out_rw' is not defined

Do you know what causes this error? How can I fix it?

Thank you very much for your help!!

Qiang

problem installing

Dear all,

Sadly I am not able to install the stpipeline. When using pip in a fresh conda env I get the following error:

` Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error

× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [6 lines of output]
Traceback (most recent call last):
File "", line 2, in
File "", line 34, in
File "/private/var/folders/z4/dc80cgy93pj6rgjpx46mnh680000gn/T/pip-install-9dekewms/taggd_cbeca8e7f968423694db01796dba5096/setup.py", line 4, in
import numpy
ModuleNotFoundError: No module named 'numpy'
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.`

When using git clone, also I get an error when running the setup.py. It says it can't find the file in stpipeline/common/cdistance.c which is also not in this directory!

Happy for your help,
philipp

Doubt in output files ?

Respected one, I have installed stpipeline and run the command. It generated a read.bed stats.csv and log.txt file.
actually, for my work, I need bam files that are generated after star for my Spatial data, and I need bam further different analysis. That bam should have information on neighbor spatial cells data. Can you help me with how to get bam here or what STAR command, especially for ST data, to get my desired bam files? Hoping for reply.

Spot barcode and UMI tags in the BAM file

Hello,

I am interested to have the intermediate bam (i.e. the analysis ready bam) file, and I would like to identify tags that represent the barcodes and the UMIs in the bam. I am working with a earlier generation of ST (pre-Visium) dataset.

  1. Following SpatialTranscriptomicsResearch#112, I have used
    --temp-folder $TEMPDIR
    --no-clean-up, and considered $TEMPDIR/annotated.bam. Please correct me if I am wrong here, especially as $TEMPDIR also contains demultiplexed_matched.bam, mapped.bam, and R2_quality_trimmed.bam, and I am not sure which one to pick.

  2. I am guessing that 'B0' represents the barcodes, and 'B3' represents the UMIs in the bam file . Is this correct?

Can you please help me with this?

Here is my input.

st_pipeline_run.py
--expName P2R1
--ids $IDS \
--ref-map $STAR_OUT
--star-two-pass-mode
--output-folder $OUT
--temp-folder $TEMP
--no-clean-up
--threads 5
--ref-annotation $GTF $F1 $F2

Import error: _intel_fast_memcpy

ImportError: /rwthfs/rz/cluster/home/nm514167/miniconda3/envs/bam/lib/python3.7/site-packages/st_pipeline/stpipeline/common/unique_events_parser.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _intel_fast_memcmp

when I run the stpipeline, it shows an error in importing unique_events_parser, there is an undefined symbol: _intel_fast_memcmp

Having trouble understanding how to use this pipeline with Real world Spatial Data.

Hi,
I am trying this out on the example visium10x dataset for prostate cancer (https://www.10xgenomics.com/resources/datasets/human-prostate-cancer-adenocarcinoma-with-invasive-carcinoma-ffpe-1-standard-1-3-0)

My end usecase is to have a basic Image Model that could predict gene expression for a given gene on the entire tissue on cell level

in this data, in the input files section, we have Fastq files and a probeset file thats it.
For the other output files im not sure if those could be used / need to be used in here.

If you could help with some concrete steps to use this , it would really help a lot.

Thank you in advance.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.