GithubHelp home page GithubHelp logo

adeslatt / long-read-proteogenomics Goto Github PK

View Code? Open in Web Editor NEW

This project forked from sheynkman-lab/long-read-proteogenomics

0.0 0.0 0.0 109.4 MB

A workflow for enhanced protein isoform detection through integration of long-read RNA-seq and mass spectrometry-based proteomics.

License: MIT License

Python 47.05% Dockerfile 4.22% Jupyter Notebook 0.54% R 0.65% Nextflow 39.94% Shell 7.61%

long-read-proteogenomics's Introduction

DOI

reviewdog misspell

Testing for Long Reads Proteogenomics without Sqanti

Sheynkman-Lab/Long-Read-Proteogenomics

Updated: 2021 July 24

This is the repository for the Long-Read Proteogenomics workflow. Written in Nextflow, it is a modular workflow beneficial to both the Transcriptomics and Proteomics fields. The data from both Long-Read IsoSeq sequencing with PacBio and Mass spectrometry-based proteomics used in the classification and analysis of protein isoforms expressed in Jurkat cells and described in the publication Enhanced protein isoform characterization through long-read proteogenomics, which will be made public in Fall 2021.

A goal in the biomedical field is to delineate the protein isoforms that are expressed and have pathophysiological relevance. Towards this end, new approaches are needed to detect protein isoforms in clinical samples. Mass spectrometry (MS) is the main methodology for protein detection; however, poor coverage and incompleteness of protein databases limit its utility for isoform-resolved analysis. Fortunately, long-read RNA-seq approaches from PacBio and Oxford Nanopore platforms offer opportunities to leverage full-length transcript data for proteomics.

We introduce enhanced protein isoform detection through integrative “long read proteogenomics”. The core idea is to leverage long-read RNA-seq to generate a sample-specific database of full-length protein isoforms. We show that incorporation of long read data directly in the MS protein inference algorithms enables detection of hundreds of protein isoforms intractable to traditional MS. We also discover novel peptides that confirm translation of transcripts with retained introns and novel exons. Our pipeline is available as an open-source Nextflow pipeline, and every component of the work is publicly available and immediately extendable.

Proteogenomics is providing new insights into cancer and other diseases. The proteogenomics field will continue to grow, and, paired with increases in long-read sequencing adoption, we envision use of customized proteomics workflows tailored to individual patients.

We acknowledge the beginning kernels of this work were formed during the Fall of 2020 at the Cold Spring Harbor Laboratory Biological Data Science Codeathon.

We acknowledge Lifebit and the use of their platform Lifebit's CloudOS key in development of the open source software Nextflow workflow used in this work.

How to use this repository and Quick Start

This workflow is complex, bringing together two measurement technologies in a long-read proteogenomics approach for integrating sample-matched long-read RNA-seq and MS-based proteomics data to enhance isoform characterization. To orient the user with the steps involved in the transformation of raw measurement data to these fully resolved, identified and annotated results, we have developed this quick start, wiki documentation including vignettes.

How to use this repository

This repository is organized into modules and parts of this repository could be useful to different researchers to annotate their own raw data. The workflow is written in Nextflow, allowing it to be run on virtually any platform with alterations to the configurations and other adaptations. The visitor is encourated to fork clone and adapt and contribute. All are encouraged to use GitHub Issues to communicate with the contributors to this open source software project. Software addtions, modifications and contributions are done through GitHub Pull Requests

Module processes details are documented within the Wiki within this repository. As well as linked to the third party resources used in this workflow.

Vignettes have been developed to go into greater detail and walk the visitor through the visualization capabilities of the final annotated results and to walk the visitor through the workflow with presented here with the quick start

Quick Start

This quick start and steps were performed on a MacBook Pro running BigSur Version 11.4 with 16 GB 2667 MHz DDR48 RAM and a 2.3 GHz 8-Core Intel Core i9 processor.

The visitor will be walked through the pre-requisites, clone the library and execute with demonstration data also used in the GitHub Actions.

Obtain the Desktop DockerHub Application

In this quick start, Dockerhub Desktop Application for the Mac with an Intel Chip was used. Follow the instructions there to install.

Configure the Desktop DockerHub Application

On the MacBook Pro running BigSur Version 11.4 with 16 GB Ram, It was necessary to configure the Dockerhub resources to use 6GB of Ram.

Obtain miniconda

On the MacBook Pro, the 64-bit version of miniconda was downloaded and installed follow the installation instructions.

Create and activate a new conda environment lrp.

conda create -n lrp
conda activate lrp

Install Nextflow.

Install and set the Nextflow version.

conda install -c bioconda nextflow -y
export NXF_VER=20.01.0

Clone this repository

Now with the environment ready, we can clone.

git clone https://
.com/sheynkman-lab/Long-Read-Proteogenomics
cd Long-Read-Proteogenomics

Run the pipeline with the test_without_sqanti.config

DOI

This Quick start uses the test_without_sqanti.config configuration file found in the conf directory of this repository.

nextflow run main.nf --config conf/test_without_sqanti.config 

For details regarding the processes and results produced, please see the Wiki and the Vignette: Workflow with test data.

To visualize results, please see the visualization capabilities of the final annotated results.

Documentation and Workflow Vignettes

The sheynkman-lab/Long-Read-Proteogenomics pipeline comes with details about each of the processes that make up the pipeline are found in the Wiki. In this you will find:

  1. Third-party tools
  2. Input parameters
  3. Output files
  4. Pipeline processes descriptions
  5. Vignette: Visualization
  6. Vignette: Workflow with test data

Workflow overview

The workflow accepts as input raw PacBio data and performs the assembly of predicted protein isoforms with high probability of existing in the sample. This database is then used in MetaMorpheus to search raw mass spectrometry data against the PacBio reference. MetaMorpheus will use protein isoform read counts during protein inference. Two other protein databases are employed for the purposes of comparison. One is from UniProt and the other is from GENCODE. A series of Jupyter notebooks can be used to perform all final comparisons and data analysis.

LRP Pipeline_v2

Using Zenodo

To make the data more accessible and FAIR, the indexed files were transferred to Zenodo using zenodo-upload from the University of Virginia's Gloria Sheynkman Lab Amazon S3 buckets.

Using Nextflow, configuration items can access locations in Google Compute Platform (GCP) buckets (gs://), Amazon Web Services (AWS) buckets (s3://) and Zenodo locations (https://) seamlessly.

The main reasons why ZENODO vs AWS S3: or GCP GS: are:

  1. Data versioning (of primary importance): In S3 or GS buckets, data can be overwritten for the same path at any point, possibly breaking the pipeline.
  2. Cost: These datasets are tiny but the principle stays: The less storage the better
  3. Access: Most users of the pipeline can most easily access ZENODO and will be able to use the data. AWS and GCP has an entry barriers.

Details on how these data were transferred and moved from AWS S3: buckets are described in the AWS to Zenodo.

Contributors

This is a joint project between the Sheynkman Lab, the Smith Lab, Lifebit and Science and Technology Consulting, LLC.

Repository template

This pipeline was generated using a modification of the nf-core template. You can cite the nf-core publication as follows:

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x. ReadCube: Full Access Link

long-read-proteogenomics's People

Contributors

bj8th avatar adeslatt avatar gsheynkman avatar rmillikin avatar cgpu avatar trishorts avatar rmmiller22 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.