GithubHelp home page GithubHelp logo

sheynkman-lab / long-read-proteogenomics Goto Github PK

View Code? Open in Web Editor NEW
37.0 7.0 15.0 110.6 MB

A workflow for enhanced protein isoform detection through integration of long-read RNA-seq and mass spectrometry-based proteomics.

License: MIT License

Python 55.90% Dockerfile 3.51% Jupyter Notebook 0.45% R 0.54% Nextflow 33.26% Shell 6.34%
pipeline proteomics long-read-sequencing isoforms nextflow

long-read-proteogenomics's Introduction

reviewdog misspellTesting for Long Reads Proteogenomics without Sqanti

This Repository contains the complete software and documentation to execute the Long-Read-Proteogenomics Workflow.

Digital Object Identifiers

For the Genome Biology Manuscript: Enhanced Protein Isoform Characterization through Long Read Proteogenomics.

DOI Description
drawing Contains the version of the repository used for execution and generation of data
drawing Contains the input data from Jurkat Samples and Reference data used in execution of the Long-Read-Proteogenomics workflow contained in this repository
drawing Contains the output data from executing the Long-Read-Proteogenomics workflow using the Zenodo version of this repository
drawing Contains the version of analysis codes and codes for generating the figures using as input the output data from executing the Long-Read-Proteogenomics Workflow version specified above
drawing Contains the Test Data used with the GitHub Actions to ensure changes to this repository still execute and perform correctly
Sequence Read Archive (SRA) Project Reference Description
PRJNA783347 Long-Read RNA Sequencing Project for Jurkat Samples
PRJNA193719 Short-Read RNA Sequencing Project for Jurkat Samples

Sheynkman-Lab/Long-Read-Proteogenomics

Updated: 2022 January 30

This is the repository for the Long-Read Proteogenomics workflow. Written in Nextflow, it is a modular workflow beneficial to both the Transcriptomics and Proteomics fields. The data from both Long-Read IsoSeq sequencing with PacBio and Mass spectrometry-based proteomics used in the classification and analysis of protein isoforms expressed in Jurkat cells and described in the publication Enhanced protein isoform characterization through long-read proteogenomics, which will be made public in Fall 2022.

The output data resulting from the execution of this workflow for the Manuscript: Enhanced Protein Isoform Characterization through Long Read Proteogenomics. May be found here [insert Zenodo Reference here]. The Analysis to produce the figures for the manuscript may be found in the companion repository Long-Read Proteogenomics Analysis

A goal in the biomedical field is to delineate the protein isoforms that are expressed and have pathophysiological relevance. Towards this end, new approaches are needed to detect protein isoforms in clinical samples. Mass spectrometry (MS) is the main methodology for protein detection; however, poor coverage and incompleteness of protein databases limit its utility for isoform-resolved analysis. Fortunately, long-read RNA-seq approaches from PacBio and Oxford Nanopore platforms offer opportunities to leverage full-length transcript data for proteomics.

We introduce enhanced protein isoform detection through integrative “long read proteogenomics”. The core idea is to leverage long-read RNA-seq to generate a sample-specific database of full-length protein isoforms. We show that incorporation of long read data directly in the MS protein inference algorithms enables detection of hundreds of protein isoforms intractable to traditional MS. We also discover novel peptides that confirm translation of transcripts with retained introns and novel exons. Our pipeline is available as an open-source Nextflow pipeline, and every component of the work is publicly available and immediately extendable.

Proteogenomics is providing new insights into cancer and other diseases. The proteogenomics field will continue to grow, and, paired with increases in long-read sequencing adoption, we envision use of customized proteomics workflows tailored to individual patients.

We acknowledge the beginning kernels of this work were formed during the Fall of 2020 at the Cold Spring Harbor Laboratory Biological Data Science Codeathon.

We acknowledge Lifebit and the use of their platform Lifebit's CloudOS key in development of the open source software Nextflow workflow used in this work.

How to use this repository and Quick Start

This workflow is complex, bringing together two measurement technologies in a long-read proteogenomics approach for integrating sample-matched long-read RNA-seq and MS-based proteomics data to enhance isoform characterization. To orient the user with the steps involved in the transformation of raw measurement data to these fully resolved, identified and annotated results, we have developed this quick start, wiki documentation including vignettes.

How to use this repository

This repository is organized into modules and parts of this repository could be useful to different researchers to annotate their own raw data. The workflow is written in Nextflow, allowing it to be run on virtually any platform with alterations to the configurations and other adaptations. The visitor is encourated to fork clone and adapt and contribute. All are encouraged to use GitHub Issues to communicate with the contributors to this open source software project. Software addtions, modifications and contributions are done through GitHub Pull Requests

Module processes details are documented within the Wiki within this repository. As well as linked to the third party resources used in this workflow.

Vignettes have been developed to go into greater detail and walk the visitor through the visualization capabilities of the final annotated results and to walk the visitor through the workflow with presented here with the quick start

Quick Start

This quick start and steps were performed on a MacBook Pro running BigSur Version 11.4 with 16 GB 2667 MHz DDR48 RAM and a 2.3 GHz 8-Core Intel Core i9 processor.

The visitor will be walked through the pre-requisites, clone the library and execute with demonstration data also used in the GitHub Actions.

Obtain the Desktop DockerHub Application

In this quick start, Dockerhub Desktop Application for the Mac with an Intel Chip was used. Follow the instructions there to install.

Configure the Desktop DockerHub Application

On the MacBook Pro running BigSur Version 11.4 with 16 GB Ram, It was necessary to configure the Dockerhub resources to use 6GB of Ram.

Obtain and install miniconda

On the MacBook Pro, the 64-bit version of miniconda was downloaded and installed follow the installation instructions.

Create and activate a new conda environment lrp.

To begin, open a terminal window, ensuring the miniconda installation has completed, reboot the terminal shell. On the Mac, this is done within a zsh shell environment.

exec -l zsh

If you already have the environment, you can see what conda environments you have with the following commnad:

conda info --envs

If you haven't already created a conda environment for this work, create and activate it now.

conda create -n lrp
conda activate lrp

Install Nextflow.

Install and set the Nextflow version.

conda install -c bioconda nextflow -y
export NXF_VER=20.01.0

Clone this repository

Now with the environment ready, we can clone.

git clone https://
.com/sheynkman-lab/Long-Read-Proteogenomics
cd Long-Read-Proteogenomics

Run the pipeline with the test_without_sqanti.config

DOI

This Quick start uses the test_without_sqanti.config configuration file found in the conf directory of this repository.

nextflow run main.nf --config conf/test_without_sqanti.config 

For details regarding the processes and results produced, please see the Wiki and the Vignette: Workflow with test data.

To visualize results, please see the visualization capabilities of the final annotated results.

Documentation and Workflow Vignettes

The sheynkman-lab/Long-Read-Proteogenomics pipeline comes with details about each of the processes that make up the pipeline are found in the Wiki. In this you will find:

  1. Third-party tools
  2. Input parameters
  3. Output files
  4. Pipeline processes descriptions
  5. Vignette: Visualization
  6. Vignette: Workflow with test data

Workflow overview

The workflow accepts as input raw PacBio data and performs the assembly of predicted protein isoforms with high probability of existing in the sample. This database is then used in MetaMorpheus to search raw mass spectrometry data against the PacBio reference. MetaMorpheus will use protein isoform read counts during protein inference. Two other protein databases are employed for the purposes of comparison. One is from UniProt and the other is from GENCODE. A series of Jupyter notebooks can be used to perform all final comparisons and data analysis.

LRP Pipeline_v2

Using Zenodo

To make the data more accessible and FAIR, the indexed files were transferred to Zenodo using zenodo-upload from the University of Virginia's Gloria Sheynkman Lab Amazon S3 buckets.

Using Nextflow, configuration items can access locations in Google Compute Platform (GCP) buckets (gs://), Amazon Web Services (AWS) buckets (s3://) and Zenodo locations (https://) seamlessly.

The main reasons why ZENODO vs AWS S3: or GCP GS: are:

  1. Data versioning (of primary importance): In S3 or GS buckets, data can be overwritten for the same path at any point, possibly breaking the pipeline.
  2. Cost: These datasets are tiny but the principle stays: The less storage the better
  3. Access: Most users of the pipeline can most easily access ZENODO and will be able to use the data. AWS and GCP has an entry barriers.

Details on how these data were transferred and moved from AWS S3: buckets are described in the AWS to Zenodo.

Contributors

This is a joint project between the Sheynkman Lab, the Smith Lab, Lifebit and Science and Technology Consulting, LLC.

Repository template

This pipeline was generated using a modification of the nf-core template. You can cite the nf-core publication as follows:

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x. ReadCube: Full Access Link

long-read-proteogenomics's People

Contributors

adeslatt avatar bj8th avatar cgpu avatar gsheynkman avatar mayankmurali avatar rmillikin avatar rmmiller22 avatar trishorts avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

long-read-proteogenomics's Issues

Readme template

module name:
overview:
input:

  • A
  • B
  • C

output:

  • D
  • E
  • F

source module(s):
target module(s):
dependencies:
threads:
original source:
shell:

New Zenodo dataset

Added CPAT input module files (Hexamer, logitModel) to Zenodo.

10.5281/zenodo.4263373

Adds tutorials shared in Slack in GutHub for longevity

I am self-assigning me for most tutorials shared in Slack in GutHub for longevity, happy to share the tutorials write up if someone offers to contribute as well.

Tutorials that shouldn't be forgotten after the codeathon:

  • github I: Contribution etiquette issues for tracking tasks, pull requests (authors and reviewers perspective)
  • github II: Command line git commands for developing and pushing code to githhub. How to resolve conflicts and keep branches in sync when many people contribute.
  • conda: Basic commands for managing dependencies, including creating, activating envs and finding and installing packages
  • docker: Finding, building and using docker images/containers. Setting up a DockerHub account, docker login from the command line. Template duo of Dockerfile, environment.yml
  • zenodo: Creating a record and updating the record with revisions from the user interface. Retrieving https links for all files in a zenodo record
  • nextflow: The anatomy of a Nextflow process. Resources for further reading, such as Nextflow patterns. How to define number of cpus or container per process type.

Download reference genome

Link for human reference genome, canonical chromosomes only:

ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_36/GRCh38.primary_assembly.genome.fa.gz

Need to confirm this is correct.

Commands for Iso-Seq, SQANTI, CPAT for adding to nextflow

Commands for PacBio and ORF calling modules:

### Iso-Seq commands I ran in an interactive session on the UVA cluster:

### Input: jurkat.ccs.bam, NEB_primers.fasta, hg38.fa (for alignment)

# create an index
pbindex jurkat2.ccs.bam

module load isoseqenv
lima --isoseq --dump-clips --peek-guess -j 40 jurkat.ccs.bam NEB_primers.fasta jurkat.demult.bam
isoseq3 refine --require-polya jurkat.demult.NEB_5p--NEB_3p.subreadset.xml NEB_primers.fasta jurkat.flnc.bam

# clustering of reads, can only make faster by putting more cores on machine (cannot parallelize)
isoseq3 cluster jurkat.flnc.bam jurkat.polished.bam --verbose --use-qvs

# align reads to the genome, takes few minutes (40 core machine)
pbmm2 align hg38.fa jurkat.polished.transcriptset.xml jurkat.aligned.bam --preset ISOSEQ --sort -j 40 --log-level INFO

# collapse redundant reads
isoseq3 collapse jurkat.aligned.bam jurkat.collapsed.gff



### SQANTI commands run via slurm

### Input: jurkat.collapsed.fasta, jurkat.collapsed.abundance.txt, gencode.v35.annotation.gtf, hg38.fa

source activate SQANTI3.env

python sqanti3_qc.py jurkat.collapsed.fasta gencode.v35.annotation.gtf hg38_canon.fa -o jurkat -d SQANTI3_out_v2/ --fl_count jurkat.collapsed.abundance.txt -n8

source deactivate

Expected output:

    jurkat.params.txt
    jurkat_classification.txt
    jurkat_corrected.faa
    jurkat_corrected.fasta
    jurkat_corrected.gtf
    jurkat_junctions.txt
    jurkat_sqanti_report.pdf



### CPAT commands

### Input: Human_Hexamer.tsv, Human_logitModel.RData, jurkat_corrected.fasta (from SQANTI)

CPAT commands to use:

cpat.py -x Human_Hexamer.tsv -d Human_logitModel.RData -g ./SQANTI3_out/jurkat_corrected.fasta --min-orf=50 --top-orf=50 -o jurkat_cpat 1> jurkat_cpat.output 2> jurkat_cpat.error

Add PULL_REQUEST.md template in .github/

Example:

Overview

This PR updates the Nextflow specific files for the ORF calling modules.

Description

The changes implement the following:

....

How can I test this works?

Assuming you have cloned the Long-Read-Proteogenomics repo, run the following commands to checkout in the branch cgpu-updates-orf-calling that implements the change:

# Navigate into the repo folder
cd Long-Read-Proteogenomics

# Navigate to the branch that has the code
git checkout adds-nextflow-refined-db

# Navigate to the folder with the Nextflow-ified standalone module:
cd modules/PG_RefinedDatabaseGeneration
nextflow run refined_db_generation.nf \
--orfs https://zenodo.org/record/4279863/files/orf-testset-fraciton16.csv\
--seq https://zenodo.org/record/4279863/files/jurkat_corrected.fasta \
# https://zenodo.org/record/4279863/files/toy_for_christina.fasta.txt 
--sample 'jurkat'

1. Positive Nextflow test (should be completed successfully)

 nextflow run lr_orfcalling.nf --fasta https://zenodo.org/record/4278034/files/toy_for_christina.fasta.txt

2. Negative Nextflow test (should be fail early)

Neglect providing the required parameter --fasta. This should fail early with an informative error message.

 nextflow run lr_orfcalling.nf

Structure of repo

Suggest that we have a folder "modules" for all the individual modules so they are not in the top directory.
at the same level as modules, we can have nextflow processes, etc. that will call the modules.

Issue installing nextflow within Lifebit CloudOS

I don't seem to have permissions to install nextflow within the lifebit/cloudos

Preparing transaction: done
Verifying transaction: done
Executing transaction: done
ERROR conda.core.link:_execute(700): An error occurred while uninstalling package 'conda-forge/linux-64::certifi-2020.6.20-py37hc8dfbb8_0'.
Rolling back transaction: done

[Errno 13] Permission denied: '/opt/conda/lib/python3.7/site-packages/certifi-2020.6.20-py3.7.egg-info/PKG-INFO' -> '/opt/conda/lib/python3.7/site-packages/certifi-2020.6.20-py3.7.egg-info/PKG-INFO.c~'
()

jovyan@d010449df382:/mnt/shared/ubuntu/session_data/Long-Read-Proteogenomics$ ```

Write out distinct entires for same-pb-protein sequence but different gene

@bj8th
@rmmiller22 and I were discussing how we want to handle pacbio sequences that have the same protein sequence but map up to different genes. For now, we'd liek to keep them, as we are keeping such cases for UniProt and Gencode.
I found a part of the code where it grabs the gene (below). Could you add a bit to check if the group of same-protein-sequence pbs correspond to the same gene, and if they do, output multiple entries?

gene = orfs[orfs['pb_acc'] == base_acc].iloc[0]['gene']

"Minimal or complete" UniProte and GENCODE proteomes

Just wondering if we'll be limiting the uniprot and gencode protoeomes to include only those genes observed in pac bio. I don't know that we should. If fact, there is some reason not to limit them. But, i'd be glad to see this topic discussed.

Add information to refine database metatable

Request more information to be added to the refine database table.

Currently jurkat_orf_aggregated.tsv outputs:
pb_accs, base_acc, FL, CPM

Requesting also:
genename
orf quality (e.g., Clear_Best_ORF)

Troubleshooting ORFCalling nextflow with Anne

Hi @adeslatt,
I attempted to run the baby nextflow code to run Transdecoder.
I correct a few minor bugs (e.g., lr_orfcalling.nr -> lr_orfcalling.nf).
The help message displayed correctly.

The nextflow processing step does appear to be running, and returns error 127.
image

When I look at the latest nextflow log, the problem appears to be during DEBUG nextflow.cli.Launcher:
image

What I'm wondering is if it is looking for nextflow.config? I had an earlier version of nextflow.config in the directory, which led to problems with display of the help method, so I "stashed" away nextflow.config. But nextflow appears to continue to look for it. Not sure if that has something to do with the issue.

I compared nextflow.config with lr_orfcalling_nextflow.config, and the only difference appears to be pointers to the executors and test configs.
image

I also went into the .command.sh file that nextflow is running, and see that there are two Transdecoder commands to run:
image
But, when these are run, Transdecoder is not found. So, does that mean we need to add code to have docker install Transdecoder in case it doesn't find it? Or do we need to remember to install Transdecoder before running this code?

I also looked at the aws.config file, and it appears that Transdecoder is not listed. Should we change Transdecoder in the command to ORFtransdecoder?
image

Figure out how others can upload to the LRP Zenodo

Only I (Gloria) have been uploading data to Zenodo.

Need to figure out how others can upload data to the Zenodo repo.
Christina had found a tool (upload_zenodo on github?) where others can upload data with a token.

Append UniProt sequences for non-observed genes

For input to MetaMorpheus. e.g., if in the pacbio data we observe transcripts transcribed from 15k genes, there are 5k genes not observed. we should take the 5k canonical uniprot protein sequences for the genes not observed and append them to the 15k we observed for a total database of 20k proteins.

Include read_gtf function in ORF calling script.

@bj8th I fixed some minor bugs in the ORF calling script (commit 5386c75)
It is still missing the function for "read_gtf". Please add that function in and I'll try running again.

Also, what is the expected format for calling the script on a command line?

Iso-Seq command code

For @adeslatt

Iso-Seq commands I ran in an interactive session on the UVA cluster:

Input is a jurkat.ccs.bam

# create an index
pbindex jurkat2.ccs.bam

module load isoseqenv
lima --isoseq --dump-clips --peek-guess -j 40 jurkat.ccs.bam NEB_primers.fasta jurkat.demult.bam
isoseq3 refine --require-polya jurkat.demult.NEB_5p--NEB_3p.subreadset.xml NEB_primers.fasta jurkat.flnc.bam

# clustering of reads, can only make faster by putting more cores on machine (cannot parallelize)
isoseq3 cluster jurkat.flnc.bam jurkat.polished.bam --verbose --use-qvs

# align reads to the genome, takes few minutes (40 core machine)
pbmm2 align hg38.fa jurkat.polished.transcriptset.xml jurkat.aligned.bam --preset ISOSEQ --sort -j 40 --log-level INFO

# collapse redundant reads
isoseq3 collapse jurkat.aligned.bam jurkat.collapsed.gff

Help to containerize the protein database mapping scripts

I merged to dev the scripts to map isoforms between the protein databases. This involves making blast databases, running blast pairwise between all three input databases (gencode, uniprot, pacbio), and parsing the output.

I would like to now dockerize/containerize/nextflowify this workflow.
@adeslatt I'm looking at your example from rmats, and will start from that. Let me know what the next steps would be, thanks!

Filter the Uniprot-Gencode multimapping genes

@kyuubi430 Filter out the genes that do not have a one to one relationship between uniprot gene and gencode gene.
Note - these are likely cases in which one uniprot gene maps up to multiple genes because there are multiple copies of the gene on the genome (gencode is genome-centric.).

Report back to @rmmiller22 and @gsheynkman the fraction of genes that have this multi-mapping status. Needs to be less than 5% to proceed.

wiki

Please add to this issue any items that you'd like to see added to a wiki

testing - continuous testing?

Do we want to have any kind of continuous testing at the local (module level) @cgpu should we think about using these local mini nextflow runs that are module centric -- and then keeping the main.nf of the entirety .. just wondering if you think this. makes sense.

Understanding channels in nextflow script.

@cgpu

In the example here - https://www.nextflow.io/example1.html
Will the intermediate files stored in the channel record be stored in the /data directory you mentioned? At the start of the codethon, you mentioned that if we need to store intermediate files, they can be written there?
Or, will the contents of record not be saved after the pipeline completes, unless we explicitly store in /data?

Added files to Zenodo

I added the jurkat.collapsed.gtf to the Zenodo: Please use version 12 (10.5281/zenodo.4320967).

hg38_canon.fa is on the Sheynkman lab project storage
/Volumes/sheynkman/comp/pacbio_processing/jurkat/hg38_canon.fa

Uniprot and Pacbio isoforms map to two different Gencode isoforms

There are some cases in which an isoform that is the same in Uniprot and PacBio map up to two different Gencode isoforms.
I had manually looked at these cases and they are edge cases in which multimapping isoforms slipped through the 95%+ identity threshold.

six-frame translation module path through pipeline

seems like we could do a six frame translation to fasta and send it through metamorpheus for peptide identification. currently its connected to peptide analysis in the graphic. But if we don't search that db for peptides in MM, then we won't ever see them.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.