sgibb / cleaver Goto Github PK
View Code? Open in Web Editor NEWCleavage of polypeptide sequences
Home Page: http://sgibb.github.io/cleaver/
Cleavage of polypeptide sequences
Home Page: http://sgibb.github.io/cleaver/
testthat 0.8 comes with a new recommended structure for storing your tests. To
better meet CRAN recommended practices, testthat now recommend that you to put
your tests in tests/testthat
, instead of inst/test
(this makes it
possible for users to choose whether or not to install tests). With this
new structure, you'll need to use test_check()
instead of test_packages()
in the test file (usually tests/testthat.R
) that runs all testthat unit
tests.
it is quite clear now, that 100% digestion efficiency with trypsin should not be assumed in proteomics workflows. Inefficient trypsin digestion also posses a very serious problems in absolute quantitation workflows using labelled isotopic standards.
The way isotopic standards are currently used is peptides to be quantified are synthesised labelled. Then a known amount of the labelled peptide is spiked in the sample prior to its analysis by LC-MS. After the acquisition the amount of unlabelled peptide (and hence its protein of origin), is computed as foolows
quantity_unlabelled = signal_unlabelled/signal_labelled * quantity_labelled
Consider quantitation of the following peptide: VTTYFPSVNLR. Below is a piece of protein sequence it originates from:
GNIR.VTTYFPSVNLR.KSSQK
note to get the peptide out of the protein digestion should occur after R, however R is followed by K, which is expected to result in two dead-end products:
VTTYFPSVNLR and VTTYFPSVNLRK
as a result the amount of VTTYFPSVNLR peptide is no longer proportional to protein amount and if absolute quantitation is performed using this peptide only, the amount of protein will be underestimated (a specific example of this happening is given in ref1).
The most obvious approach to counteract the problem is to ignore peptides like this. However this is not usually possible, given that only a limited amount of peptides suitable for quantitation is available per every protein. Thus the best solution is to mimic cleavage site by adding 3 amino acids before and after.
However consider the following peptide:
QNGRLR.HFTIPSHR.ARAGR
if we add RLR on N-teminus of peptide sequence again the cleavage site does not mimic what happens in the protein since if cleavage occurs after the first R in the protein it yeilds a dead end product:
LR.HFTIPSHR
hence the overhang needs to be extended 3 aa before the RLR. However this extension of overhangs is not always possible, since there is a limit to peptide's length (usually a synthetic peptide of no longer than 20aa) can be synthesised, hence additional parameters need to be passed to the model to determine the optimal compromise.
I will write out a detailed outline of the workflow if this functionality is to be added to cleaver.
references:
When we digest proteins with a certain number of missed cleavages (0:M), the maximum number of cleavage sites per peptide is expected to be in the ranges (0, M). But for a certain number of peptides, the number of cleavage sites in it exceed the missedCleavages
value specified in the initial digestion.
In the below example case, we can see there are 78 peptides that have more than 2 cleavage sites, even though the allowed number of missed cleavages was defined as missedCleavages=0:2
during trypsin
digestion.
Test proteins fasta: proteins.fasta.gz
library(cleaver)
## read fasta
proteins <- readAAStringSet("proteins.fasta.gz")
## number of proteins in proteins.fasta
length(proteins)
## [1] 38
## digest proteins with trypsin
cleaved <- cleaver::cleave(proteins, missedCleavages = 0:2, enzym = "trypsin")
## unlist into AAStringSet
peptides <- unlist(cleaved)
## rename individual peptides as: id::peptide
names(peptides) <- paste0(base::strsplit(names(cleaved), "\\|")[[1]][2],
"::", as.character(peptides))
## get cleaved sites within peptides
missed <- cleaver::cleavageSites(peptides, enzym = "trypsin")
## number of peptides with cleavage sites > 2
length(missed[elementNROWS(missed) > 2])
## [1] 78
## peptides with more with cleavage sites > 2
head(missed[elementNROWS(missed) > 2])
## $`A6NL46::RRKK`
[1] 1 2 3
$`A6NL46::RRKK`
[1] 1 2 3
$`A6NL46::RRAVSMDNGAKFLR`
[1] 1 2 11
$`A6NL46::RRPMIYVESSEESSDEQPDEVESPTQSQDSTPAEEREDEGASAAQGQEPEADSQELVQPKTGCELGDGPDTK`
[1] 1 36 60
$`A6NL46::RRQEGKCK`
[1] 1 2 6
$`A6NL46::RRGSSIPQFTNSPTMVIMVGLPARGK`
[1] 1 2 24
And there's also a mismatch between the number of ranges and peptides after enzymatic digestion:
cleaved <- cleaver::cleave(proteins, missedCleavages = 0, enzym = "trypsin")
ranges <- cleaver::cleavageRanges(proteins, missedCleavages = 0, enzym = "trypsin")
sites <- cleaver::cleavageSites(proteins, enzym = "trypsin")
sum(lengths(cleaved))
## [1] 17072
sum(lengths(ranges) )
## [1] 23260
sum(lengths(sites))
## [1] 23222
sum(lengths(sites)) + length(proteins)
## [1] 23260
peptides <- unlist(cleaved)
names(peptides) <- paste0(base::strsplit(names(cleaved), "\\|")[[1]][2],
"::", as.character(peptides))
missed <- cleaver::cleavageSites(peptides, enzym = "trypsin")
length(missed[elementNROWS(missed) > 0])
## 55
It has previously been demonstrated that trypsin has digestion problems if the AA in the vicinity of the K|R in the cleavage sites are
The PLGS takes into consideration some of these rules and allows missed cleaved peptides to be pepFrag1 or pepFrag2 (i.e. suitable for quantitation), i.f. the clevage site is followed by P, K, R, D, E. Note the rule only applies to the amino acid directly following K|R, although D and E seem to have an inhibitory effect on trypsin activity at positions -3:+3.
my suggestion is to allow the peptides with "special" missed cleavages for quantitation. I.e. when creating a vector of proteotypic peptides both peptides with sequence XXXXK and XXXXKEXXXR should be present.
references:
Here, the regex for pepsin 1.3 ([FLWY]
) is less stringent than the one for pepsin > 2 ([FL]
)
Lines 60 to 63 in c02266a
Should the label for these regular expressions be swapped?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.