GithubHelp home page GithubHelp logo

ragp's Introduction

ragp

Build Status

The goal of ragp is to facilitate mining of plant hydroxyproline rich glycoproteins (HGRPs), and especially arabinogalactan protein sequences (AGPs) from NGS data. The functions in the package consist of two groups: 1. web server prediction scraping and 2. sequence analyses. Functions in group 2 were made with HGRPs in mind and will be of limited use out of this scope while functions in group 1 are available for a wide range of annotation applications such as domain prediction, cellular localization, presence of trans-membrane regions and GO annotation. All of ragp functions expect protein sequences in single letter code.

Installation

You can install ragp from github with:

# install.packages("devtools")
devtools::install_github("missuse/ragp")

Example

This is a basic example which shows you how to fetch SignalP predictions:

library(seqinr) #to create a fasta file with protein sequences
library(ragp)
data(at_nsp) #a data frame of 2700 Arabidopsis protein sequences 

#produce a fasta file from at_nsp data abailible in ragp:
seqinr::write.fasta(sequence = strsplit(at_nsp$sequence, ""),
                    name = at_nsp$Transcript.id, file = "at_nsp.fasta")

file_list <- split_fasta(path_in = "at_nsp.fasta", #path to the FASTA formated file`
                         path_out = "splited_at_nsp", #path to the splited files on which integers and file type will be apended automatically
                         num_seq = 1000) #number of sequences in destination files, usually this will be 10 - 20k

#get the SignalP predictions
signalp_pred_1 <- get_signalp_file(file = file_list[1])

First at_nsp.fasta file was generated from the protein sequences using the library seqinr. This file was then split into smaller fasta files each containing a 1000 sequences using the ragp function split_fasta. This is for illustration purposes - generally big fasta files should be split to smaller ones containing 10 - 20 thousand sequences. And finally SignalP predictions were obtained using the function get_signalp_file. Similarly Phobius and TargetP can be accessed by the functions get_phobius_file and get_targetp_file.

To fetch the GPI modification site predictions from big-PI the function get_big_pi can be used:

ind <- c(145, 147, 160, 170,
         189, 203, 214, 224) #some indexes
big_pi_pred <- get_big_pi(sequence = at_nsp$sequence[ind],
                          id = at_nsp$Transcript.id[ind],
                          verbose = FALSE)

Similarly, domains can be identified by hmmscan using get_hmm function:

pfam_pred <- get_hmm(sequence = at_nsp$sequence[1:20], #a vector of protein sequences as strings
                     id = at_nsp$Transcript.id[1:20],
                     verbose = FALSE)

To enrich with GO terms:

pfam_pred_go <- pfam2go(pfam_pred)

first 10 rows of the GO result with selected columns:

id name acc GO_name GO_acc
ATCG00660.1 Ribosomal_L20 PF00453.17 GO:structural constituent of ribosome GO:0003735
ATCG00660.1 Ribosomal_L20 PF00453.17 GO:rRNA binding GO:0019843
ATCG00660.1 Ribosomal_L20 PF00453.17 GO:intracellular GO:0005622
ATCG00660.1 Ribosomal_L20 PF00453.17 GO:ribosome GO:0005840
ATCG00660.1 Ribosomal_L20 PF00453.17 GO:translation GO:0006412
AT2G43600.1 Glyco_hydro_19 PF00182.18 GO:cell wall macromolecule catabolic process GO:0016998
AT2G43600.1 Glyco_hydro_19 PF00182.18 GO:chitin catabolic process GO:0006032
AT2G43600.1 Glyco_hydro_19 PF00182.18 GO:chitinase activity GO:0004568
AT2G43600.1 Glyco_hydro_19 PF00182.18 GO:cell wall macromolecule catabolic process GO:0016998
AT2G43600.1 Glyco_hydro_19 PF00182.18 GO:chitin catabolic process GO:0006032

Perform motif and amino acid bias classification of HRGP's:

PAST_bias <- maab(sequence = at_nsp$sequence, #a vector of protein sequences as strings
                  id = at_nsp$Transcript.id) #a vector of protein identifiers as strings
                 

Predict hydroxyproline sites in sequences:

hyp_pred <- predict_hyp(sequence = at_nsp$sequence,
                        id = at_nsp$Transcript.id)

Output is a list of two elements. The first element "prediction"" is a data frame. First 10 rows:

id substr P_pos prob HYP
AT2G43600.1 FSQNCMDTSCPGLKECCSRWG 31 0.02 No
AT2G43600.1 EYCGFFCFSGPCNIKGKSYGY 58 0.02 No
AT2G43600.1 YGYDYNVDAGPRGKIETVITS 76 0.03 No
AT2G43600.1 ERYCSKSKKYPCEPGKNYYGR 163 0.04 No
AT2G43600.1 CSKSKKYPCEPGKNYYGRGLL 166 0.02 No
AT2G43600.1 YYGAGKHLGLPLLKDPDLVSR 194 0.02 No
AT2G43600.1 KHLGLPLLKDPDLVSRSPEVA 199 0.02 No
AT2G43600.1 LKDPDLVSRSPEVAFKFAMWF 206 0.02 No
AT2G43600.1 AMWFWNRNVRPALYLGFGEIT 223 0.02 No
AT2G43600.1 EMLGVTPDQGLDC 267 0.00 No

predict_hyp is also availible as a shiny app. Details on how hydroxyproline sites are predicted will be availible soon.

AGPs are characterized by the presence of so called AGII glycomodules - amino acid dimers: OA, OS, OT, AO, SO and TO (and probably OG, OV, GO and VO) which are in close proximity to each other. Where: O - hydroxyproline, A - alanine, S - serine, T - threnonine, G - glycine and V - valine. scan_ag function attempts to find the mentioned dimers according to user specified rules. Example:

at_nsp_ag <- scan_ag(sequence = at_nsp$sequence,
                     id = at_nsp$Transcript.id,
                     dim = 3, #at least 3 dimers must be present
                     div = 10, #no more than 10 amino acids apart
                     type = "conservative") #dimers will be defined as: PA, PS, PT, AP, SP, TP

List several sequences. possible AGII glycomodules are in caps:

ind <- c(145, 147, 160, 170,
         189, 203, 214, 224)
at_nsp_ag$sequence[ind]
#> [1] "mayatilmifsvvalmsgerahaavdcsslilnmadclsfvtsgstvvkpegtccsglktvvrtgpeclceafknsgslgltldlskaaslPSvckvaAPPSarcglsvsgdpPAtAPglSPTagagAPAlssganaaTPvsSPrssdasllsvsfafvifmalissfy"                                                                                                                                                                                                       
#> [2] "mgyrrsyaitfvalvaalwsvtkaqPSsscvstlttlSPclsyitgnstTPSqpccsrldsviksspqcicsavnspipniglninrtqalqlpnacniqTPpltqcnaatgPTaqpPAPSPTekTPdvtlTPTslpgarsgvgggsktvpsvgtgsssrnvdplplhflmfavlvvctssfl"                                                                                                                                                                                         
#> [3] "mrlllsllfllalttyssatyclcrdgvgekdlqtsidyacgvlkdcnpihekgpcyqpntikshcdwavntyfqrfgqisgscnfsgtattsqnlPStvvtgclyPSSPgsagtTPTtgTPSgtqtfpgpPAfgPAgdfdPSgnngAPSlfisialslgfsvviafl"                                                                                                                                                                                                        
#> [4] "mhhhhhpcnrkpfttifsffllylnlhnqqiiearnpsqfttnpspdvsipeikrhlqqygylpqnkesddvsfeqalvryqknlglpitgkpdsdtlsqillprcgfpddvepktapfhtgkkyvyfpgrprwtrdvplkltyafsqenltpylaptdirrvfrrafgkwasvipvsfietedyviadikigffngdhgdgepfdgvlgvlahtfspengrlhldkaetwavdfdeekssvavdlesvavheighvlglghssvkdaamyptlkprskkvnlnmddvvgvqslygtnpnftlnsllasetstnladgsrirsqgmiystlstvialcflnw"                          
#> [5] "makmqlsifiavvalivcsasaktaSPPAPvlpPTPAPAPAPenvnltellsvagpfhtfldyllstgvietfqnqannteegitifvpkddafkaqknpplsnltkdqlkqlvlfhalphyyslsefknlsqsgpvstfaggqyslkftdvsgtvridslwtrtkvsssvfstdpvavyqvnrvllpeaifgtdvppmPAPAPAPivsAPSdSPSvadsegasSPksshknsgqklllapismvisglvalfl"                                                                                                                  
#> [6] "mmrgaaptgvvsvmvlmilvllkqiesasangsslglpprkfcniyqgswvydksyplydskncpfierqfncksngrpdseylkyrwqpsgcnlprfngldflgrimkgkklmfvgdslslnqwqsltcllhnaAPkanststrSPSglsvfsfPAynssimfsrnaflvdivgappkrvmkldsissgslwktadvlvfnswhwwlhtdrkqpwdaimsgnvtvkdmdrlvayekammtwakwidqnidpsktkvffqgispdhgrarewskqggkgscigetkpimgssylagphaaemvvakviktmknqarlmdvtlmsqlrkdghpsvygfgghrmadcshwclsgvpdswnqllyselfhs"
#> [7] "maliknnifftsllifvtlfgvavggtvhkvgntkgwtmiggdyeawassrvfqvgdtlvfaynkdyhdvtevthndfemcesskplrryktgsdsisltkpglqhficgvpghckkgqklqihvlpaslghvavpvpgpvrsqssssSPSPSPlvdppvnnAPqyqmgPTPAshsaasadfiftfsfdltlidlctffilffilv"                                                                                                                                                                  
#> [8] "marsfaiavicivliagvtgqAPTSPPTaTPAPPTPTTPpPAaTPppvsAPppvttSPppvttAPpPAnppppvsSPpPASPpPATPppvaSPpppvaSPpPATPppvaTPpPAPlaSPPAqvPAPAPTtkpdSPSPSPSsSPplPSsdAPgPStdsiSPAPSPTdvndqngaskmvsslvfgsvlvwfmi"

Before I wrote OA, OS, OT, AO, SO and TO (and probably OG, OV, GO and VO) are AGII glycomodules but the above output considers P instead of O since most of the time the positions of O are unknown. However if the sequence argument to scan_ag contains O's at some positions, scan_ag will consider only them. To do this the sequence output of predict_hyp function can be used. Example:

at_nsp_ag <- scan_ag(sequence = hyp_pred$sequence, #hyp_pred was created a little bit back
                     id = at_nsp$Transcript.id,
                     dim = 3, #at least 3 dimers must be present
                     div = 10, #no more than 10 amino acids apart
                     type = "conservative")
#> sequence vector contains O, O will be considered instead of P

at_nsp_ag$sequence[ind]
#> [1] "mayatilmifsvvalmsgerahaavdcsslilnmadclsfvtsgstvvkpegtccsglktvvrtgpeclceafknsgslgltldlskaaslpsvckvaappsarcglsvsgdpOAtAOglSOTagagAOAlssganaatpvssprssdasllsvsfafvifmalissfy"                                                                                                                                                                                                       
#> [2] "mgyrrsyaitfvalvaalwsvtkaqpssscvstlttlspclsyitgnsttpsqpccsrldsviksspqcicsavnspipniglninrtqalqlpnacniqtppltqcnaatgOTaqpOAOSOTekTOdvtlTOTslogarsgvgggsktvpsvgtgsssrnvdplplhflmfavlvvctssfl"                                                                                                                                                                                         
#> [3] "mrlllsllfllalttyssatyclcrdgvgekdlqtsidyacgvlkdcnpihekgpcyqpntikshcdwavntyfqrfgqisgscnfsgtattsqnlpstvvtgclypsSOgsagtTOTtgTOSgtqtfpgoOAfgOAgdfdpsgnngapslfisialslgfsvviafl"                                                                                                                                                                                                        
#> [4] "mhhhhhpcnrkpfttifsffllylnlhnqqiiearnpsqfttnpsodvsipeikrhlqqygylpqnkesddvsfeqalvryqknlglpitgkpdsdtlsqillprcgfpddvepktapfhtgkkyvyfpgrprwtrdvplkltyafsqenltpylaptdirrvfrrafgkwasvipvsfietedyviadikigffngdhgdgepfdgvlgvlahtfspengrlhldkaetwavdfdeekssvavdlesvavheighvlglghssvkdaamyptlkprskkvnlnmddvvgvqslygtnpnftlnsllasetstnladgsrirsqgmiystlstvialcflnw"                          
#> [5] "makmqlsifiavvalivcsasaktaSOOAOvloOTOAOAOAOenvnltellsvagpfhtfldyllstgvietfqnqannteegitifvpkddafkaqknpplsnltkdqlkqlvlfhalphyyslsefknlsqsgpvstfaggqyslkftdvsgtvridslwtrtkvsssvfstdpvavyqvnrvllpeaifgtdvoompAOAOAOivsAOSdSOSvadsegasspksshknsgqklllapismvisglvalfl"                                                                                                                  
#> [6] "mmrgaaptgvvsvmvlmilvllkqiesasangsslglpprkfcniyqgswvydksyplydskncpfierqfncksngrpdseylkyrwqpsgcnlprfngldflgrimkgkklmfvgdslslnqwqsltcllhnaapkanststrsosglsvfsfpaynssimfsrnaflvdivgappkrvmkldsissgslwktadvlvfnswhwwlhtdrkqpwdaimsgnvtvkdmdrlvayekammtwakwidqnidpsktkvffqgispdhgrarewskqggkgscigetkpimgssylagphaaemvvakviktmknqarlmdvtlmsqlrkdghpsvygfgghrmadcshwclsgvpdswnqllyselfhs"
#> [7] "maliknnifftsllifvtlfgvavggtvhkvgntkgwtmiggdyeawassrvfqvgdtlvfaynkdyhdvtevthndfemcesskplrryktgsdsisltkpglqhficgvpghckkgqklqihvlpaslghvavovogovrsqssssSOSOSOlvdpovnnapqyqmgotoashsaasadfiftfsfdltlidlctffilffilv"                                                                                                                                                                  
#> [8] "marsfaiavicivliagvtgqAOTSOOTaTOAOOTOTTOoOAaTOoovsAOoovttSOoovttAOoOAnoooovsSOoOASOoOATOoovaSOooovaSOoOATOoovaTOoOAOlaSOOAqvOAOAOTtkodSOSOSOSsSOolOSsdAOgOStdsiSOAOSOTdvndqngaskmvsslvfgsvlvwfmi"

Extensin motifs in the form of SPPP+ are also detected by scan_ag if in correct context, to avoid this add exclude_ext = "yes" as an argument:

at_nsp_ag <- scan_ag(sequence = hyp_pred$sequence, 
                     id = at_nsp$Transcript.id,
                     dim = 3, #at least 3 dimers must be present
                     div = 10, #no more than 10 amino acids apart
                     type = "conservative",
                     exclude_ext = "yes")
#> sequence vector contains O, O will be considered instead of P

at_nsp_ag$sequence[ind]
#> [1] "mayatilmifsvvalmsgerahaavdcsslilnmadclsfvtsgstvvkpegtccsglktvvrtgpeclceafknsgslgltldlskaaslpsvckvaappsarcglsvsgdpOAtAOglSOTagagAOAlssganaatpvssprssdasllsvsfafvifmalissfy"                                                                                                                                                                                                       
#> [2] "mgyrrsyaitfvalvaalwsvtkaqpssscvstlttlspclsyitgnsttpsqpccsrldsviksspqcicsavnspipniglninrtqalqlpnacniqtppltqcnaatgOTaqpOAOSOTekTOdvtlTOTslogarsgvgggsktvpsvgtgsssrnvdplplhflmfavlvvctssfl"                                                                                                                                                                                         
#> [3] "mrlllsllfllalttyssatyclcrdgvgekdlqtsidyacgvlkdcnpihekgpcyqpntikshcdwavntyfqrfgqisgscnfsgtattsqnlpstvvtgclypsSOgsagtTOTtgTOSgtqtfpgoOAfgOAgdfdpsgnngapslfisialslgfsvviafl"                                                                                                                                                                                                        
#> [4] "mhhhhhpcnrkpfttifsffllylnlhnqqiiearnpsqfttnpsodvsipeikrhlqqygylpqnkesddvsfeqalvryqknlglpitgkpdsdtlsqillprcgfpddvepktapfhtgkkyvyfpgrprwtrdvplkltyafsqenltpylaptdirrvfrrafgkwasvipvsfietedyviadikigffngdhgdgepfdgvlgvlahtfspengrlhldkaetwavdfdeekssvavdlesvavheighvlglghssvkdaamyptlkprskkvnlnmddvvgvqslygtnpnftlnsllasetstnladgsrirsqgmiystlstvialcflnw"                          
#> [5] "makmqlsifiavvalivcsasaktaSOOAOvloOTOAOAOAOenvnltellsvagpfhtfldyllstgvietfqnqannteegitifvpkddafkaqknpplsnltkdqlkqlvlfhalphyyslsefknlsqsgpvstfaggqyslkftdvsgtvridslwtrtkvsssvfstdpvavyqvnrvllpeaifgtdvoompAOAOAOivsAOSdSOSvadsegasspksshknsgqklllapismvisglvalfl"                                                                                                                  
#> [6] "mmrgaaptgvvsvmvlmilvllkqiesasangsslglpprkfcniyqgswvydksyplydskncpfierqfncksngrpdseylkyrwqpsgcnlprfngldflgrimkgkklmfvgdslslnqwqsltcllhnaapkanststrsosglsvfsfpaynssimfsrnaflvdivgappkrvmkldsissgslwktadvlvfnswhwwlhtdrkqpwdaimsgnvtvkdmdrlvayekammtwakwidqnidpsktkvffqgispdhgrarewskqggkgscigetkpimgssylagphaaemvvakviktmknqarlmdvtlmsqlrkdghpsvygfgghrmadcshwclsgvpdswnqllyselfhs"
#> [7] "maliknnifftsllifvtlfgvavggtvhkvgntkgwtmiggdyeawassrvfqvgdtlvfaynkdyhdvtevthndfemcesskplrryktgsdsisltkpglqhficgvpghckkgqklqihvlpaslghvavovogovrsqssssSOSOSOlvdpovnnapqyqmgotoashsaasadfiftfsfdltlidlctffilffilv"                                                                                                                                                                  
#> [8] "marsfaiavicivliagvtgqAOTSOOTaTOAOOTOTTOoOAaTOoovsAOoovttsooovttaoooanoooovssoooasoooatooovasoooovasoooaTOoovaTOoOAOlaSOOAqvOAOAOTtkodSOSOSOSsSOolOSsdAOgOStdsiSOAOSOTdvndqngaskmvsslvfgsvlvwfmi"

to switch of all PPP+ from beeing detected:

at_nsp_ag <- scan_ag(sequence = hyp_pred$sequence,
                     id = at_nsp$Transcript.id,
                     dim = 3, #at least 3 dimers must be present
                     div = 10, #no more than 10 amino acids apart
                     type = "conservative",
                     exclude_ext = "all")
#> sequence vector contains O, O will be considered instead of P

at_nsp_ag$sequence[ind]
#> [1] "mayatilmifsvvalmsgerahaavdcsslilnmadclsfvtsgstvvkpegtccsglktvvrtgpeclceafknsgslgltldlskaaslpsvckvaappsarcglsvsgdpOAtAOglSOTagagAOAlssganaatpvssprssdasllsvsfafvifmalissfy"                                                                                                                                                                                                       
#> [2] "mgyrrsyaitfvalvaalwsvtkaqpssscvstlttlspclsyitgnsttpsqpccsrldsviksspqcicsavnspipniglninrtqalqlpnacniqtppltqcnaatgOTaqpOAOSOTekTOdvtlTOTslogarsgvgggsktvpsvgtgsssrnvdplplhflmfavlvvctssfl"                                                                                                                                                                                         
#> [3] "mrlllsllfllalttyssatyclcrdgvgekdlqtsidyacgvlkdcnpihekgpcyqpntikshcdwavntyfqrfgqisgscnfsgtattsqnlpstvvtgclypsSOgsagtTOTtgTOSgtqtfpgoOAfgOAgdfdpsgnngapslfisialslgfsvviafl"                                                                                                                                                                                                        
#> [4] "mhhhhhpcnrkpfttifsffllylnlhnqqiiearnpsqfttnpsodvsipeikrhlqqygylpqnkesddvsfeqalvryqknlglpitgkpdsdtlsqillprcgfpddvepktapfhtgkkyvyfpgrprwtrdvplkltyafsqenltpylaptdirrvfrrafgkwasvipvsfietedyviadikigffngdhgdgepfdgvlgvlahtfspengrlhldkaetwavdfdeekssvavdlesvavheighvlglghssvkdaamyptlkprskkvnlnmddvvgvqslygtnpnftlnsllasetstnladgsrirsqgmiystlstvialcflnw"                          
#> [5] "makmqlsifiavvalivcsasaktaSOOAOvloOTOAOAOAOenvnltellsvagpfhtfldyllstgvietfqnqannteegitifvpkddafkaqknpplsnltkdqlkqlvlfhalphyyslsefknlsqsgpvstfaggqyslkftdvsgtvridslwtrtkvsssvfstdpvavyqvnrvllpeaifgtdvoompAOAOAOivsAOSdSOSvadsegasspksshknsgqklllapismvisglvalfl"                                                                                                                  
#> [6] "mmrgaaptgvvsvmvlmilvllkqiesasangsslglpprkfcniyqgswvydksyplydskncpfierqfncksngrpdseylkyrwqpsgcnlprfngldflgrimkgkklmfvgdslslnqwqsltcllhnaapkanststrsosglsvfsfpaynssimfsrnaflvdivgappkrvmkldsissgslwktadvlvfnswhwwlhtdrkqpwdaimsgnvtvkdmdrlvayekammtwakwidqnidpsktkvffqgispdhgrarewskqggkgscigetkpimgssylagphaaemvvakviktmknqarlmdvtlmsqlrkdghpsvygfgghrmadcshwclsgvpdswnqllyselfhs"
#> [7] "maliknnifftsllifvtlfgvavggtvhkvgntkgwtmiggdyeawassrvfqvgdtlvfaynkdyhdvtevthndfemcesskplrryktgsdsisltkpglqhficgvpghckkgqklqihvlpaslghvavovogovrsqssssSOSOSOlvdpovnnapqyqmgotoashsaasadfiftfsfdltlidlctffilffilv"                                                                                                                                                                  
#> [8] "marsfaiavicivliagvtgqAOTSOOTaTOAOOTOTtoooaatooovsaooovttsooovttaoooanoooovssoooasoooatooovasoooovasoooatooovatoooAOlaSOOAqvOAOAOTtkodSOSOSOSsSOolOSsdAOgOStdsiSOAOSOTdvndqngaskmvsslvfgsvlvwfmi"

ragp's People

Contributors

ts404 avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.