GithubHelp home page GithubHelp logo

bigbio / proteomics-sample-metadata Goto Github PK

View Code? Open in Web Editor NEW
75.0 21.0 106.0 98.69 MB

The Proteomics Experimental Design file format: Standard for experimental design annotation

License: GNU General Public License v2.0

Python 59.50% Perl 40.50%
metadata sdrf pride-metadata proteomics msrun-metadata multiomics sdrf-proteomics proteomics-datasets proteomics-community proteomics-experiments

proteomics-sample-metadata's Introduction

Proteomics Sample Metadata Format

License Open Issues Open PRs Contributors Watchers Stars Read the Docs

Improving metadata annotation of Proteomics datasets

Metadata is essential in proteomics data repositories and is crucial to interpret and reanalyze the deposited data sets. While the dataset general description and standard data file formats are supported and captured for every dataset by ProteomeXchange partners, the information regarding the sample to data files is mostly missing. Recently, members of the European Bioinformatics Community for Mass Spectrometry (EuBIC - https://eubic-ms.org/) have created this open-source project to enable the standardization of sample metadata of public proteomics data sets.

The Proteomics Sample Metadata Project aims to standardize the way ProteomeXchange partners and the proteomics community capture the relation between the samples and the data generated within a PX submission. We have adapted the MAGE-TAB v1.1 format to capture necessary metadata for Proteomics experiments to allow automated re-processing. The MAGE-TAB (MicroArray Gene Expression Tabular) is the file format to store the metadata and sample information on transcriptomics experiments. By repurposing and extending the MAGE-TAB for Proteomics, we aim to provide a format for future submissions of multiomics experiments to ProteomeXchange partners and better integration with other omics data. The MAGE-TAB is divided in two main files: IDF (Investigation Description Format) and SDRF (Sample and Data Relationship Format). We will describe how these two files are adapted for Proteomics.

Our goal is to ensure maximum reusability of the deposited data. Our work aims to define the minimum information required to report the experimental design of proteomics experiments, enabling the use and reuse of the deposited data by the proteomics community. The following Use Cases should be considered to design the Proteomics Sample Metadata Format:

  • The SDRF for Proteomics should be fully compatible with MAGE-TAB version v1.1 that is used to represent transcriptomics data.
  • The IDF part of the MAGE-TAB should be compatible with the current proteomeXchange.xml file format.
  • The "Sample and Data Relationship Format for Proteomics (SDRF-Proteomics)" based on the SDRF part of MAGE-TAB should capture the Sample to Data relationships.
  • The resulting file format SHOULD enable data submitters and curators to annotate a proteomics dataset at different levels, including the sample metadata (e.g. organism and tissues), technical metadata (e.g. instrument model) and the experimental design.
  • The resulting file format SHOULD facilitate the automatic reanalysis of public proteomics datasets, by providing a better representation of quantitative datasets in public repositories.

IDF

ProteomeXchange resources developed a file format called submission.px which captures the same information as the MAGE-TAB IDF. We have developed a set of tools to automatically translate from submission.px to IDF.

SDRF (SDRF-Proteomics)

While the experiment general description is captured for all the PX submissions and experiments, the Sample to Data information is missing (or not standardized) for all PX datasets. The standardization of the SDRF (within MAGE-TAB) for proteomics is the main objective of this project (Read more about SDRF-Proteomics)

Final PSI specification

The final HUPO-PSI specification is: SDRF HUPO-PSI

How to contribute

External contributors, researchers and the proteomics community are more than welcome to contribute to this project.

Contribute with the specification: you can contribute to the specification with ideas or refinements by adding an issue into the issue tracker or performing a PR.

In the annotated projects folder users can see different public datasets that have been annotated so far by the contributors. If you would like to join these efforts, make a Fork of this repo and perform a pull request (PR) with your annotated project. If you don't have a project in mind, you can take one project from the issues and perform the annotation.

Annotate a dataset in 5 steps:

  • Read the SDRF-Proteomics specification.
  • Depending on the type of dataset, choose the appropriate sample template.
  • Annotate the corresponding ProteomeXchange PXD dataset following the guidelines.
  • Validate your SDRF file:

In order to validate your SDRF, you can install the sdrf-pipelines tool in Python

pip install sdrf-pipelines

validate the SDRF file

parse_sdrf validate-sdrf --sdrf_file sdrf.tsv

You can read more about the validator here.

  • Fork the current repository, add a folder with the ProteomeXchange accession and the annotated sdrf.tsv

30 Minutes Guide to MAGE-TAB for Proteomics

Documentation page (https://proteomics-sample-metadata.readthedocs.io/en/latest/)

We have created a 30 minutes Guide to the file format in the github repository. Additionally the following materials are relevant for new users:

Core contributors and collaborators

The project is run by different groups:

  • Yasset Perez-Riverol (PRIDE Team, European Bioinformatics Institute - EMBL-EBI, U.K.)
  • Timo Sachsenberg (OpenMS Team, Tübingen University, Germany)
  • Anja Fullgrabe (Expression Atlas Team, European Bioinformatics Institute - EMBL-EBI, U.K.)
  • Nancy George (Expression Atlas Team, European Bioinformatics Institute - EMBL-EBI, U.K.)
  • Mathias Walzer (PRIDE Team, European Bioinformatics Institute - EMBL-EBI, U.K.)
  • Pablo Moreno (Expression Atlas Team, European Bioinformatics Institute - EMBL-EBI, U.K.)
  • Juan Antonio Vizcaíno (PRIDE Team, European Bioinformatics Institute - EMBL-EBI, U.K.)
  • Oliver Alka (OpenMS Team, Tübingen University, Germany)
  • Julianus Pfeuffer (OpenMS Team, Tübingen University, Germany)
  • Marc Vaudel (University of Bergen, Norway)
  • Harald Barsnes (University of Bergen, Norway)
  • Niels Hulstaert (Compomics, University of Gent, Belgium)
  • Lennart Martens (Compomics, University of Gent, Belgium)
  • Expression Atlas Team (European Bioinformatics Institute - EMBL-EBI, U.K.)
  • Lev Levitsky (INEP team, INEPCP RAS, Moscow, Russia)
  • Elizaveta Solovyeva (INEP team, INEPCP RAS, Moscow, Russia)
  • Stefan Schulze (University of Pennsylvania, USA)
  • Veit Schwämmle (Protein Research Group, University of Southern Denmark, Denmark)
  • ProteomicsDB Team (Technical University of Munich, Germany)
  • David Bouyssié (ProFI/IPBS, University of Toulouse, CNRS, Toulouse, France)
  • Nicholas Carruthers (Wayne State University, USA)
  • Paul Rudnick (NCI, Proteomic Data Commons, USA)
  • Enrique Audain (University Medical Center Schleswig-Holstein, Germany)
  • Marie Locard-Paulet (Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Denmark)
  • Johannes Griss (Department of Dermatology, Medical University of Vienna, Austria)
  • Chengxin Dai (Chongqing Key Laboratory on Big Data for Bio Intelligence, Chongqing University of Posts and telecommunications, Chongqing, China)
  • Julian Uszkoreit ( Medical Faculty, Medizinisches Proteom-Center and Center for Protein Diagnostics (PRODI), Medical Proteome Analysis, Ruhr-University Bochum, Germany)
  • Dirk Winkelhardt ( Medical Faculty, Medizinisches Proteom-Center and Center for Protein Diagnostics (PRODI), Medical Proteome Analysis, Ruhr-University Bochum, Germany)
  • Kanami Arima (Toyama University of International Studies, Toyama Japan)
  • Shin Kawano (Toyama University of International Studies, Toyama Japan)
  • Ruri Okamoto (Toyama University of International Studies, Toyama Japan)

IMPORTANT: If you contribute with the following specification, please make sure to add your name to the list of contributors.

Code of Conduct

As part of our efforts toward delivering open and inclusive science, we follow the Contributor Covenant Code of Conduct for Open Source Projects.

How to cite

  • Dai C, Füllgrabe A, Pfeuffer J, Solovyeva EM, Deng J, Moreno P, Kamatchinathan S, Kundu DJ, George N, Fexova S, Grüning B, Föll MC, Griss J, Vaudel M, Audain E, Locard-Paulet M, Turewicz M, Eisenacher M, Uszkoreit J, Van Den Bossche T, Schwämmle V, Webel H, Schulze S, Bouyssié D, Jayaram S, Duggineni VK, Samaras P, Wilhelm M, Choi M, Wang M, Kohlbacher O, Brazma A, Papatheodorou I, Bandeira N, Deutsch EW, Vizcaíno JA, Bai M, Sachsenberg T, Levitsky LI, Perez-Riverol Y. A proteomics sample metadata representation for multiomics integration and big data analysis. Nat Commun. 2021 Oct 6;12(1):5854. doi: 10.1038/s41467-021-26111-3. PMID: 34615866; PMCID: PMC8494749. Manuscript
  • Perez-Riverol, Yasset, European Bioinformatics Community for Mass Spectrometry. "Towards a sample metadata standard in public proteomics repositories." Journal of Proteome Research (2020) Manuscript.

Copyright notice

This information is free; you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.

This information is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this work; if not, write to the Free Software
Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.

proteomics-sample-metadata's People

Contributors

2018g005 avatar baranme avatar bibishi avatar daichengxin avatar davco6 avatar deeptijk avatar di-hardt avatar douerww avatar enriquea avatar foellmelanie avatar javizca avatar julianu avatar karinschork avatar kleefischda avatar levitsky avatar luxxii avatar mlocardpaulet avatar mvaudel avatar mwang87 avatar ncarrut avatar patroklossam avatar pcm32 avatar rudnickp avatar ruriokamoto avatar savita-nferx avatar stschulze avatar timosachsenberg avatar veitveit avatar xuefeiz avatar ypriverol avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

proteomics-sample-metadata's Issues

Encoding more information about the Enzyme

@trishorts has made a comment #18 (comment) that would be possible to encode more information around the Enzyme:

We could have a plain to read key value system, but the only one I come up with is complicated. But, I think you can keep it all.

Cleavage Before CB=
Cleavage After CA=
Blocked Before BB=
Blocked After BA=

Depletion/enrichemnt

It seems you have already started the discussion about enrichment column in #20 , but I cannot see this in the code, so :
It would be nice to have some columns to specify the enrichment and depletion methods used. Since it could influence the search parameters and interpretation of the results.
For enrichment, it seems fairly easy, since there are a few commonly used strategies (e.g. phospho or Glyco, or even more specifically TiO2 enrichment).
Depletion may be more tricky, so perhaps a yes/no option is a good choice.

MSrun metadata

We probably also need to define units (e.g., as done in mzML) for the CV parameter

currently the example in README.MD is missing the unit:
"ScanSettings": [
{
"accession": "MS:1000016",
"cvLabel": "MS",
"name": "scan start time",
"value": "0.00342666666666667"
},

search-metadata: values as strings/floats/integers/etc

Currently, the type of values is not consistent, as sometimes values are given as integer/float (e.g. "missedCleavages" : 2), sometimes as string representing an integer/float (e.g. "tolerance" : {"value" : "0.6"}).
I think they should either be all given as strings, or the type for each value would need to be defined and then given in that type.

search-metadata: universal and engine specific parameters

The current search-metadata json contains an AnalysisSoftwareList (ASL) to list the employed engines and a SpectrumIdentificationProtocolList (SIPL) to list the used parameters.

I wanted to discuss here, how this deals with multiple engines, especially in regard to engines having common and unique parameters?

  1. If the position in ASL corresponds to the position in SIPL, then all common parameters would occur multiple times in the different list entries.
  2. If the position is not the same for both lists, but SIPL has just one list with all parameters, it is not clear which parameter belongs to which engine (for unique parameters).

It might also be worth it to first discuss if in general all parameters should be included or only the "important" or "common" ones, which then implies that someone would need to define which ones that would be.

I would be in favor of including all parameters and rather following 2. than 1.

Encoding PTMs parameters into one-line Experimental Design

@hbarsnes @mvaudel @StSchulze

We have continued working with the metadata experimental design.

See example, https://github.com/PRIDE-Archive/pride-metadata-standard/tree/master/experimental-design#2-the-sample-and-data-relationship-format

However, if we want to encode search parameters would be great to encode PTMs and other search parameters as key-value pairs. I have seen that MSGF+, Comet, MaxQuant encode PTMs as string lines which is great; because we can encode PTMs Variables as a string and will be easy to translate into the Search Strings.

MSGF+ :

StaticMod=C2H3N1O1,     C,  fix, any,       Carbamidomethyl       # Fixed Carbamidomethyl C (alkylation)
StaticMod=229.1629,     *,  fix, N-term,    TMT6plex
StaticMod=229.1629,     K,  fix, any,       TMT6plex

Comet:

variable_mod1 = 15.9949 M 0 3
variable_mod2 = 0.0 X 0 3
variable_mod3 = 0.0 X 0 3
variable_mod4 = 0.0 X 0 3
variable_mod5 = 0.0 X 0 3
variable_mod6 = 0.0 X 0 3

CRUX:

C+57.02146,2M+15.9949,1STY+79.966331

I think we can propose a way to encode this PTMs as String within the metadata files.

Name ; aminoacid; type; position; UnimodAccession

Where:
Name: Name of the modification.
aminoacid: Aminoacid
Type: Fixed, Variable, Custom
Position: Any, N-Term, Protein N-term
UnimodAccession: Unimod Accession

The Unimod accession can be replaced with delta mass.

The 100 - most downloaded projects from PRIDE

Would be great if we can annotate this list:

PXD004732 -- ProteomeTools -- 104608
PXD000561 -- A draft map of the human proteome -- 97948
PXD002815 -- A human interactome in three quantitative dimensions organized by stoichiometries and abundances -- 50187
PXD005336 -- Target Landscape of Clinical Kinase Inhibitors -- 46021
PXD004452 -- HeLa proteome of 12,250 protein-coding genes -- 38148
PXD010154 -- A deep proteome and transcriptome abundance atlas of 29 healthy human tissues -- 37040
PXD000865 -- Mass spectrometry based draft of the human proteome -- 29908
PXD010595 -- ProteomeTools – Part II -- 26210
PXD008840 -- A proteomic landscape of diffuse-type gastric cancer -- 24881
PXD005130 -- Extracellular Matrix Proteomics Identifies Molecular Signature of Atherosclerotic Plaques from Symptomatic Patients -- 19620
PXD004242 -- Plasma Proteome Profiling Reveals the Effects of Weight Loss on the Apolipoprotein Family and Systemic Inflammation Status -- 19439
PXD005354 -- Pharmacoproteomic characterisation of human colon and rectal cancer - CRC65 Full Proteomes -- 19111
PXD005946 -- Global Proteome Analysis of the NCI-60 Cell Line Panel, part 3 -- 17832
PXD001406 -- Impact of Regulatory Variation from RNA to Protein -- 17559
PXD008750 -- Proteogenomic Analysis of Medulloblastoma -- 15225
PXD006512 -- Proteome Landscape of Early Stage Hepatocellular Carcinoma Identifies Proteomic Subtypes and New Therapeutic Targets -- 15018
PXD009602 -- Mapping the HLA ligandome of Colorectal Cancer Reveals an Imprint of Malignant Cell Transformation -- 14829
PXD000065 -- HiRIEF deep proteomics of A431 human vulvar cancer cell line and N2A mouse neuroblastoma cell line -- 13169
PXD006537 -- Large Cohort Proteomics Profiling of Human Brain Cortex -- 13636
PXD001796 -- Integration of transcriptome and proteome annotation in the naive Ixodes ricinus midgut with genome sequencing -- 11520
PXD000138 -- Synthetic (Phospho)Peptide Library -- 11505
PXD000612 -- Ultra-deep human phosphoproteome reveals different regulatory nature of Tyr and Ser/Thr-based signaling -- 11116
PXD000953 -- The Pan-Human Library: A repository of assays to quantify 10 000 proteins by SWATH-MS -- 11028
PXD013231 -- Analysis of 1,508 plasma samples of the DiOGenes study - Robust, single shot capillary flow data-independent acquisition to decipher proteomic profiles of weight loss and maintenance -- 10606
PXD007048 -- Cell-specific proteome analyses of human bone marrow upon aging. -- 10065
PXD002952 -- LFQbench enables a multi-centered benchmark study demonstrating robust proteomic label-free quantification -- 9952
PXD006675 -- Region and cell-type resolved quantitative proteomic map of the human heart and its application to atrial fibrillation -- 9856
PXD002322 -- Panorama of ancient metazoan macromolecular complexes: Homo sapiens -- 9504
PXD007635 -- The immunopeptidomic landscape of ovarian carcinoma -- 9450
PXD006607 -- Proteomic analysis of human Medulloblastoma reveals distinct activated pathways between subgroups -- 10561
PRD000721 -- Phosphoproteomics of leaf growth in maize -- 8913
PXD006895 -- SubCellBarcode: Scalable proteome-wide mapping of protein localization and relocalization -- 8749
PXD002319 -- Panorama of ancient metazoan macromolecular complexes: Caenorhabditis elegans -- 8601
PXD004397 -- Normal human mitral valve proteome: a preliminary investigation by gel-based and gel-free proteomic approaches -- 8462
PRD000066 -- Quantitative Proteomics Analysis of the Secretory Pathway -- 8459
PXD001250 -- Cell-type and brain-region resolved mouse brain proteome -- 8411
PXD002255 -- Enrichment strategy for searching missing protein -- 8411
PXD005353 -- Pharmacoproteomic characterisation of human colon and rectal cancer - CPTAC Full Proteomes -- 8164
PXD008675 -- Gut Microbial Ecosystem Multiomics in Inflammatory Bowel Disease -- 8158
PXD005520 -- Human Plasma Gel Filtration Fractions MS Data -- 7697
PXD006003 -- Proteomics of melanoma response to immunotherapy reveals dependence on mitochondrial function -- 8011
PXD004894 -- Direct identification of clinically relevant neoepitopes presented on native human melanoma tissue by mass spectrometry -- 7013
PXD000279 -- MaxLFQ label-free quantification algorithm benchmark datasets -- 6856
PXD006291 -- HiRIEFII proteomics and proteogenomics of A431 cells and five histologically normal tissues -- 6815
PXD002870 -- Protein Turnover Rates in Normal and Hypertrophy Mouse Hearts -- 6635
PXD000117 -- Human exosome proteome -- 6455
PXD000815 -- Breast cancer tumors -- 6452
PRD000004 -- GPMDB Submission: Plasma Proteome -- 6418
PXD010271 -- Individual variability of protein expression in human tissues -- 6381
PXD005940 -- Global Proteome Analysis of the NCI-60 Cell Line Panel -- 6141
PXD009348 -- Plasma proteome profiling reveals global and specific changes of inflammatory and lipid homeostasis markers after bariatric surgery -- 6043
PXD003903 -- HipSci project pilot submission for 18 IPS cell lines -- 6025
PXD002486 -- Target identification for 11q13 amplicon -- 6018
PXD004785 -- Proteogenomics reanalysis of human testis data -- 5912
PXD006201 -- Characterisation of protein ubiquitination using UbiSite technology -- 5903
PXD005235 -- Genomic determinants of protein abundance variation in colorectal cancer cell lines -- 5852
PXD004023 -- The immunopeptidome presents a selected portion of the human genome with distinct features to CD8 T cells -- 5769
PXD010429 -- Proteogenomic Landscape of Squamous Cell Lung Cancer 2: Tandem Mass Tag Datasets -- 5579
PXD004682 -- Comparison of Lung Cancer Proteome Profiles 1: Label Free Quantification -- 5573
PXD005445 -- CEGS Proteomics -- 5517
PXD003469 -- Functional Proteogenomics Reveals Biomarkers and Therapeutic Targets in Lymphomas -- 5282
PXD003271 -- Proteomic analysis of clear cell renal cell carcinoma tissue versus matched normal kidney tissue -- 5168
PXD004352 -- Social Network Architecture of Human Immune Cells Unveiled by Quantitative Proteomics -- 5152
PXD002619 -- Proteomic maps of breast cancer subtypes -- 5061
PXD007160 -- Global quantitative analysis of the human brain proteome in Alzheimer’s and Parkinson’s Disease -- 4949
PXD006833 -- Human bladder,colon,kidney,liver cancer LC MS/MS -- 4932
PXD006122 -- Proteomics on post-mortem human brain tissue of patients with Alzheimer`s, Parkinson`s and Lewy body dementias. -- 4909
PXD001471 -- Proteomic analysis of cellular soluble proteins from human bronchial smooth muscle cells by combining nondenaturing micro 2DE and quantitative LC-MS/MS: Preparation of more than 4000 native protein maps -- 4870
PXD001212 -- A mammalian transcription factor-specific peptide repository for targeted proteomics -- 4827
PXD002854 -- Plasma proteome profiling to assess human health and disease -- 4734
PXD003907 -- Human Fecal Gut Microbiome proteomics -- 4694
PXD000197 -- AKT2 interacting proteins_ MBP- and MAP-TAPs -- 4669
PXD008222 -- The Proteomic Landscape of Triple-Negative Breast Cancer -- 4547
PXD002081 -- CPTAC, TCGA Cancer Proteome Study of Colorectal Tissue: Proteome, VU, part 2 -- 4456
PXD000948 -- Macro-porous Reversed Phase Separation of Proteins Combined with Reversed Phase Separation of Phosphopeptides and Tandem Mass Spectrometry for Profiling the Phosphoproteome of MDA-MB-231 cells -- 4433
PXD002080 -- CPTAC, TCGA Cancer Proteome Study of Colorectal Tissue: Proteome, VU, part 1 -- 4377
PXD002395 -- 11 human cell lines -- 4357
PXD010899 -- Plasma Hirief -- 4330
PXD001064 -- Quantitative variability of 342 plasma proteins in a human twin population -- 4257
PXD002082 -- CPTAC, TCGA Cancer Proteome Study of Colorectal Tissue: Proteome, VU, part 3 -- 4240
PXD001020 -- Redox-state of pentraxin 3 as a novel biomarker for  resolution of inflammation and survival in sepsis -- 4204
PXD002801 -- Defining the effects of genetic variation on a proteome-wide scale -- 4177
PXD006932 -- Performance evaluation of the Q Exactive HF-X for shotgun proteomics -- 4164
PXD002086 -- CPTAC, TCGA Cancer Proteome Study of Colorectal Tissue: Proteome, VU, part 7 -- 3991
PXD009270 -- Comprehensive analysis of proteomic landscape remodeling during carcinogenesis -- 3975
PXD004193 -- A compendium of RNA-binding proteins that regulate microRNA biogenesis -- 3914
PXD000321 -- Signatures for Mass Spectrometry Data Quality, part 2 of 5 -- 3908
PRD000073 -- HUPO Plasma Proteome Project 2 (ProteomExchange) -- 3907
PXD002087 -- CPTAC, TCGA Cancer Proteome Study of Colorectal Tissue: Proteome, VU, part 8 -- 3841
PXD002088 -- CPTAC, TCGA Cancer Proteome Study of Colorectal Tissue: Proteome, VU, part 9 -- 3813
PXD003133 -- HEK293 Quality Control example dataset -- 3727
PRD000644 -- Protemic analysis of human hippocampus shows differential protein expression in the different hippocampal subfields -- 3712
PXD005573 -- Optimization of Experimental Parameters in Data-Independent Mass Spectrometry Significantly Increases Depth and Reproducibility of Results -- 3676
PXD002412 -- A method to characterize antibody selectivity immunoprecipitation -- 3671
PXD005554 -- Analysis of ECM-enriched samples using different methods -- 3665
PXD002084 -- CPTAC, TCGA Cancer Proteome Study of Colorectal Tissue: Proteome, VU, part 5 -- 3639
PXD010557 -- HipSci: the  iPSC proteomic compendium -- 3643
PXD002049 -- CPTAC proteomic analysis of TCGA colon and rectal carcinomas using standard and customized databases, part 9 -- 3567
PXD004896 -- N-terminomics Proteogenomics -- 3560

Suggestions about reporting of modifications

As far as I can see, if we want the file to be computer readable, one of these two options MUST take place:

  • "Database accession" for modifications is made Mandatory (at the moment is optional).

OR

If 'Database Accession' is still optional, then 'Monoisotopic mass' MUST be Mandatory (at the moment is optional).

This would cover cases like novel modifications with no name (but with a known delta mass).

I am not sure how this can be expressed in terms of cardinality in the table.

Also if 'Database Accession' is made mandatory, then to have the information separately about the 'Target Amino acid' is not so essential.

Finally I would change 'Database accession' to 'Modification accession'. It is much more intuitive.

JSON format for each MSRun

I open this issue to discuss the MSRun JSON format. In this first draft we divided the JSON in two main parts: one with MSRun file metadata (filename, checksum, organism, instrument, etc.) and another one with the data itself. The "contextSource" allows to distinguish if the data is related either to the peptides inside the sample or to all the sample in general. We also follow the QC codes standards defined by the HUPO-PSI:

https://github.com/HUPO-PSI/qcML-development

In the latest specification of the HUPO-PSI, all the QC parameters that we have in our database are not yet defined, but we hope that they will be soon.

sample description ambiguity

what if we have more than one organism in a sample?
Notable examples would be metaproteomics or xenograft samples.

Other experimental readouts, external identifiers

A raw file is never acquired alone: experimental designs include other experimental readouts like a QC gel, functional follow-up, or other molecular data (sequencing, metabolites, protein properties characterization). Should these be included as well or do you want to stick to the MS file? If external identifiers are available (eg sequencing or metabolites), it would be extremely useful to have them linked here.

search-metadata: input files

In the current search-metadata json format I don't see entries for the input/output file(s) that correspond to the search-metadata json. However, to my understanding, one PRIDE project can have multiple result files corresponding to the analysis of different input files (at least using ProteomeXchange one also needs to specify then, which files belong together). Since search parameters might vary, the corresponding input/output file(s) would need to be specified, in my opinion.

Comments on proposed proteomics SDRF spec

@ypriverol I've had a look at the guidelines on the experimental-design page and here are my comments from the transcriptomics (ArrayExpress/Expression Atlas) usage point-of-view. I hope this makes sense and is helpful!

"The SDRF file is a tab-delimited format where each ROW corresponds to a Sample"

This is not quite right, at least from the MAGE-TAB definition there is no strict rule like that. It's more about the relationship between the elements in each node column.
Of course you can define it that strictly in your rule set, but it will limit your options how you may represent the experimental flow graph.
One example: Sample 1 generates 2 raw data files, either by technical replication or because the method does it like this. Going with one row per sample you can do:
Sample 1 -> File1 -> File2
Now what would you do if you have processed data generated from these two raw data files?
Sample 1 > File1 > File2 > ProcessedFile1
Great! But what if you have two processed files and proc. file 1 was generated from raw file 1 and proc. file2 was generated from raw file 2? You can't express this relationship in 1 row.
The alternative could be:
Sample 1 > File1 > PorcessedFile1
Sample 1 > File2 > ProcessedFile2
This also represents a neutral relationship between File1 and File2 and wouldn't inadvertently imply that File2 was generated out of File1.

" for some samples the value is unknown. In those cases users SHOULD use NA."

NA is ambiguous so we don't use it and prefer the explicit versions: "not available" (for unknown) and "not applicable" (for attribute not relevant)

" Case sensitive"

OK. What are your rules on white space characters? By definition SDRF is insensitive to spaces, so "SourceName" = "source name"

"… format developed by the RNA-Seq community."

This link will go away soon, would link to the FGED page: http://fged.org/projects/mage-tab/

" the source database or file for ontology terms in these columns may be given in an adjacent “Term Source” column immediately to the right of the “Characteristics” column. In the absence of a “Term Source” column the value is assumed to be user defined."

I don't quite understand what you mean here. Can you give an example what this refers to and the how the "Term Source" is going to be used?
Also the field is called "Term Source REF" in MAGE-TAB.

" characteristics[phenotype]: sample treated with drug A"

As a curation side-note: "phenotype" is not a good term to describe that attribute. Better annotation is "compound: drug A" and "compound: none" (for the control). The second reason why this is better is that "drug A" can be mapped to the ontology term for drug A, while "sample treated with drug A" is not an ontology term.

"Multiple values (columns) for the same Characteristics term are allowed."

MAGE-TAB allows this but we don't recommend this. If you have multiple phenotypes, you can specify what it refers to or use another more specific term, e.g. "immunophenotype". We rather use a non-ontology term than repeating an ontology-compliant category. Computers don't like this either.

" Users can provide the corresponding URI of the ontology/CV term as a value"

You allow this but also the "Term Source REF" (see above, maybe I got this wrong), which one is preferred?

Before going over to the run section, what I was missing in the Source/Sample section is the requirement that the value for the Source Name should to be unique for each biological replicate. I think it needs clarification what entity should be regarded as a unique source, i.e. when do you create separate identifiers and when do you use the same. In our case, we create separate samples for sub-specimen from the same donor.
Looking at https://github.com/bigbio/proteomics-metadata-standard/blob/master/annotated-projects/PXD012203/sdrf.txt: there are normal and disease samples grouped under the same source name linked to the same data file (How is this possible? Although this might be outdated).
On the other hand https://github.com/bigbio/proteomics-metadata-standard/blob/master/annotated-projects/PXD002049/sdrf.tsv has separate samples for separate fractions of the same experimental sample. You probably should discuss what should be the standard for this common case. In case you go for separate source identifiers, it makes sense to think about another attribute (column) that links the fractions from the same source sample (similarly to technical replicates). Or you could treat same as technical replicates (see next point, replace run with fraction).

"The use of Comment is mainly aimed at differentiating Sample Characteristics from the Sample/MSrun properties."

I don't think I like this idea much. The MAGE-TAB way to distinguish between different entities is adding another "node". As we discussed briefly maybe adding "Assay Name" can help here to mark the transition from the sample characteristics to the MS run characteristics. The Assay Name column would hold some identifier of the run. This is also very useful to denote multiple runs from the same sample (technical replicates).
Example:

Source Name Characteristics […] Assay Name Characteristics […] Data File
Sample 1 x Run 1 y File 1
Sample 1 x Run 2 z File 2

Note: we use only Comment columns after Assay Name but there is no good reason for it. Comments or Characteristics is both fine.
Another argument for introducing assay is that "assay" is also the terminology used by the DSP data model, so it's generally compliant with the EBI-wide efforts.

"Comment [Label]"

"Label" without comment is actually an allowed SDRF header. Why not use it? Especially since it might be mandatory?

"source id"

Is source id the same as source name?

technical replicate examples

I don't understand your technical replicate examples. We seem to be using different strategies here. The same file name is duplicated here but links to different source ids and fraction numbers. Does this mean the same file contains data from multiple entities?
E.g. "000261_C05_P0001563_A00_B00K_F1_TR1.RAW" is both annotated as technical replicate 1 and 2, what is the logic here?
Also it seems in this part you are switching back to the old style of annotating the data file rather than the sample going through the stages of processing with the data file being the last column (as it's the final output).

"comment[associated file uri]"

I'm being picky here, but do you need "associated"? What does associated refer to? Why not Comment[file uri]? The association comes naturally by the position of the value in the table (same row next to the data file column).

modification parameters

The description of the modifications with key/value pairs is a clever idea, I like that. I just think these annotations should come before the data file name because the modifications are done to the sample before the assay (MS run), right? This keeps the order of the columns following the experimental workflow (which generally makes understanding an experiment just by looking at the SDRF much easier). As said above, having three sections for 1) sample 2) assay 3) data files makes the structure clearer.

Minimum sample attributes

"disease" does apply to plants. But you would want "age" or "developmental stage" for plants, this is important (so at least mark it as optional). "strain/breed" for plants would be called "ecotype/cultivar".
"cell type" is not always applicable. Maybe that was what you wanted to say with the asterisk but there is no footnote on this.
Did you have any thoughts on units e.g. for age attribute?

Some errors to correct

Discordance in FTP URL in column 10 in sdrf file (comment[associated file uri])

In annotation of PXD000612 project does not exist for the following samples:

pY_EXQ5_KiSh_SA_LabelFree_HeLa_Proteome_Control_rep4_pH11.raw
pY_EXQ5_KiSh_SA_LabelFree_HeLa_Proteome_Control_rep4_pH3.raw
pY_EXQ5_KiSh_SA_LabelFree_HeLa_Proteome_Control_rep4_pH4.raw
pY_EXQ5_KiSh_SA_LabelFree_HeLa_Proteome_Control_rep4_pH5.raw
pY_EXQ5_KiSh_SA_LabelFree_HeLa_Proteome_Control_rep4_pH6.raw
pY_EXQ5_KiSh_SA_LabelFree_HeLa_Proteome_Control_rep4_pH8.raw
pY_EXQ5_KiSh_SA_LabelFree_HeLa_Proteome_Control_rep5_pH11.raw
pY_EXQ5_KiSh_SA_LabelFree_HeLa_Proteome_Control_rep5_pH3.raw
pY_EXQ5_KiSh_SA_LabelFree_HeLa_Proteome_Control_rep5_pH4.raw
pY_EXQ5_KiSh_SA_LabelFree_HeLa_Proteome_Control_rep5_pH5.raw
pY_EXQ5_KiSh_SA_LabelFree_HeLa_Proteome_Control_rep5_pH6.raw
pY_EXQ5_KiSh_SA_LabelFree_HeLa_Proteome_Control_rep5_pH8.raw
pY_EXQ5_KiSh_SA_LabelFree_HeLa_Proteome_Control_rep6_pH11.raw
pY_EXQ5_KiSh_SA_LabelFree_HeLa_Proteome_Control_rep6_pH3.raw
pY_EXQ5_KiSh_SA_LabelFree_HeLa_Proteome_Control_rep6_pH4.raw
pY_EXQ5_KiSh_SA_LabelFree_HeLa_Proteome_Control_rep6_pH5.raw
pY_EXQ5_KiSh_SA_LabelFree_HeLa_Proteome_Control_rep6_pH6.raw
pY_EXQ5_KiSh_SA_LabelFree_HeLa_Proteome_Control_rep6_pH8.raw
pY_EXQ5_KiSh_SA_LabelFree_HeLa_Proteome_Nocodazole_rep1_pH11.raw
pY_EXQ5_KiSh_SA_LabelFree_HeLa_Proteome_Nocodazole_rep1_pH3.raw
pY_EXQ5_KiSh_SA_LabelFree_HeLa_Proteome_Nocodazole_rep1_pH4.raw
pY_EXQ5_KiSh_SA_LabelFree_HeLa_Proteome_Nocodazole_rep1_pH5.raw
pY_EXQ5_KiSh_SA_LabelFree_HeLa_Proteome_Nocodazole_rep1_pH6.raw
pY_EXQ5_KiSh_SA_LabelFree_HeLa_Proteome_Nocodazole_rep1_pH8.raw
pY_EXQ5_KiSh_SA_LabelFree_HeLa_Proteome_Nocodazole_rep2_pH11.raw
pY_EXQ5_KiSh_SA_LabelFree_HeLa_Proteome_Nocodazole_rep2_pH3.raw
pY_EXQ5_KiSh_SA_LabelFree_HeLa_Proteome_Nocodazole_rep2_pH4.raw
pY_EXQ5_KiSh_SA_LabelFree_HeLa_Proteome_Nocodazole_rep2_pH5.raw
pY_EXQ5_KiSh_SA_LabelFree_HeLa_Proteome_Nocodazole_rep2_pH6.raw
pY_EXQ5_KiSh_SA_LabelFree_HeLa_Proteome_Nocodazole_rep2_pH8.raw
pY_EXQ5_KiSh_SA_LabelFree_HeLa_Proteome_Nocodazole_rep3_pH11.raw
pY_EXQ5_KiSh_SA_LabelFree_HeLa_Proteome_Nocodazole_rep3_pH3.raw
pY_EXQ5_KiSh_SA_LabelFree_HeLa_Proteome_Nocodazole_rep3_pH4.raw
pY_EXQ5_KiSh_SA_LabelFree_HeLa_Proteome_Nocodazole_rep3_pH5.raw
pY_EXQ5_KiSh_SA_LabelFree_HeLa_Proteome_Nocodazole_rep3_pH6.raw
pY_EXQ5_KiSh_SA_LabelFree_HeLa_Proteome_Nocodazole_rep3_pH8.raw
pY_EXQ5_KiSh_SA_LabelFree_HeLa_Proteome_Nocodazole_rep4_pH11.raw
pY_EXQ5_KiSh_SA_LabelFree_HeLa_Proteome_Nocodazole_rep4_pH3.raw
pY_EXQ5_KiSh_SA_LabelFree_HeLa_Proteome_Nocodazole_rep4_pH4.raw
pY_EXQ5_KiSh_SA_LabelFree_HeLa_Proteome_Nocodazole_rep4_pH5.raw
pY_EXQ5_KiSh_SA_LabelFree_HeLa_Proteome_Nocodazole_rep4_pH6.raw
pY_EXQ5_KiSh_SA_LabelFree_HeLa_Proteome_Nocodazole_rep4_pH8.raw

and for the rest of the samples the URL is different e.g.:
For 20120309_EXQ5_KiSh_SA_LabelFree_HeLa_Phospho_EGF_rep4_FT3.raw
instead of this:
ftp://ftp.pride.ebi.ac.uk/pride/archive/projects/PXD000612/20120309_EXQ5_KiSh_SA_LabelFree_HeLa_Phospho_EGF_rep4_FT3.raw
it should be this:
ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2014/08/PXD000612/20120309_EXQ5_KiSh_SA_LabelFree_HeLa_Phospho_EGF_rep4_FT3.raw

Data without fractions

Hi all,

thanks for your efforts to generate better metadata annotations :)

In the Experiment Design description I am missing an explanation on how to handle unfractionated data. Comment [Fraction Identifier] is included in the templates, therefore I assume one should not delete the column. So far I have decided to enter "1" for every dataset that has no fractions. Would it be more clear to write "not applicable"?

Maybe you can add one sentence to clarify this in section "3. From Samples to Assay (MSRun)".

Best,
Melanie

Minor Refinements in Spec Document 3

In the case of the "label free sample" ontology/CV term, I guess that NA would also be possible here? Or maybe we want to discourage this use, and make this a MUST?

MS2 analyzer type and fragmentation energy

  1. For some Orbitraps it's possible to measure fragments in the ion trap or the Orbitrap. It would be nice to add this information (probably next to the comment [instrument]). It may be a bit redundant since there is information about fragment tolerance, but an explicit "MS2 analyser type" column would be a more straightforward characterization of the experiment.
  2. There seems to be no way of specifying HCD/CID energy. Since some tools can predict fragmentation spectra for particular energy, this parameter may be helpful with in silico libraries for DIA and different dot product scores.

Minor refinements in Spec Document 2

In general there are too many SHOULD statements in the text. I think it would be good to change them to MUST in a number of cases:

For instance, "In case of multiplex experiments such as TMT, SILAC, and/or ITRAQ the corresponding label SHOULD be added”. Replace SHOULD by MUST.

We need to improve the examples

  • Is it Characteristics[data file] (PXD000612) or Comment[data file] (NCI60)?
  • Also using 'sample n' for a Source name is misleading if there is a Sample column (and if there is not, too).
  • What keeps us from using/adapting the Labeled extracts column?

Adding Experiment type to the sample metadata

I think it would be a good idea to include the Proteomics Data Acquisition Method to explain the type of experiment, this term is in the NCI Thesaurus OBO Edition ontology and the terms are:

  • Data-Dependent Acquisition
  • Data-Independent Acquisition
  • Parallel Reaction Monitoring
  • Selected Reaction Monitoring

Minor refinements in SRDF Spec Document

In the table:

|===
| source id | comment[data file] | comment[label] | comment[fraction identifier] | comment[technical replicate]
| 1 | 000261_C05_P0001563_A00_B00K_F1_TR1.RAW | label free sample | 1 | 1
| 2 | 000261_C05_P0001563_A00_B00K_F2_TR2.RAW | label free sample | 2 | 2
| 3 | 000261_C05_P0001563_A00_B00K_F1_TR1.RAW | label free sample | 1 | 2
| 4 | 000261_C05_P0001563_A00_B00K_F2_TR2.RAW | label free sample | 2 | 2
|===

As far as I can see here, there should be 4 different raw files the ones that would be needed, at the moment there are only two. Each fraction is normally a different raw file.

The same would be applicable to the following table in the text

|===
| source id | comment[data file] | comment[label] | comment[fraction identifier] | comment[technical replicate] | comment[biological replicate]
| 1 | 000261_C05_P0001563_A00_B00K_F1_TR1.RAW | label free sample | 1 | 1 | patient 1
| 2 | 000261_C05_P0001563_A00_B00K_F2_TR2.RAW | label free sample | 2 | 2 | patient 1
| 3 | 000261_C05_P0001563_A00_B00K_F1_TR1.RAW | label free sample | 1 | 2 | patient 1
| 4 | 000261_C05_P0001563_A00_B00K_F2_TR2.RAW | label free sample | 2 | 2 | patient 1
| 5 | 000261_C05_P9999999_A00_B00K_F1_TR1.RAW | label free sample | 1 | 1 | patient 2
| 6 | 000261_C05_P9999999_A00_B00K_F2_TR2.RAW | label free sample | 2 | 2 | patient 2
| 7 | 000261_C05_P9999999_A00_B00K_F1_TR1.RAW | label free sample | 1 | 2 | patient 2
| 8 | 000261_C05_P9999999_A00_B00K_F2_TR2.RAW | label free sample | 2 | 2 | patient 2
|===

source vs individual vs replicate

Hi,

it would be great to add more clarity and/or more examples about the relationship between:
source name, individual and replicates

I understand individual as individual (human) being. In most (or even all?) cases this will correspond to biological replicate. As long as there are no technical replicates it will also correspond to source name. Or how exactly is source defined? If one tissue is cut into 5 pieces I assume this corresponds to 5 different sources?

Cheers,
Melanie

experimental design 'file_path'

do we want file paths?
Might be a good idea but should then be relative to the submission folder.
On the other hand, this will then reduce immediate reusability.

search-metadata: complex analysis pipelines

I would like to discuss here, how the search-metadata json could deal with compelx analysis pipelines.
The current format is based on the mzIdentML format, which doesn't support workflows (i.e. to give the order in which the engines were used or in whih parameters were changed). However, this would be important for slightly more compex workflows like cascaded searches or similar.
I guess one one question in this regard would be how often this occurs (i.e. results from a pipeline morre complex than just a simple identification engine + validation/filtering). But since PRIDE, in principle, also supports quantification results, I would imagine it to be quite relevant.

One way might be to include a "history" entry, similar to what we did in Ursgal (https://github.com/ursgal/ursgal) or the requirement to upload intermediate result files.

Clarify the reporting of channels for multiplex experiments (e.g. TMT)

From what I understand from the spec file, the way to report labels would be to use the column 'comment[label]'. But the only example there is a 'label free sample'.

  • For TMT experiments the SDRF uses the PRIDE ontology terms under sample label. Here some examples of TMT channels:

TMT126, TMT127, TMT127C , TMT127N, TMT128 , TMT128C, TMT128N, TMT129, TMT129C, TMT129N, TMT130, TMT130C, TMT130N, TMT131

So, this means that the actual channels are included under 'Label'.

One cleaner way would be to use 'Label' to indicate the overall label (TMT in this case) and a different column comment[channel] for the actual channel.

But in any case this needs to be clarified.

Also an example needs to be added for SILAC, because in this case, the actual channels would be in fact different modifications in Unimod/PSI-MOD.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.