cbioportal / datahub Goto Github PK

A centralized location for storing curated data from cBioPortal

Shell 2.49% HTML 88.93% Python 8.58%

datahub's Introduction

cBioPortal Public Datahub

The datahub is a repository for data storage only. It contains staging files which are validated and can be loaded directly into the cBioPortal.

Behind the scenes git-lfs is used to manage the large files. https://github.com/github/git-lfs

Test Status

Validation status of all studies on Datahub master branch. This runs weekly using the validation code from the cBioPortal master branch. It also validates if the studies on cbioportal.org and on Datahub are in sync.

How to Download Data

Downloading zip files individual studies

At cbioportal.org datasets page a zipped file with staging files from each study can be downloaded. These zip files are compressed versions of the study folders in the master branch of this repository.

Example downloading individual study with git-lfs

It is also possible to download uncompressed staging files from this repository with git-lfs.

After you have installed git-lfs, configure it not to download all data files right away:

git lfs install --skip-repo --skip-smudge

Clone the git repository and install lfs hooks into it:

git clone https://github.com/cBioPortal/datahub.git
cd datahub
git lfs install --local --skip-smudge

Download the data files for a study folder, for example brca_tcga:

git lfs pull -I public/brca_tcga

How to Upload Data

Create a new branch from the 'master' branch.

git checkout master
git pull origin master
git checkout -b [name_of_your_new_branch]

For general background on creating and managing branches within GitHub, see: Git Branching and Merging.

Commit changes, and push the branch back to GitHub.

[back to the root directory]
git add .
git commit -m '[notes_for_your_change]'
git push origin [name_of_your_new_branch]

Open a Pull Request on GitHub to the 'master' branch.

For instructions on submitting a pull-request, please see: Using Pull Requests and Sending Pull Requests.

Download a complete MySQL export of the latest database

http://download.cbioportal.org/mysql-snapshots/mysql-snapshots-toc.html

License

The data are available under the ODC Open Database License (ODbL).You are free to share and modify the data as long as you attribute any public use of the database, or works produced from the database; keep the resulting data-sets open; and offer your shared or adapted version of the data-set under the same ODbL license.

TCGA data are availabe under Broad Institute GDAC TCGA Analysis Pipeline License. The Cancer Genome Atlas Consortium is pleased to provide the research community with preliminary data prior to publication. Users are requested to carefully consider that these data are preliminary and have yet to be validated. Researchers are warned that the preliminary data have a significant uncertainty, are likely to change, and should be used with caution.

User Assistance

For questions, please post on our user discussion group at: https://groups.google.com/g/cbioportal

datahub's People

Contributors

Stargazers

Watchers

Forkers

zheins thehyve biocq ritikakundra xiaoshen19930901 n1zea144 inodb albertliumi chaelir pieterlukasse dancingwinter arsheedganaie pvannierop bk2204 danielamsel xchromosome219 quiltomics tmallava ao508 timeaglenebula leehunhee kalletlak ngenebio kitinje parkerici lizabethkatsnelson tapaswenipathak dharmesh-poddar fongcj ndaniel dippindots bioshare nki-tgo jianguozhou3 bc2zb carlettos flywind2 pengcui johnyaku marriott-er bnm3we rahul799 lmckinney2020 twilightbi liyanji-code hassanfa nobie666 iugarov jainesha nebs123 mackaay ritanshoo mathewyang sunlee0 zhx828 luke-sikina mridu-enigma snijesh mandawilson averyniceday sheridancbio akv3001 zhangbei123 living1069 alanrace lswann mywanuo syedhaider5 kids-first daniel-henning wdiaz07 flavius-wu leeson89 liaoscience wychi-github yuefengzju peishanli jessicath cbioportal-curation-gene-update breezyzhao oumeishi serviolimareina olenive volvic-19 sandeepkasaragod cmk323 nf-osi legacy-repo wook2014 tpillars allaway jaybee84 rmadupuri rahil19 mohammedhusenkhatib mohammedhusen yasmouri favourj-bit tgerke stockschl

datahub's Issues

Number of rows/columns different in rppa/rppa_zscores for brca

Hi,

I was wondering why the number of rows (226) and columns (845) in the data_rppa_Zscores.txt is different from the number of rows (227) and columns (939) in data_rppa.txt for the brca study.

In the Zscore file, the first row after the header seems to be missing. For the columns, i thought maybe there are samples with only NA values for which Zscores couldnt be calculated, but this doesnt seem to be the case.

Kind regards,
Sander

Missing CANCER_TYPE_DETAILED in tcga studies?

In cbioportal.org we now see the pancancer histogram view for BRCA study:

However, this does not seem to appear in the study loaded from datahub? Is the CANCER_TYPE_DETAILED field missing or empty in the datahub tcga studies?

CCSK data set

The file: target_ccsk_pa_01_cna_hg19.seg has failures:

Sample ID not defined in clinical file in lines 2, 3, 4, (2655 more)

Unknown chromosome, must be one of (24|20|21|22|23|1|3|2|5|4|7|6|9|8|Y|X|11|10|13|12|15|14|17|16|19|18) in lines 30, 61, 458, (20 more)

and
Blank cell found in column in line 2497

brca_tcga.tar.gz what should be the datatype for protein expression?

Hello
I am trying to upload brca_tcga.tar.gz to the latest version of cbioportal. During the validation, files meta_protein_quantification.txt and meta_protein_quantification_Zscores.txt produce errors saying the datatype is incorrect. In http://cbioportal.readthedocs.io/en/latest/File-Formats.html I found that protein expression can be only rppa. But files meta_rppa.txt and meta_rppa_Zscores.txt already exist among brca_tcga.tar.gz files. How to import meta_protein_quantification.txt and meta_protein_quantification_Zscores.txt? Is there any other data type for protein expression?
Best regards,
Marian

Error in skcm_tcga study

The skcm study seems to contain some errors that show up in the validator. I generated the following report with the latest version of the validator, which is hotfix on Nov 4 + PR1866,1868,1873

Summarised errors:

Value in column 'n_ref_count' is invalid (.)
Value in column 'Validation_Status' is invalid (---)
Sample ID not defined in clinical file (241 times)

Full validation report:

Report_skcm.html.zip

CBTTC-0005 Data Set

The file called data_expression_median.txt in CBTTC-0005 has the following failure messages:

Value is neither a real number nor NA in lines 2, 3, 4, (22947 more) and column number 21

and

Entrez gene id is non-positive in lines 12, 73, 129, (2414 more)

Issues with RNA-seq data of nepc_wcm_2016

The RNA-seq data does not look like z-score profiles.

There is no case list for the RNA-seq profile

Validator: Sample-level attributes that should be on patient-level

It should not fail for an attribute. Also in our config file, it is a patient based attribute.

AML Data Set

The file called data_mutations_extended.txt in the data set AML contains the following error that needs to be addressed:

Invalid column header, file cannot be parsed

Sample IDs ov_tcga

@pieterlukasse mentioned in #11:

"Found strange sample ids in the end of the file. So the last samples contain much longer part of the "barcode" (and actually do not comply to the TCGA barcode format...) and potentially overlap with samples in previous lines, given data loader will only use the first part of the barcode, i.e. only TCGA-42-2591-01 portion of TCGA-42-2591-01A-21)"

TCGA-13-0800    TCGA-13-0800-10 cd5a08e6-343a-494e-b73a-2b060c33451d    [Not Available] [Not Available] [Not Available] [Not Available] [Not Available] [Not Available] NO      [Not Available] [Not Availab
TCGA-13-0801    TCGA-13-0801-01 dfaf19b4-03b4-49d3-b39f-01aaeb897a2a    [Not Available] [Not Available] [Not Available] [Not Available] [Not Available] 0.6     NO      1.5     [Not Available] [Not Availab
TCGA-13-0801    TCGA-13-0801-10 d2c929db-171c-4b5c-9de5-613407999d56    [Not Available] [Not Available] [Not Available] [Not Available] [Not Available] [Not Available] NO      [Not Available] [Not Availab
TCGA-13-0802    TCGA-13-0802-01 9e78346d-14d1-4c3e-a144-5b90c32a2731    [Not Available] [Not Available] [Not Available] [Not Available] [Not Available] 1.1     NO      1.4     [Not Available] [Not Availab
TCGA-13-0802    TCGA-13-0802-10 c1ab65a8-155e-40dc-a2ae-d526e0ee5138    [Not Available] [Not Available] [Not Available] [Not Available] [Not Available] [Not Available] NO      [Not Available] [Not Availab
TCGA-13-0803    TCGA-13-0803-01 13227d89-2bd2-4775-80a5-3fc08927dd25    [Not Available] [Not Available] [Not Available] [Not Available] [Not Available] 0.9     NO      1.3     [Not Available] [Not Availab
TCGA-13-0803    TCGA-13-0803-10 81bdc7f0-3cbf-4965-976d-6fb3bd080d72    [Not Available] [Not Available] [Not Available] [Not Available] [Not Available] [Not Available] NO      [Not Available] [Not Availab
TCGA-13-0804    TCGA-13-0804-01 7432f954-a16f-4053-bedd-aa776f939f71    [Not Available] [Not Available] [Not Available] [Not Available] [Not Available] 1.3     NO      1.5     [Not Available] [Not Availab
TCGA-13-0804    TCGA-13-0804-10 47946a17-b529-4490-894b-e74baf6beb3c    [Not Available] [Not Available] [Not Available] [Not Available] [Not Available] [Not Available] NO      [Not Available] [Not Availab
TCGA-13-0805    TCGA-13-0805-01 01cd17b5-aa30-4a47-a1c5-501f9aedb3d2    [Not Available] [Not Available] [Not Available] [Not Available] [Not Available] 0.8     NO      1       [Not Available] [Not Availab
TCGA-13-0805    TCGA-13-0805-10 49ad654d-d1de-4cf8-a689-9193bb0926db    [Not Available] [Not Available] [Not Available] [Not Available] [Not Available] [Not Available] NO      [Not Available] [Not Availab
TCGA-10-0931-01A-21     TCGA-10-0931-01A-21-20  NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA
TCGA-42-2582-01A-21     TCGA-42-2582-01A-21-20  NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA
TCGA-61-2610-01A-21     TCGA-61-2610-01A-21-20  NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA
TCGA-13-0901-01A-21     TCGA-13-0901-01A-21-20  NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA
TCGA-13-1411-01A-21     TCGA-13-1411-01A-21-20  NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA
TCGA-42-2589-01A-21     TCGA-42-2589-01A-21-20  NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA
TCGA-36-2552-01A-21     TCGA-36-2552-01A-21-20  NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA
TCGA-13-1410-01A-21     TCGA-13-1410-01A-21-20  NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA
TCGA-36-2540-01A-21     TCGA-36-2540-01A-21-20  NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA
TCGA-04-1341-01A-21     TCGA-04-1341-01A-21-20  NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA
TCGA-24-1416-01A-21     TCGA-24-1416-01A-21-20  NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA
TCGA-42-2591-01A-21     TCGA-42-2591-01A-21-20  NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA

@zheins mentioned in #11 a fix is underway, but this has not been merged yet.

Missing meta files in TCGA (Provisional) studies

Some meta files are missing, and therefor the data files cannot be loaded.

meta_clinical_supp_oncotree.txt
meta_gistic_genes_amp.txt
meta_gistic_genes_del.txt
meta_mutsig.txt

Validation and Verification Status should not be so strict

Public studies have different values and validator should make them a warning rather than an error.

coadread_tcga & ov_tcga: errors in mass spectrometry data

To circumvent this problem, remove the protein_quantification files from these studies.

coadread_tcga: Both data_protein_quantification and data_protein_quantification_Zscores have duplicate column names, with different values in the columns (some even 0, and other >20). This does not look correct They look a bit too different to be technical replicates.

ov_tcga data_protein_quantification.txt: There's a lot of missing data, this does not seem correct. When data is missing, these values should contain NA instead of being empty.

ov_tcga data_protein_quantification_Zscores.txt: The last column misses the lower 75% of data.

Missing documentation on seeds for old cBio releases

In a code review thread regarding how we document compatible seed databases across releases of cBioPortal, @inodb, @pieterlukasse and I came to agree that it would be best to maintain a list seed files for old releases in the corresponding README file on datahub.

Updates to the seed files, such as this one, have so far removed links to the old versions of seed files from the documentation. These links should be restored, referencing the seed files in specific commits, and we should start requiring future backwards-incompatible updates of the seed files to maintain links to previous versions.

Contributing without downloading all first

Would be good to document the --skip-smudge option to help people that just want to upload a new file (without having to download all first)

--skip-smudge : https://github.com/github/git-lfs/blob/master/docs/man/git-lfs-install.1.ronn

TCGA Disease Free Survival Issue (moved from cbioportal project)

Reported by end user:

I’m trying to reproduce your disease free survival plots but am unclear how your pipelines select from multiple relapse events for the same patients. For example patient TCGA-CU-A72E has two new tumor event times: 256 days and 364 days. Cbioportal selects 364 days. Isn’t 256 more correct if we are trying to measure disease free survival?

More info from user:

Here is a link to gdc data portal from which you can download clinical metadata for TCGA-CU-A72E: https://portal.gdc.cancer.gov/files/4f3ae24f-eecd-4ba3-a592-cb9af99ed6e2

Download, de-compress, open and go to follow-ups section. Under follow-ups you’ll find two new tumor event entries (256 and 364).

BLCA GDAC firehose merged clinical files also shows these two events.

Notes from Ethan: Reached out to Ben, who confirmed that this calculation is done within one of the MSK pipelines (that is not currently on github). Also confirmed with @schultzn that we should be using the first event, not the second.

TCGA Case lists contain samples that are not in study

Infant MLL Study summary - uncertainty

When i look into the Study summary of
Infant MLL-Rearranged Acute Lymphoblastic Leukemia I see an uncertainty in the list of mutated genes.

The number of samples (#) is higher than the number of profiled samples.

TCGA staging files enhancements

This ticket is meant for tackling some of the remaining warning ⚠️ messages we still get from the validation reports. The goal is to have a new version (v2) of the staging files without these issues:

Notify:
@zheins, @n1zea144

reorganizing the order of RNA Seq in Data Set

If i want to order the Data set according to RNA Seq, the list is not fully organized.
Could you please check this.

Archive for study hnsc_tcga_pub wrong Metadata

The archive for the study hnsc_tcga_pub contains wrong metadata.
The file meta_RNA_Seq_v2_expression_median.txt has the wrong "datatype" "Z-SCORE" instead of the expected "CONTINUOUS".

The file meta_RNA_Seq_v2_expression_median.txt contains:

cancer_study_identifier: hnsc_tcga_pub
genetic_alteration_type: MRNA_EXPRESSION
datatype: Z-SCORE
stable_id: hnsc_tcga_pub_rna_seq_v2_mrna
show_profile_in_analysis_tab: false
profile_description: Expression levels for 20532 genes in 303 hnsc cases (RNA Seq V2 RSEM).
profile_name: mRNA expression (RNA Seq V2 RSEM)
data_filename: data_RNA_Seq_v2_expression_median.txt

Missing clinical attributes for paad_icgc

When running the validation from the command line (validateData.py), I noticed that study “paad_icgc”does not containing clinical data (“data_clinical.txt”). When you go to cBioPortal.org, there you also don’t see clinical data (only mutation data). However, the publication (http://www.ncbi.nlm.nih.gov/pubmed/23103869) does contain clinical data as far as I understand (e.g., figure 2 — the have some survival curves there), which indicates that the initial study should contain clinical data and we lost it somewhere. Could we check that somehow?

lgg_ucsf_2014: errors when loading the study

The study lgg_ucsf_2014 has missing files. Specifically, it misses the meta_study file and the meta files for the different timepoint data files. After I added this, more errors were raised by the validator:

missing data in mesothelioma tcga

From a user:

"I downloaded many of the TCGA, Provisional data sets from the portal (http://www.cbioportal.org/data_sets.jsp) and found only mesothelioma had several files missing, including: data_mutations_extended.txt. Could you please let me know if this file is available for download?"

Add CPTAC data

Would be nice to have the CPTAC data in the next version (Ovarian, breast, and colorectal?)

Notify: @zheins

brca_tcga: errors in mass spec data

Fixes that need to be solved:

data_protein_quantification.txt and data_protein_quantification_Zscores.txt: replace the headers to Composite.Element.REF instead of Hugo_Symbol.
data_protein_quantification.txt and data_protein_quantification_Zscores.txt: fill the empty fields.

z-score file missing for `mrna` profile

E.g. in TCGA BRCA study we have a file for mrna, but no respective file for mrna_median_Zscore profile data.

@zheins @n1zea144 : could you check for this missing file? It seems to be there on the public portal, so it is just missing here.

Missing archives

There are at least 7 studies that are listed in cbioportal.org but which have no corresponding data archive in datahub/public. They are listed below.
Is it possible to add these archives to this repository?

List updated 13.10.2017

Loading CPTAC data generates many "gene" records

When loading brca study with CPTAC mass spectrometry data, the portal will generate a large amount of new "gene" records to store the data reported for each separate isoform(?) in the CPTAC files (72,159 new records in gene table!)

Here are some concerns:

the query page becomes slow when typing "PHOSPHOPROTEIN" in the Genes box (each new protein "gene" also gets this alias). The resulting drop down is very slow.
depending on what each symbol means in the CPTAC data file, this solution might not be scalable. For example: are SORBS1_pT72 or SORBS1_pT82_S89 encoding modifications to the canonical protein sequence known for gene SORBS1 rather than symbols of well known isoforms? If so, we risk an explosion of the number of records in the gene table as each study finds new modifications.

Another question I had when looking at the data (see data sample below) is:

how is the entry SORBS1|SORBS1 made? Is this an aggregation of all the other SORBS1|* items? How is this aggregation done?

Data sample from file:

SORBS1|SORBS1   0.545571655184  1.31369690336   1.20131762167   1.1320980343    0.54739111875   1.19041192239   2.73163154855   0.948705044244  1.33867851356   2.12510951076   1.01727605533   1.3008073214
SORBS1|SORBS1_pT72      0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.319789927879  0.325725496261  0.453594164798  0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0
SORBS1|SORBS1_pS76      0.0     0.0     0.0     0.0     0.0     0.0     0.802830082054  1.10324826511   1.43253238093   0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0
SORBS1|SORBS1_pS77      0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0
SORBS1|SORBS1_pS78      0.0312601610742 0.247415810145  1.62051540467   1.11476716945   0.849155398463  2.08833175784   0.0     0.0     0.0     0.0     0.0     0.0     1.32989448951   0.754650122826  0.78

Missing 5 meta files in thca_tcga study

These files in the thca_tcga study are missing meta files:

tcga_pub studies cannot be loaded

Currently none of the 16 _tcga_pub studies can be loaded. @pulyakhina and I have validated them all, and these are the most common errors:

Clinical data files in old format: data_clinical.txt instead of _data_patient.txt and _data_sample.txt
Stable ids in meta files are in an invalid format: stable_id: blca_tcga_pub_gistic instead of stable_id: gistic
For .seg and multiple meta files, file type cannot be determined. Could not determine the file type. Please check your meta files for correct configuration. genetic_alteration_type: PROTEIN_LEVEL, datatype: CONTINUOUS

Error reports:
Reports_tcga_pub.zip

Affected studies:

blca_tcga_pub
brca_tcga_pub
coadread_tcga_pub
gbm_tcga_pub
hnsc_tcga_pub
kich_tcga_pub
kirc_tcga_pub
laml_tcga_pub
lgggbm_tcga_pub
luad_tcga_pub
lusc_tcga_pub
ov_tcga_pub
prad_tcga_pub
stad_tcga_pub
thca_tcga_pub
ucec_tcga_pub

Missing description RNA-Seq normalization

In the meta file for the RNA-Seq of the TCGA provisional data, a description is missing how the data is normalized. It only states: Expression levels for 20532 genes in 171 gbm cases (RNA Seq V2 RSEM).
The file does not contain raw read counts, because the values are floats. Also contains some extreme outliers. Information on this is probably located on https://gdc.cancer.gov/ which hosts the TCGA data.

As an example, some statistics on the values in gbm_tcga:

> expr_data[1:5,1:5]
          TCGA-02-0047-01 TCGA-02-0055-01 TCGA-02-2483-01 TCGA-02-2485-01 TCGA-02-2486-01
100134869          6.7611         15.6973         13.9398         14.9571          4.8049
10357             54.7036         31.3945         60.3441         91.8238         62.5366
155060           232.9512        162.0182        135.0923        417.6190        276.2195
26823              0.0000          0.5606          0.0000          1.9048          0.0000
280660             0.0000          0.0000          0.0000          0.0000          0.0000

> range(expr_data)
[1]       0 1026361

Boxplots including outliers:

Boxplots without outliers:

Validator Issue: Change fusion DNA and RNA Support to DNA_support & RNA_support

To make headers consistent, can we make the header in fusion file to:
DNA_support & RNA_support?

Timeline file for events like status fails on validation

Event types like status, specimen, surgery etc which are a single day event don't have a stop date. The validator fails when it does not find a stop date for any timeline file. Stop date is only for durations like treatments. Others could or could not have a stop date.Therefore maybe keep it as a warning rather than an error

IPOP data

There is no RNA-Seq count for study 4
survival plot is not shown for the 2 new studies
Priority

CCSK.html

data_cna.txt in the CCSK data set has an issue defined as "Invalid CNA value: possible values are [-2, -1, 0, 1, 2, NA]"
The values encountered are "-0.0072, 0.0078, 0.0799, (1023 more)"

After uploading a study it does not appear in cbioprtal unless it is restarted

Hello authors

I am trying use cbiortal for visualization of our sequencing data. Could you suggest how to overcome a problem with data loading? newly uploaded data is not visible in cbiortal until next restart, which is not so optimal. Perhaps you know an easy way to disable some cache so that cbiortal always queries the database?

Thank you
Best regards,
Marian Caikovski

Naming inconsistencies

Should these tars be renamed to match their corresponding cancer study identifiers?

Cancer Study Identifier	tar filename
nepc_wcm_2016	prad_cornell_2016.tar.gz
thyroid_mskcc_2016	thca_mskcc_2016.tar.gz

meso_tcga: errors in data_mutation_extended

The new maf file contains some errors:

@zheins

Broken seed data - references to missing tables

The schema file for the seed data doesn't contain a reference to attribute_metadata, yet the data files attempt to lock that table.

At a guess, the seed data is from a version that predates the schema, this table is dropped by migration in any event.

In any event, you can't load the schema and the data right now.

Profile name and despcrition error in msk_impact_2017

Chromosome (field 5) NA for >50% of COADREAD mutation data

Hi there,

it seems that for >50% of all mutations in coadread/tcga/data_mutations_extended.txt field 5 (chromosome) seems to be NA. To be more precise, only chromosomes >=10 are reported.

$ cut -f5 data_mutations_extended.txt | sort | uniq -c
   3410 10
   4743 11
   4832 12
   1872 13
   2361 14
   2723 15
   2715 16
   3961 17
   1480 18
   5046 19
   1990 20
    821 21
   1288 22
      1 Chromosome
  47344 NA
      1 #version 2.4

I did a quick check on a couple of other indications, and there it seems to happen too.

Thanks,
Markus

If you download the tarball you will see SAMPLE_ID and PATIENT_ID in data_clinical.txt are LUSC-XX-XXXX, but Tumor_Sample_Barcode is TCGA-XX-XXXX in data_mutations_extended.txt. The above url has empty columns for clinical data.

msk_impact_2017

Requested changes

meta_cna.txt

profile_name: MSK-IMPACT Clinical Sequencing Cohort (MSKCC)
to
profile_name: Putative copy-number alterations

Replace value in profile_name and profile_description with correct value. (see #65)

meta_fusions.txt

stable_id: msk_impact_2017_mutations
to
stable_id: fusion

Add: data_filename: data_fusions.txt

data_mutations_extended.txt

Line 7058: pp.C189_A190delinsWH to p.C189_A190delinsWH
Line 13878: p. L482_E483delinsF* to p.L482_E483delinsF*

meta_study.txt

Add: add_global_case_list: true

data_fusions.txt

Replace 0 for Entrez_Gene_Id by with real values

data_mutations.txt

Replace pp.C189_A190delinsWH to p.C189_A190delinsWH

Gene panels

@zheins mentioned there is gene panel data for msk_impact, but it's currently not included in the files.

Should we upload the unpacked files instead of gz files

For discussion:

Often times, I just wanted to look at one file, but I have to download the whole study to do that.

Gzipped files are also not good for comparison / keep tracks of changes.

And if I find a small issue, I wanted to be able to fix just one txt file instead of uploading the whole gzipped file.

one line difference in CPTAC brca files

data_protein_quantification_Zscores.txt has 84233 lines while data_protein_quantification.txt 84234 lines. Is this correct, or was one line left out from zscores file by mistake?

Empty segment files in brca study (provisional)

Hi,

Segment data for the brca study is missing. There are files for it, brca_tcga_data_cna_hg19.seg and brca_tcga_meta_cna_hg19_seg.txt but the .seq file seems to be emtpy. This segment data is available on cbioportal.org for this study, but just not in this download file.