alexslemonade / refinebio-examples Goto Github PK

View Code? Open in Web Editor NEW

10.0 7.0 5.0 796.55 MB

Example workflows for refine.bio data

Home Page: https://www.refine.bio

License: Other

HTML 99.89% Python 0.01% TeX 0.07% R 0.02% Dockerfile 0.01% Shell 0.01% CSS 0.01%

biodata gene-expression notebook pathway-analysis differential-expression

refinebio-examples's Introduction

refinebio-examples

This repository contains example workflows of how to use data downloaded from refine.bio.

Go to the Getting Started Section

refinebio-examples's People

Contributors

Stargazers

Watchers

Forkers

riyan-rickson johnthomas75 kaiser-huang shicheng-guo pythseq

refinebio-examples's Issues

Validate differential expression results using a refine.bio dataset

From @cgreene:

process your own data, do differential expression for it & a refinebio dataset & compare

This is also an example I used when chatting with folks at the AACR booth and it seemed to be highly relevant to potential users.

A few notes about some decision points:

We might consider grabbing the example "user's own data" as processed data from ArrayExpress and GEO.
Do we quantile normalize the "user's own data" using the refine.bio QN target?
Is looking at the overlap of differentially expressed genes at some common cutoff (FDR < 0.05) with a Venn diagram a sufficient comparison?
What gene identifiers will we use?

Link to use cases section of documentation

We need to link these examples to the use cases section of our documentation: https://github.com/AlexsLemonade/refinebio-docs/blob/master/docs/main_text.md#use-cases-for-downstream-analysis

This requires the following

1. Enabling Github pages
2. Making this repository public
3. Add a link to this repo (e.g., https://github.com/AlexsLemonade/refinebio-examples)
4. Linking to the notebook pages as well the READMEs for individual workflows

Depends on #14

From UE testing: hyperlinks in READMEs should open in a new tab

It's helpful to have links open in a new tab or window so users don't lose their place in the READMEs. This appears to be especially important when installing the required software.

Here's an example using the pattern we use in the refine.bio docs:

[**R**](https://cran.r-project.org/)

becomes

<a href = "https://cran.r-project.org/" target = "blank">**R**</a>

Upload create_gct_file.R to S3 and update when changes are pushed to master

Context

To make refine.bio data compatible with GenePattern, which requires a particular file format, we've included a script that takes a refine.bio TSV and outputs a GCT file.

This script is currently located in the scripts folder. We explain how to use this script in the README for relevant fields (see: https://github.com/AlexsLemonade/refinebio-examples/tree/master/differential-expression#create-a-gct-file):

Create a GCT file

Convert a gene expression tab separated values (TSV) file provided into a 'gene cluster text' (GCT) file for use in GenePattern. In order to create a GCT formatted file from a TSV refine.bio data file, download the create_gct_file.R script.

The link above is to https://github.com/AlexsLemonade/refinebio-examples/blob/master/scripts/create_gct_file.R

A user like myself may know how to wget this file. For the intended users, the only instructions we have essentially assume that they will have downloaded the entire repository and that may be unnecessarily if using GenePattern.

In addition, we've designed other modules to be self-contained (e.g., each has its own data folder) so that downstream users could potentially delete local copies of the modules they are not using and/or move the modules they are using outside of the overarching GitHub repository structure.

Problem or idea

Ideally, the link to the Rscript in the relevant READMEs (the differential-expression example quoted above and clustering) would download the latest version of that script and users could put a local copy wherever they want. I believe this would make this easier to use for the intended audience.

@arielsvn suggested the following:

Sync this file to S3
Update the file on S3 anytime changes to this file are pushed to master

Next steps

@arielsvn @kurtwheeler do you have ideas about how we would accomplish this and how complex it would be to implement?

Do we need to improve the instructions around GenePattern notebooks?

We envisioned GenePattern notebooks as a lower barrier to entry than our R Notebooks. However, our script, instructions, etc. about getting things formatted for use with GenePattern may be insufficient. @dvenprasad will go through this and make recommendations about what needs to be improved.

Come up with a UE test plan to test the examples

Test users looking/using at examples.

Questions:

What are specific things we want to know from testing?
What are we unsure about the way examples are currently presented?
Which type of users should we target?
Which examples would be appropriate to test?

tagging @jaclyn-taroni and @cansav09 for input.

[Investigation] How might we make a bookdown with Rmd chapters in multiple input folders?

It was unclear to me how easy or hard this would be to make a bookdown by keep our current module per folder structure.

Here's the relevant bookdown chapter
https://bookdown.org/yihui/rmarkdown/bookdown-project.html

This issue made me think it wasn't very possible: rstudio/bookdown#107

But these issues made me think it's more possible if we change the yml file:
rstudio/bookdown#418
https://stackoverflow.com/questions/40578609/set-r-bookdown-input-directory

Add more information about file formats for QN own data

From @dvenprasad:

For normalizing your own data, explicitly add what types of files/formats can be plugged into normalize own data script

Add a 'file is created' message for `create_gct_file.R`

Add a message so the users know the thing is done creating.

Update example data now that `Gene` column is forced

Once AlexsLemonade/refinebio#1280 is deployed, we want to redownload the data and update examples.

Differential expression analysis examples

We'll need to use both gene expression matrices and metadata from refine.bio to compare 2 groups, preferably in an experiment with a straightforward case v. control or vehicle v. drug type of design
Microarray:

Using R (limma ? open for discussion)
Using GenePattern notebooks

RNA-seq

limma see this reference

Explore Code Ocean

Explore Code Ocean as a potential avenue to reduce barrier to entry for researchers to setup and run examples.

This is relevant to Bio Experts who don't have experience with being able to setup their own programming environment.

"Getting Started" section materials to be added to refinebio-examples book

Background

In our goal of making our refinebio-examples more approachable, we'll need to add some content that equips users with some foundational knowledge. Here's a rough outline and some materials that we can borrow from:
We should probably think about borrowing from the intro to R training-module where appropriate https://github.com/AlexsLemonade/training-modules/tree/master/intro-to-R-tidyverse
This will need to get broken out into smaller issues as we decide what topics are needed in this section

Rough outline:

Getting Started
- About refine.bio (link out to docs!)
- About how this tutorial book is structured (i.e. each analysis has its own somewhat repetitive info about getting started)
- What you need to install to run the examples
  - Can borrow from: https://github.com/AlexsLemonade/refinebio-examples#general-requirements-for-the-example-workflows
- How to get the data - this is still to be determined
  - Might be able to borrow from, but will likely be very different: https://github.com/AlexsLemonade/refinebio-examples#how-to-use-this-repository.
- How to use/get Rmds
  - Should definitely link to: https://bookdown.org/yihui/rmarkdown/
A quick primer on file paths and R
- Borrow from: https://github.com/AlexsLemonade/training-modules/tree/master/intro-to-R-tidyverse where appropriate.
Resources for learning R

Resources we should link to

This getting started section should also include resources for people who may need more practice with the basics. Some suggestions:

Apply general recommendations for each module

From @dvenprasad:

General Thoughts/Recommendations:

Include install for dplyr and other packages?

Link to GenePattern Notebook instead of the preview (am I supposed to create my own notebook and try replicate the image or is that notebook available for me to plug and play?)

Link notebook links to rendered notebooks

Add blurb recommending moving their own data in the data folder of the example

Add a list of required packages or expected setups in README for each examples.

And from #40:

Add a list of packages/softwares they need to have set up to run the example notebooks.

A statement about downloading in the README (#62 (comment)):

To run the examples yourself, you will need to clone or download this repository < https://github.com/AlexsLemonade/refinebio-examples >

Apply these recommendations to the following modules

Clustering (@cansav09)
Differential expression (@cansav09)
Dimension reduction (@cansav09)
Ensembl ID conversion (@cansav09)
QN your own data (@jaclyn-taroni)
Ortholog mapping (@jaclyn-taroni)
Pathway analysis (@jaclyn-taroni)
Validate differential expression (@cansav09)

How do we instruct users obtain example dataset and Rmd files?

Some users are intimidated by Github. It is not very clear how one can get individual notebooks, either from the repo or the rendered HTML R notebook.

We had talked about having a script to upload them to s3 on commit to master for a different script. That could be an option.

Explore other possible solutions so it is simpler for people to get individual notebooks.

The band-aid fix for this would be to explicitly mention in the docs and README how to get these individual notebooks.

Add inline comments in notebooks where users need to replace code

Users didn't always feel confident about what they needed to replace. Users tended to skim through the comments before the code blocks and often missed what they need to do.

Add inline comments like
Replace path with path to your data file
Replace with your organism
So it explicit what they are expected to do.

Add the gct and cls file instructions to the `differential-expression` notebook

Differential gene expression analysis example specifically for RNA-seq

Related: AlexsLemonade/refinebio#512

If we start offering different options for downloading individual RNA-seq datasets (likely via the API), we can add a differential expression analysis example that uses a workflow that's specifically for RNA-seq.

Clean up README

Check for style consistency

Add terms and conditions

We need a link to T&C in the main README and to state

In using these data, you agree to our terms and conditions.

A terms and condition markdown file needs to be placed in each module's data folder.
A link and the above statement in each module's README.

We need to do this for the following modules:

Clustering (@cansav09)
Differential expression (@cansav09)
Dimension reduction (@cansav09)
Ensembl ID conversion (@cansav09)
QN your own data (@jaclyn-taroni)
Ortholog mapping (@jaclyn-taroni)
Pathway analysis (@jaclyn-taroni)
Validate differential expression (@cansav09)
Batch correction (@cansav09)

[Idea] Port over examples to use bookdown

Not all of our users will be familiar with GitHub and our use of GitHub requires a lot of repetition of text. Would using bookdown be "friendlier?"

Here's an excellent example: Claus O. Wilke's Fundamentals of Data Visualization

Relevant links

Ortholog mapping example

Mapping between zebrafish and human identifiers, probably using hcop

Move data files into module-specific data directories

PCA example

This is woven into #24, but we can do a short example that's just PCA + visualization. I have a slight preference for using a small RNA-seq dataset that is case vs. control or treatment vs. mock.

Train a PLIER model on a dataset that is aggregated by species

Related: #24

If we pick a large (enough) dataset that we aggregate by species for our batch correction example, we can then use it as training data for PLIER. Specifically, @cansav09 have talked about obtaining many datasets from a particular cell line (e.g., MCF-7, HEK293) for the batch correction example. We can then potentially train a PLIER model on the data with and without batch correction and compare. We'll have to be a bit careful about how we frame the comparison, though, as users may be linked from the docs to this example. We'll need to include sufficient context. cc @dvenprasad

Format data for use with cBioPortal visualization tools

We may need to create a project or can make an example for importing the data. A bit more research is required.

Add setup info needed for users to run the notebooks themselves

Add a list of packages/softwares they need to have set up to run the example notebooks.

Add LICENSE

API steps in normalize-own-data need to be updated

Organize examples by technology/use case, rather than analysis type

Related to #101 and #92

If we present things as a follow along tutorial, it makes sense to me to organize material by technology or use case rather than analysis type. Specifically, duplicating analysis types like gene identifier conversion across separate microarray and RNA-seq "tracks" allows for some repetition of concepts and more opportunities to vary things in a way that supports learning (e.g., use different species [#98]).

Here are some initial thoughts about how to organize things, where the listing of individual types of analyses are incomplete and for illustrative purposes:

Introduction
- Setting up RStudio
- Obtaining files locally, etc.
- Skip to "advanced use cases" (more on that below)
Microarray
- Background on how we process microarray data (similar information to what's in the docs)
- Clustering
- Gene ID conversion
- Differential gene expression analysis
- Pathway analysis
RNA-seq
- Background on how we process RNA-seq data (similar information to what's in the docs)
- Clustering
- Gene ID conversion
- Differential gene expression analysis
- Pathway analysis
Larger cross-experiment, cross-platform experiments & compendium
- How we process these datasets in general and in the specific compendium case (similar information to what's in the docs)
- Machine learning examples
"Advanced usage"
- Normalizing your own data
- API usage examples
- Eventually: Python Client
- Eventually: Command line
- Eventually: R Client

Where for the clients this may be more like a place where we link out to other documentation specific to those clients, but include some info about installation, for example.

Need README for dimension reduction folder

It doesn't have one right now.

Buff up the main README page

Based on the recommendations explained in person by @dvenprasad, here's a main list things that need to be made to the main README. This is partially adapted from the notion link provided by @dvenprasad

More set up info

Add connection to training-modules
Add more set up (links to installation of RStudio and etc)
What packages are used throughout? tidyverse

More GenePattern background info:

Fix links #48
Add more context about how the notebooks are designed to be used
Give them a heads up that they will need to login/create an account if they use the GenePattern notebooks

More info on modifying the notebooks

Add instructions about how to use/modify the notebooks on your own data
Note about Gene column label not always being in the download files. X1
(also determine the pattern for that and why that happens)
Add blurb recommending moving their own data in the data folder of the example

Advanced Topics: Batch correction example

Would be good to show examples of how to correct for batch effect. For this example, we would use refinebio data that is aggregated by species. Presumably include some sort of initial litmus test of sorts that if followed, would help identify whether batch correction is needed in the first place.

`tibble::columns_as_rownames` in differential expression notebook

@dvenprasad found in the usability evals that they had an error come up withas.data.frame != TRUE tibble::columns_as_rownames. Need to investigate and fix this.

Create DESeqDataSet from refine.bio TSV file

This is related to #102 in my opinion.

For RNA-seq data where quantile normalization is skipped, we supply "bias corrected counts without an offset" in a TSV file. We can use this file with DESeq2::DESeqDataSetFromMatrix() and then have subsequent examples where we perform transformations, etc.

Conversion to a different gene identifier (within species)

From Ensembl gene IDs, which we deliver, to something like gene symbols or Entrez IDs

Updates to refinebio-example READMEs

It's been a minute since I originally wrote these, some of them could use some streamlining and further refinement in general.

Pathway analysis module

Pathway analysis is very likely to be a popular use case. I propose a new pathway-analysis module that contains two use cases: 1) a single dataset and 2) meta-analysis. In a way, this issue is related to #24 and #30—that is, one could combine several studies, aggregating by species, and then do pathway analysis. It may be more desirable to analyze datasets separately.

Here's what I think this module needs:

Pathway analysis in a single data set with QuSAGE (qusage, publication)
Pathway meta-analysis with QuSAGE (implemented in package as of 2.0.0, publication)
A pointer to Gene Set Enrichment Analysis (GenePattern) in the README

Fix GenePattern notebook links

Fix GenePattern notebook links including CLSFileCreator link.
Link to GenePattern Notebook instead of the preview
Link notebook links to rendered notebooks

Reorganize repo such that different workflows are in their own directories

Different workflows are all in the main directory. I believe a strategy where each workflow (e.g., clustering, differential expression analyses, ortholog mapping) is in it's own directory will make this easier for folks to follow.

I've sketched out what I think this should look like:

To reiterate the points from that sketch:

The top or main directory README would serve mostly as an introduction and a table of contents for the rest of the repository
Each workflow's directory would contain a README (named README.md to facilitate display on Github web interface) that includes the relevant content that is currently housed in the top directory README and a link to the GCT script were applicable
create_gct_file.R (and any other scripts) would live in a new directory scripts/
Any data would live in a new directory data/ -- note that this probably requirse some tweaks to the setup chunks in notebooks such that they can access the data/ directory

Related: #11 - specifically, we'll want to make sure everything is consistent

Present examples as follow-along tutorial

Presenting examples as follow-along tutorials will better help orient novice R users. We can explore Bookdown as a medium to do this. #92

Users should obtain example data via the regenerate files mechanism

Context

We've designed the Share and Regenerate Files functionality of refine.bio to facilitate dataset sharing. Our examples should follow this model.

Idea and Solution

This ticket has two components:

Remove any data tracked with git from the repository.
Replace that information with a link to a shared dataset. This information is probably best served in a "Data Requirements" section in each module's README.

Number two above is a bit trickier than it may seem. Specifically, notebooks in a module expect metadata and gene expression data to be in the data folder and we've generally ignored the JSON files.

I'm thinking it will go something like: Follow the link > Regenerate files > Enter email and accept T&C > Download dataset via your email from the refine.bio mail robot > Unzip data in the top-level directory of a module > rename folder to data ?

We'll also have to pay special attention to the larger datasets that used to be zipped up. The notebooks have steps for unzipping them that need to get taken out.

We need to do this for the following modules:

Clustering (@cansav09)
Differential expression (@cansav09)
Dimension reduction (@cansav09)
Ensembl ID conversion (@cansav09)
QN your own data (@jaclyn-taroni)
Ortholog mapping (@jaclyn-taroni)
Pathway analysis (@jaclyn-taroni)
Validate differential expression (@cansav09)
Batch correction (@cansav09)

Using refine.bio QN targets with your own data

I think a notebook in this repository that we then link to in the documentation is the best way to accomplish the above issue.

Any example should contain the following components:

Using the API to obtain the QN target file
Importing your own data and the QN target into R and usingpreprocessCore to normalize one's own data
A plot demonstrating the effect of quantile normalization (ECDF perhaps)

Recommendations from 2019 evaluations

In mid-2019, @dvenprasad did a series of usability evaluations. This epic is to track issues related to recommendations that stem from that series of evaluations.

Add Mus musculus as an example to illustrate change in annotation package

In the notebook for converting Ensembl IDs to Gene Symbols, write another example after human.
Mouse would be good since it would cover our top 3 organisms.
Something along the line : For Mus musculus, Mm would be used

WebGestaltR package problems - version on CRAN mirror may be old

A workshop participant was trying to follow the refine.bio-examples/pathway-analysis/ora_with_webgestaltr.Rmd but the function listIdType was not being recognized thought she loaded in the WebGestaltR library. Turns out older version of WebGestalt had this function called listIDType (with a capital D). This also affected the main WebGestaltR function from properly downloading the data since it also uses this function. I duplicated this same problem on my own computer.

I'm unsure why the package version installed from CRAN was not working properly. But, when I installed the WebGestaltR package using devtools::install_github("bzhanglab/WebGestalt") these problems were resolved.

Make sure author and/or headers are consistent

Across Rmd, Rscript, etc.

Simplify differential expression analysis

As noted in #6 (comment), the experiment selected for differential expression analysis (same used one in compare-processing) has two factors: time point and genotype. These two factors are not taken into account in the differential expression example. We should pick a "simpler" experiment (e.g., single factor like treatment) for our differential expression analysis example.

Make a release of the current refinebio-examples repo before making changes

Before we start with an entire restructuring of this repo, we need to make a release of this repo and also make sure that that release is what is link in the front end on refine.bio.

Revamping refinebio-examples

This is an Epic to keep track of our issues to make refinebio-examples into a more user-friendly, follow-along tutorial.

Here's other tutorials as inspiration:

https://r4ds.had.co.nz/

https://scrnaseq-course.cog.sanger.ac.uk/website/index.html

alexslemonade / refinebio-examples Goto Github PK

refinebio-examples's Introduction

refinebio-examples

refinebio-examples's People

Contributors

Stargazers

Watchers

Forkers

refinebio-examples's Issues

Context

Create a GCT file

Problem or idea

Next steps

Background

Rough outline:

Resources we should link to

We need to do this for the following modules:

Relevant links

More set up info

More GenePattern background info:

More info on modifying the notebooks

Context

Idea and Solution

We need to do this for the following modules:

Recommend Projects

Recommend Topics

Recommend Org

Jobs