GithubHelp home page GithubHelp logo

alexslemonade / refinebio-examples Goto Github PK

View Code? Open in Web Editor NEW
10.0 7.0 5.0 796.55 MB

Example workflows for refine.bio data

Home Page: https://www.refine.bio

License: Other

HTML 99.89% Python 0.01% TeX 0.07% R 0.02% Dockerfile 0.01% Shell 0.01% CSS 0.01%
biodata gene-expression notebook pathway-analysis differential-expression

refinebio-examples's Introduction

refinebio-examples's People

Contributors

actions-user avatar cansavvy avatar cbethell avatar davidsmejia avatar dvenprasad avatar jaclyn-taroni avatar jashapiro avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

refinebio-examples's Issues

Validate differential expression results using a refine.bio dataset

From @cgreene:

process your own data, do differential expression for it & a refinebio dataset & compare

This is also an example I used when chatting with folks at the AACR booth and it seemed to be highly relevant to potential users.

A few notes about some decision points:

  • We might consider grabbing the example "user's own data" as processed data from ArrayExpress and GEO.
  • Do we quantile normalize the "user's own data" using the refine.bio QN target?
  • Is looking at the overlap of differentially expressed genes at some common cutoff (FDR < 0.05) with a Venn diagram a sufficient comparison?
  • What gene identifiers will we use?

From UE testing: hyperlinks in READMEs should open in a new tab

It's helpful to have links open in a new tab or window so users don't lose their place in the READMEs. This appears to be especially important when installing the required software.

Here's an example using the pattern we use in the refine.bio docs:

[**R**](https://cran.r-project.org/)

becomes

<a href = "https://cran.r-project.org/" target = "blank">**R**</a>

Upload create_gct_file.R to S3 and update when changes are pushed to master

Context

To make refine.bio data compatible with GenePattern, which requires a particular file format, we've included a script that takes a refine.bio TSV and outputs a GCT file.

This script is currently located in the scripts folder. We explain how to use this script in the README for relevant fields (see: https://github.com/AlexsLemonade/refinebio-examples/tree/master/differential-expression#create-a-gct-file):

Create a GCT file

Convert a gene expression tab separated values (TSV) file provided into a 'gene cluster text' (GCT) file for use in GenePattern. In order to create a GCT formatted file from a TSV refine.bio data file, download the create_gct_file.R script.

The link above is to https://github.com/AlexsLemonade/refinebio-examples/blob/master/scripts/create_gct_file.R

A user like myself may know how to wget this file. For the intended users, the only instructions we have essentially assume that they will have downloaded the entire repository and that may be unnecessarily if using GenePattern.

In addition, we've designed other modules to be self-contained (e.g., each has its own data folder) so that downstream users could potentially delete local copies of the modules they are not using and/or move the modules they are using outside of the overarching GitHub repository structure.

Problem or idea

Ideally, the link to the Rscript in the relevant READMEs (the differential-expression example quoted above and clustering) would download the latest version of that script and users could put a local copy wherever they want. I believe this would make this easier to use for the intended audience.

@arielsvn suggested the following:

  • Sync this file to S3
  • Update the file on S3 anytime changes to this file are pushed to master

Next steps

@arielsvn @kurtwheeler do you have ideas about how we would accomplish this and how complex it would be to implement?

Come up with a UE test plan to test the examples

Test users looking/using at examples.

Questions:

  • What are specific things we want to know from testing?
  • What are we unsure about the way examples are currently presented?
  • Which type of users should we target?
  • Which examples would be appropriate to test?

tagging @jaclyn-taroni and @cansav09 for input.

[Investigation] How might we make a bookdown with Rmd chapters in multiple input folders?

It was unclear to me how easy or hard this would be to make a bookdown by keep our current module per folder structure.

Here's the relevant bookdown chapter
https://bookdown.org/yihui/rmarkdown/bookdown-project.html

This issue made me think it wasn't very possible: rstudio/bookdown#107

But these issues made me think it's more possible if we change the yml file:
rstudio/bookdown#418
https://stackoverflow.com/questions/40578609/set-r-bookdown-input-directory

Differential expression analysis examples

We'll need to use both gene expression matrices and metadata from refine.bio to compare 2 groups, preferably in an experiment with a straightforward case v. control or vehicle v. drug type of design
Microarray:

  • Using R (limma ? open for discussion)
  • Using GenePattern notebooks

RNA-seq

Explore Code Ocean

Explore Code Ocean as a potential avenue to reduce barrier to entry for researchers to setup and run examples.

This is relevant to Bio Experts who don't have experience with being able to setup their own programming environment.

"Getting Started" section materials to be added to refinebio-examples book

Background

In our goal of making our refinebio-examples more approachable, we'll need to add some content that equips users with some foundational knowledge. Here's a rough outline and some materials that we can borrow from:
We should probably think about borrowing from the intro to R training-module where appropriate https://github.com/AlexsLemonade/training-modules/tree/master/intro-to-R-tidyverse
This will need to get broken out into smaller issues as we decide what topics are needed in this section

Rough outline:

Resources we should link to

This getting started section should also include resources for people who may need more practice with the basics. Some suggestions:

Apply general recommendations for each module

From @dvenprasad:

General Thoughts/Recommendations:

  • Include install for dplyr and other packages?
  • Link to GenePattern Notebook instead of the preview (am I supposed to create my own notebook and try replicate the image or is that notebook available for me to plug and play?)
  • Link notebook links to rendered notebooks
  • Add blurb recommending moving their own data in the data folder of the example
  • Add a list of required packages or expected setups in README for each examples.

And from #40:

Add a list of packages/softwares they need to have set up to run the example notebooks.

A statement about downloading in the README (#62 (comment)):

To run the examples yourself, you will need to clone or download this repository < https://github.com/AlexsLemonade/refinebio-examples >

Apply these recommendations to the following modules

  • Clustering (@cansav09)
  • Differential expression (@cansav09)
  • Dimension reduction (@cansav09)
  • Ensembl ID conversion (@cansav09)
  • QN your own data (@jaclyn-taroni)
  • Ortholog mapping (@jaclyn-taroni)
  • Pathway analysis (@jaclyn-taroni)
  • Validate differential expression (@cansav09)

How do we instruct users obtain example dataset and Rmd files?

Some users are intimidated by Github. It is not very clear how one can get individual notebooks, either from the repo or the rendered HTML R notebook.

We had talked about having a script to upload them to s3 on commit to master for a different script. That could be an option.

Explore other possible solutions so it is simpler for people to get individual notebooks.

The band-aid fix for this would be to explicitly mention in the docs and README how to get these individual notebooks.

Add inline comments in notebooks where users need to replace code

Users didn't always feel confident about what they needed to replace. Users tended to skim through the comments before the code blocks and often missed what they need to do.

Add inline comments like
Replace path with path to your data file
Replace with your organism
So it explicit what they are expected to do.

Screen Shot 2020-05-29 at 9 19 26 AM

Add terms and conditions

We need a link to T&C in the main README and to state

In using these data, you agree to our terms and conditions.

  1. A terms and condition markdown file needs to be placed in each module's data folder.
  2. A link and the above statement in each module's README.

We need to do this for the following modules:

  • Clustering (@cansav09)
  • Differential expression (@cansav09)
  • Dimension reduction (@cansav09)
  • Ensembl ID conversion (@cansav09)
  • QN your own data (@jaclyn-taroni)
  • Ortholog mapping (@jaclyn-taroni)
  • Pathway analysis (@jaclyn-taroni)
  • Validate differential expression (@cansav09)
  • Batch correction (@cansav09)

PCA example

This is woven into #24, but we can do a short example that's just PCA + visualization. I have a slight preference for using a small RNA-seq dataset that is case vs. control or treatment vs. mock.

Train a PLIER model on a dataset that is aggregated by species

Related: #24

If we pick a large (enough) dataset that we aggregate by species for our batch correction example, we can then use it as training data for PLIER. Specifically, @cansav09 have talked about obtaining many datasets from a particular cell line (e.g., MCF-7, HEK293) for the batch correction example. We can then potentially train a PLIER model on the data with and without batch correction and compare. We'll have to be a bit careful about how we frame the comparison, though, as users may be linked from the docs to this example. We'll need to include sufficient context. cc @dvenprasad

Organize examples by technology/use case, rather than analysis type

Related to #101 and #92

If we present things as a follow along tutorial, it makes sense to me to organize material by technology or use case rather than analysis type. Specifically, duplicating analysis types like gene identifier conversion across separate microarray and RNA-seq "tracks" allows for some repetition of concepts and more opportunities to vary things in a way that supports learning (e.g., use different species [#98]).

Here are some initial thoughts about how to organize things, where the listing of individual types of analyses are incomplete and for illustrative purposes:

  • Introduction
    • Setting up RStudio
    • Obtaining files locally, etc.
    • Skip to "advanced use cases" (more on that below)
  • Microarray
    • Background on how we process microarray data (similar information to what's in the docs)
    • Clustering
    • Gene ID conversion
    • Differential gene expression analysis
    • Pathway analysis
  • RNA-seq
    • Background on how we process RNA-seq data (similar information to what's in the docs)
    • Clustering
    • Gene ID conversion
    • Differential gene expression analysis
    • Pathway analysis
  • Larger cross-experiment, cross-platform experiments & compendium
    • How we process these datasets in general and in the specific compendium case (similar information to what's in the docs)
    • Machine learning examples
  • "Advanced usage"
    • Normalizing your own data
    • API usage examples
    • Eventually: Python Client
    • Eventually: Command line
    • Eventually: R Client

Where for the clients this may be more like a place where we link out to other documentation specific to those clients, but include some info about installation, for example.

Buff up the main README page

Based on the recommendations explained in person by @dvenprasad, here's a main list things that need to be made to the main README. This is partially adapted from the notion link provided by @dvenprasad

More set up info

  • Add connection to training-modules
  • Add more set up (links to installation of RStudio and etc)
  • What packages are used throughout? tidyverse

More GenePattern background info:

  • Fix links #48
  • Add more context about how the notebooks are designed to be used
  • Give them a heads up that they will need to login/create an account if they use the GenePattern notebooks

More info on modifying the notebooks

  • Add instructions about how to use/modify the notebooks on your own data
  • Note about Gene column label not always being in the download files. X1
    (also determine the pattern for that and why that happens)
  • Add blurb recommending moving their own data in the data folder of the example

Advanced Topics: Batch correction example

Would be good to show examples of how to correct for batch effect. For this example, we would use refinebio data that is aggregated by species. Presumably include some sort of initial litmus test of sorts that if followed, would help identify whether batch correction is needed in the first place.

Create DESeqDataSet from refine.bio TSV file

This is related to #102 in my opinion.

For RNA-seq data where quantile normalization is skipped, we supply "bias corrected counts without an offset" in a TSV file. We can use this file with DESeq2::DESeqDataSetFromMatrix() and then have subsequent examples where we perform transformations, etc.

Pathway analysis module

Pathway analysis is very likely to be a popular use case. I propose a new pathway-analysis module that contains two use cases: 1) a single dataset and 2) meta-analysis. In a way, this issue is related to #24 and #30—that is, one could combine several studies, aggregating by species, and then do pathway analysis. It may be more desirable to analyze datasets separately.

Here's what I think this module needs:

  • Pathway analysis in a single data set with QuSAGE (qusage, publication)
  • Pathway meta-analysis with QuSAGE (implemented in package as of 2.0.0, publication)
  • A pointer to Gene Set Enrichment Analysis (GenePattern) in the README

Fix GenePattern notebook links

  • Fix GenePattern notebook links including CLSFileCreator link.
  • Link to GenePattern Notebook instead of the preview
  • Link notebook links to rendered notebooks

Reorganize repo such that different workflows are in their own directories

Different workflows are all in the main directory. I believe a strategy where each workflow (e.g., clustering, differential expression analyses, ortholog mapping) is in it's own directory will make this easier for folks to follow.

I've sketched out what I think this should look like:

9_24_18 8_54 am office lens

To reiterate the points from that sketch:

  • The top or main directory README would serve mostly as an introduction and a table of contents for the rest of the repository
  • Each workflow's directory would contain a README (named README.md to facilitate display on Github web interface) that includes the relevant content that is currently housed in the top directory README and a link to the GCT script were applicable
  • create_gct_file.R (and any other scripts) would live in a new directory scripts/
  • Any data would live in a new directory data/ -- note that this probably requirse some tweaks to the setup chunks in notebooks such that they can access the data/ directory

Related: #11 - specifically, we'll want to make sure everything is consistent

Users should obtain example data via the regenerate files mechanism

Context

Related to: AlexsLemonade/refinebio-frontend#633, AlexsLemonade/refinebio-docs#99, AlexsLemonade/refinebio-frontend#636

We've designed the Share and Regenerate Files functionality of refine.bio to facilitate dataset sharing. Our examples should follow this model.

Idea and Solution

This ticket has two components:

  1. Remove any data tracked with git from the repository.
  2. Replace that information with a link to a shared dataset. This information is probably best served in a "Data Requirements" section in each module's README.

Number two above is a bit trickier than it may seem. Specifically, notebooks in a module expect metadata and gene expression data to be in the data folder and we've generally ignored the JSON files.

I'm thinking it will go something like: Follow the link > Regenerate files > Enter email and accept T&C > Download dataset via your email from the refine.bio mail robot > Unzip data in the top-level directory of a module > rename folder to data ?

We'll also have to pay special attention to the larger datasets that used to be zipped up. The notebooks have steps for unzipping them that need to get taken out.

We need to do this for the following modules:

  • Clustering (@cansav09)
  • Differential expression (@cansav09)
  • Dimension reduction (@cansav09)
  • Ensembl ID conversion (@cansav09)
  • QN your own data (@jaclyn-taroni)
  • Ortholog mapping (@jaclyn-taroni)
  • Pathway analysis (@jaclyn-taroni)
  • Validate differential expression (@cansav09)
  • Batch correction (@cansav09)

Using refine.bio QN targets with your own data

Related to: AlexsLemonade/refinebio-docs#73

I think a notebook in this repository that we then link to in the documentation is the best way to accomplish the above issue.

Any example should contain the following components:

  • Using the API to obtain the QN target file
  • Importing your own data and the QN target into R and usingpreprocessCore to normalize one's own data
  • A plot demonstrating the effect of quantile normalization (ECDF perhaps)

WebGestaltR package problems - version on CRAN mirror may be old

A workshop participant was trying to follow the refine.bio-examples/pathway-analysis/ora_with_webgestaltr.Rmd but the function listIdType was not being recognized thought she loaded in the WebGestaltR library. Turns out older version of WebGestalt had this function called listIDType (with a capital D). This also affected the main WebGestaltR function from properly downloading the data since it also uses this function. I duplicated this same problem on my own computer.

I'm unsure why the package version installed from CRAN was not working properly. But, when I installed the WebGestaltR package using devtools::install_github("bzhanglab/WebGestalt") these problems were resolved.

Simplify differential expression analysis

As noted in #6 (comment), the experiment selected for differential expression analysis (same used one in compare-processing) has two factors: time point and genotype. These two factors are not taken into account in the differential expression example. We should pick a "simpler" experiment (e.g., single factor like treatment) for our differential expression analysis example.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.