This repository contains example workflows of how to use data downloaded from refine.bio.
Go to the Getting Started Section
Example workflows for refine.bio data
Home Page: https://www.refine.bio
License: Other
This repository contains example workflows of how to use data downloaded from refine.bio.
Go to the Getting Started Section
From @cgreene:
process your own data, do differential expression for it & a refinebio dataset & compare
This is also an example I used when chatting with folks at the AACR booth and it seemed to be highly relevant to potential users.
A few notes about some decision points:
We need to link these examples to the use cases section of our documentation: https://github.com/AlexsLemonade/refinebio-docs/blob/master/docs/main_text.md#use-cases-for-downstream-analysis
This requires the following
Depends on #14
It's helpful to have links open in a new tab or window so users don't lose their place in the READMEs. This appears to be especially important when installing the required software.
Here's an example using the pattern we use in the refine.bio docs:
[**R**](https://cran.r-project.org/)
becomes
<a href = "https://cran.r-project.org/" target = "blank">**R**</a>
To make refine.bio data compatible with GenePattern, which requires a particular file format, we've included a script that takes a refine.bio TSV and outputs a GCT file.
This script is currently located in the scripts
folder. We explain how to use this script in the README for relevant fields (see: https://github.com/AlexsLemonade/refinebio-examples/tree/master/differential-expression#create-a-gct-file):
Create a GCT file
Convert a gene expression tab separated values (TSV) file provided into a 'gene cluster text' (GCT) file for use in GenePattern. In order to create a GCT formatted file from a TSV refine.bio data file, download the create_gct_file.R script.
The link above is to https://github.com/AlexsLemonade/refinebio-examples/blob/master/scripts/create_gct_file.R
A user like myself may know how to wget
this file. For the intended users, the only instructions we have essentially assume that they will have downloaded the entire repository and that may be unnecessarily if using GenePattern.
In addition, we've designed other modules to be self-contained (e.g., each has its own data
folder) so that downstream users could potentially delete local copies of the modules they are not using and/or move the modules they are using outside of the overarching GitHub repository structure.
Ideally, the link to the Rscript in the relevant READMEs (the differential-expression
example quoted above and clustering
) would download the latest version of that script and users could put a local copy wherever they want. I believe this would make this easier to use for the intended audience.
@arielsvn suggested the following:
@arielsvn @kurtwheeler do you have ideas about how we would accomplish this and how complex it would be to implement?
We envisioned GenePattern notebooks as a lower barrier to entry than our R Notebooks. However, our script, instructions, etc. about getting things formatted for use with GenePattern may be insufficient. @dvenprasad will go through this and make recommendations about what needs to be improved.
Test users looking/using at examples.
Questions:
tagging @jaclyn-taroni and @cansav09 for input.
It was unclear to me how easy or hard this would be to make a bookdown by keep our current module per folder structure.
Here's the relevant bookdown chapter
https://bookdown.org/yihui/rmarkdown/bookdown-project.html
This issue made me think it wasn't very possible: rstudio/bookdown#107
But these issues made me think it's more possible if we change the yml file:
rstudio/bookdown#418
https://stackoverflow.com/questions/40578609/set-r-bookdown-input-directory
From @dvenprasad:
For normalizing your own data, explicitly add what types of files/formats can be plugged into normalize own data script
Add a message so the users know the thing is done creating.
Once AlexsLemonade/refinebio#1280 is deployed, we want to redownload the data and update examples.
We'll need to use both gene expression matrices and metadata from refine.bio to compare 2 groups, preferably in an experiment with a straightforward case v. control or vehicle v. drug type of design
Microarray:
limma
? open for discussion)RNA-seq
limma
see this referenceExplore Code Ocean as a potential avenue to reduce barrier to entry for researchers to setup and run examples.
This is relevant to Bio Experts who don't have experience with being able to setup their own programming environment.
In our goal of making our refinebio-examples more approachable, we'll need to add some content that equips users with some foundational knowledge. Here's a rough outline and some materials that we can borrow from:
We should probably think about borrowing from the intro to R training-module where appropriate https://github.com/AlexsLemonade/training-modules/tree/master/intro-to-R-tidyverse
This will need to get broken out into smaller issues as we decide what topics are needed in this section
This getting started section should also include resources for people who may need more practice with the basics. Some suggestions:
From @dvenprasad:
General Thoughts/Recommendations:
- Include install for dplyr and other packages?
- Link to GenePattern Notebook instead of the preview (am I supposed to create my own notebook and try replicate the image or is that notebook available for me to plug and play?)
- Link notebook links to rendered notebooks
- Add blurb recommending moving their own data in the data folder of the example
- Add a list of required packages or expected setups in README for each examples.
And from #40:
Add a list of packages/softwares they need to have set up to run the example notebooks.
A statement about downloading in the README (#62 (comment)):
To run the examples yourself, you will need to clone or download this repository < https://github.com/AlexsLemonade/refinebio-examples >
Apply these recommendations to the following modules
Some users are intimidated by Github. It is not very clear how one can get individual notebooks, either from the repo or the rendered HTML R notebook.
We had talked about having a script to upload them to s3 on commit to master for a different script. That could be an option.
Explore other possible solutions so it is simpler for people to get individual notebooks.
The band-aid fix for this would be to explicitly mention in the docs and README how to get these individual notebooks.
Users didn't always feel confident about what they needed to replace. Users tended to skim through the comments before the code blocks and often missed what they need to do.
Add inline comments like
Replace path with path to your data file
Replace with your organism
So it explicit what they are expected to do.
Related: AlexsLemonade/refinebio#512
If we start offering different options for downloading individual RNA-seq datasets (likely via the API), we can add a differential expression analysis example that uses a workflow that's specifically for RNA-seq.
Check for style consistency
We need a link to T&C in the main README and to state
In using these data, you agree to our terms and conditions.
data
folder.Not all of our users will be familiar with GitHub and our use of GitHub requires a lot of repetition of text. Would using bookdown be "friendlier?"
Here's an excellent example: Claus O. Wilke's Fundamentals of Data Visualization
Mapping between zebrafish and human identifiers, probably using hcop
This is woven into #24, but we can do a short example that's just PCA + visualization. I have a slight preference for using a small RNA-seq dataset that is case vs. control or treatment vs. mock.
Related: #24
If we pick a large (enough) dataset that we aggregate by species for our batch correction example, we can then use it as training data for PLIER. Specifically, @cansav09 have talked about obtaining many datasets from a particular cell line (e.g., MCF-7, HEK293) for the batch correction example. We can then potentially train a PLIER model on the data with and without batch correction and compare. We'll have to be a bit careful about how we frame the comparison, though, as users may be linked from the docs to this example. We'll need to include sufficient context. cc @dvenprasad
We may need to create a project or can make an example for importing the data. A bit more research is required.
Add a list of packages/softwares they need to have set up to run the example notebooks.
Related to: AlexsLemonade/refinebio-docs#119
If we present things as a follow along tutorial, it makes sense to me to organize material by technology or use case rather than analysis type. Specifically, duplicating analysis types like gene identifier conversion across separate microarray and RNA-seq "tracks" allows for some repetition of concepts and more opportunities to vary things in a way that supports learning (e.g., use different species [#98]).
Here are some initial thoughts about how to organize things, where the listing of individual types of analyses are incomplete and for illustrative purposes:
Where for the clients this may be more like a place where we link out to other documentation specific to those clients, but include some info about installation, for example.
It doesn't have one right now.
Based on the recommendations explained in person by @dvenprasad, here's a main list things that need to be made to the main README. This is partially adapted from the notion link provided by @dvenprasad
tidyverse
Gene
column label not always being in the download files. X1
Would be good to show examples of how to correct for batch effect. For this example, we would use refinebio data that is aggregated by species. Presumably include some sort of initial litmus test of sorts that if followed, would help identify whether batch correction is needed in the first place.
@dvenprasad found in the usability evals that they had an error come up withas.data.frame != TRUE
tibble::columns_as_rownames
. Need to investigate and fix this.
This is related to #102 in my opinion.
For RNA-seq data where quantile normalization is skipped, we supply "bias corrected counts without an offset" in a TSV file. We can use this file with DESeq2::DESeqDataSetFromMatrix()
and then have subsequent examples where we perform transformations, etc.
From Ensembl gene IDs, which we deliver, to something like gene symbols or Entrez IDs
It's been a minute since I originally wrote these, some of them could use some streamlining and further refinement in general.
Pathway analysis is very likely to be a popular use case. I propose a new pathway-analysis
module that contains two use cases: 1) a single dataset and 2) meta-analysis. In a way, this issue is related to #24 and #30—that is, one could combine several studies, aggregating by species, and then do pathway analysis. It may be more desirable to analyze datasets separately.
Here's what I think this module needs:
qusage
, publication)2.0.0
, publication)Different workflows are all in the main directory. I believe a strategy where each workflow (e.g., clustering, differential expression analyses, ortholog mapping) is in it's own directory will make this easier for folks to follow.
I've sketched out what I think this should look like:
To reiterate the points from that sketch:
README.md
to facilitate display on Github web interface) that includes the relevant content that is currently housed in the top directory README and a link to the GCT script were applicablecreate_gct_file.R
(and any other scripts) would live in a new directory scripts/
data/
-- note that this probably requirse some tweaks to the setup chunks in notebooks such that they can access the data/
directoryRelated: #11 - specifically, we'll want to make sure everything is consistent
Presenting examples as follow-along tutorials will better help orient novice R users. We can explore Bookdown as a medium to do this. #92
Related to: AlexsLemonade/refinebio-frontend#633, AlexsLemonade/refinebio-docs#99, AlexsLemonade/refinebio-frontend#636
We've designed the Share
and Regenerate Files
functionality of refine.bio to facilitate dataset sharing. Our examples should follow this model.
This ticket has two components:
Number two above is a bit trickier than it may seem. Specifically, notebooks in a module expect metadata and gene expression data to be in the data
folder and we've generally ignored the JSON files.
I'm thinking it will go something like: Follow the link > Regenerate files > Enter email and accept T&C > Download dataset via your email from the refine.bio mail robot > Unzip data in the top-level directory of a module > rename folder to data
?
We'll also have to pay special attention to the larger datasets that used to be zipped up. The notebooks have steps for unzipping them that need to get taken out.
Related to: AlexsLemonade/refinebio-docs#73
I think a notebook in this repository that we then link to in the documentation is the best way to accomplish the above issue.
Any example should contain the following components:
preprocessCore
to normalize one's own dataIn mid-2019, @dvenprasad did a series of usability evaluations. This epic is to track issues related to recommendations that stem from that series of evaluations.
A workshop participant was trying to follow the refine.bio-examples/pathway-analysis/ora_with_webgestaltr.Rmd
but the function listIdType
was not being recognized thought she loaded in the WebGestaltR
library. Turns out older version of WebGestalt had this function called listIDType
(with a capital D). This also affected the main WebGestaltR
function from properly downloading the data since it also uses this function. I duplicated this same problem on my own computer.
I'm unsure why the package version installed from CRAN was not working properly. But, when I installed the WebGestaltR package using devtools::install_github("bzhanglab/WebGestalt")
these problems were resolved.
Across Rmd, Rscript, etc.
As noted in #6 (comment), the experiment selected for differential expression analysis (same used one in compare-processing
) has two factors: time point and genotype. These two factors are not taken into account in the differential expression example. We should pick a "simpler" experiment (e.g., single factor like treatment) for our differential expression analysis example.
Before we start with an entire restructuring of this repo, we need to make a release of this repo and also make sure that that release is what is link in the front end on refine.bio.
This is an Epic to keep track of our issues to make refinebio-examples into a more user-friendly, follow-along tutorial.
Here's other tutorials as inspiration:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.