GithubHelp home page GithubHelp logo

bioc_msr_tech_blog's Introduction

Bioconductor on Microsoft Azure

Nitesh Turaga; Erdal Cosgun; and Vince Carey

Teaser: Using Bioconductor on Microsoft Azure cloud resources for scalable genomic computing.

Introduction

The Bioconductor project promotes the statistical analysis and comprehension of current and emerging high-throughput biological assays. Bioconductor is a strict proponent to open source and open development of software; and collaborative, literate, and reproducible research.

As the scale of genomic data grows exponentially in the genomics era, the use of cloud services is on the upward trend to deal with the size of the data. The advantage of cloud computing services fits the needs of the analysis of the varying size of data depending on the analysis setting. The elasticity and scalability of cloud services is a resource that makes it easy for a small lab or a large company to take advantage of Bioconductor's open source software, and data resources.

Bioconductor website

Bioconductor Docker Images

Bioconductor produces docker images so that users can run the latest stable version of R, Bioconductor using either the command line or an RStudio UI. These images are built with system libraries that can be used to install (and compile) over 2000 Bioconductor packages.

These docker images hosted by Bioconductor are available on the Microsoft container registry (MCR) and are freely available to the public on an open-source Artistic-2.0 license.

docker pull mcr.microsoft.com/bioconductor/bioconductor_docker:RELEASE_3_14

These images can be used with Azure container instances (ACI) with the available launch instructions.

The added benefit of these docker images are the availability of pre-compiled Bioconductor package binaries. These package binaries speed up the installtion of packages on the Docker image and provide users an efficient reasearch computing environment where exploratory data analysis is faster.

Since this is a more recent feature in development - it is available on a branch of the CRAN package BiocManager that will be merged soon. An example of installation of binary packages is given below:

BiocManager::install('Bioconductor/BiocManager@anvil')

pkgs <- c('BiocParallel', 'rsbml', 'rhdf5`)

BiocManager::install(pkgs)

Bioconductor Hubs - AnnotationHub and ExperimentHub data

Bioconductor distributes it's annotation and experiment hub data through Azure Storage containers. The Bioconductor AnnotationHub resource provides a central location where genomic files (e.g., VCF, bed, wig) and other resources from standard locations (e.g., UCSC, Ensembl) can be discovered. The resource includes metadata about each resource, e.g., a textual description, tags, and date of modification.

ExperimentHub provides a central location where curated data from experiments, publications or training courses can be accessed. Each resource has associated metadata, tags and date of modification.

As of this post, (1/27/2022) about 2.5 TB of data has been distributed to important genomic research to scientists around the world.

BioconductorHubs Usage Stats for 1 month

Jupyter Notebooks and Virtual Machines for Bioconductor on Microsoft Azure

The Genomics Data Lake provides various public datasets that you can access for free and integrate into your genomics analysis workflows and applications. The datasets include genome sequences, variant info, and subject/sample metadata in BAM, FASTA, VCF, CSV file formats. The Genomics Data Lake is hosted in the West US 2 and West Central US Azure region. Allocating compute resources in West US 2 and West Central US is recommended for affinity. The Bioconductor Annotation and Experiment Hub data will be available on Microsoft Genomics Data Lake on mid-February 2022.

You can find the sample Jupyter notebooks from this repo to download the Genomics Data Lake's data on your R-Bioconductor project. Another useful way to use Bioconductor packages on Azure is to use built-in Genomics Data Science VMs on Azure. You can easily deploy your Windows OR Linux VMs from this link.

bioc_msr_tech_blog's People

Contributors

erdalcosgun avatar nturaga avatar vjcitn avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.