alleninstitute / scrattch.io Goto Github PK

View Code? Open in Web Editor NEW

11.0 13.0 3.0 17.09 MB

Functions for handling RNA-seq files and formats as input and output for scrattch functions.

License: Other

R 100.00%

transcriptomics r tome loom 10xgenomics hdf5

scrattch.io's Introduction

scrattch.io: scrattch File Input/Output Handling

master:
dev:

Installation

scrattch.io requires the rhdf5 package from BioConductor, which can be installed with:

source("https://bioconductor.org/biocLite.R")
biocLite("rhdf5")

Once rhdf5 is in place, scrattch.io can be installed from github:

devtools::install_github("AllenInstitute/scrattch.io")

If you'd like to use the developer branch where we're testing out new code, it can be installed using:

devtools::install_github("AllenInstitute/scrattch.io", ref = "dev")

.tome files

A major component of scrattch.io is a set of helpful functions for writing and reading .tome files, which are an HDF5-based format for transcriptomics in an open, modular, extensible format.

Why another HDF5 format for transcriptomics?

Existing formats for transcriptomics are either designed for fast computation, like .loom, or a small storage footprint, like the .h5 files generated by 10X Genomics' cellRanger. The goal of .tome is to combine compact storage with reasonably fast random access of both genes and samples.

This is accomplished by storing the main data matrix in a sparse format, based on dgCMatrix from the R Matrix package, stored in both orientations. This structure is also chunked and compressed to speed access and reduce file size. The compression level can be changed depending on how quickly you need to read your data (see ?write_tome_data for details).

The practical upshot of this strategy is that .tome files are ~1/10th the size of .loom files for storage of data from 10X genomics experiments, while providing a way to read gene or sample data for display quickly.

Many additional metadata can be stored in .tome files as well, from sample annotations to precomputed statistics.

The .tome cheatsheets on Google Docs is a helpful reference for where scrattch.io stores these within the HDF5 file structure, and which functions can be used to read and write these objects.

.tome is intended to be extensible. Want to store something that isn't already provided? Check out the Generic functions section of the .tome cheatsheet, to add your own data however it makes sense to you.

.loom files

scrattch.io also includes simple functions for reading matrices, annotations, and projections from .loom files with read_loom_dgCMatrix(), read_loom_anno(), and read_loom_projections(), respectively.

You can find out more about the .loom format, developed by the Linnarsson lab, here: loompy.org

A more complete implementation of the .loom format in R is available from the Satija lab's loomR package on Github here: mojaveazure/loomR

10X Genomics files

scrattch.io includes the ability to read the data matrix from the .h5 files that are output by CellRanger in HDF5 Gene-Barcode Matrix Format with read_10x_dgCMatrix().

.h5ad files

scrattch.io also supports reading the main data matrix from .h5ad files that are generated by tools like Scanpy with read_h5ad_dgCMatrix().

The `scrattch` suite

scrattch.io is one component of the scrattch suite of packages for Single Cell RNA-seq Analysis for Transcriptomic Type CHaracterization from the Allen Institute.

License

The license for this package is available on Github at: https://github.com/AllenInstitute/scrattch.io/blob/master/LICENSE

Level of Support

We are planning on occasional updating this tool with no fixed schedule. Community involvement is encouraged through both issues and pull requests.

Contribution Agreement

If you contribute code to this repository through pull requests or other mechanisms, you are subject to the Allen Institute Contribution Agreement, which is available in full at: https://github.com/AllenInstitute/scrattch.io/blob/master/CONTRIBUTION

scrattch.io's People

Contributors

Stargazers

Watchers

Forkers

dhtc wuzhaoqi1015 mmoisse

scrattch.io's Issues

export from tome to any other format?

Hi, we have a tome file that we need to process. Is there any function or way to get the data out of the .tome file in a standard format? Like .mtx, .h5, a .csv or .tsv file with the genes on the lines and the first column being the geneId (possibly the symbol separated by | or similar) ?

I can see that tome is very good at importing files, but I cannot see an export function...

thanks!
Max

Add hdf5 installation to README

Unix/Linux and Mac users may not have hdf5, and installing rhdf5 will not retrieve these dependencies.

For Mac:

Install Homebrew
brew install hdf5

For CentOS:
sudo yum install hdf5-devel

`read_loom_dgCMatrix` dimnames error

Hi - I am reading a loom file using read_loom_dgCMatrix and I receive the following error:

Reading samples 1 to 5000Error in dimnamesGets(x, value) : length of Dimnames[[1]] (9757) is not equal to Dim[1] (5000)

I am not sure why the error is arising, other loom files can be read without an issue.

Thanks!
Joe

Investigate DelayedArrays for .tome

Currently, .tome files store data matrices in a format similar to dgCMatrix from the Matrix package. This is great for making the files much smaller (nice for shinyapps.io deployments), but other implementations of on-disk matrices are more complete.

DelayedArray looks like an interesting set of methods for large on-disk arrays/matrices stored in an HDF5 file, and converting for compatibility with these arrays could have nice add-on effects - as the packages for matrixStats support continue to mature, we could benefit from the added functionality.

Here's a nice workshop chapter on implementation:
https://bioconductor.github.io/BiocWorkshops/effectively-using-the-delayedarray-framework-to-support-the-analysis-of-large-datasets.html

To try this out, we'll probably first need low-level write_tome_darray() and read_tome_darray() functions. From there, I can see if it makes sense to replace the dgCMatrices for count data, or offer DelayedArrays as an option alongside dgCMatrices. The latter could be nice - one compact structure for portability and Shiny display and another larger format for computation.

Appending samples to a .tome

It should be possible to append samples to an existing .tome file. This would allow us to incrementally update a single file instead of recompiling the entire thing from scratch.

This is probably straightforward for adding columns as vectors, but may be more complicated for adding samples. Maybe I can expand and add values to sample-indexed sparse matrices. gene-indexed matrices will have to be rebuilt.

H5Ldelete error

Error in H5Ldelete(h5loc = loc$H5Identifier, name = name) : Specified link doesn't exist.

This error shows up constantly and prevents some functions from working when "overwrite = TRUE". Usually re-running the function after getting this error solves the problem, but I've run into a situation where I cannot update a variable at all once it is written without completely deleting the tome and restarting. If that comes up again, I'll update this issue with more specifics.

Installation issues

Hello!

I had some issues just installing the scrattch.io package using your instructions so that I could work with tome files. I used conda to install R v3.5.1 (that's the only way I could find to install "rhdf5" using biocLite as recommended in the README), but when I run

> devtools::install_github("AllenInstitute/scrattch.io")

I get an error:

Error in rbind(info, getNamespaceInfo(env, "S3methods")) : 
  number of columns of matrices must match (see arg 2)

This happens when I try to install with both the regular and the dev commands.

I found online a suggestion to just try reinstalling the package that came up in the error (https://forum.linuxconfig.org/t/error-in-rbind-info-getnamespaceinfo-env-somepackage-gnu-r/3649), but that doesn't seem to work. I just get a message about "S3methods" not being available for the R version I was using (again only using the older version because biocLite wasn't available for new R versions).

I was finally able to install this package by starting over from a fresh conda install of R v3.6.0 and then using conda to install rhdf5 and devtools.

It would be great if you could update your install instructions to include this information about using conda, or amend the current instruction to be more explicit about what R version you are using, alternatives to biocLite for newer R versions, or just ones that consistently work outside your ecosystem.

Also, you don't mention the need to install the "devtools" package in your install instructions, but running the scrattch.io install command without having installed that package leads to an error too.

Confusing function: save_sparse_matrix_h5()

The save_sparse_matrix_h5() function looks like it writes a tome-style matrix to an HDF5 file. It should probably be renamed write_tome_dgCMatrix() or should be dropped.

This is confusing, as most _h5() functions currently refer to the .h5 format specification generated by 10X Genomics.

The write_dgCMatrix_h5() function writes a matrix to an HDF5 file with names that match a 10X output file.

Reading a subset from the matrices is slow.

Via Jeremy:

If reading more than ~5% of the samples or genes using read_tome_[sample/gene]_data(), it's often faster to read the entire matrix with read_tome_dgCMatrix().

This may be due to having both open and close in read_tome_vector(). I think this can be optimized to either take one pass at read_tome_vector() or else keep the connection open for iteration over read_tome_vector().

read_tome functions should check for existence

Some read_tome functions don't check if an object exists, causing an error to be thrown by H5Dread. This should be changed so that h5ls() is used to see if something exists, and if not, throw a simpler error or warning.