GithubHelp home page GithubHelp logo

kojix2 / ruby-htslib Goto Github PK

View Code? Open in Web Editor NEW
10.0 4.0 0.0 3.48 MB

HTSlib bindings for Ruby

Home Page: https://kojix2.github.io/ruby-htslib/

License: MIT License

Ruby 99.47% Dockerfile 0.53%
htslib bioinformatics ruby bam sam genomics bcf

ruby-htslib's Introduction

ruby-htslib

Gem Version test The MIT License DOI Docs Stable

Ruby-htslib is the Ruby bindings to HTSlib, a C library for high-throughput sequencing data formats. It allows you to read and write file formats commonly used in genomics, such as SAM, BAM, VCF, and BCF, in the Ruby language.

🍎 Feel free to fork it!

Requirements

  • Ruby 3.1 or above.
  • HTSlib
    • Ubuntu : apt install libhts-dev
    • macOS : brew install htslib
    • Windows : mingw-w64-htslib is automatically fetched when installing the gem (RubyInstaller only).
    • Build from source code (see the Development section)

Installation

gem install htslib

If you have installed htslib with apt on Ubuntu or homebrew on Mac, pkg-config will automatically detect the location of the shared library. If pkg-config does not work well, set PKG_CONFIG_PATH. Alternatively, you can specify the directory of the shared library by setting the environment variable HTSLIBDIR.

export HTSLIBDIR="/your/path/to/htslib" # Directory where libhts.so is located

ruby-htslib also works on Windows. If you use RubyInstaller, htslib will be prepared automatically.

Usage

HTS::Bam - SAM / BAM / CRAM - Sequence Alignment Map file

Reading fields

require 'htslib'

bam = HTS::Bam.open("test/fixtures/moo.bam")

bam.each do |r|
  pp name: r.qname,
     flag: r.flag,
     chrm: r.chrom,
     strt: r.pos + 1,
     mapq: r.mapq,
     cigr: r.cigar.to_s,
     mchr: r.mate_chrom,
     mpos: r.mpos + 1,
     isiz: r.isize,
     seqs: r.seq,
     qual: r.qual_string,
     MC:   r.aux("MC")
end

bam.close

With a block

HTS::Bam.open("test/fixtures/moo.bam") do |bam|
  bam.each do |r|
    puts r.to_s
  end
end

HTS::Bcf - VCF / BCF - Variant Call Format file

Reading fields

require 'htslib'

bcf = HTS::Bcf.open("test/fixtures/test.bcf")

bcf.each do |r|
  p chrom:  r.chrom,
    pos:    r.pos,
    id:     r.id,
    qual:   r.qual.round(2),
    ref:    r.ref,
    alt:    r.alt,
    filter: r.filter,
    info:   r.info.to_h,
    format: r.format.to_h
end

bcf.close

With a block

HTS::Bcf.open("test/fixtures/test.bcf") do |bcf|
  bcf.each do |r|
    puts r.to_s
  end
end

HTS::Faidx - FASTA / FASTQ - Nucleic acid sequence

fa = HTS::Faidx.open("test/fixtures/moo.fa")
fa.seq("chr1:1-10") # => CGCAACCCGA # 1-based
fa.close

HTS::Tabix - GFF / BED - TAB-delimited genome position file

tb = HTS::Tabix.open("test/fixtures/test.vcf.gz")
tb.query("poo", 2000, 3000) do |line|
  puts line.join("\t")
end
tb.close

Low-level API

Middle architectural layer between high-level Ruby code and low-level C code. HTS::LibHTS provides native C functions using Ruby-FFI.

require 'htslib'

a = HTS::LibHTS.hts_open("a.bam", "r")
b = HTS::LibHTS.hts_get_format(a)
p b[:category]
p b[:format]

The low-level API makes it possible to perform detailed operations, such as calling CRAM-specific functions.

Macro functions

HTSlib is designed to improve performance with many macro functions. However, it is not possible to call C macro functions directly from Ruby-FFI. To overcome this, important macro functions have been re-implemented in Ruby, allowing them to be called in the same way as native functions.

Garbage Collection and Memory Freeing

A small number of commonly used structs, such as Bam1 and Bcf1, are implemented using FFI's ManagedStruct. This allows for automatic memory release when Ruby's garbage collection is triggered. On the other hand, other structs are implemented using FFI::Struct, and they will require manual memory release.

Need more speed?

Try Crystal. HTS.cr is implemented in Crystal language and provides an API compatible with ruby-htslib.

Documentation

Development

Compile from source code

GNU Autotools is required to compile htslib. To get started with development:

git clone --recursive https://github.com/kojix2/ruby-htslib
cd ruby-htslib
bundle install
bundle exec rake htslib:build
bundle exec rake test

Macro functions are reimplemented

HTSlib has many macro functions. These macro functions cannot be called from FFI and must be reimplemented in Ruby.

Use the latest Ruby

Use Ruby 3 or newer to take advantage of new features. This is possible because we have a small number of users.

Keep compatibility with Crystal language

Compatibility with Crystal language is important for Ruby-htslib development.

  • HTS.cr - HTSlib bindings for Crystal

Return value

The most challenging part is the return value. In the Crystal language, methods are expected to return only one type. On the other hand, in the Ruby language, methods that return multiple classes are very common. For example, in the Crystal language, the compiler gets confused if the return value is one of six types: Int32, Int64, Float32, Float64, Nil, or String. In fact Crystal allows you to do that. But the code gets a little messy. In Ruby, this is very common and doesn't cause any problems.

Memory management

Ruby and Crystal are languages that use garbage collection. However, the memory release policy for allocated C structures is slightly different: in Ruby-FFI, you can define a self.release method in FFI::Struct. This method is called when GC. So you don't have to worry about memory in high-level APIs like Bam::Record or Bcf::Record, etc. Crystal requires you to define a finalize method on each class. So you need to define it in Bam::Record or Bcf::Record.

Macro functions

In ruby-htslib, C macro functions are added to LibHTS, but in Crystal, LibHTS is a Lib, so methods cannot be added. methods are added to LibHTS2.

Naming convention

If you are not sure about the naming of a method, follow the Rust-htslib API. This is a very weak rule. if a more appropriate name is found later in Ruby, it will replace it.

Support for bitfields of structures

Since Ruby-FFI does not support structure bit fields, the following extensions are used.

  • ffi-bitfield - Extension of Ruby-FFI to support bitfields.

Automatic validation

In the script directory, there are several tools to help implement ruby-htslib. Scripts using c2ffi can check the coverage of htslib functions in Ruby-htslib. They are useful when new versions of htslib are released.

  • c2ffi is a tool to create JSON format metadata from C header files.

Contributing

Ruby-htslib is a library under development, so even minor improvements like typo fixes are welcome! Please feel free to send us your pull requests.

# Ownership and Commit Rights

Do you need commit rights to the ruby-htslib repository?
Do you want to get admin rights and take over the project?
If so, please feel free to contact us @kojix2.

Why do you implement htslib in a language like Ruby, which is not widely used in bioinformatics?

One of the greatest joys of using a minor language like Ruby in bioinformatics is that nothing stops you from reinventing the wheel. Reinventing the wheel can be fun. But with languages like Python and R, where many bioinformatics masters work, there is no chance for beginners to create htslib bindings. Bioinformatics file formats, libraries, and tools are very complex, and I need to learn how to understand them. So I started to implement the HTSLib binding myself to better understand how the pioneers of bioinformatics felt when establishing the file format and how they created their tools. I hope one day we can work on bioinformatics using Ruby and Crystal languages, not to replace other languages such as Python and R, but to add new power and value to this advancing field.

Links

Funding support

This work was supported partially by Ruby Association Grant 2020.

License

MIT License.

ruby-htslib's People

Contributors

kojix2 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

ruby-htslib's Issues

Make Method Chaining Fun

Currently, ruby-htslib methods are not that easy to method chain. However, in Ruby programming, method chaining is a lot of fun, kind of like %>% in the R language.

To make this work, you need to make sure that many methods return self.

With that change, Ruby should be able to detect htslib errors.

A typical code might look like this:

def magical_method(sesame)
  r = LibHTS.hts_magical_method(sesame)
  raise "Error" if r < 0
  self
end

Use Ruby 3.1 or higher

ruby-htslib is a private project with no users other than myself.
So, I prefer to use new features of Ruby as much as possible, rather than keeping backwards compatibility.

I will set the gemspec to use Ruby 3.1 or later. But, I will leave branch v0.1.1 for those using Ruby 2.7 for some reason.

Let's continue developing ruby-htslib.

FFI::Struct or FFI::ManagedStrict ?

FFI::Struct seems to need to be released explicitly after allocating memory. On the other hand, FFI::ManagedStruct seems to be able to register functions necessary for memory release.
Should I use ManagedStruct?

Preparing test files

In order to develop Ruby-htslib comfortably, you need a good looking test file.
Preparing these files is just as important as writing the code.

In order to avoid copyright issues, these files should be generated by simulation.

Looping with Enumerable is not fast enough

Looping with Enumerable is not fast enough.

Currently, ruby-htslib executes the loop by including Enumeable in the classes that represent HTS file formats such as bam, bcf.

This is a common practice in Ruby. It provides a consistent API. This is a good thing.

However, it is not fast enough.
For example, comparing samtools and ruby-htslib, ruby-htslib is much slower, probably because ruby-htslib executes loops in Ruby's Enumerable, while Samtools is implemented in C. This fact makes it difficult to find an area where ruby-htslib can be useful.

Looping with Enumerable is slow.

This is a recurring problem in data processing in Ruby. (And all scripting languages, not just Ruby, have this problem)

There are several approaches that do not use Enumerable.

  • numo-narray executes compiled code and does not use enumerable.
  • Apache Arrow includes enumerable, but the methods added by enumerable are slow and there are other alternatives.
  • SQL executes the query on the SQL side, not the Ruby side. Can we do the same thing with ruby-htslib?

This is a very difficult problem to solve, but in order to find some hints, I tried to put my thoughts on the problem in writing with the help of Deepl translator.

Function xyz not found in htslib

Hey there kojix2,

bam = HTS::Bam.open("foobar.bam")

I get some warnings I guess:

Function 'bam_mods_query_type' not found in [/home/Programs/Htslib/1.15.1/lib/libhts.so]
Function 'bam_mods_recorded' not found in [/home/Programs/Htslib/1.15.1/lib/libhts.so]

Not sure whether these are caused by the htslib gem, or by the upstream htslib.

If you can, and know whether this is an issue, and how to resolve it, perhaps the main
README could mention it. Or if it is a warning that could be suppressed. Anyway just
reporting here to let you know!

Suggestion: statistics overview? E. g. a common statistics overview for ruby-htslib

Hey there kojix2,

Today I tested ruby-htslib a bit following the examples in the main README e. g.

bam = HTS::Bam.open("foobar.bam")

bam.each do |r|
  pp name: r.qname,

And so forth. It shows a lot of information and outputs a LOT. E. g. a 67 MB .bam
file contains so much data.

Would it be possible to add some "central" statistics module into ruby-htslib,
and show in the main README how to use it?

This statistics module should show a summary of the content of the .bam
file such as how many entries it contains and so forth, any information
that is useful, but that can be displayed on the commandline, without
scrolling too much outside. E. g. a full "paper size" at max, not more.
A summary.

The rationale for this proposal would be to get a quick overview from
a .bam file without needing to go through .each - just to understand what
this .bam file is all about. And it would be useful if ruby-htslib could support
this directly as-is, if possible, so people can just use it rather than create
their own custom ad-hoc solution (they can do so anyway but I think it
would be more convenient if ruby-htslib can support this directly).

Anyway it is just a suggestion, please feel free to proceed in any way you
see fit.

On Faidx and Tabix, automatic release fails after explicitly closing a file

    class Tbx < FFI::ManagedStruct
      layout \
        :conf,           TbxConf,
        :idx,            HtsIdx.ptr,
        :dict,           :pointer

      def self.release(ptr)
        LibHTS.tbx_destroy(ptr) unless ptr.null?
      end
    end
    FaiFormatOptions = enum(:FAI_NONE, :FAI_FASTA, :FAI_FASTQ)

    class Faidx < FFI::Struct # FIXME: ManagedStruct
      layout :bgzf,      BGZF.ptr,
             :n,         :int,
             :m,         :int,
             :name,      :pointer,
             :hash,      :pointer,
             :format,    FaiFormatOptions

      def self.release(ptr)
        LibHTS.fai_destroy(ptr) unless ptr.null?
      end
    end

These ManagedStructs fire self.release as expected, but will double-free the memory if the file is already closed.
Why? I must have made some mistake...

Duplicate naming; Bcf#filter

require "htslib"

a = HTS::Bcf.open "foo.vcf"
a.filter # Which does it mean? filter field in a vcf file? Or the equivalent of select?

seek rewind

HTS::Bam and HTS::Bcf need to behave like Ruby file objects. That is, they should have methods to move the file pointer, like seek and rewind.

It seems that those methods can be implemented by using functions such as bgzf_seek and cram_seek.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.