GithubHelp home page GithubHelp logo

hamster's Introduction

Hamster - Efficient, Immutable, Thread-Safe Collection classes for Ruby

GitHub:   http://github.com/harukizaemon/hamster
RubyGems: https://rubygems.org/gems/hamster
email:    [email protected]
IRC:      ##haruki_zaemon on freenode

Introduction

Hamster started out as an implementation of Hash Array Mapped Tries (HAMT) for Ruby (see lamp.epfl.ch/papers/idealhashtrees.pdf) and has since expanded to include implementations of other Persistent Data Structures (see en.wikipedia.org/wiki/Persistent_data_structure) including Set, List, Stack, Queue, and Vector.

Hamster collections are immutable. Whenever you modify a Hamster collection, the original is preserved and a modified copy is returned. This makes them inherently thread-safe and sharable. (For an interesting perspective on why immutability itself is inherently a good thing, you might like to take a look at Matthias Felleisen’s Function Objects presentation: www.ccs.neu.edu/home/matthias/Presentations/ecoop2004.pdf)

Hamster collection classes remain space efficient by making use of some very well understood and very simple techniques that enable sharing between copies.

Hamster collections are almost always closed under a given operation. That is, whereas Ruby’s collection methods always return arrays, Hamster collections will return an instance of the same class wherever possible.

And lastly, Hamster lists are lazy – where Ruby’s language constructs permit – making it possible to, among other things, process “infinitely large” lists. (Note: Ruby 1.9 supports a form of laziness using Enumerator. However, they’re implemented using Fibers which unfortunately can’t be shared across threads.)

Installation

Hamster is distributed as a gem via rubygems (rubygems.org/gems/hamster) or as source via GitHub (github.com/harukizaemon/hamster).

Installation via the gem is easy:

> gem install hamster

Once installed, all that remains is to make the collection classes available in your code:

require 'hamster'

Installation via bundler is even easier:

gem "hamster"

If you prefer, you can instead require individual classes as necessary:

require 'hamster/list'
require 'hamster/stack'
require 'hamster/queue'
require 'hamster/hash'
require 'hamster/set'
require 'hamster/vector'

Examples

Most Hamster classes support an API that resembles their standard library counterpart, with the caveat that any modification returns a new instance. What follows are some simple examples to help illustrate this point.

Hash

Constructing a Hamster Hash is almost as simple as a regular one:

simon = Hamster.hash(:name => "Simon", :gender => :male)  # => {:name => "Simon", :gender => :male}

Accessing the contents will be familiar to you:

simon[:name]          # => "Simon"
simon.get(:gender)    # => :male

Updating the contents is a little different than you are used to:

james = simon.put(:name, "James")   # => {:name => "James", :gender => :male}
simon                               # => {:name => "Simon", :gender => :male}
james[:name]                        # => "James"
simon[:name]                        # => "Simon"

As you can see, updating the hash returned a copy leaving the original intact. Similarly, deleting a key returns yet another copy:

male = simon.delete(:name)    # => {:gender => :male}
simon                         # => {:name => "Simon", :gender => :male}
male.has_key?(:name)          # => false
simon.has_key?(:name)         # => true

Hamster’s hash doesn’t provide an assignment (#[]=) method. The reason for this is simple yet irritating: Ruby assignment methods always return the assigned value, no matter what the method itself returns. For example:

counters = Hamster.hash(:odds => 0, :evens => 0)
counters[:odds] += 1          # => 1

Because of this, the returned copy would be lost thus making the construct useless. Instead #put accepts a block instead of an explicit value so we can still do something similar

counters.put(:odds) { |value| value + 1 }    # => {:odds => 1, :evens => 0}

or more succinctly:

counters.put(:odds, &:next)   # => {:odds => 1, :evens => 0}

Set

List

Lists have a head – the value of the item at the head of the list – and a tail – containing the remaining items. For example:

list = Hamster.list(1, 2, 3)

list.head    # => 1
list.tail    # => Hamster.list(2, 3)

To add to a list, you use #cons:

original = Hamster.list(1, 2, 3)
copy = original.cons(0)           # => Hamster.list(0, 1, 2, 3)

Notice how modifying a list actually returns a new list. That’s because Hamster lists are immutable. Thankfully, just like Hamster Set and Hash, they’re also very efficient at making copies!

Lists are, where possible, lazy. That is, they try to defer processing items until absolutely necessary. For example, given a crude function to detect prime numbers:

def prime?(n)
  2.upto(Math.sqrt(n).round) { |i| return false  if n % i == 0 }
  true
end

The following code will only call prime? as many times as necessary to generate the first 3 prime numbers between 10000 and 1000000:

Hamster.interval(10000, 1000000).filter { |i| prime?(i) }.take(3)    # => 0.0009s

Compare that to the conventional equivalent which needs to calculate all possible values in the range before taking the first 3:

(10000..1000000).select { |i| prime?(i) }.take(3)   # => 10s

Besides Hamster.list there are other ways to construct lists:

Hamster.interval(from, to) (aliased as .range) creates a lazy list equivalent to a list containing all the values between from and to without actually creating a list that big.

Hamster.stream { ... } allows you to creates infinite lists. Each time a new value is required, the supplied block is called. For example, to generate a list of integers you could do:

count = 0
integers = Hamster.stream { count += 1 }

Hamster.repeat(x) creates an infinite list with x the value for every element.

Hamster.replicate(n, x) creates a list of size n with x the value for every element.

Hamster.iterate(x) { |x| ... } creates an infinite list where the first item is calculated by applying the block on the initial argument, the second item by applying the function on the previous result and so on. For example, a simpler way to generate a list of integers would be:

integers = Hamster.iterate(1) { |i| i + 1 }

or even more succinctly:

integers = Hamster.iterate(1, &:next)

You also get Enumerable#to_list so you can slowly transition from built-in collection classes to Hamster.

And finally, you get IO#to_list allowing you to lazily processes huge files. For example, imagine the following code to process a 100MB file:

File.open("my_100_mb_file.txt") do |io|
  lines = []
  io.each_line do |line|
    break if lines.size == 10
    lines << line.chomp.downcase.reverse
  end
end

How many times/how long did you read the code before it became apparent what the code actually did? Now compare that to the following:

File.open("my_100_mb_file.txt") do |io|
  io.map(&:chomp).map(&:downcase).map(&:reverse).take(10)
end

Unfortunately, though the second example reads nicely, it takes around 13 seconds to run (compared with 0.033 seconds for the first) even though we’re only interested in the first 10 lines! However, using a little #to_list magic, we can get the running time back down to 0.033 seconds!

File.open("my_100_mb_file.txt") do |io|
  io.to_list.map(&:chomp).map(&:downcase).map(&:reverse).take(10)
end

How is this even possible? It’s possible because IO#to_list creates a lazy list whereby each line is only ever read and processed as needed, in effect converting it to the first example without all the syntactic, imperative, noise.

Stack

Queue

Vector

Disclaimer

Hamster started out as a spike to prove a point and has since morphed into something I actually use. My primary concern has been to round out the functionality with good test coverage and clean, readable code.

Performance is pretty good – especially with lazy lists – but there are some things which may blow the stack due to a lack of Tail-Call-Optimisation in Ruby.

Documentation is sparse but I’ve tried as best I can to write specs that read as documentation. I’ve also tried to alias methods as their Enumerable equivalents where possible to ease code migration.

Reporting bugs

Please report all bugs on the GitHub issue tracker located at: github.com/harukizaemon/hamster/issues

But I still don’t understand why I should care?

As mentioned earlier, persistent data structures perform a copy whenever they are modified meaning there is never any chance that two threads could be modifying the same instance at any one time. And, because they are very efficient copies, you don’t need to worry about using up gobs of memory in the process.

Even if threading isn’t a concern, because they’re immutable, you can pass them around between objects, methods, and functions in the same thread and never worry about data corruption; no more defensive calls to #dup!

OK, that sounds mildly interesting. What’s the downside–there’s always a downside?

There’s a potential performance hit when compared with MRI’s built-in, native, hand-crafted C-code implementation of Hash. For example:

hash = Hamster.hash
(1..10000).each { |i| hash = hash.put(i, i) }   # => 0.05s
(1..10000).each { |i| hash.get(i) }             # => 0.008s

versus

hash = {}
(1..10000).each { |i| hash[i] = i }   # => 0.004s
(1..10000).each { |i| hash[i] }       # => 0.001s

That seems pretty bad?

Well, yes and no. The previous comparison wasn’t really fair. Sure, if all you want to do is replace your existing uses of Hash in single-threaded environments then don’t even bother. However, if you need something that can be used efficiently in concurrent environments where multiple threads are accessing–reading AND writing–the contents things get much better.

Do you have a better example?

A more realistic comparison might look like:

hash = Hamster.hash
(1..10000).each { |i| hash = hash.put(i, i) }   # => 0.05s
(1..10000).each { |i| hash.get(i) }             # => 0.008s

versus

hash = {}
(1..10000).each { |i|
  hash = hash.dup
  hash[i] = i
}                                 # => 19.8s

(1..10000).each { |i| hash[i] }   # => 0.001s

What’s even better – or worse depending on your perspective – is that after all that, the native Hash version still isn’t thread-safe and still requires some synchronisation around it slowing it down even further and.

The Hamster version on the other hand was unchanged from the original whilst remaining inherently thread-safe, and 3 orders of magnitude faster.

Sure, but as you say, you still need synchronisation so why bother with the copying?

Well, I could show you one but I’d have to re-write/wrap most Hash methods to make them generic, or at the very least write some application-specific code that synchronised using a Mutex and … well … it’s hard, I always make mistakes, I always end up with weird edge cases and race conditions so, I’ll leave that as an exercise for you :)

And don’t forget that even if threading isn’t a concern for you, the safety provided by immutability alone is worth it, not to mention the lazy implementations.

hamster's People

Contributors

harukizaemon avatar kef avatar tianyicui avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.