elixir-lang / gen_stage Goto Github PK

View Code? Open in Web Editor NEW

1.5K 58.0 190.0 1.31 MB

Producer and consumer actors with back-pressure for Elixir

Home Page: http://hexdocs.pm/gen_stage

Elixir 100.00%

gen_stage's Introduction

GenStage

GenStage is a specification for exchanging events between producers and consumers.

This project currently provides the following functionality:

GenStage (docs) - a behaviour for implementing producer and consumer stages
ConsumerSupervisor (docs) - a supervisor designed for consuming events from GenStage and starting a child process per event

You may also be interested in two other projects built on top of GenStage:

Flow for building computational flows using map-reduce, partitions, windows, and more that run concurrently. See the documentation for Flow or José Valim's keynote at ElixirConf 2016 introducing the main concepts behind GenStage and Flow
Broadway for building concurrent and multi-stage data ingestion and data processing pipelines to consume events from Amazon SQS, RabbitMQ, and others. See Broadway's documentation or José Valim's introduction to Broadway

Examples

Examples for using GenStage and ConsumerSupervisor can be found in the examples directory:

ProducerConsumer - a simple example of setting up a pipeline of A -> B -> C stages and having events flowing through it
ConsumerSupervisor - an example of how to use one or more ConsumerSupervisor as a consumer to a producer that works as a counter
GenEvent - an example of how to use GenStage to implement an alternative to GenEvent that leverages concurrency and provides more flexibility regarding buffer size and back-pressure
RateLimiter - an example of performing rate limiting in a GenStage pipeline

Installation

GenStage requires Elixir v1.5. Just add :gen_stage to your list of dependencies in mix.exs:

def deps do
  [{:gen_stage, "~> 1.0"}]
end

License

Same as Elixir under Apache License 2.0. Check NOTICE and LICENSE for more information.

gen_stage's People

Contributors

Stargazers

Watchers

Forkers

ronanh foo42 leeroyding ayarulin austinsmorris jacktang gazler nanne007 linearregression muaazsaleem hdabrows riacataquian stevedomin thbar chrrasmussen ma2gedev ericentin oskwazir tokafish mitchellhenke lmarlow sanrodari joe-noh cclam0827 mkompanets f1sty pragtob snowattitudes aerosol gusaiani pedrosnk seomoz suddenrushofsushi cameronp scohen guedes aaronrenner jeffweiss oylenshpeegul bryanenders amokan gustf anderslime lessless pma scrogson zampino pcmarks iurifq nathanaschbacher lucacorti redvers leolorenzoluis omnibs xavier namjae zuki mvpgomes elijahkim jeroenvisser101 benissimo barsukov stuheiss ebostijancic seejohnplay jdemaris ijcd profil asonge pikeas nmichel maennchen xfumihiro electricshaman mbuhot talklittle manjufy kioqq pdawczak sysashi christopheradams tbhavs xuanyuan300 lukaszsamson clipsmm paulswartz nsweeting kanmo dnnx sntran-contrib jgautsch mikeni anildigital prorok64b rbeene kraigie michalmuskala stevehebert kokolegorille alvaromlg

gen_stage's Issues

Use Process.send/3 with `:noconnect` when sending consumer/producer messages in GenStage process

As the monitor is active on both sides of a subscription, or in the case of the subscribe the from consumer the :DOWN will arrive asap without setting up a subscription on disconnect, it is unhelpful to reconnect when sending a message because it will be ignored as the counter party is guaranteed to receive a :DOWN before the message arrives and so ignore the message. If a reconnect is attempted on send it will also block the caller. Therefore we block to send a message that is never going to be handled.

We can avoid this by always using Process.send(pid, msg, [:noconnect]) when inside the GenStage process (when a monitor is active on pid). As Process.monitor/1 is always called first a connection attempt will always be made at the start of a subscription (on both sides). This means a nodedown will not slow down a GenStage except when setting up a subscription.

Even when a subscription is half open (e.g. consumer sent subscription, disconnect occurs, producer receives subscription, monitors and acks and consumer receives :DOWN before :ack) a disconnect will close the open half, so :noconnect is still valid.

Return useful information, like current buffer size, number of consumers, number of producers and so on in format_status

Require producer acks for subscription

Raise if attempting to subscribe to a remote process

handle_cancel should always receive a tuple as reason

Replace :emit option by an emit command

The rationale is that :emit effectively affect the result returned by the pipeline so it has to be more explicit than an option.

Docs around demand modes are inconsistent

The docs for GenStage.ask/3 say:

This is an asynchronous request typically used by consumers in :manual demand mode.

But according to GenStage.demand/2, the only valid demand modes are :forward and :accumulate:

@doc """
Sets the demand mode for a producer.

When `:forward`, the demand is always forwarded to the `handle_demand`
callback. When `:accumulate`, demand is accumulated until its mode is
set to `:forward`. This is useful as a synchronization mechanism, where
the demand is accumulated until all consumers are subscribed. Defaults
to `:forward`.

This command is asynchronous.
"""
@spec demand(stage, :forward | :accumulate) :: :ok
def demand(stage, mode) when mode in [:forward, :accumulate] do
  cast(stage, {:"$demand", mode})
end

This is quite confusing -- it's not clear what is meant by :manual demand mode given that demand/2 says only :forward and :accumulate are supported demand modes.

Trouble with producer that initially has no events

I've been playing around with gen_stage today to prototype some stuff and I'm having trouble with an example that feels like it should be really simple. I'm trying to build a really simple worker pool -- the idea is that I start up a single producer and N consumers. The consumers are workers and demand N jobs (0-arity functions) to work on and then work on them. A enqueue_jobs function is provided so you can enqueue work. Here's what I've got:

# worker_pool.exs
alias Experimental.GenStage
require Logger

defmodule JobWorkerPool do
  def start_link(worker_count, subscribe_options) do
    {:ok, producer_pid} = GenStage.start_link(__MODULE__.JobProducer, :ok)
    subscribe_options = Keyword.put(subscribe_options, :to, producer_pid)

    Enum.each(1..worker_count, fn _ ->
      {:ok, consumer_pid} = GenStage.start_link(__MODULE__.Worker, :ok)
      GenStage.sync_subscribe(consumer_pid, subscribe_options)
    end)

    {:ok, producer_pid}
  end

  def enqueue_jobs(pid, jobs) do
    :ok = GenStage.call(pid, {:enqueue_jobs, jobs})
  end

  defmodule JobProducer do
    use GenStage

    def init(:ok), do: {:producer, :queue.new()}

    def handle_call({:enqueue_jobs, jobs}, _from, queue) do
      Logger.info "Enqueued #{length jobs} jobs"
      queue = Enum.reduce(jobs, queue, &:queue.in(&1, &2))
      {:reply, :ok, [], queue}
    end

    def handle_demand(demand, queue) do
      Logger.info "Handling #{demand} demand with a queue of size #{:queue.len(queue)}"
      {reversed_jobs, queue} = take_jobs(queue, demand, [])

      if System.get_env("FILL_IN_FAKE_JOBS") && Enum.empty?(reversed_jobs) do
        fake_jobs = Enum.map(1..demand, fn _ -> :fake_job end)
        {:noreply, fake_jobs, queue}
      else
        {:noreply, Enum.reverse(reversed_jobs), queue}
      end
    end

    defp take_jobs(queue, 0, jobs), do: {jobs, queue}
    defp take_jobs(queue, n, jobs) when n > 0 do
      case :queue.out(queue) do
        {:empty, ^queue} -> {jobs, queue}
        {{:value, job}, queue} -> take_jobs(queue, n - 1, [job | jobs])
      end
    end
  end

  defmodule Worker do
    use GenStage

    def init(:ok), do: {:consumer, nil}

    if System.get_env("FILL_IN_FAKE_JOBS") do
      def handle_events([:fake_job | _], _from, nil) do
        Process.sleep(50)
        {:noreply, [], nil}
      end
    end

    def handle_events(jobs, _from, nil) do
      Logger.info "Handling #{length jobs} job events"
      Enum.each(jobs, &(&1.()))
      {:noreply, [], nil}
    end
  end
end

{:ok, pid} = JobWorkerPool.start_link(4, max_demand: 10)

jobs = Enum.map(1..100, fn i ->
  fn -> IO.puts "performed job #{i}" end
end)

JobWorkerPool.enqueue_jobs(pid, jobs)
Process.sleep(:infinity)

(Ignore the System.get_env("FILL_IN_FAKE_JOBS") bit for the moment -- it's a work around that I explain below).

When I run this with mix run worker_pool.exs, you can see that the workers send demand before any jobs have been enqueued (as you would expect), and then they apparently don't ever ask again, so things just sit there and nothing happens:

$ mix run worker_pool.exs
23:21:26.015 [info] Handling 10 demand with a queue of size 0
23:21:26.015 [info] Handling 10 demand with a queue of size 0
23:21:26.015 [info] Handling 10 demand with a queue of size 0
23:21:26.015 [info] Handling 10 demand with a queue of size 0
23:21:26.015 [info] Enqueued 100 jobs

However, if I fake it out and provide fake events just to satisfy the demand the consumers asked for (implemented conditionally using the FILL_IN_FAKE_JOBS env var), it works:

$ FILL_IN_FAKE_JOBS=1 mix run worker_pool.exs
23:22:53.963 [info] Handling 10 demand with a queue of size 0
23:22:53.963 [info] Handling 10 demand with a queue of size 0
23:22:53.963 [info] Handling 10 demand with a queue of size 0
23:22:53.963 [info] Handling 10 demand with a queue of size 0
23:22:53.963 [info] Enqueued 100 jobs
23:22:54.018 [info] Handling 5 demand with a queue of size 100
23:22:54.019 [info] Handling 5 demand with a queue of size 95
23:22:54.019 [info] Handling 5 demand with a queue of size 90
23:22:54.019 [info] Handling 5 demand with a queue of size 85
performed job 1
performed job 6
performed job 11
23:22:54.070 [info] Handling 5 job events
performed job 16
23:22:54.070 [info] Handling 5 job events
23:22:54.070 [info] Handling 5 job events
23:22:54.070 [info] Handling 5 job events
23:22:54.070 [info] Handling 5 demand with a queue of size 80
performed job 2
performed job 7
performed job 12
performed job 17
performed job 3
# ...

So, a few questions/comments:

When the producer returns an empty list of events from handle_demand, why do consumers stop sending demand? Apparently they give up and never ask again, which seems like a bug.
I read through the docs a couple times to see if I was missing something and couldn't find anything that suggested that consumers unsubscribe (or halt or whatever) if they don't get the asked for events, and I found this behavior completely surprising. If this is by design, it'd be nice if the docs explained enough of the rationale behind this behavior so users who try this have a way to understand what's going on.
Interestingly enough, the success of the fake job list hack depends on the size of the list. If I send back a list of 5-10 fake jobs (half the max_demand or more), the hack works. But if I return a list of 1-4 fake jobs, it doesn't do anything, and I get the same behavior of the consumers no longer requesting work. This suggests that this issue is related to the min_demand/max_demand options, but I haven't been able to improve things by experimenting with those settings.
Is there a better way to build a worker pool on top of GenStage than how I'm trying to do it here?

I'm hoping we can figure out a solution to these problems, because I'm quite keen to use GenStage in production soon :).

Polling option for producer stage

I am currently using GenStage to retrieve data from a Redis queue and then process it in my elixir program. When creating the producer stage I got a little confused on how to implement it:
the producer handle_demand function will be called just after startup by the consumers but will not be called again after all events are consumed. I can add events to my queue in Redis but the producer will not check the queue again.

This problem can be solved easily by calling the producer handle_demand periodically, for example:

defmodule Producer do
  use GenStage
  @polling_interval 1000

  def init(state) do
    Process.send_after(self, :poll, @polling_interval)
    {:producer, state}
  end

  def handle_info(:poll, state) do
    Process.send_after(self, :poll, @polling_interval)
    handle_demand(0, state)
  end

  def handle_demand(demand, state) when demand >= 0 do
    events = RedisQueue.take(demand + state)
    count = Enum.count(events)
    {:noreply, events, demand + state - count}
  end

I think this use case must be pretty common when consuming tasks from an external queue. I suggest to add a new polling option when starting the producer:

GenStage.start_link(Producer, state, polling_interval: 1000)

I am not sure if I am missing something or if this option is out of the scope or if this would be actually a useful option to have. Please let me know.
Thanks !

Proposal for DynamicSupervisor

A DynamicSupervisor is a supervisor designed to supervise
and manage many children dynamically.

It is a spawn-off of the :simple_one_for_one strategy
found in the regular Supervisor.

We have a couple goals by introducing a dynamic supervisor:

Simplify the API and usage of both Supervisor modules. Most
of the documentation in the Supervisor module is full of
conditionals: "if the supervisor type is :simple_one_for_one,
it will behave as X, otheriwse as Y." The differences in
behaviour with little surrounding context makes supervisors
hard to learn, understand and use;
Provide out-of-the-box supervisor sharding for cases where
the supervisor itself may be a scalability concern;
Provide a built-in registry to avoid developers unecessarily
using dependencies like gproc or incorrect dependencies like
global;
Implement the GenStage specification
so dynamic supervisors can subscribe to producers and spawn
children dynamically based on demand;

The first bullet is about is about implementing a DynamicSupervisor
module with the same API and functionality as a :simple_one_for_one
Supervisor. That's relatively straight-forward to do and therefore
we will focus on the other functionality for the rest of this proposal.

Shards

The DynamicSupervisor is going to provide automatic sharding. Imagine
the following start_link call:

DynamicSupervisor.start_link(MySupervisor, args, [])

it will start a single supervisor with the specification defined by
MySupervisor. By passing the :shards option, the DynamicSupervisor
will start N supervisors (let's call them shards) under the parent
supervisor with the specification defined by MySupervisor:

DynamicSupervisor.start_link(MySupervisor, args, [shards: 3])

In other words, a regular dynamic supervisor will look like:

      /-- child1
     /--- child2
[sup] --- ...
     \--- childy
      \-- childz

With shards, we have:

                           /-- child1
                          /--- child2
          /--------[shard] --- ...
         /                \--- childy
        /                  \-- childz
       /
      /                    /-- child1
     /                    /--- child2
[sup]--------------[shard] --- ...
     \                    \--- childy
      \                    \-- childz
       \
        \                  /-- child1
         \                /--- child2
          \--------[shard] --- ...
                          \--- childy
                           \-- childz

Those N shards will write to the same ETS table. The supervisor
will redirect commands like start_child to one of the shards
(probably by using a consistent hashing algorithm) while commands
like which_children/1 and count_children/1 will read from the
ETS table and return correct results.

The :shards option require a positive integer or :schedulers
as value. If :schedulers is given, the number of shards started
will be the same as the amount of schedulers online.

Registry

The supervisor will also work as a registry by starting it with
the registry option:

DynamicSupervisor.start_link(MySupervisor, args, [registry: MySupervisor, name: MySupervisor])

Note: although not strictly required, we recommend the registry
name to be the same name as the supervisor name.

Besides the start_child/2 function, start_child/3 will also
be added, which allows a process to be started with a given id:

DynamicSupervisor.start_child(MySupervisor, "hello", args)

That will start a new child with id of "hello". Registry lookups are
done with the {:via, ..., ...} option:

location = {:via, DynamicSupervisor, {:id, MySupervisor, "hello"}}
GenServer.call(location, :perform_action)

Sharded registry

The registry and shards feature can be used together, which means
all shards will be written to the same registry. Furthermore, the
registry itself can be used to lookup for a particular shard:

location = {:via, DynamicSupervisor, {:shard, MySupervisor, 0}}
DynamicSupervisor.start_child(location, "hello", args)

This will start a child in the supervisor at shard 0 with ID hello,
completely bypassing the main supervisor in the shard case.

DynamicSupervisor as consumer

Finally, the DynamicSupervisor can be used as a consumer in
a GenStage pipeline. In such cases, the supervisor will be
able to send demand upstream and receive events. Every time
an event is received, a child will be started for that
supervisor. In order to provide such feature, the supervisor
init/1 may return the same options as a GenStage's init/1
would:

def init(arg) do
  GenStage.async_subscription(self(), SomeProducer)

  children = [
    worker(MyWorker, [])
  ]

  {:ok, children, max_demand: 100, min_demand: 50}
end

In case of a sharded supervisor, the supervisor will work as
a proxy to all shards. Every time it is asked for the supervisor
to subscribe to a given producer, it will redirect the subscription
request to all shards (and it will persist such in case they
crash, forcing them to subscribe even in case they restart).

M-N spec update

Here is the updated SPEC.

It updates the current SPEC and replaces the "two-step subscription"
proposal by providing a mechanism where both producers and consumers
can start subscriptions.

Updated spec

Sent by both:

{:"$gen_subscribe", [{consumer_pid, ref}] | [{producer_pid, ref}], options} -
both producers and consumers can start subscriptions. Once subscribe
is sent or received by the consumer, it can immediately start
sending demand to the producer. The ref is unique to identify the
subscription. Both sides must monitor the opposite side so clean-up
happens in case of crashes.

Sent by consumer:

{:"$gen_ask", {pid, ref}, count} -
used to ask data from a producer. The ref identifies the
subscription. The producer MUST emit data up to the counter to the
pid identified by ref - even if it does not match the pid in
the :"$gen_ask" message. The producer MUST send a reply (detailed
below), even if it does not know the given reference, in which case
the reply MUST be an :eos. Following messages will increase the
counter kept by the producer. ask/3 is a convenience function to
send this message.
{:"$gen_unsubscribe", {pid, ref}, reason} -
cancels the current producer/consumer relationship. The producer
MUST send a :"$gen_route" :eos message as a reply (detailed
below) to the original subscriber. If it does not know the given
ref, the reply is sent to pid. However there is no guarantee
the message will be received (for example, the producer may crash
just before sending the confirmation). For such, it is recomended
for the producer to be monitored. unsubscribe/3 is a convenience
function to send this message.

Sent by producer:

{:"$gen_route", {pid, ref}, [event]} -
used to send data to a consumer. The ref identifies the
subscription. The third argument is a non-empty list of events.
route/3 is a convenience function to send this message.
{:"$gen_route", {pid, ref}, {:eos, reason}} -
signals the end of the "event stream" identified by ref. Reason
may be :done, :halted or :ignored (for unknown asks and
unsubscribes). route/3 is a convenience function to send this
message.

M-N Router

Here will define the semantics of a M-N router that allows M producers
to send data to N consumers. The connections between producers and
consumers are established directly and not managed by the router.
However, all consumers are also subscribed to the router, which allows
the router to dynamically send data.

When a new producer is added, the GenRouter will send N subscribe
messages to the producer, each referencing all the existing N
consumers. After it will send a subscribe message to all existing N
consumers referencing the producer. The GenRouter will remain
subscribed to the producer according to its own strategy.

When a new consumer is added, a demand will be established between
router and consumer, where the router is effectively a producer.
This will be used by ad-hoc events, for example, via
GenRouter.sync_notify. After the router-consumer relationship is
established, the router will send M subscribe messages to all existing
M producers referencing the consumer as well as M messages to the
consumer referencing all M producers.

PENDING This mechanism ensures router has established the
relationship between M producers and N consumers, however, once
a given producer has an event to send to a consumer, which consumer
should it choose?

Failure semantics

Because the router is able to reconnect producers to consumers,
the router does not need to crash if a producer or a consumer
crashes. Furthermore, producers do not need to crash if a
consumer crash, nor a consumer needs to crash if a producer
crashes.

The only exception is the router-consumer relationship. If the
router crashes, consumers must crash, otherwise a new router
may start and duplicate the relationships between the M producers
and N consumers.

PENDING Therefore we need to decide if the failure semantics,
let's call it linking (even if it may end-up implemented with
monitors) is a property of the subscription (i.e. linking only
happens when the consumer starts the subscription) or if the
semantics are specified by an explicit option in the subscribe
message.

Add :hash options to Flow.partition

DynamicSupervisor ignores min_demand

The min_demand option—while stored and computed—is ignored in DynamicSupervisor. See: https://github.com/elixir-lang/gen_stage/blob/master/lib/dynamic_supervisor.ex#L510-L516

Demo:

defmodule DynSupDemo do
  alias Experimental.DynamicSupervisor
  use DynamicSupervisor

  def start_link do
    DynamicSupervisor.start_link(__MODULE__, [], name: __MODULE__)
  end

  def init([]) do
    children = [
      worker(Consumer, [], restart: :temporary)
    ]

    {:ok, children, [strategy: :one_for_one,
                     subscribe_to: [{Producer,
                                     min_demand: 999,
                                     max_demand: 1000
                                    }]]}
  end
end

defmodule Producer do
  use Experimental.GenStage
  alias Experimental.{GenStage, DynamicSupervisor}

  def start_link do
    GenStage.start_link(__MODULE__, 1, name: __MODULE__)
  end

  def init(counter) do
    {:producer, counter}
  end

  def handle_demand(demand, counter) when demand > 0 do
    IO.puts "==>#{demand} --- #{DynamicSupervisor.count_children(DynSupDemo).active}"
    # This exists to stagger the event completion (see Consumer)
    counter = cond do
      counter >= 10 -> 1
      counter -> counter
    end
    list = Enum.to_list(counter..(demand - 1 + counter))
    IO.inspect list
    {:noreply, list, counter + demand}
  end
end

defmodule Consumer do
  use GenServer

  def start_link(event) do
    GenServer.start_link(__MODULE__, event)
  end

  def init(args) do
    send(self(), :process)
    {:ok, args}
  end

  def handle_info(:process, state) do
    IO.inspect state
    # Stagger completion.
    :timer.sleep(3_00 + (state * 1_00))
    {:stop, :normal, state}
  end
end

You can see that even though min_demand is set to 999, the DynamicSupervisor will ask for more events after each event is processed. Is this the intended behavior, a bug, or am I using DynamicSupervisor for the wrong purpose?

Introduce Flow.departition

When working with flow, it creates multiple partitions and those partitions give us only a fragmented view of the data. To get a full view back, we need to merge the data together. We could handle this process by providing a Flow.departition/2 that receives a flow and puts the partitioned data together according to the given function. The only downside is that we lose parallelism.

Can't view process info in observer for GenStage

When I double click on a GenStage process in observer I get the follow crash:

iex(6)> Child (unknown) crashed exiting:  <0.870.0> {function_clause,
                                             [{lists,mapfoldl,
                                               [#Fun<observer_html_lib.3.79402349>,
                                                false,
                                                {"State",
                                                 [{subscribe_to,
                                                   [twitch_producer]}]}],
                                               [{file,"lists.erl"},
                                                {line,1352}]},
                                              {lists,mapfoldl,3,
                                               [{file,"lists.erl"},
                                                {line,1354}]},
                                              {lists,mapfoldl,3,
                                               [{file,"lists.erl"},
                                                {line,1354}]},
                                              {observer_html_lib,
                                               expandable_term_body,3,
                                               [{file,"observer_html_lib.erl"},
                                                {line,104}]},
                                              {observer_html_lib,
                                               expandable_term,3,
                                               [{file,"observer_html_lib.erl"},
                                                {line,55}]},
                                              {observer_procinfo,
                                               '-init_state_page/3-fun-0-',3,
                                               [{file,"observer_procinfo.erl"},
                                                {line,288}]},
                                              {observer_procinfo,
                                               init_state_page,3,
                                               [{file,"observer_procinfo.erl"},
                                                {line,291}]},
                                              {observer_procinfo,init_panel,
                                               4,
                                               [{file,"observer_procinfo.erl"},
                                                {line,102}]}]}

I'm on Elixir 1.3.1 and OTP 18

Introduce a notification system

Notifications are never dropped and always sent to all consumers.

Notifications must keep the ordering with previously dispatched events, therefore we should implement them using a wheel.

Async event delivery is undocumented/unsupported

As far as I can tell if a consumer requests events the producer must immediately return that many events in the handle_demand callback or the producer has to drop down to the raw message protocol to send events asynchronously. Otherwise the consumer has to issue another request in order to ever receive any events.

Is async_notify/sync_notify meant to be used for this purpose? how is an callback module supposed to handle notify messages? raw interpretation of the notify message in handle_info?

GenRouter -> GenBroker

This proposal introduce two new components into Elixir,
GenStage and Broker.

Stages are computation stages that send and/or receive data
from other stages. When a stage sends data, it acts as
a producer. When it receives data, it acts as a consumer.
Stages may take both producer and consumer roles at once.
From now on, when we mention "producer" and "consumer", we
imply a stage taking its producer or consumer roles.

When data is sent between stages, it is done by a message
protocol that provides back-pressure. It starts by the
consumer stage subscribing to the producer stage and
asking for events. A consumer stage will never receive
more data than it has asked for from its producer stage.

By default, a stage may only connect to a single producer
and/or a single consumer. A broker lifts this limitation
by allowing M producers to connect to N subscribers according
to a given strategy.

This document describes the messages received by both
producers and consumer roles. It also specifies both stage
and broker behaviours.

Message protocol

This section specifies the message protocol for both producers
and consumers. Most developers won't implement those messages
but rely on GenStage and Broker behaviours defined in later
sections.

Producer

The producer is responsible for sending events to consumers
based on demand.

A producer MUST manage at least one subscription by receiving a
subscription message from a consumer stage or from a consumer
broker. Once a subscription is established, new connections
MAY be established and demand MAY be received.

Except by the initial subscription message, the producer does
not make distinction about its consumers. All messages it must
receive are defined below:

{:"$gen_producer", from :: {consumer_pid, subscription_ref}, {:stage, options}} -
sent by the consumer to the producer to start a new subscription.

Once sent, the consumer MAY immediately send demand to the producer.
The subscription_ref is unique to identify the subscription. The
consumer MUST monitor the producer for clean-up purposes in case of
crashes. The consumer MUST NOT establish new connections over this
subscription.

Once received, the producer MUST monitor the consumer. If the producer
already has a subscription, it MAY ignore future subscriptions by
sending a disconnect reply (defined in the Consumer section) except
for cases where the new subscription matches the subscription_ref.
In such cases, the producer MUST crash.
{:"$gen_producer", from :: {consumer_pid, subscription_ref}, {:broker, strategy, options}} -
sent by the consumer to the producer to start a new subscription.

The consumer MAY establish new connections by sending :connect
messages defined below. The subscription_ref is unique to identify
the subscription. The consumer MUST monitor the producer for clean-up
purposes in case of crashes.

Once received, the producer MUST monitor the consumer. The producer
MUST initialize the strategy by calling strategy.init(from, options).
If the producer already has a subscription, it MAY ignore future
subscriptions by sending a disconnect reply (defined in the Consumer
section) except for cases where the new subscription matches the
subscription_ref. In such cases, the producer MUST crash.
{:"$gen_producer", from :: {pid, subscription_ref}, {:connect, consumers :: [pid]}} -
sent by the consumer to producers to start new connections.

Once sent, the consumer MAY immediately send demand to the producer.
The subscription_ref is unique to identify the subscription.

Once received, the producer MUST call strategy.connect(consumers, from, state)
if one is available. If the subscription_ref is unknown, the
producer MUST send an appropriate disconnect reply to each consumer.
{:"$gen_producer", from :: {consumer_pid, subscription_ref}, {:disconnect, reason}} -
sent by the consumer to disconnect a given consumer-subscription pair.

Once received, the producer MAY call strategy.disconnect(reason, from, state)
if one is available. The strategy MUST send a disconnect message to the
consumer pid. If the consumer_pid refers to the process that started
the subscription, all connections MUST be disconnected. If the
consumer-subscription is unknown, a disconnect MUST still be sent with
proper reason. In all cases, however, there is no guarantee the message
will be delivered (for example, the producer may crash just before sending
the confirmation).
{:"$gen_producer", from :: {consumer_pid, subscription_ref}, {:ask, count}} -
sent by consumers to ask data from a producer for a given consumer-subscription pair.

Once received, the producer MUST call strategy.ask(count, from, state)
if one is available. The producer MUST send data up to the demand. If the
pair is unknown, the produder MUST send an appropriate disconnect reply.

Consumer

The consumer is responsible for starting the subscription
and sending demand to producers.

A consumer MUST manage at least one subscription by sending a
subscription message to a producer. Once a subscription is
established, new connections MAY be established and demand MAY
be sent. Once demand is sent, messages may be received as
defined below:

{:"$gen_consumer", from :: {producer_pid, subscription_ref}, {:connect, producers :: [pid]}} -
sent by producers to consumers to start new connections.

Once received, the consumer MAY immediately send demand to
the producer. The subscription_ref is unique to identify
the subscription. If the subscription is not known, a
disconnect message must be sent back to each producer.
{:"$gen_consumer", from :: {producer_pid, subscription_ref}, {:disconnect, reason}} -
sent by producers to disconnect a given producer-subscription pair.

It is used as a confirmation for client disconnects OR whenever
the producer wants to cancel some upstream demand. Reason may be
:done, :halted or :unknown_subscription.
{:"$gen_consumer", from :: {producer_pid, subscription_ref}, [event]} -
events sent by producers to consumers.

subscription_ref identifies the subscription. The third argument
is a non-empty list of events. If the subscription is unknown, the
events must be ignored.

GenStage

GenStage is a generic stage that may act as a producer,
consumer or both. It is built on top of a GenServer with
the following changes:

init(args) may return {:ok, state, opts} where opts
MAY contain keys such as:
- :subscribe_to - the producer to subscribe to (enables consumer)
- :max_demand - the maximum demand it may ask from producer
- :min_demand - the minimum demand which, once reached, requests for more demand upstream
handle_event(event, from, state) invoked on consumers.
Must return the same as GenServer.handle_info/2.
handle_call/3, handle_cast/2 and handle_info/2 will
be changed to allow emitting events (for producers).
handle_demand(demand, from, state) invoked on producers.
Must return the same as GenStage.handle_call/2.

TODO: Should we copy all of the GenServer API (call, cast, multicall) into GenStage? Part of it?
Or should we ask them to use GenServer?

Consumer example

A simple consumer that inspects events:

defmodule InspectConsumer do
  use GenStage

  def init(_) do
    # TODO: How to specify options for the subscription itself?
    # I.e. the options in {:"$gen_producer", from, {:stage, options}}?
    {:ok, %{}, subscribe_to: ..., max_demand: 50, min_demand: 25}
  end

  def handle_event(event, _from, state) do
    IO.inspect event
    {:noreply, state}
  end
end

Producer example

A simple producer that returns data according to a counter:

defmodule CounterProducer do
  use GenStage

  def init(_) do
    {:ok, 0}
  end

  def handle_demand(demand, _from, state) do
    {:dispatch, Enum.to_list(counter..demand-1), counter + demand}
  end
end

Broker

The broker is responsible for connecting M producers to
N consumers. The connections between producers and consumers
are established directly and not intermediated by the broker.
This means consumers will send demand to M producers and
producers will send events to N consumers. How the demand is
handled by the producer is done via a broker strategy.

Subscribing a consumer to a broker is the same as subscribing it
to any other producer. A broker may also subscribe itself to a
producer, the only difference from the producer perspective is
that subscription message is tagged as :broker with a strategy
instead of :stage (as specified in the "$gen_producer" messages
defined in earlier sections).

A broker will never send demand to its producers. That's because
the producer is never expected to send events directly to the
broker. Demand is always received directly from consumers and
events are sent directly to customers to avoid overhead.

A broker, however, will receive demand from consumers. Such
are used for dynamically dispatch events through the broker.

Finally, a broker is responsible for monitoring all producers
and consumers and relay the proper connect and disconnect
messages to producers and consumers.

Connection management

When a new producer is added to the broker, the broker will send
N connect messages to the producer, each referencing all the
existing N consumers. After it will send a connect message to all
existing N consumers referencing the producer. The Broker will
remain subscribed to the producer but never send demand upstream.

When a new consumer is added to the broker, a demand will be
established between broker and consumer, where the broker is
effectively a producer. This will be used for dynamic borker
dispatch. After the broker-consumer relationship is established,
the broker will send M subscribe messages to all existing M
producers referencing the consumer as well as M messages to
the new consumer referencing all M producers.

Broker strategy

TODO: specify all callbacks in the broker strategy

Dynamic broker dispatch

TODO: specify how dynamic dispatch through the broker works

Consumers stop requesting events if they process events too quickly

I have a pipeline consisting of the following:

producer (via from_enumerable(list)) -> producer_consumer -> producer_consumer -> consumer

My consumer appears to be processing events too quickly, and stops requesting more events after a few batches.

On v0.3.0, Process.sleep(250) is required in order for the consumer to work properly.
On master, weirdly Process.sleep(1) is required for it to work.

I have an .exs script that replicates this issue with my current mix project here

Proposal for DynamicSupervisor

A DynamicSupervisor is a supervisor designed to supervise
and manage many children dynamically.

It is a spawn-off of the :simple_one_for_one strategy
found in the regular Supervisor.

We have a couple goals by introducing a dynamic supervisor:

Simplify the API and usage of both Supervisor modules. Most
of the documentation in the Supervisor module is full of
conditionals: "if the supervisor type is :simple_one_for_one,
it will behave as X, otheriwse as Y." The differences in
behaviour with little surrounding context makes supervisors
hard to learn, understand and use;
Provide max_children limit to ensure supervisors cannot be overloaded;
Implement the GenStage specification
so dynamic supervisors can subscribe to producers and spawn
children dynamically based on demand;

Overloaded

The supervisor will allow a new option called :max_children. Once
:max_children is reached, start_child/2 will return
{:error, :overloaded}.

DynamicSupervisor as consumer

def init(arg) do
  GenStage.async_subscription(self(), SomeProducer)

  children = [
    worker(MyWorker, [])
  ]

  {:ok, children, max_demand: 100, min_demand: 50}
end

File.stream! causes Flows to never exit

This issue can be reproduced by using examples from the documentation for flow, for example:

Elixir Version: 1.3.2
gen_stage version: 0.4.1
erlang version: OTP 19

      alias Experimental.GenStage.Flow
      File.stream!("path/to/some/file")
      |> Flow.from_enumerable()
      |> Flow.flat_map(&String.split(&1, " "))
      |> Flow.partition()
      |> Flow.reduce(fn -> %{} end, fn word, acc ->
        Map.update(acc, word, 1, & &1 + 1)
      end)
      |> Enum.to_list()

If you run this snippet, Enum.to_list will never output anything.

If you change the File.stream! to File.read! and turn it into an enumerable with something like: String.split("\n") then Enum.to_list will output as expected.

Expected behaviour:

Flows should terminate when all records have been read from a file when using File.stream!

Documentation still using the term 'subscription_ref'

The GenStage documentation has three occurrences of 'subscription_ref' while it appears that the preferred naming is 'subscription_tag'

Namespace under Experimental

Missing documentation

It seems there's something missing in the paragraph: https://github.com/elixir-lang/gen_stage/blob/master/lib/gen_stage.ex#L262-L264

Introduce link: true option on subscribe

Need documentation for cancelling flows of finite producers

The original question was brought up on this ElixirForum post but here's a recap.

I am creating a finite GenStage producer that reads the body of a hackney http request. Ideally, I'd like to hook this Producer up to a Flow using Flow.from_stage/1. I've been able to create a basic producer that reads the request, but don't know how to shut down the flow after the body has been read. Is this possible?

@josevalim mentioned that I needed to use GenStage.async_notify(self(), {:producer, :done}) to shut down all of the consumers and this behavior would be good to document. It would also be good to mention how GenStage.Streamer tracks the subscriptions and cancellations of its consumers when the consumers: is set to :permanent. It looks like once all of the consumers have been deleted, it returns {:stop, :normal, state} to shut down itself. Without this tracking it appears the producer won't shut down.

I'm also wondering where should documentation for this go? Is GenStage.async_notify(self(), {:producer, :done}) specific to Flow, or does it work for GenStage as well? I noticed in the GenStage.from_enumerable/2 function it mentions you could also send GenStage.async_notify(self(), {:producer, :halted}). What does that do?

Thanks so much for your help!

Allow redirections on subscribe

This will be needed if we want to subscribe to the flow coordinator. However, if we want to allow such subscriptions, developers will need to explicitly choose how they want to consume the flow (per event or per state/batch/window).

We also need to guarantee redirections are atomic. We need to call handle_subscribe for every new redirect and just then call handle_cancel for the "redirector" process.

Support handle_subscribe(opts, from, state) and handle_cancel(from, state) for consumers

The return type for handle_subscribe will be {:auto, state} | {:manual, state}.

Default demand should be 1

I think the default demand should be one, or there should be no default at all.

Premises

When people play around with Flow, they're expecting basically a configurable concurrent Enum. Obviously there's a lot more there, but this comparison is evident both in Flow's API as well as the examples given in Flow / GenStage's docs which include explicit comparisons to Enum base pipelines.

Issues with current defaults:

Much too high for many uses. For anything IO bound or where the time taken to perform an operation dominates the runtime of the overall flow, any default other than 1 is entirely too high. The current defaults specifically are orders of magnitude too high.
Counterintuitive. Given premise 1, people are used to thinking about consuming enumerables one thing at a time. Batching an enumerable takes an explicit call to do so. As both my own experience and the experience of others will testify, there have been many cases where we used flow to build something and saw everything happen sequentially because we didn't know there was batching happening under the covers. While such batching may be useful when trying to maximize the throughput of certain use cases, I think it is more natural to consider batching as something you opt into rather than need to opt out of.

Issues with the proposal of 1, and responses.

Much to low for many uses. While true, it seems more natural to start with too little grouping and add batching on rather than start with an arbitrary amount of batching and have to adjust up or down.

Issues with any default other than 1:

There are definitely cases where 1 is the ideal default. This is less true for every other number.
No matter what number you choose, it's going to be wrong for a lot of cases. You can try to pick numbers you think suite the majority of cases, but this is hard to determine a priori. 1 actually does relatively well here however. The other factor in choosing a value is intuitiveness, and I think 1 works in that respect as well due to premise 1.

i'm just wondering, how's it going?

Hi. I've heard about GenRouter from ElixirConf, man, it's like the missing OTP's new cool gen_, highly usable! So, i'm just wondering about current state of idea, do you guys have any news?

Support more options around buffering

@fishcakez I was implementing a simple GenEvent using the Broadcast dispatcher:

defmodule EventManager do
  use GenStage

  def start_link(opts) do
    GenStage.start_link(__MODULE__, opts, opts)
  end

  def notify(manager, events, timeout \\ 5000) when is_list(events) do
    GenStage.call(manager, events, timeout)
  end

  def init(opts) do
    {:producer, :empty_state, [dispatch: GenStage.BroadcastDispatcher] ++ opts}
  end

  def handle_call({:notify, events}, _from, state) do
    {:reply, :ok, events, state}
  end
end

I realized there are a couple differences related to buffering:

the manager above will buffer events when there are no consumers
once the buffer limit is reached, we will always keep the first entries and discard the latest ones

We already have :buffer_size that handles the size of the buffer. I think we should add two new options:

Allow :discard_buffer_when_no_consumers with values true and false (default)
Support :buffer_keep with values :first and :last. The default should be to keep the last because keeping the first seems to be a weird behaviour? Alternatively we can only :keep_last and wait until someone complains.

Consumer event list in handle_events return

When reading the recent blog post I got confused by:

# We are a consumer, so we never emit events.
{:noreply, [], sleeping_time}

If a consumer can never emit events then why am I required to return a list of events for it to not emit?

Of course if I don't make this list of events an empty list I'll get an error:

 defp dispatch_events(events, %{type: :consumer} = stage) do
    :error_logger.error_msg('GenStage consumer ~p cannot dispatch events (an empty list must be returned): ~p~n', [name(), events])
    stage
end

But then if it knows the return value is always [] why do I have to tell it?

I'd argue that handle_events have the type:

{:noreply, new_state} |
{:noreply, new_state, :hibernate} |
{:noreply, [event], new_state} |
{:noreply, [event], new_state, :hibernate} |
{:stop, reason, new_state} when new_state: term, reason: term, event: term

And when handling mod.handle_events match on the type of process currently running since only consumer should be using the return type with no list of events.

Now I also notice that :noreply appears to be superfluous. You can never reply, so why say :noreply at all? It is an odd way to differentiate from :stop, so maybe unnecessary?

Add Flow.start_link

Break batches in two in a way we can send demand early for large batches

Lack of parallelism when using the `Flow` API without partition

First of all, this is great stuff and I'm super excited to start using this for real! One problem that I'm facing now though is that I'm not observing any parallelism when using the Flow API without calling partition. For example:

1..8
|> Flow.from_enumerable
|> Flow.map(fn(x) -> IO.puts(x); :timer.sleep(1000) end)
|> Flow.run

From what I understood from the documentation, the map block should run in parallel and, therefore, in my 8 core machine, this should take roughly 1 second. Also, the output should be printed out-of-order. Investigating it a little more, if I just print the process in the map block, it returns the same every time, which shows there's no parallelism. However, if I use Flow.partition right after Flow.from_enumerate, it indeed parallelizes the execution.

I'm on Elixir 1.3.2, Erlang 19 and gen_stage 0.4.3.

GenStage callback timeouts can get lost if consumer/producer message arrives

I think we will need to use a timer?

Add PID or Name of stage to error messages

Introduce count based trigger

list |> Flow.from_enumerable(max_demand: n)

Right now customizations to the number of stages or demand numbers require the use of Flow.new. This disrupts the general flow of a pipeline IE.

File.stream!("path/to/file")
|> Flow.from_enumerable()
|> Flow.etc...

If I want to provide options it's now

Flow.new(opts)
|> Flow.from_enumerable(File.stream!("path/to/file"))
|> Flow.etc...

The File.stream! is for all intents and purposes the logical source of the Flow is the File.Stream, but it is now relegated to a secondary position. I'm proposing

File.stream!("path/to/file")
|> Flow.from_enumerable(opts)
|> Flow.etc...

Add handle_subscriber and handle_cancel for producers and remove without_consumers

Introduce time based trigger

Decide if handle_notification/3 should be introduced or not

producer_consumer as a producer

Hi! Perhaps I want to implement some sort of cache with GenStage. So in scheme [A] -> [B] -> [C], [A] is a producer, [C] is a consumer and [B] is a producer-consumer and I want to cache results from [A] in [B]. Now I want to analyze demands in [B] and reply immediately (without demand message to [A]) if I've result in cache. Is it possible to override handle_info in [B] to intercept messages like $gen_producer and implememnt above logic? As I understand handle_demand is never called in producer_consumer. May be it's more flexible to call this callback in producer_consumer with {:noreply, :forward} reply in default behavior and with possible {:reply, event} reply for above cases?

May be the GenStage is not designed for such cases at all?

thanks!

Two-step subscriptions

Today there is a chance GenRouter can be bottleneck if we send all demand and events through it. The following proposal attempts to remove it by allowing two-step subscriptions.

Two-step subscriptions

First the consumer will send a subscription request to the subscription manager. The subscription manager will then reply to such request with the actual producers. The subscription manager may also be the producer. The subscription manager may tell the consumer of new producers at any time. The following workflow is proposed:

Consumer sends a subscription message to the manager
Manager replies with producer processes
Consumer starts sending ask message to producer processes

@fishcakez, notice I have removed the gen_subscribe between producer and consumer because of the explicit step with the subscription manager. Is this acceptable?

Why it matters?

Imagine a M-N scenario where M producers want to send messages to N consumers. Assuming we are doing demand-driven routing (you ask, you get it), it would work like this today:

[P 1] -\            /- [C 1]
[...] --- [Router] --- [...]
[P M] -/            \- [C N]

In the scenario above, the GenRouter becomes a bottleneck. However, with this extension, consumers are connected directly with producers and by-pass the router. "C 1" will directly send demand to all M producers, as well as "C 2" and so forth. The events are also sent directly to consumers. This also means the GenRouter no longer needs to track the demand (which could become out of sync and provide further issues).

Other implications

This means the In/Out components of a router must be coupled if we want better performance. In other words, GenRouter should no longer have both In/Out callback modules. Furthermore, given this new role, it may make sense to generally allow a GenRouter to work as a registry (with notify and whereis APIs). However, we need to figure out how the optimizations above would work with other strategies like broadcast, sharding and so on.

GenStage.init typespec mismatch

When I run dialyzer on my code I get this:

lib/priceline/hotel_producer_stage.ex:11: The inferred return type of init/1 ({'producer',_}) has nothing in common with 'ignore' | {'ok',_} | {'stop',_} | {'ok',_,'hibernate' | 'infinity' | non_neg_integer()}, which is the expected return type for the callback of 'Elixir.GenServer' behaviour

This is the relevant part of my stage:

defmodule Priceline.LoadHotelStage do
  alias Experimental.GenStage
  use GenStage

  def init(stream) do
    {:producer, stream}
  end

  def handle_demand(demand, stream) when demand > 0 do
    ...
  end
end

No new demand upon re-subscription?

When a consumer subscribes to a producer, then after some time its subscription is cancelled, and it resubscribes, new demand isn't sent. Is that intentional?

Here's a test case, the producer tracks all subscription refs in a set, and cancels them all when it receives a :cancel_all cast.

alias Experimental.{GenStage}

defmodule P do
  use GenStage

  def init(:ok) do
    {:producer, MapSet.new}
  end

  def handle_demand(demand, subscriptions) do
    IO.puts "new demand: #{demand}"
    {:noreply, [], subscriptions}
  end

  def handle_subscribe(:consumer, opts, {_, ref}, subscriptions) do
    IO.puts "#{inspect ref} subscribed"
    {:automatic, Set.put(subscriptions, ref)}
  end

  def handle_cancel(_reason, {_, ref}, subscriptions) do
    IO.puts "#{inspect ref} cancelled"
    {:noreply, [], Set.delete(subscriptions, ref)}
  end

  def handle_cast(:cancel_all, subscriptions) do
    Enum.each(subscriptions, fn ref ->
      IO.puts "asking #{inspect ref} to cancel"
      GenStage.cancel({self, ref}, :bye)
    end)
    {:noreply, [], subscriptions}
  end
end

defmodule C do
  use GenStage

  def init(:ok) do
    {:consumer, :ok}
  end

  def handle_events(events, _from, state) do
    {:noreply, [], state}
  end
end

{:ok, p} = GenStage.start(P, :ok)
{:ok, c} = GenStage.start(C, :ok)

GenStage.sync_subscribe(c, to: p, cancel: :temporary)
GenStage.cast(p, :cancel_all)
:timer.sleep(1_000)
GenStage.sync_subscribe(c, to: p, cancel: :temporary)
Process.sleep(:infinity)

Consider guarding against GenStage.sync_subscribe arguments

I accidentally passed a tuple to GenStage.sync_subscribe (instead of a pid), and got this very confusing error:

** (FunctionClauseError) no function clause matching in GenServer.whereis/1
     (elixir) lib/gen_server.ex:772: GenServer.whereis({:ok, #PID<0.622.0>})
     (elixir) lib/gen_server.ex:594: GenServer.call/3
    (my_app) lib/my_module.ex:19: MyModule.execute/0

Flow.zip as counterpart to Stream.zip

It might useful to have `Flow.zip` alongside `Stream.zip`.

Flow.zip subscribes to all :producer stages given as arguments, and only emits events containing one event payload from each :producer it subscribed to.

It might be useful to deviate from Stream.zip(left, right) and create a Flow.zip(left , [right]) in order not to be forced to pick the order of zipping at the call site.

I'm too new to Elixir to have an opinion whether

Flow.zip(left , [right])is the right form or rather
Flow.zip([]), which (i guess) could not be used in |> notation.

This Idea started on the elixir-forum

Consider automatically computing statistics

We could keep start_time and store the number of events received, the number of event message, the amount of demand received and the amount of demand requests. Those should be relatively cheap to store and compute.

/cc @fishcakez

Increase max_demand to 1000

And buffer size to 10000.

elixir-lang / gen_stage Goto Github PK

gen_stage's Introduction

GenStage

Examples

Installation

License

gen_stage's People

Contributors

Stargazers

Watchers

Forkers

gen_stage's Issues

Shards

Registry

Sharded registry

DynamicSupervisor as consumer

Updated spec

M-N Router

Failure semantics

Message protocol

Producer

Consumer

GenStage

Consumer example

Producer example

Broker

Connection management

Broker strategy

Dynamic broker dispatch

Overloaded

DynamicSupervisor as consumer

Premises

Issues with current defaults:

Issues with the proposal of 1, and responses.

Issues with any default other than 1:

Two-step subscriptions

Why it matters?

Other implications

It might useful to have Flow.zip alongside Stream.zip.

Recommend Projects

Recommend Topics

Recommend Org

Jobs

It might useful to have `Flow.zip` alongside `Stream.zip`.