GithubHelp home page GithubHelp logo

edgurgel / verk Goto Github PK

View Code? Open in Web Editor NEW
721.0 19.0 65.0 474 KB

A job processing system that just verks! πŸ§›β€

Home Page: https://hex.pm/packages/verk

License: MIT License

Elixir 99.15% Lua 0.85%
redis workers elixir sidekiq resque

verk's Introduction

Verk

Build Status Coverage Status Module Version Hex Docs Total Download License Last Updated

Verk is a job processing system backed by Redis. It uses the same job definition of Sidekiq/Resque.

The goal is to be able to isolate the execution of a queue of jobs as much as possible.

Every queue has its own supervision tree:

  • A pool of workers;
  • A QueueManager that interacts with Redis to get jobs and enqueue them back to be retried if necessary;
  • A WorkersManager that will interact with the QueueManager and the pool to execute jobs.

Verk will hold one connection to Redis per queue plus one dedicated to the ScheduleManager and one general connection for other use cases like deleting a job from retry set or enqueuing new jobs.

The ScheduleManager fetches jobs from the retry set to be enqueued back to the original queue when it's ready to be retried.

It also has one GenStage producer called Verk.EventProducer.

The image below is an overview of Verk's supervision tree running with a queue named default having 5 workers.

Supervision Tree

Feature set:

  • Retry mechanism with exponential backoff
  • Dynamic addition/removal of queues
  • Reliable job processing (RPOPLPUSH and Lua scripts to the rescue)
  • Error and event tracking

Installation

First, add :verk to your mix.exs dependencies:

def deps do
  [
    {:verk, "~> 1.0"}
  ]
end

and run $ mix deps.get.

Add :verk to your applications list if your Elixir version is 1.3 or lower:

def application do
  [
    applications: [:verk]
  ]
end

Add Verk.Supervisor to your supervision tree:

defmodule Example.App do
  use Application

  def start(_type, _args) do
    import Supervisor.Spec
    tree = [supervisor(Verk.Supervisor, [])]
    opts = [name: Simple.Sup, strategy: :one_for_one]
    Supervisor.start_link(tree, opts)
  end
end

Finally we need to configure how Verk will process jobs.

Configuration

Example configuration for Verk having 2 queues: default and priority

The queue default will have a maximum of 25 jobs being processed at a time and priority just 10.

config :verk, queues: [default: 25, priority: 10],
              max_retry_count: 10,
              max_dead_jobs: 100,
              poll_interval: 5000,
              start_job_log_level: :info,
              done_job_log_level: :info,
              fail_job_log_level: :info,
              node_id: "1",
              redis_url: "redis://127.0.0.1:6379"

Verk supports the convention {:system, "ENV_NAME", default} for reading environment configuration at runtime using Confex:

config :verk, queues: [default: 25, priority: 10],
              max_retry_count: 10,
              max_dead_jobs: 100,
              poll_interval: {:system, :integer, "VERK_POLL_INTERVAL", 5000},
              start_job_log_level: :info,
              done_job_log_level: :info,
              fail_job_log_level: :info,
              node_id: "1",
              redis_url: {:system, "VERK_REDIS_URL", "redis://127.0.0.1:6379"}

Now Verk is ready to start processing jobs! πŸŽ‰

Workers

A job is defined by a module and arguments:

defmodule ExampleWorker do
  def perform(arg1, arg2) do
    arg1 + arg2
  end
end

This job can be enqueued using Verk.enqueue/1:

Verk.enqueue(%Verk.Job{queue: :default, class: "ExampleWorker", args: [1,2], max_retry_count: 5})

This job can also be scheduled using Verk.schedule/2:

perform_at = Timex.shift(Timex.now, seconds: 30)
Verk.schedule(%Verk.Job{queue: :default, class: "ExampleWorker", args: [1,2]}, perform_at)

Retry at

A job can define the function retry_at/2 for custom retry time delay:

defmodule ExampleWorker do
  def perform(arg1, arg2) do
    arg1 + arg2
  end

  def retry_at(failed_at, retry_count) do
    failed_at + retry_count
  end
end

In this example, the first retry will be scheduled a second later, the second retry will be scheduled two seconds later, and so on.

If retry_at/2 is not defined the default exponential backoff is used.

Keys in arguments

By default, Verk will decode keys in arguments to binary strings. You can change this behavior for jobs enqueued by Verk with the following configuration:

config :verk, :args_keys, value

The following values are valid:

  • :strings (default) - decodes keys as binary strings
  • :atoms - keys are converted to atoms using String.to_atom/1
  • :atoms! - keys are converted to atoms using String.to_existing_atom/1

Queues

It's possible to dynamically add and remove queues from Verk.

Verk.add_queue(:new, 10) # Adds a queue named `new` with 10 workers
Verk.remove_queue(:new) # Terminate and delete the queue named `new`

Deployment

The way Verk currently works, there are two pitfalls to pay attention to:

  1. Each worker node's node_id MUST be unique. If a node goes online with a node_id, which is already in use by another running node, then the second node will re-enqueue all jobs currently in progress on the first node, which results in jobs executed multiple times.

  2. Take caution around removing nodes. If a node with jobs in progress is killed, those jobs will not be restarted until another node with the same node_id comes online. If another node with the same node_id never comes online, the jobs will be stuck forever. This means you should not use dynamic node_ids such as Docker container ids or Kubernetes Deployment pod names.

On Heroku

Heroku provides an experimental environment variable named after the type and number of the dyno.

config :verk,
  node_id: {:system, "DYNO", "job.1"}

It is possible that two dynos with the same name could overlap for a short time during a dyno restart. As the Heroku documentation says:

[...] $DYNO is not guaranteed to be unique within an app. For example, during a deploy or restart, the same dyno identifier could be used for two running dynos. It will be eventually consistent, however.

This means that you are still at risk of violating the first rule above on node_id uniqueness. A slightly naive way of lowering the risk would be to add a delay in your application before the Verk queue starts.

On Kubernetes

We recommend using a StatefulSet to run your pool of workers. StatefulSets add a label, statefulset.kubernetes.io/pod-name, to all its pods with the value {name}-{n}, where {name} is the name of your StatefulSet and {n} is a number from 0 to spec.replicas - 1. StatefulSets maintain a sticky identity for its pods and guarantee that two identical pods are never up simultaneously. This way it satisfies both of our deployment rules mentioned above.

Define your worker like this:

# StatefulSets require a service, even though we don't use it directly for anything
apiVersion: v1
kind: Service
metadata:
 name: my-worker
 labels:
   app: my-worker
spec:
 clusterIP: None
 selector:
   app: my-worker

---

apiVersion: apps/v1
kind: StatefulSet
metadata:
 name: my-worker
 labels:
   app: my-worker
spec:
 selector:
   matchLabels:
     app: my-worker
 serviceName: my-worker
 # We run two workers in this example
 replicas: 2
 # The workers don't depend on each other, so we can use Parallel pod management
 podManagementPolicy: Parallel
 template:
   metadata:
     labels:
       app: my-worker
   spec:
     # This should probably match up with the setting you used for Verk's :shutdown_timeout
     terminationGracePeriodSeconds: 30
     containers:
       - name: my-worker
         image: my-repo/my-worker
         env:
           - name: VERK_NODE_ID
             valueFrom:
               fieldRef:
                 fieldPath: metadata.labels['statefulset.kubernetes.io/pod-name']

Notice how we use a fieldRef to expose the pod's statefulset.kubernetes.io/pod-name label as the VERK_NODE_ID environment variable. Instruct Verk to use this environment variable as node_id:

config :verk,
  node_id: {:system, "VERK_NODE_ID"}

Be careful when scaling the number of replicas down. Make sure that the pods that will be stopped and never come back do not have any jobs in progress. Scaling up is always safe.

Don't use Deployments for pods that will run Verk. If you hardcode node_id into your config, multiple pods with the same node_idwill be online at the same time, violating the first rule. If you use a non-sticky environment variable, such as HOSTNAME, you'll violate the second rule and cause jobs to get stuck every time you deploy.

If your application serves as e.g. both an API and Verk queue, then it may be wise to run a separate Deployment for your API, which does not run Verk. In that case you can configure your application to check an environment variable, VERK_DISABLED, for whether it should handle any Verk queues:

# In your config.exs
config :verk,
  queues: {:system, {MyApp.Env, :verk_queues, []}, "VERK_DISABLED"}

# In some other file
defmodule MyApp.Env do
  def verk_queues("true"), do: {:ok, []}
  def verk_queues(_), do: {:ok, [default: 25, priority: 10]}
end

Then set VERK_DISABLED=true in your Deployment's spec.

EXPERIMENTAL - Generate Node ID

Since Verk 1.6.0 there is a new experimental optional configuration generate_node_id. Node IDs are completely controlled automatically by Verk if this configuration is set to true.

Under the hood

  • Each time a job is moved to the list of jobs inprogress of a queue this node is added to verk_nodes (SADD verk_nodes node_id) and the queue is added to verk:node:#{node_id}:queues (SADD verk:node:123:queues queue_name)

  • Each frequency milliseconds we set the node key to expire in 2 * frequency PSETEX verk:node:#{node_id} 2 * frequency alive

  • Each frequency milliseconds check for all the keys of all nodes (verk_nodes). If the key expired it means that this node is dead and it needs to have its jobs restored.

To restore we go through all the running queues (verk:node:#{node_id}:queues) of that node and enqueue them from inprogress back to the queue. Each "enqueue back from in progress" is atomic (<3 Lua) so we won't have duplicates.

Configuration

The default frequency is 30_000 milliseconds but it can be changed by setting the configuration key heartbeat.

config :verk,
  queues: [default: 5, priority: 5],
  redis_url: "redis://127.0.0.1:6379",
  generate_node_id: true,
  heartbeat: 30_000,

Reliability

Verk's goal is to never have a job that exists only in memory. It uses Redis as the single source of truth to retry and track jobs that were being processed if some crash happened.

Verk will re-enqueue jobs if the application crashed while jobs were running. It will also retry jobs that failed keeping track of the errors that happened.

The jobs that will run on top of Verk should be idempotent as they may run more than once.

Error tracking

One can track when jobs start and finish or fail. This can be useful to build metrics around the jobs. The QueueStats handler does some kind of metrics using these events: https://github.com/edgurgel/verk/blob/master/lib/verk/queue_stats.ex

Verk has an Event Manager that notifies the following events:

  • Verk.Events.JobStarted
  • Verk.Events.JobFinished
  • Verk.Events.JobFailed
  • Verk.Events.QueueRunning
  • Verk.Events.QueuePausing
  • Verk.Events.QueuePaused

One can define an error tracking handler like this:

defmodule TrackingErrorHandler do
  use GenStage

  def start_link() do
    GenStage.start_link(__MODULE__, :ok)
  end

  def init(_) do
    filter = fn event -> event.__struct__ == Verk.Events.JobFailed end
    {:consumer, :state, subscribe_to: [{Verk.EventProducer, selector: filter}]}
  end

  def handle_events(events, _from, state) do
    Enum.each(events, &handle_event/1)
    {:noreply, [], state}
  end

  defp handle_event(%Verk.Events.JobFailed{job: job, failed_at: failed_at, stacktrace: trace}) do
    MyTrackingExceptionSystem.track(stacktrace: trace, name: job.class)
  end
end

Notice the selector to get just the type JobFailed. If no selector is set every event is sent.

Then adding the consumer to your supervision tree:

defmodule Example.App do
  use Application

  def start(_type, _args) do
    import Supervisor.Spec
    tree = [supervisor(Verk.Supervisor, []),
            worker(TrackingErrorHandler, [])]
    opts = [name: Simple.Sup, strategy: :one_for_one]
    Supervisor.start_link(tree, opts)
  end
end

Dashboard ?

Check Verk Web!

Dashboard

Metrics ?

Check Verk Stats

License

Copyright (c) 2013 Eduardo Gurgel Pinho

Verk is released under the MIT License. See the LICENSE.md file for further details.

Sponsorship

Initial development sponsored by Carnival.io

verk's People

Contributors

alexcastano avatar alissonsales avatar andrewdryga avatar bastos avatar coop avatar edgurgel avatar halfdan avatar iurifq avatar juddey avatar karmajunkie avatar keyan avatar kianmeng avatar krasio avatar mbaeuerle avatar mindreframer avatar mitchellhenke avatar myobie avatar oestrich avatar psli avatar raksonibs avatar rubikill avatar sebastianseilund avatar tinenbruno avatar tlvenn avatar vip30 avatar zhongwencool avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

verk's Issues

Error in decoding job in WorkerManager causes Application crash

We are seeing this issue come up:

15:49:13.960 [info]  10 jobs readded to the queue php_queue:emails from inprogress list

15:49:13.963 [error] Manager terminating, reason: {%Poison.SyntaxError{message: "Unexpected end of input", token: nil}, [{Poison.Parser, :parse!, 2, [file: 'lib/poison/parser.ex', line: 54]}, {Poison, :decode!, 2, [file: 'lib/poison.ex', line: 83]}, {Verk.Job, :decode!, 1, [file: 'lib/verk/job.ex', line: 19]}, {Verk.WorkersManager, :"-handle_info/2-fun-0-", 3, [file: 'lib/verk/workers_manager.ex', line: 102]}, {Enum, :"-reduce/3-lists^foldl/2-0-", 3, [file: 'lib/enum.ex', line: 1623]}, {Verk.WorkersManager, :handle_info, 2, [file: 'lib/verk/workers_manager.ex', line: 102]}, {:gen_server, :try_dispatch, 4, [file: 'gen_server.erl', line: 601]}, {:gen_server, :handle_msg, 5, [file: 'gen_server.erl', line: 667]}]}

15:49:13.964 [error] GenServer :"php_queue:emails.workers_manager" terminating
** (Poison.SyntaxError) Unexpected end of input
    (poison) lib/poison/parser.ex:54: Poison.Parser.parse!/2
    (poison) lib/poison.ex:83: Poison.decode!/2
    (verk) lib/verk/job.ex:19: Verk.Job.decode!/1
    (verk) lib/verk/workers_manager.ex:102: anonymous fn/3 in Verk.WorkersManager.handle_info/2
    (elixir) lib/enum.ex:1623: Enum."-reduce/3-lists^foldl/2-0-"/3
    (verk) lib/verk/workers_manager.ex:102: Verk.WorkersManager.handle_info/2
    (stdlib) gen_server.erl:601: :gen_server.try_dispatch/4
    (stdlib) gen_server.erl:667: :gen_server.handle_msg/5
Last message: :timeout
State: %Verk.WorkersManager.State{monitors: :"php_queue:emails.workers_manager", pool_name: :"php_queue:emails.pool", pool_size: 10, queue_manager_name: :"php_queue:emails.queue_manager", queue_name: :"php_queue:emails", timeout: 1000}

15:49:13.966 [info]  Application <APPLICATION NAME> exited: shutdown
{"Kernel pid terminated",application_controller,"{application_terminated,<APPLICATION NAME>,shutdown}"}
Kernel pid terminated (application_controller) ({application_terminated,<APPLICATION NAME>,shutdown})

Crash dump is being written to: erl_crash.dump...done

It looks like what is happening is that a WorkerManager attempts to decode a job that probably has bad JSON, this makes the worker manager crash and eventually leads to the parent process crashing.

I am confused as to why this would result in our application crashing. We are supervising Verk in the suggested way, I would expect instead for the worker manager process to crash and be restarted by Verk.Supervisor. Here is our supervisor configuration:

import Supervisor.Spec, warn: false
tree = [supervisor(Verk.Supervisor, [])]
opts = [name: <APPLICATION_NAME>.Supervisor, strategy: :one_for_one]
Supervisor.start_link(tree, opts)

So:

  1. Why is our application crashing?
  2. Is this expected?
  3. What if we call Poison.decode instead and then move jobs with bad JSON into a malformed key in Redis?
    • We can't mark them as dead through DeadSet without modifications, because we rely on proper encoding in DeadSet.add/3.

Populate enqueued_at when enqueuing jobs

This needs to be done when we enqueue through Verk.enqueue and when jobs move from the scheduled set to the queues.

enqueued_at is just an unix time of the moment we enqueued.

Why not mnesia/dets

I've just started Elixir, and I just wonder what was your reasoning against using an erlang built in solution like Mnesia or DETS for this instead of Redis?

I hope that doesn't sound harsh or offend you, this is a great library.

I just want to get some answers as it's been on my head for quite a bit.

Error Retries and Exponential Backoff

Hi,

It is not very apparent looking at the doc if the retry mechanism implements any kind of exponential backoff. If not, I believe it would be a good addition. Thanks

Change Verk.Log to use proper units

We are always logging seconds

[info] process_id=#PID<0.7160.0> NoOpWorker 4e500781cedd2f68fc13e1d6 done: 0 secs

It would be nice to see them in micro seconds if less than 1 second. Basically use the appropriated unit depending on the value.

Verk.Worker.current_job in test environment

This may be related to #48, but I am trying to test my worker that references Verk.Worker.current_job and get an error.

For example, with this worker:

defmodule ExportWorker do
  def perform(random_id) do
    job_id = Verk.Worker.current_job.jid
  end
end

and this test:

  test "performs" do
    result = ExportWorker.perform("abc123")
    assert result == :error
  end

I get this error message:

  1) test performs (ExportWorkerTest)
     test/workers/export_worker_test.exs:36
     ** (UndefinedFunctionError) undefined function :undefined.jid/0 (module :undefined is not available)
     stacktrace:
       :undefined.jid()
       (wombat_worker) ExportWorker.perform/2
       test/workers/export_worker_test.exs:37

Any ideas?

Possible to have an ordered queue?

My question is, if it's possible to process the incoming work in the order it's arriving.
So if the first job is failing, the other ones have to wait in line for the first one to finish. So only when the first job is finished, the second in line is processed and so on.

Add credo

This will be useful to keep the code consistent.

https://github.com/rrrene/credo

Ideally we can hook the PR with the credo analysis.

We also need to discuss which rules we should follow etc.

Change or allow configuration of log levels

The log level for "start" and "done" for each job is info. I would like to not have these logs in production, as we have a high volume of jobs.

I thought about lowering my production log level to warn, but I would like the phoenix request log messages, and they are also at info.

Two solutions that would work for me:

  1. Change start and done to debug. I'm not sure if this would be ideal for others. I would keep fail at info or change it to warn.
  2. Allow for configuring the level of each call. I could also see someone wanting to log done and fail, but not start. This should be doable with Logger.log/3, however it does recommend using the macros as they are optimized out at compile time.
  3. Another option that would be less ideal, but simple and would be an improvement for me, is to just change fail to log at warn.

I'd be happy to write up a pull-request if there's agreement on a direction.

Timex 3 support

Timex 3 is the latest major version, so Verk should probably support that. Opening this ticket to track that support.

Jobs with high priority (gap)

Is it possible to do so that until tasks from the highest priority queue are completed, do not start tasks from others. Or set the task at the top of the queue so that it runs in high priority - RPUSH insteadof LPUSH in Verk.enqueue

Replace "class" in Verk.Job struct

Because class sounds a bit object oriented, I guess it would be nice if it were replaced. I was thinking that module or worker would be a bit better.

Let me know what do you think, maybe I can send a pull request with the changes.

How get jobs fetched from redis?

I'm a noob to Elixir and looking for a job queue that I can adapt to my needs and your project looks good.
After a quick look on your source, I'm wondering where new jobs gets fetched from Redis.
Is the fetching done via the

def handle_info(:timeout, state) do

function in the Verk.WorkersManager module?

Many thanks for explaining

Timex Version

What would be the impact of upgrading to Timex 2.0?

Handle case where Redis is down then up

I was playing with this, and saw an issue if we disconnect Redis and then reconnect it.

Once Redis is back up, I'm getting this error:

Failed to fetch retry set. Error: {:error, %Redix.Error{message: "NOSCRIPT No matching script. Please use EVAL."}}

Looks like the Lua scripts need to be reloaded? Wonder if there is a way to listen to disconnect/reconnect events from Redix.

Task :timeout

Hello, how can we configure Verk task (worker) timeout?
We have an applications with several taks, all of then works fine, but some file transfers, take more than tak timeouts and always fail.

How should we configure it ?
We've tryed with workers_manager_timeout: 360000 but we still got timeout

[debug] Worker got down, reason: :timeout, [#Reference<0.0.6.1826>, #PID<0.760.0>]
[debug] Rumbl.VideoWorker 20668957681470793693 fail: 31 s
[error] Task #PID<0.782.0> started from #PID<0.760.0> terminating
** (stop) exited in: Task.Supervised.stream(30000)

Thanx

Configurable max retries

For myself, the use case is not wanting to retry certain types of jobs, and some others I'd like to retry less than the default.

I would be willing add the feature if this is something would be beneficial to have in Verk.

Add some random part to WorkersManager's default timeout

Right now it seeks for new jobs each second. We should add some random part to it so every queue does not try to fetch at the same time and they keep somehow unsync'd. This could be a problem if someone is running hundreds of queues

Generate stats per queue

Right now Verk does not use Redis to store stats about the jobs. We can easily "flush" data from the QueueStats handler to the keys.

We could keep track of how many failed and processed jobs per queue and also overall.

Association values are empty inside worker

Hi, Here're a few parts of my code from a controller

campaign = Repo.get!(Campaign, id) |> Repo.preload(group: :users)

Verk.enqueue(%Verk.Job{queue: :default, class: "EmailWorker", args: [campaign], max_retry_count: 5})

And here's what I get inside worker when inspecting campaign args\":[{\"group\":{},\"

Group values are available inside controller but empty inside worker.

I also added
@derive {Poison.Encoder, only: [:name, :subject, :body, :group]} to Campaign model
@derive {Poison.Encoder, only: [:user]} to User model

Any advise, please?
Thanks

Have Verk start as a child the user's supervision tree.

16:46 <jeregrine> is there any reason why I *shouldn't* manually start an application outside of my mix.exs? Say I want to make sure Repo is started before an application?
16:48 <ericmj> jeregrine: it sounds weird to have a dependency depend on your app
16:48 <jeregrine> ericmj: yea it is. We're using the worker queue "verk" and the worker starts jobs which require the Repo before the repo has started
16:48 <jeregrine> its kind of a race condition
16:49 <jeregrine> so whenever we boot we get a couple jobs that fail
16:49 <edgurgel> yeah we would need a way to β€œwait” till the system is up
16:49 <edgurgel> cause Verk itself holds the queues,workers.
16:49 <ericmj> maybe the verk supervisor should be started by the parent app
16:49 <edgurgel> so Verk starts and it tries to run the jobs as fast as possible
16:50 <edgurgel> yeah it would probably be a good solution

Questions around failure handling

Hello!

I've just discovered this library, and I'm interested in how it compares to https://github.com/akira/exq

There were a few things I was unable to determine from the documentation and a very quick scan of the codebase.

  • Is there a limit to how long a worker can take to process a job?
  • When a node goes down and does not recover, how is it determined which "in process" jobs need to be requeued, and how does the failure detection mechanism work?

It'd be great if the algorithm was documented so it's easier to make an informed decision when evaluating worker libs :)

Thanks,
Louis

Verk should be able to start even if Redis connection is not available

Several GenServers include operations in their init() functions that require a connection to Redis, for example:

def init(_) do
{:ok, redis_url} = Application.fetch_env(:verk, :redis_url)
{:ok, redis} = Redix.start_link(redis_url)
Verk.Scripts.load(redis)

def init([queue_name]) do
node_id = Application.get_env(:verk, :node_id, "1")
{:ok, redis_url} = Application.fetch_env(:verk, :redis_url)
{:ok, redis} = Redix.start_link(redis_url)
Verk.Scripts.load(redis)

This results in these GenServers failing to start when the Redis connection is not available. The effect is that if an application adds Verk.Supervisor to their supervision tree, the application will not be able to start if Redis is not available.

This doesn't seem to be what we would want in most cases. Instead I'd expect my application to be able to handle Redis connection retries in case there was a temporary connection disruption during startup.

Possible solution

We could move function calls that rely on the Redis connection to be present out of init() and instead do that work in a callback. This would allow for the work to fail and the GenServer to still start.

For example Verk.QueueManager.init would look something like:

  def init([queue_name]) do
    node_id = Application.get_env(:verk, :node_id, "1")
    Process.send_after(self(), :startup, 0)

    state = %State{queue_name: queue_name, redis: nil, node_id: node_id}

    {:ok, state}
  end

  def handle_info(:startup, state) do
    {:ok, redis_url} = Application.fetch_env(:verk, :redis_url)
    {:ok, redis} = Redix.start_link(redis_url)
    Verk.Scripts.load(redis)

    state.redis = redis
    Logger.info "Queue Manager started for queue #{state.queue_name}"
    {:noreply, state}
  end

Error reporting

We need to find a way to report errors that happened with a job so the user can define/use their own error reporting system (Airbrake, Raygun, etc)

My initial idea was to create events using GenEvent and one of them would be "error happened". The same GenEvent could be used to handle metrics as well.

Priority job queue

Having read the documentation, I got the impression that the job queue that the workers receive job from is FIFO. In many applications, the jobs have priority associated. If I want to make job queue type configurable (FIFO, priority queue), how much design change in verk do you think one would have to make?

Rethink how jobs are managed

Right now the WorkersManager does a couple of tasks:

  • Ask for jobs to be done through the QueueManager;
  • Message the workers to perform a job;
  • Receive feedback from workers related to a job (done or failure messages);
  • Track jobs that are being done in an ETS table;
  • Monitor workers so we know if a job failed because a worker simply died;
  • Acknowledge or schedule for retry through the QueueManager;
  • Clean up the inprocess list of jobs that were hanging from previous execution

This issue is a starting point to discuss what and if we can split these concerns in different processes but keeping the same feature set we have now.

cc/ @mitchellhenke

Provide a way to read the job metadata

It would be interesting if we could provide a way to fetch metadata about the job so that the worker could know which queue, job_id it's working on now.

My initial idea is to use the process dictionary as it's just a metadata related to the worker(process) and it will be cleaned when the job is done or failed. It feels like the perfect use case!

Still need some thought around the API. Worker.metadata ?

Utilize Redis Connection Outside Verk

I'm fairly new to Elixir so there might be an obvious answer to this, but is there an easy way to reuse the Redix connection that Verk opens up elsewhere in the code?

Comparison to Exq

Hey!

More of question than anything else, could you comment a bit on the differences between this library and exq?

Thanks!

Allow graceful restarts of Verk applications

For context: I'm running my Verk-backed application in Kubernetes. When I do a deploy Kubernetes issues a SIGTERM signal, waits 30 seconds, then issues a SIGKILL. There is also an option to run a custom script as a hook before issuing the SIGTERM.

I don't believe that the Erlang VM is handling the Unix signals in any special way, I think instead that eventually Verk is brutally killed, stopping any in-progress jobs in a potentially weird state. I'd like to have some way to gracefully stop Verk, perhaps by stopping dequeuing of jobs from Redis and letting it finish any in-progress jobs.

Does Verk have anything right now to support this? Can we somehow leverage GenServer callbacks to accomplish this?

Retry now from web ui

I think the option to "retry now" selected failed jobs would be useful - i.e. to run failed jobs after deploying a fixed codebase.

Thoughts?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.