GithubHelp home page GithubHelp logo

qless-core's Introduction

Qless Core

Build Status

This is the set of all the lua scripts that comprise the qless library. We've begun migrating away from the system of having one lua script per command to a more object-oriented approach where all code is contained in a single unified lua script.

There are a few reasons for making this choice, but essentially it was getting too difficult to maintain and there was a lot of duplicated code in different sections. This also happens to have the added benefit of allowing you to build on top of the qless core library within your own lua scripts through composition.

Building

For ease of development, we've broken this unified file into several smaller submodules that are concatenated together into a qless.lua script. These are:

  • base.lua -- forward declarations and some uncategorized functions
  • config.lua -- all configuration interactions
  • job.lua -- the regular job class
  • queue.lua -- the queue class
  • recurring.lua -- the recurring job class
  • worker.lua -- manage available workers
  • api.lua -- exposing the interfaces that the clients invoke, it's a very thin wrapper around these classes

In order to build up the qless.lua script, we've included a simple Makefile though all it does is cat these files out in a particular order:

make qless.lua

If you'd like to use just the core library within your lua script, you can get lua script that contains all the classes, but none of the wrapping layer that the qless clients use:

make qless-lib.lua

Testing

Historically, tests have appeared only in the language-specific bindings of qless, but that has become a tedious process. Not to mention the fact that it's a steep barrier to entry for writing new clients. In light of that, we now include tests directly in qless-core, written in python. To run these, you will need python and the nose and redis libraries. If you have pip installed:

pip install redis nose

To run the tests, there is a directive included in the makefile:

make test

If you have Redis running somewhere other than localhost:6379, you can supply the REDIS_URL environment variable:

REDIS_URL='redis://host:port' make test

Conventions

No more KEYS

When originally developing this, I wrote some functions using the KEYS portion of the lua scripts, but eventually realized that to do so didn't make any sense. For just about all operations there's no way to determine a priori which Redis keys would be touched, and so I abandoned that idea. However, in many cases there were vestigial KEYS in use, but that has now changed. No more KEYS!

Time, Time Everywhere

To ease the client logic, every command now takes a timestamp with it. In many cases this argument is ignored, but it is still required in order to make a valid call. This requirement only comes through in the exposed script API, but not in the class interface. At the class function level, only the functions which require the now argument list it.

Documentation

The documentation of the code is present in each of the modules, but it is excluded from the production code to reduce the weight of it.

Features and Philosophy

Locking

A worker is given an exclusive lock on a piece of work when it is given that piece of work. That lock may be renewed periodically so long as it's before the provided 'heartbeat' timestamp. Likewise, it may be completed.

If a worker attempts to heartbeat a job, it may optionally provide an updated JSON blob to describe the job. If the job has been given to another worker, the heartbeat should return false and the worker should yield.

When a node attempts to heartbeat, the lua script should check to see if the node attempting to renew the lock is the same node that currently owns the lock. If so, then the lock's expiration should be pushed back accordingly, and the updated expiration returned. If not, an exception is raised.

Stats

Qless also collects statistics for job wait time (time popped - time put), and job completion time (time completed - time popped). By 'statistics', I mean average, variange, count and a histogram. Stats for the number of failures and retries for a given queue are also available.

Stats are grouped by day. In the case of job wait time, its stats are aggregated on the day when the job was popped. In the case of completion time, they are grouped by the day it was completed.

Tracking

Jobs can be tracked, which just means that they are accessible and displayable. This can be useful if you just want to keep tabs on the progress of jobs through the pipeline. All the currently-tracked jobs are stored in a sorted set, ql:tracked.

Failures

Failures are stored in such a way that we can quickly summarize the number of failures of a given type, but also which items have succumb to that type of failure. With that in mind, there is a Redis set, ql:failures whose members are the names of the various failure lists. Each type of failure then has its own list of instance ids that encountered such a failure. For example, we might have:

ql:failures
=============
upload error
widget failure

ql:f:upload error
==================
deadbeef
...

Worker Data

We'll keep a sorted set of workers sorted by the last time they had any activity. We'll store this set at ql:workers.

In addition to this list, we'll keep a set of the jids that a worker currently has locks for at ql:w:<worker>:jobs. This should be sorted by the time when we last saw a heartbeat (or pop) for that worker from that job.

TBD We will likely store data about each worker. Perhaps this, too, can be kept by day.

Job Data Deletion

We should delete data about completed jobs periodically. We should prune both by the policies for the maximum number of retained completed jobs, and by the maximum age for retained jobs. To accomplish this, we'll use a sorted list to keep track of which items should be expired. This list should be stored in the key ql:completed

Configuration Options

The configuration should go in the key ql:config, and here are some of the configuration options that qless is meant to support:

  1. heartbeat (60) -- The default heartbeat in seconds for queues
  2. stats-history (30) -- The number of days to store summary stats
  3. histogram-history (7) -- The number of days to store histogram data
  4. jobs-history-count (50k) -- How many jobs to keep data for after they're completed
  5. jobs-history (7 * 24 * 60 * 60) -- How many seconds to keep jobs after they're completed
  6. heartbeat-<queue name> -- The heartbeat interval (in seconds) for a particular queue
  7. max-worker-age -- How long before workers are considered disappeared
  8. <queue>-max-concurrency -- The maximum number of jobs that can be running in a queue. If this number is reduced, it does not impact any currently-running jobs
  9. max-job-history -- The maximum number of items in a job's history. This can be used to help control the size of long-running jobs' history

Internal Redis Structure

This section stands to speak to the internal structure and naming conventions.

Jobs

Each job is stored primarily in a key ql:j:<jid>, a Redis hash, which contains most of the keys that describe the job. A set (possibly empty) of jids on which this job depends is stored in ql:j:<jid>-dependencies. A set (also possibly empty) of jids that rely on the completion of this job is stored in ql:j:<jid>-dependents. For example, ql:j:<jid>:

{
	# This is the same id as identifies it in the key. It should be
	# a hex value of a uuid
	'jid'         : 'deadbeef...',

	# This is a 'type' identifier. Clients may choose to ignore it,
	# or use it as a language-specific identifier for determining
	# what code to run. For instance, it might be 'foo.bar.FooJob'
	'type'        : '...',

	# This is the priority of the job -- lower means more priority.
	# The default is 0
	'priority'    : 0,

	# This is the user data associated with the job. (JSON blob)
	'data'        : '{"hello": "how are you"}',

	# A JSON array of tags associated with this job
	'tags'        : '["testing", "experimental"]',

	# The worker ID of the worker that owns it. Currently the worker
	# id is <hostname>-<pid>
	'worker'      : 'ec2-...-4925',

	# This is the time when it must next check in
	'expires'     : 1352375209,

	# The current state of the job: 'waiting', 'pending', 'complete'
	'state'       : 'waiting',

	# The queue that it's associated with. 'null' if complete
	'queue'       : 'example',

	# The maximum number of retries this job is allowed per queue
	'retries'     : 3,
	# The number of retries remaining
	'remaining'   : 3,

	# The jids that depend on this job's completion
	'dependents'  : [...],
	# The jids that this job is dependent upon
	'dependencies': [...],

	# A list of all the things that have happened to a job. Each entry has
	# the keys 'what' and 'when', but it may also have arbitrary keys
	# associated with it.
	'history'   : [
		{
			'what'  : 'Popped',
			'when'  : 1352075209,
			...
		}, {
			...
		}
	]
}

Queues

A queue is a priority queue and consists of three parts:

  1. ql:q:<name>-scheduled -- sorted set of all scheduled job ids
  2. ql:q:<name>-work -- sorted set (by priority) of all jobs waiting
  3. ql:q:<name>-locks -- sorted set of job locks and expirations
  4. ql:q:<name>-depends -- sorted set of jobs in a queue, but waiting on other jobs

When looking for a unit of work, the client should first choose from the next expired lock. If none are expired, then we should next make sure that any jobs that should now be considered eligible (the scheduled time is in the past) are then inserted into the work queue. A sorted set of all the known queues is maintained at ql:queues. Currently we're keeping it sorted based on the time when we first saw the queue, but that's a little bit at odd with only keeping queues around while they're being used.

When a job is completed, it removes itself as a dependency of all the jobs that depend on it. If it was the last job that a job depended on, it is then inserted into the queue's work.

Stats

Stats are grouped by day and queue. The day portion of the stats key is an integer timestamp of midnight for that day:

<day> = time - (time % (24 * 60 * 60))

Stats are stored under two hashes: ql:s:wait:<day>:<queue> and ql:s:run:<day>:<queue> respectively. Each has the keys:

  • total -- The total number of data points contained
  • mean -- The current mean value
  • vk -- Not the actual variance, but a number that can be used to both numerically stable-ly find the variance, and compute it in a streaming fashion
  • s1, s2, ..., -- second-resolution histogram counts for the first minute
  • m1, m2, ..., -- minute-resolution for the first hour
  • h1, h2, ..., -- hour-resolution for the first day
  • d1, d2, ..., -- day-resolution for the rest

This is also another hash, ql:s:stats:<day>:<queue> with keys:

  • failures -- This is how many failures there have been. If a job is run twice and fails repeatedly, this is incremented twice.
  • failed -- This is how many are currently failed
  • retries -- This is how many jobs we've had to retry

Tags

All jobs store a JSON array of the tags that are associated with it. In addition, the keys ql:t:<tag> store a sorted set of all the jobs associated with that particular tag. The score of each jid in that tag is the time when that tag was added to that job. When jobs are tagged a second time with an existing tag, then it's a no-op.

Implementing Clients

There are a few nuanced aspects of implementing bindings for your particular language that are worth bringing up. The canonical example for bindings should be the python and ruby bindings.

Structure

We recommend using git submodules to keep qless-core in your bindings.

Testing

The majority of tests are implemented in qless-core, and so your bindings should merely test that they provide sensible access to the functionality. This should include a notion of queues, workers, jobs, etc.

Running the Worker

If your language supports dynamic importing of code, and in particular if a class can be imported deterministically from a string identifier, then you should include a worker executable with your release. For example, in Python, given the class foo.Job, that string is enough to know what module to import. As such, a worker binary can just be given a list of queues, a number (and perhaps type) of workers, wait intervals, etc., and then can import all the code required to perform work.

Timestamps

Jobs with identical priorities are popped in the order they were inserted. The caveat is that it's only true to the precision of the timestamps your bindings provide. For example, if you provide timestamps to the second granularity, then jobs with the same priority inserted in the same second can be popped in any order. Timestamps at the thousandths of a second granularity will maintain this property better. While for most applications it's likely not important, it is something to be aware of when writing language bindings.

Filesystem Access

It's intended to be a common usecase that bindings provide a worker script or binary that runs several worker subprocesses. These should run with their working directory as a sandbox.

Forking Model

There are a couple of philosophies regarding how to best fork processes to do work. Certainly, there should be a parent process that manages child processes, if for no other reason than to ensure that child workers are well-behaved. But how exactly the child processes work is less clear. We encourage you to make all models available in your client:

  • Fork once for each job -- This has the added benefit of containing potential issues like resource leaks, but it comes at the potentially high cost of forking once for each job.
  • Fork long-running processes -- Forking long-running processes means that you will likely to be able to saturate the CPUs on a machine more easily, and reduces the cost per job of forking.
  • Coroutines in long-running processes -- Especially for I/O-bound processes this is handy, since you can keep the number of processes relatively small and still get good I/O parallelism.

Each style of worker should be able to listen for worker-specific lock_lost, canceled and put events, each of which can signal that a worker has lost its right to process a job. If that's discovered, a parent process could take the opportunity to stop the child worker that's currently running that job (if it exists). While qless ensures correct behavior when taking action on jobs where a lock has been lost, this is an opportunity to gain efficiency.

Queue Popping Order

Workers are allowed (and encouraged) to pop off of more than one queue. But then we get into the problem of what order they should be polled. Workers should support two modes of popping: ordered and round-robin. Consider queues A, B, and C with job counts:

A: 5
B: 2
C: 3

In an ordered verion, the order in which the queues are specified has significance in the order in which jobs are popped. For example, if our queued were ordered C, B, A in the worker, we'd pop jobs off:

C, C, C, B, B, A, A, A, A, A

In the round-robin implementation, a worker pops off a job from each queue as it progress through all queues:

C, B, A, C, B, A, C, A, A, A

Internal Style Guide

These aren't meant to be stringent, but just to keep myself sane so that when moving between different chunks of code that it's all formatted similarly, and the same variable names have the same meaning.

  1. Parameter sanitization should be performed as early as possible. This includes making use of assert and error based on the number and type of arguments.
  2. Job ids should be referred to as jid, both internally and in the clients.
  3. Failure types should be described with group. I'm not terribly thrilled with the term, but I thought it was better than 'kind.' After spending some time with a Thesaurus, I didn't find anything that appealed to me more
  4. Job types should be described as klass (nod to Resque), because both 'type' and 'class' are commonly used in languages.

qless-core's People

Contributors

b4hand avatar bkirz avatar brenfrow avatar corbingraham avatar databus23 avatar dlecocq avatar eliemichel avatar hschowdhury avatar jacobpenny avatar moteus avatar myronmarston avatar scriptedworld avatar stephenreay avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

qless-core's Issues

Provide a means for pausing workers

I talked this over with @dlecocq. Documenting here so we don't forget the conversation:

Dan suggested implementing rate limiting as a more flexible, useful mechanism that could also be used to pause workers (i.e. by setting the rate to 0 for a particular queue). Rate limiting is a much larger, more complex feature, though, so we decided that for now we can implement a simple pause/unpause API, and if/when we add rate limiting in the future, the implementation of that API can change to use rate limiting underneath.

Suggested API:

pause_queues("q1", "q2")
unpause_queues("q1", "q2")

Do we plan to have function to delete the queue.

Hi!, We need dynamic to create the queue and delete the queue. Are we going to support this function? If not, any instruction that we can do it by our own lua script?

More precisely, we are trying to solve the problem that some jobs in the queue taking very long time and some jobs in the queue taking short time.
Here is the situation:
Our jobs are grouped in certain way that we do not want each group of jobs interacts each other. The group is dynamically created and can be gone forever. Currently, we are trying to dynamically associate each group to its own queue. However, the number of groups is not fixed. So if the groups are created and then gone forever, we have to delete the those queues to avoid leaking.

Thanks a lot!

Provide a config option to never fail a job with an `Job exhausted retries in queue`

We've opted to use a ruby qless middleware to handle retries rather than doing it in the qless-core scripts:

https://github.com/seomoz/qless/blob/master/lib/qless/middleware/retry_exceptions.rb

This is preferrable for us because "Job exhausted retries in queue" gives nowbacktrace or indication about what the original failure was -- so it hides the source of the problem from us. By doing it in ruby, we get the full backtrace and can troubleshoot the problem. In addition, if we call Job#retry, we want it to retry: we don't qless-core failing it instead with this error.

Currently we're running qless-core and qless branches that have this feature disabled but we don't want to stay on a branch as it makes getting updates hard:

https://github.com/seomoz/qless/tree/kill_exhausted_retries_error
https://github.com/seomoz/qless-core/tree/myron-disable-exhausted-retries-error

Lua Script Error on Redis 3.2

On Redis 3.2, this error breaks qless:

Redis::CommandError - ERR Error running script (call to f_3f9682e7ddb462dca8c60a26d5e88ac70c3a49e9): @user_script:925: @user_script: 925: Lua redis() command arguments must be strings or integers :
  redis (3.3.0) lib/redis/client.rb:121:in `call'
  newrelic_rpm (3.15.1.316) lib/new_relic/agent/instrumentation/redis.rb:42:in `block in call'
  newrelic_rpm (3.15.1.316) lib/new_relic/agent/datastores.rb:111:in `block in wrap'
  newrelic_rpm (3.15.1.316) lib/new_relic/agent/method_tracer.rb:73:in `block in trace_execution_scoped'
  newrelic_rpm (3.15.1.316) lib/new_relic/agent/method_tracer_helpers.rb:82:in `trace_execution_scoped'
  newrelic_rpm (3.15.1.316) lib/new_relic/agent/method_tracer.rb:71:in `trace_execution_scoped'
  newrelic_rpm (3.15.1.316) lib/new_relic/agent/datastores.rb:108:in `wrap'
  newrelic_rpm (3.15.1.316) lib/new_relic/agent/instrumentation/redis.rb:41:in `call'
  redis (3.3.0) lib/redis.rb:2394:in `block in _eval'
  redis (3.3.0) lib/redis.rb:58:in `block in synchronize'
  /Users/xxx/.rbenv/versions/2.2.3/lib/ruby/2.2.0/monitor.rb:211:in `mon_synchronize'
  redis (3.3.0) lib/redis.rb:58:in `synchronize'
  redis (3.3.0) lib/redis.rb:2393:in `_eval'
  redis (3.3.0) lib/redis.rb:2445:in `evalsha'
   () Users/xxx/.rbenv/versions/2.2.3/lib/ruby/gems/2.2.0/bundler/gems/qless-e5615c39eaff/lib/qless/lua_script.rb:44:in `_call'
   () Users/xxx/.rbenv/versions/2.2.3/lib/ruby/gems/2.2.0/bundler/gems/qless-e5615c39eaff/lib/qless/lua_script.rb:26:in `block in call'
   () Users/xxx/.rbenv/versions/2.2.3/lib/ruby/gems/2.2.0/bundler/gems/qless-e5615c39eaff/lib/qless/lua_script.rb:49:in `handle_no_script_error'
   () Users/xxx/.rbenv/versions/2.2.3/lib/ruby/gems/2.2.0/bundler/gems/qless-e5615c39eaff/lib/qless/lua_script.rb:25:in `call'
   () Users/xxx/.rbenv/versions/2.2.3/lib/ruby/gems/2.2.0/bundler/gems/qless-e5615c39eaff/lib/qless.rb:202:in `call'
   () Users/xxx/.rbenv/versions/2.2.3/lib/ruby/gems/2.2.0/bundler/gems/qless-e5615c39eaff/lib/qless/queue.rb:108:in `put'

ql:tracked set is leaking canceled jobs

I just noticed that the ql:tracked set in my productive Qless installation contained ~ 450.000 entries while there are only < 8.000 jobs this Qless installation is still aware off.

Digging into it a little is seems to me jids are only removed from the ql:tracked key when they are explicitly untracked. The jid is not removed from the ql:tracked set when the job is canceled. This leads to an ever growing ql:tracked sorted set which breaks the qless webui as soon as the tracked lua script takes longer then 5 seconds to loop through the set.

I'm still running on qless 0.9.3 but having a quick glimse at the current master of qless-core it seems to me this problem still exists.

Move job failure messages into separate, hash-based keys

qless-core currently stores a job's failure message as part of its data. For the most worker types, this means all failed jobs have exception backtraces and messages in their job data. We've seen thousands of jobs fail with identical failure messages, which bloats our memory usage considerably. Additionally, these keys never expire, so the only way to prevent running out of memory is to handle these errors manually.

I discussed this with @myronmarston, and we came up with a few changes we can make to solve this problem:

  • Move job failure messages into separate keys with their own expirations. That way, jobs that fail won't be dropped on the floor, but they won't use up nearly as much memory indefinitely.
  • Key the job failure messages by the hash of the message. This handles the duplicate error message issue.

What do you think, @dlecocq?

max_retries seems a bit restrictive, design-wise

If I want a progressive delay logic, i.e. incrementally increasing delay time, I need to do this myself in the worker when calling retry(). I might want to limit the retry by time (30 days) rather than count.

Given that, it doesn't make sense to me that the worker must worry about the producer not queuing up the job with the correct number of retries. I basically say max_retries=99999 now, which is ugly.

What I'd like to see is the retry() command gaining an option that allows users to skip the default max_retries logic, and allow the worker to decide if the retries have been exhausted.

Use vararg expression instead `arg` table.

vararg expression was introduced in Lua 5.1

The vararg system changed from the pseudo-argument `arg` with a table with the extra arguments to the vararg expression.

And since Lua 5.2 it does not avaliable. (At least without compt compile flags).
Instead or this better use vararg expression like.

-- e.g. instead of
function f1(a, ...)  print(unpack(arg)) end
function f2(a, ...)  print(a, ...) end

If you need handle vararg as array you can do

function f1(...) 
  -- do not use `arg` name for this variable
  local argv, argc = {...}, select('#', ...)
  for i = 1, argc do
    -- handle argv[i]
  end

Unable to use the library

I am trying to use qless-core
local queue = require "qless"
But right after that, I get the following error,
./qless.lua:2009: attempt to get length of global 'KEYS' (a nil value)
It looks like I am not initializing the library correctly? Is there any other way to initialize?

Thanks.

Consider truncating history as it grows

We have some old jobs that have been bouncing around our staging environment for a while, and we've retried them repeatedly with larger sets of failed jobs. As a result, they've got over 17K events in their history. This causes memory bloat -- the history for this job is using up ~ 2 MB of redis memory.

While you can argue that we shouldn't keep retrying the job (and you'd be right) it would also be nice if Qless protected against the history becoming too bloated. What do you think about this?

  • Have a max-retained-history-events config setting that defaults to something reasonable (say, 50 or 100).
  • When adding to the history, truncate as needed to keep it under that. Ideally it would always keep the first event plus the last (max - 1) events.

Support throttling on arbitrary resource names

Currently Qless supports throttling at a per-queue level. We have a need to due throttling on an arbitrary named resource (in our case a MySQL host in our shard ring). To prevent our MySQL hosts from getting overloaded, we've set a hard connection limit of 30 connections for our shard building jobs. We rescue and retry "too many connections" errors, but it would be more efficient if we could set a max concurrency per host, w/o having to put jobs in a per-host queue.

So...here's an idea for how we could refactor the current concurrency throttling to be more general:

  • Each job can have a set of named throttlable resources. When enqueing a job you can specify a list of these: queue.put(MyJobClass, { data: 15 }, throttlable_resources: ['foo', 'bar']).
  • The queue name and klass name are implicitly included in the list of throttlable resources, but not actually stored in the redis set qless-core will use for this. (The internal QlessJob#throttlable_resources qless-core API will take care of adding the queue and klass names to this list when things request the throttlable resources).
  • qless-core will maintain a set of counters for each named resource, that indicates the current number of jobs that have the named resource in its list of throttable_resources. In Pop() it will increment the counter for each throttlable resource of the popped job.
  • When a job completes, fails or times out, it will decrement the counters for each throttlable resource.
  • In Pop() it will also check that a potentially popped job's throttlabe resources all have available capacity by looking at the counters. If any of the counters are full, it won't pop that job, moving on in the queue to the next job.
  • We might consider using sets of jids (rather than counters) for each throttlable resource, as the set of jids gives us more information: it tells us what all the jobs that are using that resource are. scard can be used to get the count in O(1) time.
  • qless-core would provide a way to set limits on these throttlable resources, potentially using its config API.

In our use case, we would use MySQL host names as our throttable resources. This could supercede the existing per-queue throttling (as a queue name would be an implicit throttled resource and this could easily support that use case). It would also nicely support per-job-class throttling.

Thoughts, @dlecocq?

/cc @proby

Documentation regarding priority is incorrect

The documentation indicates a lower value indicates more priority, which is incorrect according to implementation.

Items are added to a sorted set using the following formula, where priority defaults to 0:

score = priority - (now / 10000000000)

Using current unix time stamps means typically items are scored with negative numbers like -0.1383766174 and getting smaller. Items are popped off using ZREVRANGE, which sorts from largest to smallest. If we set a negative priority, the item will not be fetched until the queue is empty, given that even a priority of -1 will make it considerably smaller than the default values.

Clarification on how scheduled dates are calculated

queue_obj.scheduled.add(now + delay, self.jid)

Want to confirm that this line is indicating that the scheduled date is calculated by the Now dateTime + the number of seconds for the delay resulting in future date.

I am looking at adding the ability to schedule a job based on a future date/time. From the looks of it, this is "technically" already supported. Just to need pre-calculate the difference between now and the future date in seconds and submit that as the delay.

(is there also persistence for these future dates even if the whole stack is shutdown for a period of time)

Is my thinking correct on this?

Thanks

this is related to: seomoz/qless#185

Consider adding a MessagePack option (instead of JSON) for storing complex data

Qless currently uses JSON to store job history, data, tag lists, and failure information. Since Redis exposes a messagepack library to EVALed LUA scripts, it would be nice to have an option to use it for space saving reasons. Adding it as an option lets users decide if they want to save space with msgpack or be able to browse their redis keys easily without using the qless-core API.

When jobs timeout the history does not reflect that

I've noticed that in pop.lua, when a worker loses its lock on a job, it gives the job to another worker and does not update the history to reflect the original timeout. The history is meant to reflect what happened to the job, and it timing out is an important even that should be included, IMO.

@dlecocq -- is this by design or a simple oversight?

Jobs that exhaust retries don't publish to `failed` channel

If a worker calls retry on a job that has exhausted it's retries, the job is marked as failed but never emits a message on the failed pubsub channel.

This means that workers cannot simply attempt to retry and rely on the system to do the right thing - they have to check if the job has any retries remaining and fail it if not.

Resolving dependencies on job failures

From what I can tell, if a job fails, none of its dependent jobs get triggered. Is this expected behavior? If so, it would be nice if the behavior could be configured so that I can depend on a job completing, be it successful or not.

Allow `top tags` API to return all tags

Original:

We see the tags in redis/ui but when call client.call("tag", "top", offset, count) we always get empty tag list, i.e. {}. That happens in ruby binding and java binding.

Updated:

The top tags API is currently only designed to return tags with at least 2 jobs tagged as such. This task is to make the minimum number of backing jobs an argument (with a sane default) to the top tags API.

Encode math.huge

I found problem on my test installation (Windows Redis-3.0 x64).
redis.call( ..., math.huge ...) convert value to 1.INF# string and I get error
when I go to web http://localhost:5678/queues/test-queue/running
Qless::LuaScriptError at /queues/test-queue/running ERR min or max is not a float
App send this commands to redis

1484839149.823831 [0 127.0.0.1:2376] "evalsha" "3f9682e7ddb462dca8c60a26d5e88ac70c3a49e9" "0" "jobs" "1484839149.8228312" "running" "test-queue" "0" "25"      
1484839149.823831 [0 lua] "zrangebyscore" "ql:q:test-queue-locks" "1484839149.8228312" "1.#INF" "LIMIT" "0" "25"                                               

So I suggest replace math.huge to +inf string like in this example

redis.call('zrangebyscore', queue:prefix('locks'), now, "+inf", 'LIMIT', offset, count)

Upadate
Basic testcase

127.0.0.1:6379> eval "redis.call('set', 'a', math.huge) local a=redis.call('get', 'a') return {a, math.huge}" 0
1) "1.#INF"
2) (integer) -9223372036854775808

Broken conditional logic in put.lua

I'm debugging an issue we're having by looking at the qless monitoring output. I noticed this:

1360856508.527433 [8 lua] "zrem" "ql:w::jobs" "64c435622a9d402e9e51e3422cf81216"

This is coming from here:

-- If this had previously been given out to a worker,
-- make sure to remove it from that worker's jobs
if worker then
    redis.call('zrem', 'ql:w:' .. worker .. ':jobs', jid)
end

Worker is a blank string (or some other value that is concatenated as a blank string) but is truthy in the conditional. This may not be a real problem, but it was surprising to me and suggests some false assumptions being made that could cause other bugs later down the road, so I thought I'd mention it.

/cc @dlecocq

Cleanup failed jobs.

What is correct way to keep only last N filed jobs.
For now I try to figure out is it possible use qless in my use case.
In my use case I can just threw away a job and forget about it
We have separate logging infrastructure and we can checkout logs there.
I just need write metrics about number of failures to the graphite.
Each worker simply will try to complite job or call retry with some delay. If a number of retries exhausted qless now marks such job as failed and never remove them.
(I have over 10M messages per day and aroud 40% will be marked as a failed because of they can not be complited)
I see only one solution is just make my own counter and mark all jobs as complited.
But may be there exists some efficien way to remove all failed jobs from the queue?

Fails on redis 6.2.7 and redis 7.0 due to globally reachable lua tables becoming read only

docker run -it --rm -p 127.0.0.1:6379:6379 redis:6.2.7 and make test fails all tests due to tables becoming read only:

...
======================================================================
FAIL: Cannot fail a job that doesn't exist
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/test/qless-core/test/test_fail.py", line 84, in test_fail_nonexistent
    self.lua, 'fail', 1, 'jid', 'worker', 'group', 'message', {})
  File "/home/test/qless-core/test/common.py", line 41, in assertRaisesRegexp
    '%s does not match %s' % (str(exc), regex))
AssertionError: Error running script (call to f_1c4a4e283c4a97440f92a680b4aa0c4c070ee1e9): @user_script:27: user_script:27: Attempt to modify a readonly table does not match does not exist

======================================================================
FAIL: Cannot complete a job that doesn't exist
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/test/qless-core/test/test_job.py", line 99, in test_complete_nonexistent
    self.lua, 'complete', 1, 'jid', 'worker', 'queue', {})
  File "/home/test/qless-core/test/common.py", line 41, in assertRaisesRegexp
    '%s does not match %s' % (str(exc), regex))
AssertionError: Error running script (call to f_1c4a4e283c4a97440f92a680b4aa0c4c070ee1e9): @user_script:27: user_script:27: Attempt to modify a readonly table does not match does not exist
...
Ran 243 tests in 0.368s

FAILED (errors=220, failures=6)

Same with redis 7.

With 6.2.6 and prior releases it still works.

Related redis PR: redis/redis#10651

Have not used lua before, not sure what would be the proper way to workaround that.

Idea: publish a message to workers when a queue moves from empty to non-empty

In our workers the general pattern is to to the following in a loop:

  • Pop jobs until they have no more work to do.
  • Go into a sleep loop where they sleep for x seconds, try to pop a job, then sleep again if there's no job.

This is pretty inefficient and results in steady redis traffic even when there's no work to do.

I'd like to propose an alternate model:

  • When a queue transitions from having no jobs to pop to having jobs to pop (whether via move, retry, put or whatever) have it publish a message like "jobs_available" in a channel that is named after the queue.
  • Then the workers could have use sleep (with no arg) to sleep indefinitely, until it's subscriber gets a jobs_available message, at which time it would use Thread#run to wake up the worker.

This would result in much less redis traffic and would allow workers to get jobs immediately when they are put on the queue rather than waiting through the 5 second (or whatever) sleep we currently use.

One "gotcha" with this, though: with scheduled/recurring jobs, jobs are put on the queue when a worker calls pop or peek, causing qless to make the state of that stuff consistent. so if nothing calls pop or peek, it'll never move the scheduled job to the waiting state and never notify workers. Thus, we may want to do something like use a long sleep (rather than an indefinite one) or consider having the parent process call peek periodically.

Project stewardship?

Let me start off by saying that I am very grateful to Moz for open sourcing Qless. Plenty of organisations would never do so. So thank you, and please understand that I simply want this tool to be the best it can.

Having said that, the inactivity is disappointing, and as someone attempting to contribute, quite frustrating too. Is this project still actively used by Moz (the organisation)? If so, is anyone (either an individual or a team) responsible for maintaining it?

Right now, across this 'core' repo and the various language bindings there are over 100 open issues and PR's - many of which are years old.

In several cases, @dlecocq has approved PR's but they're never merged.

In several more cases, a relatively small amount of work might be required to either fix an issue, or bring a non-satisfactory PR to a point where it can be merged - but without a reasonable expectation of it being merged, who is going to bother?

Cheers

Stephen

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.