suredone / qdone Goto Github PK

View Code? Open in Web Editor NEW

12.0 6.0 1.0 3.5 MB

Command line job queue for SQS

License: ISC License

JavaScript 100.00%

sqs job-queue queue producer consumer aws amazon amazon-web-services cli workers

qdone's Introduction

qdone

Command line job queue for SQS

Features

Enqueue and run any command line job with parameters
Creates SQS queues (and failed job queues) on demand
Minimizes SQS API calls
Workers can listen to multiple queues, including wildcards
Efficient batch enqueing of large numbers of jobs
Dynamic visibility timeout for long running jobs
Dynamic removal of idle queues

qdone was inspired, in part, by experiences with RQ in production.

Installing

npm install -g qdone

Examples

Enqueue a job and run it:

$ qdone enqueue myQueue "echo hello world"
Enqueued job 030252de-8a3c-42c6-9278-c5a268660384

$ qdone worker myQueue
...
Looking for work on myQueue (https://sqs.us-east-1ld...)
  Found job a23c71b3-b148-47b1-bfbb-f5dbb344ef97
  Executing job command: nice echo hello world
  SUCCESS
  stdout: hello world

Queues are automatically created when you use them:

$ qdone enqueue myNewQueue "echo nice to meet you"
Creating fail queue myNewQueue_failed
Creating queue myNewQueue
Enqueued job d0077713-11e1-4de6-8f26-49ad51e008b9

Notice that qdone also created a failed queue. More on that later.

To queue many jobs at once, put a queue name and command on each line of stdin or a file:

$ qdone enqueue-batch -  # use stdin
queue_0 echo hi
queue_1 echo hi
queue_2 echo hi
queue_3 echo hi
queue_4 echo hi
queue_5 echo hi
queue_6 echo hi
queue_7 echo hi
queue_8 echo hi
queue_9 echo hi
^D
Enqueued job 14fe4e30-bd4f-4415-b902-8df29cb73066 request 1
Enqueued job 60e31392-9810-4770-bfad-6a8f44114287 request 2
Enqueued job 0f26806c-2030-4d9a-94d5-b8d4b7a89115 request 3
Enqueued job 330c3d93-0364-431a-961b-5ace83066e55 request 4
Enqueued job ef64ab68-889d-4214-9ba5-af70d84565e7 request 5
Enqueued job 0fece491-6092-4ad2-b77a-27ccb0bd8e36 request 6
Enqueued job f053b027-3f4a-4e6e-8bb5-729dc8ecafa7 request 7
Enqueued job 5f11b69e-ede1-4ea2-8a60-c994adf2c5a0 request 8
Enqueued job 5079a10a-b13c-4b31-9722-8c1d3b146c28 request 9
Enqueued job 5dfe1008-9a1e-41df-b3bc-614ec5f34660 request 10
Enqueued 10 jobs

If you are using the same queue, requests to SQS will be batched:

$ qdone enqueue-batch -  # use stdin
queue_one echo hi
queue_one echo hi
queue_one echo hi
queue_one echo hi
queue_two echo hi
queue_two echo hi
queue_two echo hi
queue_two echo hi
^D
Enqueued job fb2fa6d1... request 1   # one
Enqueued job 85bfbe92... request 1   # request
Enqueued job cea6d180... request 1   # for queue_one
Enqueued job 9050fd34... request 1   #
Enqueued job 4e729c18... request 2      # another
Enqueued job 6dac2e4d... request 2      # request
Enqueued job 0252ae4b... request 2      # for queue_two
Enqueued job 95567365... request 2      #
Enqueued 8 jobs

Failed jobs

A command fails if it finishes with a non-zero exit code:

$ qdone enqueue myQueue "false"
Enqueued job 0e5957de-1e13-4633-a2ed-d3b424aa53fb;

$ qdone worker myQueue
...
Looking for work on myQueue (https://sqs.us-east-1....)
  Found job 0e5957de-1e13-4633-a2ed-d3b424aa53fb
  Executing job command: nice false
  FAILED
  code  : 1
  error : Error: Command failed: nice false

The failed command will be placed on the failed queue.

To retry failed jobs, wait 30 seconds, then listen to the corresponding failed queue:

$ qdone worker myQueue_failed --include-failed
...
Looking for work on myQueue_failed (https://sqs.us-east-1.../qdone_myQueue_failed)
  Found job 0e5957de-1e13-4633-a2ed-d3b424aa53fb
  Executing job command: nice false
  FAILED
  code  : 1
  error : Error: Command failed: nice false

It failed again. It will go back on the failed queue.

In production you will either want to set alarms on the failed queue to make sure that it doesn't grow to large, or set all your failed queues to drain to a failed job queue after some number of attempts, which you also check.

Listening to multiple queues

It's nice sometimes to listen to a set of queues matching a prefix:

$ qdone worker 'test*'  # use single quotes to keep shell from globbing
...
Listening to queues (in this order):
  test - https://sqs.us-east-1.../qdone_test
  test1 - https://sqs.us-east-1.../qdone_test1
  test2 - https://sqs.us-east-1.../qdone_test2
  test3 - https://sqs.us-east-1.../qdone_test3
  test4 - https://sqs.us-east-1.../qdone_test4
  test5 - https://sqs.us-east-1.../qdone_test5
  test6 - https://sqs.us-east-1.../qdone_test6
  test7 - https://sqs.us-east-1.../qdone_test7
  test8 - https://sqs.us-east-1.../qdone_test8
  test9 - https://sqs.us-east-1.../qdone_test9

Looking for work on test (https://sqs.us-east-1.../qdone_test)
  Found job 2486f4b5-57ef-4290-987c-7b1140409cc6
...
Looking for work on test1 (https://sqs.us-east-1.../qdone_test1)
  Found job 0252ae4b-89c4-4426-8ad5-b1480bfdb3a2
...

The worker will listen to each queue for the --wait-time period, then start over from the beginning.

Long running jobs

Workers prevent others from processing their job by automatically extending the default SQS visibility timeout (30 seconds) as long as the job is still running. You can see this when running a long job:

$ qdone enqueue test "sleep 35"
Enqueued job d8e8927f-5e42-48ae-a1a8-b91e42700942

$ qdone worker test --kill-after 300
...
  Found job d8e8927f-5e42-48ae-a1a8-b91e42700942
  Executing job command: nice sleep 35
  Ran for 15.009 seconds, requesting another 60 seconds
  SUCCESS
...

The SQS API call to extend this timeout (ChangeMessageVisibility) is called at the halfway point before the message becomes visible again. The timeout doubles every subsequent call but never exceeds --kill-after.

Dynamically removing queues

If you have workers listening on a dynamic number of queues, then any idle queues will negatively impact how quickly jobs can be dequeued and/or increase the number of unecessary API calls. You can discover which queues are idle using the idle-queues command:

$ qdone idle-queues 'test*' --idle-for 60 > idle-queues.txt
Resolving queues: test*
  done

Checking queues (in this order):
  test - https://sqs.us-east-1.../qdone_test
  test2 - https://sqs.us-east-1.../qdone_test2

Queue test2 has been idle for the last 60 minutes.
Queue test has been idle for the last 60 minutes.
Queue test_failed has been idle for the last 60 minutes.
Queue test2_failed has been idle for the last 60 minutes.
Used 4 SQS and 28 CloudWatch API calls.

$ cat idle-queues.txt
test
test2

Accurate discovery of idle queues cannot be done through the SQS API alone, and requires the use of the more-expensive CloudWatch API (at the time of this writing, ~$0.40/1M calls for SQS API and ~$10/1M calls on CloudWatch). The idle-queues command attempts to make as few CloudWatch API calls as possible, exiting as soon as it discovers evidence of messages in the queue during the idle period.

You can use the --delete option to actually remove a queue if it has been idle:

$ qdone idle-queues 'test*' --idle-for 60 --delete > deleted-queues.txt
...
Deleted test
Deleted test_failed
Deleted test2
Deleted test2_failed
Used 8 SQS and 28 CloudWatch API calls.

$ cat deleted-queues.txt
test
test2

Because of the higher cost of CloudWatch API calls, you may wish plan your deletion schedule accordingly. For example, at the time of this writing, running the above command (two idle queues, 28 CloudWatch calls) every 10 minutes would cost around $1.20/month. However, if most of the queues are actively used, the number of CloudWatch calls needed goes down. On one of my setups, there are around 60 queues with a dozen queues idle over a two-hour period, and this translates to about 200 CloudWatch API calls every 10 minutes or $8/month.

FIFO Queues

The equeue and enqueue-batch commands can create FIFO queues with limited features controlled by the --fifo and --group-id <string> options.

Using the --fifo option with enqueue or enqueue-batch:

causes any new queues to be created as FIFO queues
causes the .fifo suffix to be appended to any queue names that do not explicitly have them
causes failed queues to take the form ${name}_failed.fifo

Using the --group-id option with enqueue or enqueue-batch implies that:

Any commands with the same --group-id will be worked on in the order they were received by SQS (see FIFO docs).
If you don't set --group-id it defaults to a unique id per call to qdone, so this means messages sent by enqueue-batch will always be processed within the batch in the order you sent them.
If you want each message in a batch to have a unique group id (i.e. they don't need to be processed in order, but need to be delivered exactly once) then use the --group-id-per-message option with enqueue-batch.

Enqueue limitations:

There is NO option to set group id per-message in enqueue-batch. Adding this feature in the future will change the format of the batch input file.
There is NO support right now for Content Deduplication, however a Unique Message Deduplication ID is generated for each command, so retry-able errors should not result in duplicate messages.

Using the --fifo option with worker:

causes the .fifo suffix to be appended to any queue names that do not explicitly have them
causes the worker to only listen to queues with a .fifo suffix when wildcard names are specified (e.g. test_* or *)

Worker limitations:

Failed queues are still only included if --include-failed is set.
Regardless of how many workers you have, FIFO commands with the same --group-id will only be executed by one worker at a time.
There is NO support right now for only-once processing using the Receive Request Attempt ID

Production Logging

The output examples in this readme assume you are running qdone from an interactive shell. However, if the shell is non-interactive (technically if stderr is not a tty) then qdone will automatically use the --quiet option and will log failures to stdout as one JSON object per line the following format:

{
  "event": "JOB_FAILED",
  "timestamp": "2017-06-25T20:21:19.744Z",
  "job": "0252ae4b-89c4-4426-8ad5-b1480bfdb3a2",
  "command": "python /opt/myapp/jobs/reticulate_splines.py 42",
  "exitCode": "1",
  "killSignal": "SIGTERM",
  "stderr": "...",
  "stdout": "reticulating splines...",
  "errorMessage": "You can't kill me using SIGTERM, muwahahahahaha! Oh wait..."
}

Each field in the above JSON except event and timestamp is optional and only appears when it contains data. Note that log events other than JOB_FAILED may be added in the future. Also note that warnings and errors not in the above JSON format will appear on stderr.

Shutdown Behavior

Send a SIGTERM or SIGINT to qdone and it will exit successfully after any running jobs complete. A second SIGTERM or SIGINT will immediately kill the entire process group, including any running jobs.

Interactive shells and init frameworks like systemd signal the entire process group by default, so jobs may exit prematurely after receiving the group signal.

To get around this problem in systemd, use KillMode=mixed to keep the job from hearing the signal sent to qdone (but still allow systemd to send a SIGKILL to the child if it runs past TimeoutStopSec).

Here is an example systemd service that runs a qdone worker that allows jobs to run for up to an hour. Calls to systemctl stop|restart will block until any running job is safely finished:

[Unit]
Description=qdone long-running job example
AssertPathExists=/usr/bin/qdone

[Service]
Type=simple
Restart=always
RestartSec=30
TimeoutStopSec=3600
KillMode=mixed
ExecStart=/usr/bin/qdone worker long-running-job-queue --kill-after 3600

[Install]
WantedBy=multi-user.target

SQS API Call Complexity

Context	Calls	Details
`qdone enqueue`	2 [+3]	One call to resolve the queue name, one call to enqueue the command, three extra calls if the queue does not exist yet.
`qdone enqueue-batch`	q + ceil(c/10) + 3n	q: number of unique queue names in the batch c: number of commands in the batch n: number of queues that do not exist yet
`qdone worker` (while listening, per listen round)	n + (1 per n×w)	w: `--wait-time` in seconds n: number of queues
`qdone worker` (while listening with `--active-only`, per round)	2n + (1 per a×w)	w: `--wait-time` in seconds a: number of active queues
`qdone worker` (while job running)	log(t/30) + 1	t: total job run time in seconds

AWS Authentication

You must provide ONE of:

On AWS instances, the instance may have an IAM role that allows the appropriate SQS calls. No further configuration necessary.
A credentials file (~/.aws/credentials) containing a [default] section with appropriate keys.
Both AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY as environment variables

Example IAM policy allowing qdone to use queues with its prefix in any region:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "sqs:GetQueueAttributes",
                "sqs:GetQueueUrl",
                "sqs:SendMessage",
                "sqs:SendMessageBatch",
                "sqs:ReceiveMessage",
                "sqs:DeleteMessage",
                "sqs:CreateQueue",
                "sqs:ChangeMessageVisibility"
            ],
            "Effect": "Allow",
            "Resource": "arn:aws:sqs:*:YOUR_ACCOUNT_ID:qdone_*"
        },
        {
            "Action": [
                "sqs:ListQueues"
            ],
            "Effect": "Allow",
            "Resource": "arn:aws:sqs:*:YOUR_ACCOUNT_ID"
        }
    ]
}

For the idle-queues subcommand, you must add the following permission (and as of this writing, it is not possible to narrow the scope):

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": ["cloudwatch:GetMetricStatistics"],
            "Effect": "Allow",
            "Resource": "*"
        }
    ]
}

Command Line Usage

usage: qdone [options] <command>

Commands

enqueue         Enqueue a single command                       
enqueue-batch   Enqueue multiple commands from stdin or a file 
worker          Execute work on one or more queues

Global Options

--prefix string        Prefix to place at the front of each SQS queue name [default: qdone_]
--fail-suffix string   Suffix to append to each queue to generate fail queue name [default: _failed]
--region string        AWS region for Queues [default: us-east-1]
-q, --quiet            Turn on production logging. Automatically set if stderr is not a tty.
-v, --verbose          Turn on verbose output. Automatically set if stderr is a tty.
-V, --version          Show version number
--help                 Print full help message.

Enqueue Usage

usage: qdone enqueue [options] <queue> <command>
usage: qdone enqueue-batch [options] <file...>

<file...> can be one ore more filenames or - for stdin

Options

-f, --fifo                Create new queues as FIFOs
-g, --group-id string     FIFO Group ID to use for all messages enqueued in current command. Defaults to an string unique to this invocation.
--group-id-per-message    Use a unique Group ID for every message, even messages in the same batch.
--deduplication-id string A Message Deduplication ID to give SQS when sending a message. Use this
                          option if you are managing retries outside of qdone, and make sure the ID is
                          the same for each retry in the deduplication window. Defaults to a string
                          unique to this invocation.
--prefix string           Prefix to place at the front of each SQS queue name [default: qdone_]
--fail-suffix string      Suffix to append to each queue to generate fail queue name [default: _failed]
--region string           AWS region for Queues [default: us-east-1]
-q, --quiet               Turn on production logging. Automatically set if stderr is not a tty.
-v, --verbose             Turn on verbose output. Automatically set if stderr is a tty.
-V, --version             Show version number
--help                    Print full help message.

Worker Usage

usage: qdone worker [options] <queue...>

<queue...> one or more queue names to listen on for jobs

If a queue name ends with the * (wildcard) character, worker will listen on all queues that match the name up-to the wildcard. Place arguments like this inside quotes to keep the shell from globbing local files.

Options:

-k, --kill-after number   Kill job after this many seconds [default: 30]
-w, --wait-time number    Listen at most this long on each queue [default: 20]
--include-failed          When using '*' do not ignore fail queues.
--active-only             Listen only to queues with pending messages.                                  
--drain                   Run until no more work is found and quit. NOTE: if used with
                         --wait-time 0, this option will not drain queues.
--prefix string           Prefix to place at the front of each SQS queue name [default: qdone_]
--fail-suffix string      Suffix to append to each queue to generate fail queue name [default: _failed]
--region string           AWS region for Queues [default: us-east-1]
-q, --quiet               Turn on production logging. Automatically set if stderr is not a tty.
-v, --verbose             Turn on verbose output. Automatically set if stderr is a tty.
-V, --version             Show version number
--help                    Print full help message.

Idle queues usage

usage: qdone idle-queues [options] <queue...>

Options:

-o, --idle-for number   Minutes of inactivity after which a queue is considered
                        idle. [default: 60]
--delete                Delete the queue if it is idle. The fail queue also must be
                        idle unless you use --unpair.
--unpair                Treat queues and their fail queues as independent. By default
                        they are treated as a unit.
--include-failed        When using '*' do not ignore fail queues. This option only
                        applies if you use --unpair. Otherwise, queues and fail queues
                        are treated as a unit.
--prefix string         Prefix to place at the front of each SQS queue name [default: qdone_]
--fail-suffix string    Suffix to append to each queue to generate fail queue name [default: _failed]
--region string         AWS region for Queues [default: us-east-1]
-q, --quiet             Turn on production logging. Automatically set if stderr is not a tty.
-v, --verbose           Turn on verbose output. Automatically set if stderr is a tty.
-V, --version           Show version number
--help                  Print full help message.

qdone's People

Contributors

Stargazers

Watchers

Forkers

ryanwitt

qdone's Issues

Worker children not killed immediately when reaching --kill-after

Observed in production that child processes do not exit and become orphaned once --kill-after is reached. Saw this on the command line.

Also see separate orphaned child processes on some production machines.

Theory on what's happening:

qdone uses child_process.exec (not to be confused with exec(3)) to execute children within a shell.

It seems like exec's timeout option (that we use to send SIGTERM to the child if it reaches the timeout) does not actually kill the child of the shell (observed on Ubuntu 16.04).

Seems like a fairly useless option for exec. Maybe a better option for execFile?

Workarounds:

Kill the qdone process group?
Find the PID of the child (not the child shell, but shell's child) and manually signal that.

I'm starting this ticket to record my findings, but it may be worth checking node issues as well to see if anybody has run into this strange design on exec.

Move current chatty output to --verbose mode

There's a lot going on in the output.

We probably only need something like this for the default output for worker:

SUCCEEDED a78fefa8-5c43-40d9-9f8e-c733442049d9: /path/to/command some args
FAILED b83fefa8-1c43-30d9-of8e-4733442049d3: /path/to/command some args

Enqueue is probably fine as-is.

GraphQLError: One or more message failures: [{"Code":"InternalError","Id":"90","Message":"","SenderFault":false}]

Sentry Issue: NODE-CLI-14A

GraphQLError: One or more message failures: [{"Code":"InternalError","Id":"90","Message":"","SenderFault":false}]

Add --delay option

SQS provides a feature where message delivery can be delayed up to 15 minutes (900 seconds) per message. It would be great to provide a --delay <seconds> option to qdone enqueue|enqueue-batch to use this feature.

Use cases include rate limits, exponential backoff throttles and scheduling events.

If user explicitly listens on a queue, it should always be resolved

Right now qdone worker without --always-resove ignores queues that don't exist at invocation. This runs contrary to user expectations that if a queue is listed explicitly, we should listen on it.

We may want to think about treating wildcard queues and explicitly listed queues as separate concepts internally.

Action required: Greenkeeper could not be activated 🚨

🚨 You need to enable Continuous Integration on Greenkeeper branches of this repository. 🚨

To enable Greenkeeper, you need to make sure that a commit status is reported on all branches. This is required by Greenkeeper because it uses your CI build statuses to figure out when to notify you about breaking changes.

Since we didn’t receive a CI status on the greenkeeper/initial branch, it’s possible that you don’t have CI set up yet.
We recommend using:

CircleCI
Travis CI
Buildkite
CodeShip
Azure Pipelines
TeamCity
Buddy
AppVeyor
But Greenkeeper will work with every other CI service as well.

If you have already set up a CI for this repository, you might need to check how it’s configured. Make sure it is set to run on all new branches. If you don’t want it to run on absolutely every branch, you can whitelist branches starting with greenkeeper/.

Once you have installed and configured CI on this repository correctly, you’ll need to re-trigger Greenkeeper’s initial pull request. To do this, please click the 'fix repo' button on account.greenkeeper.io.

enqueue --fifo interacts poorly with callers using exponential backoff

User encountered a problem where the same job was being sent to a fifo queue multiple times due to exponential backoff code wrapping qdone.

In this case, qdone generates a MessageDeduplicationID for each message for each retry, allowing multiple instances of the same message through.

To fix this, we could bring exponential backoff and retry into qdone, pending discovery of whether the errors we see are retryable.

Alternately, we could introduce an option that pre-seeds qdone with a random string that it uses to create a consistent MessageDeduplicationID for each message enqueued. This approach should work for enqueue-batch too.

QueueDoesNotExist: The specified queue does not exist.

Sentry Issue: QDONE-F

When used with idle-queues --delete, we often see temporary failure to resolve queues in _cheapIdleCheck(). These should be able to be safely ignored, as they will disappear at the next listening round. Catch this error instead of throwing.

QueueDoesNotExist: The specified queue does not exist.
  File "/usr/lib/node_modules/qdone/node_modules/@aws-sdk/client-sqs/dist-cjs/protocols/Aws_json1_0.js", line 1537, in de_QueueDoesNotExistRes
    const exception = new models_0_1.QueueDoesNotExist({
  File "/usr/lib/node_modules/qdone/node_modules/@aws-sdk/client-sqs/dist-cjs/protocols/Aws_json1_0.js", line 607, in de_GetQueueAttributesCommandError
    throw await de_QueueDoesNotExistRes(parsedOutput, context);
  File "node:internal/process/task_queues", line 96, in processTicksAndRejections
  File "/usr/lib/node_modules/qdone/node_modules/@smithy/middleware-serde/dist-cjs/index.js", line 35, in <anonymous>
    const parsed = await deserializer(response, options);
  File "/usr/lib/node_modules/qdone/node_modules/@smithy/core/dist-cjs/index.js", line 165, in <anonymous>
    const output = await next({
...
(5 additional frame(s) were not displayed)

exception deleting failed queues in paired mode when fail queue does not exist

When making a call like qdone idle-queues --delete 'test*', I'm getting a failure halfway through the list of queues because one of the failed queues does not exist:

AWS.SimpleQueueService.QueueDeletedRecently: You must wait 60 seconds after deleting a queue before you can create another with the same name.
at Request.extractError (/usr/lib/node_modules/qdone/node_modules/aws-sdk/lib/protocol/query.js:47:29)
at Request.callListeners (/usr/lib/node_modules/qdone/node_modules/aws-sdk/lib/sequential_executor.js:105:20)
at Request.emit (/usr/lib/node_modules/qdone/node_modules/aws-sdk/lib/sequential_executor.js:77:10)
at Request.emit (/usr/lib/node_modules/qdone/node_modules/aws-sdk/lib/request.js:683:14)
at Request.transition (/usr/lib/node_modules/qdone/node_modules/aws-sdk/lib/request.js:22:10)
at AcceptorStateMachine.runTo (/usr/lib/node_modules/qdone/node_modules/aws-sdk/lib/state_machine.js:14:12)
at /usr/lib/node_modules/qdone/node_modules/aws-sdk/lib/state_machine.js:26:10
at Request.<anonymous> (/usr/lib/node_modules/qdone/node_modules/aws-sdk/lib/request.js:38:9)
at Request.<anonymous> (/usr/lib/node_modules/qdone/node_modules/aws-sdk/lib/request.js:685:12)
at Request.callListeners (/usr/lib/node_modules/qdone/node_modules/aws-sdk/lib/sequential_executor.js:115:18)

Qdone should ignore this non-existence and keep going. Maybe print a message that the queue does note exist.

Protect against double receive

SQS guarantees at least once delivery, but sometimes jobs are not idempotent and should not be executed more than once.

It would be helpful to provide an optional way for users to prevent duplicate messages using a fact store like Redis. We already have the MessageId to use as a key.

We could obtain a Redis lock, and mirror the message visibility timeout extension calls to extend the lock TTL using the same values we send to AWS, finally setting a long TTL (same as message retention) upon job finish and successful SQS delete call.

Of course, failure in any of these writes or the Redis instance could defeat the safety of this feature.

Log child process information

As long as qdone is is logging jobs, it would be great to log things about the child process, like how much cpu it used, how many wall-clock seconds it used, peak memory, etc.

Maybe we should log qdone overhead as well.

SIGTERM should allow running job to finish work

In practice, users want the option to request a worker shutdown but still allow the worker to finish. We should catch SIGTERM and set the worker in a mode that quits when the child is done.

This is a breaking API change. Users will have to SIGKILL qdone to force termination of the child before --kill-after timeout is up.

FIFO option

Some users need the ability to make queues FIFOs. The enqueue command should gain a --fifo option.

What should we do about fail queues in this case?

Add option for unique group ids for every message in a FIFO batch

The current behavior for qdone enqueue-batch --fifo is to assign a single group id to all messages in a batch.

Sometimes you don't need to guarantee messages are ordered within a batch and would rather workers be able to pick them all up at once.

Therefore it would be useful to have a --group-id-per-message flag to ensure each message gets a different unique group id.

Reset SQS message visibility after failed job

The current algorithm does exponential backoff to request new visibility timeouts, but can leave a failed job invisible for the duration of the timeout.

Since we know when a job fails, it would be good to make the message visible again in this case.

Add --deduplication-id option for single enqueue

Some users want to be able to explicitly list a deduplication id.

Missing region in config

I'm getting this ConfigError from aws-sdk
It seems like it's not taking the region from the .aws/credentials file

I fixed it by adding a new ENV called AWS_REGION to the project

Error [ConfigError]: Missing region in config
at Request.VALIDATE_REGION (.../node_modules/qdone/node_modules/aws-sdk/lib/event_listeners.js:92:45)
at Request.callListeners (..../node_modules/qdone/node_modules/aws-sdk/lib/sequential_executor.js:106:20)
at callNextListener (..../node_modules/qdone/node_modules/aws-sdk/lib/sequential_executor.js:96:12)
at ..../node_modules/qdone/node_modules/aws-sdk/lib/event_listeners.js:86:9
at finish (..../node_modules/qdone/node_modules/aws-sdk/lib/config.js:349:7)
at ..../node_modules/qdone/node_modules/aws-sdk/lib/config.js:391:9
at Object.<anonymous> (..../node_modules/qdone/node_modules/aws-sdk/lib/credentials/credential_provider_chain.js:111:13)
at Object.arrayEach (..../node_modules/qdone/node_modules/aws-sdk/lib/util.js:516:32)
at resolveNext (..../node_modules/qdone/node_modules/aws-sdk/lib/credentials/credential_provider_chain.js:110:20)
at ..../node_modules/qdone/node_modules/aws-sdk/lib/credentials/credential_provider_chain.js:126:13 {
  code: 'ConfigError',
  time: 2022-12-21T16:32:00.266Z
}

Bulk enqueue

It would be nice to queue a bunch of jobs from client languages without waiting for a node process to boot. Also, it would be nice not to write client libraries for everything yet.

We should have a bulk enqueue mode that loads queue, command pairs from files and stdin.

This probably should support a different queue for each line.

Something like:

$ qdone enqueue << EOF
queue1 "/usr/bin/env php /path/to/some/script arg arg arg"
queue1 "/usr/bin/env php /path/to/some/script arg arg more-different-arg"
queue2 "/usr/bin/env php /path/to/some/script yet more args"
...
EOF

The underlying calls should be handled efficiently, including batching messages to individual queues.

Option to listen to multiple queues in parallel instead of in sequence

Right now, qdone listens on one queue for --wait-time seconds, then moves on to the next queue.

We could get lower latency by listening to all queues, returning when we find data on any of them, and abort()ing the listen requests to the other queues.

This could potentially starve queues if one consistently wins the race for returning data first.

tests

use native promises if possible

DLQ (dead letter queue) support

qdone's builtin failed queues are nice, but if a job repeatedly fails, it can be useful to get a developers attention on it by sending it to a master failed queue after some number (say 3 or 5) attempts. Furthermore, it may be an advantage to have dynamically named DLQs so this allows for that as well.

Add an option --dlq-name NAME and --dlq-after 5 to activate DLQ support on failed queues.

In SureDone, there are several scenarios where this is potential problem currently:

channel imports trigger creation of dynamic user based product level import queues
- those jobs fail for some reason, and never are caught as the failed queues do not send to dead letter queues
bulk jobs still mysteriously don't complete sometimes
- bulk jobs are put on user based dynamically created user based queues and failures are invisible without dlq
critical sold action inventory update queues are put on dynamically created user based queues
- we currently have no visibility into if/when these processes fail without dlq

ci

Remove idle dynamic queues

It's much more efficient to not listen on dynamic queues if they are idle for a long time. There should be some facility to either listen only on queues that are going to have data, or remove idle queues.

Update qdone dependencies

enforce standard js

Qdone restarts too fast when there are no queues present

Nov 04 19:50:06 ip-10-172-65-227 qdone[12283]: AWS.SimpleQueueService.NonExistentQueue: 
...
Nov 04 19:50:06 ip-10-172-65-227 systemd[1]: qdone-chunker.service: Main process exited, code=exited, status=1/FAILURE
Nov 04 19:50:06 ip-10-172-65-227 systemd[1]: qdone-chunker.service: Unit entered failed state.
Nov 04 19:50:06 ip-10-172-65-227 systemd[1]: qdone-chunker.service: Failed with result 'exit-code'.

One gotcha with systemd integration is when qdone has no queues to listen on, it can fail faster than normal, causing systemd to move the unit to the failed state and not restart it.

This came up in a use case where we were deleting idle queues, and several weeks into deployment, a brief lull in job activity caused all queues to be deleted.

You can fix this with systemd config, but I'd prefer qdone behave consistently and not trip people up.

The default behavior should be to listen for --wait-time seconds, even if there are no queues to listen on.

Add --retry-count option

This would control the number of failures allowed before the queue sends jobs to failed.

usage: --retry-count <number>

test and set node engine in package.json

Log successful jobs

Right now when --verbose is not set, qdone logs failed jobs. It would be nice to log successful jobs as well.

Add --tag option

SQS now has cost tagging, and we should support it at the command line for enqueue. The cli should support --tag <tag> in some format like Key=Value like the AWS cli, and should support multiple tags.

arrow funcs

we has them, use to set node engine

enqueue-batch fails when command lines exceed 256k for one batch

Steps to reproduce

Create a file with 10 lines, each command should be 100kb long
Try to enqueue it

$ ./qdone enqueue-batch too-big.txt
Creating fail queue test_failed
Creating queue test
TypeError: Cannot read property 'Failed' of undefined
    at /Users/ryan/src/qdone-master/src/enqueue.js:128:17
    at _fulfilled (/Users/ryan/src/qdone-master/node_modules/q/q.js:854:54)
    at self.promiseDispatch.done (/Users/ryan/src/qdone-master/node_modules/q/q.js:883:30)
    at Promise.promise.promiseDispatch (/Users/ryan/src/qdone-master/node_modules/q/q.js:816:13)
    at /Users/ryan/src/qdone-master/node_modules/q/q.js:624:44
    at runSingle (/Users/ryan/src/qdone-master/node_modules/q/q.js:137:13)
    at flush (/Users/ryan/src/qdone-master/node_modules/q/q.js:125:13)
    at _combinedTickCallback (internal/process/next_tick.js:73:7)
    at process._tickDomainCallback (internal/process/next_tick.js:128:9)
bash: update_terminal_cwd: command not found

Expected behavior

Large command lines should enqueue in an appropriate number of api calls (2 commands per api call, in this test case).

Deleting idle queues doesn't work properly with FIFO

When using idle-queues --delete, the normal behavior is to delete failed queues along with their normal counterpart.

SQS's naming scheme for FIFO queues (appending .fifo) messes this, up, because our normal _failed becomes _failed.fifo. This is handled properly in enqueue and worker but not in idle-queues.

The fix is to generate the failed queue names properly in the idle-queues command.

Remove nice in worker

Users should do this themselves.

qdone: add better batch error handling

This must solve the issues seen in sentry here: https://github.com/suredone/suredone/issues/9494

One way we could do this is:

Automatically retry messages with no senderFault in new batches
Take senderFault errors and pair them with their message, then send this to sentry, or perhaps rewrite the error message to include this
Or we could log failed messages in a queue and alert on it

--active-only expensive when many queues are active, add caching option

After switching to --active-only on jobs that have a large number of dynamic queues, we notice that we start spending a lot of money on GetQueueAttributes calls:

This makes sense, comparing the --active-only API call complexity with the base case, when a is high, then so are the calls:

Context	Calls	Details
`qdone worker` (while listening, per listen round)	n + (1 per n×w)	w: `--wait-time` in seconds n: number of queues
`qdone worker` (while listening with `--active-only`, per round)	2n + (1 per a×w)	w: `--wait-time` in seconds a: number of active queues

However the state of the active queues is very cacheable, especially if queues tend to have large backlogs, as ours do.

I propose we add three options:

--cache-url that takes a redis://... cluster url [no default]
--cache-ttl-seconds that takes a number of seconds [default 10]
--cache-prefix that defines a cache key prefix [default qdone]

The presence of the --cache-url option will cause the worker to cache GetQueueAttributes for each queue for the specified ttl. Probably can use mget for this, if we're careful about key slots.