GithubHelp home page GithubHelp logo

suredone / qdone Goto Github PK

View Code? Open in Web Editor NEW
12.0 6.0 1.0 1.36 MB

Command line job queue for SQS

License: ISC License

JavaScript 100.00%
sqs job-queue queue producer consumer aws amazon amazon-web-services cli workers

qdone's Issues

Remove idle dynamic queues

It's much more efficient to not listen on dynamic queues if they are idle for a long time. There should be some facility to either listen only on queues that are going to have data, or remove idle queues.

Add --tag option

SQS now has cost tagging, and we should support it at the command line for enqueue. The cli should support --tag <tag> in some format like Key=Value like the AWS cli, and should support multiple tags.

Qdone restarts too fast when there are no queues present

Nov 04 19:50:06 ip-10-172-65-227 qdone[12283]: AWS.SimpleQueueService.NonExistentQueue: 
...
Nov 04 19:50:06 ip-10-172-65-227 systemd[1]: qdone-chunker.service: Main process exited, code=exited, status=1/FAILURE
Nov 04 19:50:06 ip-10-172-65-227 systemd[1]: qdone-chunker.service: Unit entered failed state.
Nov 04 19:50:06 ip-10-172-65-227 systemd[1]: qdone-chunker.service: Failed with result 'exit-code'.

One gotcha with systemd integration is when qdone has no queues to listen on, it can fail faster than normal, causing systemd to move the unit to the failed state and not restart it.

This came up in a use case where we were deleting idle queues, and several weeks into deployment, a brief lull in job activity caused all queues to be deleted.

You can fix this with systemd config, but I'd prefer qdone behave consistently and not trip people up.

The default behavior should be to listen for --wait-time seconds, even if there are no queues to listen on.

Move current chatty output to --verbose mode

There's a lot going on in the output.

We probably only need something like this for the default output for worker:

SUCCEEDED a78fefa8-5c43-40d9-9f8e-c733442049d9: /path/to/command some args
FAILED b83fefa8-1c43-30d9-of8e-4733442049d3: /path/to/command some args

Enqueue is probably fine as-is.

enqueue --fifo interacts poorly with callers using exponential backoff

User encountered a problem where the same job was being sent to a fifo queue multiple times due to exponential backoff code wrapping qdone.

In this case, qdone generates a MessageDeduplicationID for each message for each retry, allowing multiple instances of the same message through.

To fix this, we could bring exponential backoff and retry into qdone, pending discovery of whether the errors we see are retryable.

Alternately, we could introduce an option that pre-seeds qdone with a random string that it uses to create a consistent MessageDeduplicationID for each message enqueued. This approach should work for enqueue-batch too.

SIGTERM should allow running job to finish work

In practice, users want the option to request a worker shutdown but still allow the worker to finish. We should catch SIGTERM and set the worker in a mode that quits when the child is done.

This is a breaking API change. Users will have to SIGKILL qdone to force termination of the child before --kill-after timeout is up.

Bulk enqueue

It would be nice to queue a bunch of jobs from client languages without waiting for a node process to boot. Also, it would be nice not to write client libraries for everything yet.

We should have a bulk enqueue mode that loads queue, command pairs from files and stdin.

This probably should support a different queue for each line.

Something like:

$ qdone enqueue << EOF
queue1 "/usr/bin/env php /path/to/some/script arg arg arg"
queue1 "/usr/bin/env php /path/to/some/script arg arg more-different-arg"
queue2 "/usr/bin/env php /path/to/some/script yet more args"
...
EOF

The underlying calls should be handled efficiently, including batching messages to individual queues.

Add option for unique group ids for every message in a FIFO batch

The current behavior for qdone enqueue-batch --fifo is to assign a single group id to all messages in a batch.

Sometimes you don't need to guarantee messages are ordered within a batch and would rather workers be able to pick them all up at once.

Therefore it would be useful to have a --group-id-per-message flag to ensure each message gets a different unique group id.

Log child process information

As long as qdone is is logging jobs, it would be great to log things about the child process, like how much cpu it used, how many wall-clock seconds it used, peak memory, etc.

Maybe we should log qdone overhead as well.

Protect against double receive

SQS guarantees at least once delivery, but sometimes jobs are not idempotent and should not be executed more than once.

It would be helpful to provide an optional way for users to prevent duplicate messages using a fact store like Redis. We already have the MessageId to use as a key.

We could obtain a Redis lock, and mirror the message visibility timeout extension calls to extend the lock TTL using the same values we send to AWS, finally setting a long TTL (same as message retention) upon job finish and successful SQS delete call.

Of course, failure in any of these writes or the Redis instance could defeat the safety of this feature.

Missing region in config

I'm getting this ConfigError from aws-sdk
It seems like it's not taking the region from the .aws/credentials file

I fixed it by adding a new ENV called AWS_REGION to the project

Error [ConfigError]: Missing region in config
at Request.VALIDATE_REGION (.../node_modules/qdone/node_modules/aws-sdk/lib/event_listeners.js:92:45)
at Request.callListeners (..../node_modules/qdone/node_modules/aws-sdk/lib/sequential_executor.js:106:20)
at callNextListener (..../node_modules/qdone/node_modules/aws-sdk/lib/sequential_executor.js:96:12)
at ..../node_modules/qdone/node_modules/aws-sdk/lib/event_listeners.js:86:9
at finish (..../node_modules/qdone/node_modules/aws-sdk/lib/config.js:349:7)
at ..../node_modules/qdone/node_modules/aws-sdk/lib/config.js:391:9
at Object.<anonymous> (..../node_modules/qdone/node_modules/aws-sdk/lib/credentials/credential_provider_chain.js:111:13)
at Object.arrayEach (..../node_modules/qdone/node_modules/aws-sdk/lib/util.js:516:32)
at resolveNext (..../node_modules/qdone/node_modules/aws-sdk/lib/credentials/credential_provider_chain.js:110:20)
at ..../node_modules/qdone/node_modules/aws-sdk/lib/credentials/credential_provider_chain.js:126:13 {
  code: 'ConfigError',
  time: 2022-12-21T16:32:00.266Z
}

QueueDoesNotExist: The specified queue does not exist.

Sentry Issue: QDONE-F

When used with idle-queues --delete, we often see temporary failure to resolve queues in _cheapIdleCheck(). These should be able to be safely ignored, as they will disappear at the next listening round. Catch this error instead of throwing.

QueueDoesNotExist: The specified queue does not exist.
  File "/usr/lib/node_modules/qdone/node_modules/@aws-sdk/client-sqs/dist-cjs/protocols/Aws_json1_0.js", line 1537, in de_QueueDoesNotExistRes
    const exception = new models_0_1.QueueDoesNotExist({
  File "/usr/lib/node_modules/qdone/node_modules/@aws-sdk/client-sqs/dist-cjs/protocols/Aws_json1_0.js", line 607, in de_GetQueueAttributesCommandError
    throw await de_QueueDoesNotExistRes(parsedOutput, context);
  File "node:internal/process/task_queues", line 96, in processTicksAndRejections
  File "/usr/lib/node_modules/qdone/node_modules/@smithy/middleware-serde/dist-cjs/index.js", line 35, in <anonymous>
    const parsed = await deserializer(response, options);
  File "/usr/lib/node_modules/qdone/node_modules/@smithy/core/dist-cjs/index.js", line 165, in <anonymous>
    const output = await next({
...
(5 additional frame(s) were not displayed)

Action required: Greenkeeper could not be activated 🚨

🚨 You need to enable Continuous Integration on Greenkeeper branches of this repository. 🚨

To enable Greenkeeper, you need to make sure that a commit status is reported on all branches. This is required by Greenkeeper because it uses your CI build statuses to figure out when to notify you about breaking changes.

Since we didn’t receive a CI status on the greenkeeper/initial branch, it’s possible that you don’t have CI set up yet.
We recommend using:

If you have already set up a CI for this repository, you might need to check how it’s configured. Make sure it is set to run on all new branches. If you don’t want it to run on absolutely every branch, you can whitelist branches starting with greenkeeper/.

Once you have installed and configured CI on this repository correctly, you’ll need to re-trigger Greenkeeper’s initial pull request. To do this, please click the 'fix repo' button on account.greenkeeper.io.

Add --delay option

SQS provides a feature where message delivery can be delayed up to 15 minutes (900 seconds) per message. It would be great to provide a --delay <seconds> option to qdone enqueue|enqueue-batch to use this feature.

Use cases include rate limits, exponential backoff throttles and scheduling events.

Deleting idle queues doesn't work properly with FIFO

When using idle-queues --delete, the normal behavior is to delete failed queues along with their normal counterpart.

SQS's naming scheme for FIFO queues (appending .fifo) messes this, up, because our normal _failed becomes _failed.fifo. This is handled properly in enqueue and worker but not in idle-queues.

The fix is to generate the failed queue names properly in the idle-queues command.

Reset SQS message visibility after failed job

The current algorithm does exponential backoff to request new visibility timeouts, but can leave a failed job invisible for the duration of the timeout.

Since we know when a job fails, it would be good to make the message visible again in this case.

DLQ (dead letter queue) support

qdone's builtin failed queues are nice, but if a job repeatedly fails, it can be useful to get a developers attention on it by sending it to a master failed queue after some number (say 3 or 5) attempts. Furthermore, it may be an advantage to have dynamically named DLQs so this allows for that as well.

Add an option --dlq-name NAME and --dlq-after 5 to activate DLQ support on failed queues.


In SureDone, there are several scenarios where this is potential problem currently:

  • channel imports trigger creation of dynamic user based product level import queues
    • those jobs fail for some reason, and never are caught as the failed queues do not send to dead letter queues
  • bulk jobs still mysteriously don't complete sometimes
    • bulk jobs are put on user based dynamically created user based queues and failures are invisible without dlq
  • critical sold action inventory update queues are put on dynamically created user based queues
    • we currently have no visibility into if/when these processes fail without dlq

Option to listen to multiple queues in parallel instead of in sequence

Right now, qdone listens on one queue for --wait-time seconds, then moves on to the next queue.

We could get lower latency by listening to all queues, returning when we find data on any of them, and abort()ing the listen requests to the other queues.

This could potentially starve queues if one consistently wins the race for returning data first.

exception deleting failed queues in paired mode when fail queue does not exist

When making a call like qdone idle-queues --delete 'test*', I'm getting a failure halfway through the list of queues because one of the failed queues does not exist:

AWS.SimpleQueueService.QueueDeletedRecently: You must wait 60 seconds after deleting a queue before you can create another with the same name.
at Request.extractError (/usr/lib/node_modules/qdone/node_modules/aws-sdk/lib/protocol/query.js:47:29)
at Request.callListeners (/usr/lib/node_modules/qdone/node_modules/aws-sdk/lib/sequential_executor.js:105:20)
at Request.emit (/usr/lib/node_modules/qdone/node_modules/aws-sdk/lib/sequential_executor.js:77:10)
at Request.emit (/usr/lib/node_modules/qdone/node_modules/aws-sdk/lib/request.js:683:14)
at Request.transition (/usr/lib/node_modules/qdone/node_modules/aws-sdk/lib/request.js:22:10)
at AcceptorStateMachine.runTo (/usr/lib/node_modules/qdone/node_modules/aws-sdk/lib/state_machine.js:14:12)
at /usr/lib/node_modules/qdone/node_modules/aws-sdk/lib/state_machine.js:26:10
at Request.<anonymous> (/usr/lib/node_modules/qdone/node_modules/aws-sdk/lib/request.js:38:9)
at Request.<anonymous> (/usr/lib/node_modules/qdone/node_modules/aws-sdk/lib/request.js:685:12)
at Request.callListeners (/usr/lib/node_modules/qdone/node_modules/aws-sdk/lib/sequential_executor.js:115:18)

Qdone should ignore this non-existence and keep going. Maybe print a message that the queue does note exist.

Worker children not killed immediately when reaching --kill-after

Observed in production that child processes do not exit and become orphaned once --kill-after is reached. Saw this on the command line.

Also see separate orphaned child processes on some production machines.

Theory on what's happening:

qdone uses child_process.exec (not to be confused with exec(3)) to execute children within a shell.

It seems like exec's timeout option (that we use to send SIGTERM to the child if it reaches the timeout) does not actually kill the child of the shell (observed on Ubuntu 16.04).

Seems like a fairly useless option for exec. Maybe a better option for execFile?

Workarounds:

  1. Kill the qdone process group?
  2. Find the PID of the child (not the child shell, but shell's child) and manually signal that.

I'm starting this ticket to record my findings, but it may be worth checking node issues as well to see if anybody has run into this strange design on exec.

Log successful jobs

Right now when --verbose is not set, qdone logs failed jobs. It would be nice to log successful jobs as well.

--active-only expensive when many queues are active, add caching option

After switching to --active-only on jobs that have a large number of dynamic queues, we notice that we start spending a lot of money on GetQueueAttributes calls:

Screen Shot 2019-06-25 at 12 05 01 PM

This makes sense, comparing the --active-only API call complexity with the base case, when a is high, then so are the calls:

Context Calls Details
qdone worker (while listening, per listen round) n + (1 per n×w) w: --wait-time in seconds
n: number of queues
qdone worker (while listening with --active-only, per round) 2n + (1 per a×w) w: --wait-time in seconds
a: number of active queues

However the state of the active queues is very cacheable, especially if queues tend to have large backlogs, as ours do.

I propose we add three options:

  • --cache-url that takes a redis://... cluster url [no default]
  • --cache-ttl-seconds that takes a number of seconds [default 10]
  • --cache-prefix that defines a cache key prefix [default qdone]

The presence of the --cache-url option will cause the worker to cache GetQueueAttributes for each queue for the specified ttl. Probably can use mget for this, if we're careful about key slots.

enqueue-batch fails when command lines exceed 256k for one batch

Steps to reproduce

  1. Create a file with 10 lines, each command should be 100kb long

  2. Try to enqueue it

$ ./qdone enqueue-batch too-big.txt
Creating fail queue test_failed
Creating queue test
TypeError: Cannot read property 'Failed' of undefined
    at /Users/ryan/src/qdone-master/src/enqueue.js:128:17
    at _fulfilled (/Users/ryan/src/qdone-master/node_modules/q/q.js:854:54)
    at self.promiseDispatch.done (/Users/ryan/src/qdone-master/node_modules/q/q.js:883:30)
    at Promise.promise.promiseDispatch (/Users/ryan/src/qdone-master/node_modules/q/q.js:816:13)
    at /Users/ryan/src/qdone-master/node_modules/q/q.js:624:44
    at runSingle (/Users/ryan/src/qdone-master/node_modules/q/q.js:137:13)
    at flush (/Users/ryan/src/qdone-master/node_modules/q/q.js:125:13)
    at _combinedTickCallback (internal/process/next_tick.js:73:7)
    at process._tickDomainCallback (internal/process/next_tick.js:128:9)
bash: update_terminal_cwd: command not found

Expected behavior

Large command lines should enqueue in an appropriate number of api calls (2 commands per api call, in this test case).

FIFO option

Some users need the ability to make queues FIFOs. The enqueue command should gain a --fifo option.

What should we do about fail queues in this case?

If user explicitly listens on a queue, it should always be resolved

Right now qdone worker without --always-resove ignores queues that don't exist at invocation. This runs contrary to user expectations that if a queue is listed explicitly, we should listen on it.

We may want to think about treating wildcard queues and explicitly listed queues as separate concepts internally.

Add --retry-count option

This would control the number of failures allowed before the queue sends jobs to failed.

usage: --retry-count <number>

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.