stackstorm / orquesta Goto Github PK

View Code? Open in Web Editor NEW

98.0 20.0 39.0 1.98 MB

Orquesta is a graph based workflow engine for StackStorm. Questions? https://github.com/StackStorm/st2/discussions

Home Page: https://docs.stackstorm.com/orquesta/

License: Apache License 2.0

Makefile 0.16% Python 99.84%

workflow-engine workflow-execution stackstorm orquesta workflows devops automation orchestration

orquesta's People

Contributors

Stargazers

Watchers

orquesta's Issues

Nesting When statements in Orquesta

Right now when writing check logic in orquesta I have to have a when block for each conditional. It's not graceful and harder to manage.
For ex:
Instead of ,

When ( succeeded)
      When ( x=true)
           Do thing
       When (y= true)
            Do thang

We have to do

When ( succeeded and x=true)
   do thing
When (succeeded and y = true)
   do thang

It's not clean and harder to read/maintain. Especially in some of our workflows the conditionals are large and you end up having many statements with repeated keywords in them.

YAQL methods can't handle an array of dict value at 'items' of 'with' parameter

An execution of following workflow would fail with this error message when user specifes an array of dict value.

'YaqlEvaluationException: Unable to evaluate expression ''<% ctx(array_input).distinct() %>''. TypeError: unhashable type: ''BaseDict'

Metadata file (test_input.yaml)

---
name: test_input
description: Test action to check input parameter would be handled correctly
runner_type: orquesta
entry_point: workflows/test_input.yaml
enabled: true
parameters:
  array_input:
    type: array

Orquesta file (workflows/test_input.yam)

version: 1.0

input:
  - array_input

description: Intermediate workflow to check performance

tasks:
  task:
    action: core.echo
    with:
      items: <% ctx(array_input).distinct() %>
    input:
      message: <% item() %>

Execution result

Environment of st2

Note

This problem was happend at calling YAQL method 'distinct' to evaluate array of dict value which was specified in 'items' parameter.

Jinja - 'referenced before assignment' with if/else blocks

Summary

When orquesta tries to graph a run of a workflow, it will error when a variable is referenced before its defined (yay!)
However, if the variable that isn't defined, is inside of a Jinja IF/ELSE block that wouldn't be selected anyway, it still errors.

This behavior is different from Mistral. I'm not sure if this is intentional, or what the preferred behavior should be. I could see arguments either way.

Reproduction

---
version: '1.0'
input:
  - payload
tasks:
  start:
    action: core.noop
    next:
      - when: "{{ ctx(payload).hostname is defined }}"
        do:
          - task1
        publish:
          - hostname: "{{ ctx(payload).hostname }}"
          - method: host
      - when: "{{ ctx(payload).client is defined }}"
        do:
          - task1
        publish:
          - client: "{{ ctx(payload).client }}"
          - method: client
  task1:
    action: core.local
    input:
      cmd: "echo '{% if ctx(method) == 'client' %}{{ ctx(client) }}{% elif ctx(method) == 'host' %}{{ ctx(hostname) }}{% endif %}'"

Input for payload

payload: {"hostname":"myhost"}

Expected results

My expectations were that this would work fine as it does in mistral, as this was discovered while I was migrating this Workflow from Mistral to Orquesta.

Observed Results

This WF fails to run because the graph fails to be built.

{
  "output": null,
  "errors": [
    {
      "language": "jinja",
      "type": "context",
      "spec_path": "tasks.task1.input",
      "schema_path": "properties.tasks.patternProperties.^\\w+$.properties.input",
      "message": "Variable \"client\" is referenced before assignment.",
      "expression": "echo '{% if ctx(method) == 'client' %}{{ ctx(client) }}{% elif ctx(method) == 'host' %}{{ ctx(hostname) }}{% endif %}'"
    },
    {
      "language": "jinja",
      "type": "context",
      "spec_path": "tasks.task1.input",
      "schema_path": "properties.tasks.patternProperties.^\\w+$.properties.input",
      "message": "Variable \"hostname\" is referenced before assignment.",
      "expression": "echo '{% if ctx(method) == 'client' %}{{ ctx(client) }}{% elif ctx(method) == 'host' %}{{ ctx(hostname) }}{% endif %}'"
    }
  ]
}

Workaround

Defining all possible vars that are conditionally created with null in the vars: section so that they exist, even if they won't be updated (publish) and used.

---
version: '1.0'
input:
  - payload
vars:
  - hostname: null
  - client: null
tasks:
   ...

current task is not in the context when evaluating expressions in task action and input

The current task does not seem to be set correctly at the time when the YAQL and Jinja expressions are being evaluated for task action and input.

Workflow join is not triggered on complete for failed task(s)

Workflows that have long running tasks in sub-workflows are not waited to complete before the top layer workflow completes. I tried join: all and join: 2 based on other issues on this repo but got the same result each time. Given the following workflows and action files:

ST2 Version:

# st2 --version
st2 3.1.0, on Python 2.7.5

Master workflow:
Action orquesta-join.yaml:

---
description: A basic workflow that demonstrate branching and join.
enabled: true
runner_type: orquesta
entry_point: workflows/orquesta-join.yaml
name: orquesta-join
pack: encore

Workflow orquesta-join.yaml:

---
version: 1.0

description: A basic workflow that demonstrate branching and join.

tasks:
  task1:
    action: core.noop
    next:
      - when: <% completed() %>
        do: task2, task3

  task2:
    action: encore.orquesta-join-4
    next:
      - when: <% completed() %>
        do: finish

  task3:
    action: encore.orquesta-join-2
    next:
      - when: <% completed() %>
        do: finish

  finish:
    # This did not work with either 2 or all
    # join: all
    join: 2
    action: core.noop
    next:
      - do: noop

Sub workflows:
Action orquesta-join-2.yaml:

---
description: A basic workflow that demonstrate branching and join.
enabled: true
runner_type: orquesta
entry_point: workflows/orquesta-join-2.yaml
name: orquesta-join-2
pack: encore

Workflow orquesta-join-2.yaml:

---
version: 1.0

description: A basic workflow that demonstrate branching and join.

tasks:
  task1:
    action: core.echo message="Fee fi"
    next:
      - when: <% succeeded() %>
        do: task2

  task2:
    action: encore.orquesta-join-3

Action orquesta-join-3.yaml:

---
description: A basic workflow that demonstrate branching and join.
enabled: true
runner_type: orquesta
entry_point: workflows/orquesta-join-3.yaml
name: orquesta-join-3
pack: encore

Workflow orquesta-join-3.yaml:

---
version: 1.0

description: A basic workflow that demonstrate branching and join.

tasks:
  task1:
    action: core.echo message="Fee fi"
    next:
      - when: <% succeeded() %>
        do: task2

  task2:
    action: core.local
    input:
      cmd: "sleep 100"

Action orquesta-join-4.yaml:

---
description: A basic workflow that demonstrate branching and join.
enabled: true
runner_type: orquesta
entry_point: workflows/orquesta-join-4.yaml
name: orquesta-join-4
pack: encore

Workflow orquesta-join-4.yaml:

---
version: 1.0

description: A basic workflow that demonstrate branching and join.

tasks:
  task1:
    delay: 20
    action: core.echo message="Fee fi"
    next:
      - when: <% succeeded() %>
        do: task2

  task2:
    action: encore.orquesta-join-3

Error:

st2workflowengine.log

2020-03-09 10:58:07,904 140625357435120 ERROR workflows [-] [5e66592949b3d2060f42bc78] Workflow execution completed with errors. (task_id=u'task2',type=u'error',result={u'output': None, u'errors': []},message=u'Execution failed. See result for details.')
2020-03-09 10:58:08,005 140625358292784 INFO workflows [-] [5e66592749b3d26f3749e3be] Identifying next set (0) of tasks after completion of task "task2", route "0".
2020-03-09 10:58:08,007 140625358292784 INFO workflows [-] [5e66592749b3d26f3749e3be] No tasks identified to execute next.
2020-03-09 10:58:08,108 140625358292784 ERROR workflows [-] [5e66592749b3d26f3749e3be] Workflow execution completed with errors. (task_id=u'task3',type=u'error',result={u'output': None, u'errors': [{u'message': u'Execution failed. See result for details.', u'type': u'error', u'result': {u'output': None, u'errors': []}, u'task_id': u'task2'}]},message=u'Execution failed. See result for details.')
2020-03-09 10:58:08,109 140625358292784 ERROR workflows [-] [5e66592749b3d26f3749e3be] Workflow execution completed with errors. (task_id=u'task2',type=u'error',result={u'output': None, u'errors': [{u'message': u'Execution failed. See result for details.', u'type': u'error', u'result': {u'output': None, u'errors': []}, u'task_id': u'task2'}]},message=u'Execution failed. See result for details.')
2020-03-09 10:58:08,249 140625358292784 ERROR workflows [-] [5e66592749b3d26f3749e3be] Workflow execution completed with errors. (task_id=u'task3',type=u'error',result={u'output': None, u'errors': [{u'message': u'Execution failed. See result for details.', u'type': u'error', u'result': {u'output': None, u'errors': []}, u'task_id': u'task2'}]},message=u'Execution failed. See result for details.')
2020-03-09 10:58:08,250 140625358292784 ERROR workflows [-] [5e66592749b3d26f3749e3be] Workflow execution completed with errors. (task_id=u'task2',type=u'error',result={u'output': None, u'errors': [{u'message': u'Execution failed. See result for details.', u'type': u'error', u'result': {u'output': None, u'errors': []}, u'task_id': u'task2'}]},message=u'Execution failed. See result for details.')

Add something to force an action to succeed even when failed

https://docs.stackstorm.com/orquesta/languages/orquesta.html#engine-commands

One idea could be to add an additional option for pass.

Use Case:

sometimes actions complete gracefully but exit with a fail code. Sometimes this is because a resource wasn't found or some other reason.

For example: st2.kv.get

Sometimes you check for a key to exist. If this is the last action in a workflow, it can cause the workflow to complete as a fail even though it all other operations were success and you aren't actually concerned with this action failing. A work around I've used is having an action called its_ok_if_this_fails for core.noop. This allows the workflow to still complete as a success.

With an engine-command of pass, I wouldn't need this extra step, i could simply have

task1:
  action: core.noop
  next:
    - do: pass
      when: <% failed() %>

Where as currently i need

task1:
  action: core.noop
  next:
    - do: its_ok_if_this_fails
      when: <% failed() %>

its_ok_if_this_fails:
  action: core.noop

If something like this doesn't fit or work for some reason, I would be interested in any ideas that could help address the use case. Thank you!

Exception running workflow - 'ValueError: malformed node or string: <_ast.BinOp object at 0x7f80d29946d8>'

I haven't been able to find where this error is being generated from. As far as I can tell the action follows format and syntax correctly.

Action from workflow that is failing

  send_maintenance_start:
    action: email.send_email
    input:
      account: smarthost
      email_from: "no-reply@XXXXXX"
      email_to: "carlos@XXXXXX"
      subject: Incident detected on <% ctx("hostname") %>
      message: Automatic reboot of <% ctx("hostname") %> to recover from host down.
    next:
      - do: disable_users

Exception from workflowegine log.

2020-01-17 23:33:43,532 140191264317064 INFO (unknown file) [-] [5e22445438b7e59d28ade4fb] Mark task "send_maintenance_start", route "0", in conductor as running.
2020-01-17 23:33:43,580 140191264317064 INFO (unknown file) [-] [5e22445438b7e59d28ade4fb] Requesting execution for task "send_maintenance_start", route "0".
2020-01-17 23:33:43,580 140191264317064 INFO (unknown file) [-] [5e22445438b7e59d28ade4fb] Processing task execution request for task "send_maintenance_start", route "0".
2020-01-17 23:33:43,589 140191264317064 INFO (unknown file) [-] [5e22445438b7e59d28ade4fb] Task execution "5e22445738b7e59c9835e3ff" created for task "send_maintenance_start
", route "0".
2020-01-17 23:33:43,599 140191264317064 WARNING (unknown file) [-] Determining if exception <class 'ValueError'> should be retried.
2020-01-17 23:33:43,599 140191264317064 WARNING (unknown file) [-] Determining if exception <class 'ValueError'> should be retried.
2020-01-17 23:33:43,599 140191264317064 ERROR (unknown file) [-] [5e22445438b7e59d28ade4fb] Failed action execution(s) for task "send_maintenance_start", route "0". malforme
d node or string: <_ast.BinOp object at 0x7f80d29946d8>
Traceback (most recent call last):
  File "/opt/stackstorm/st2/lib/python3.6/site-packages/st2common/util/casts.py", line 36, in _cast_object
    return json.loads(x)
  File "/usr/lib/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.6/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.6/json/decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Failure to publish True to a variable using shorthand format

Given the following workflow where there is a shorthand publish x=True, the output of the workflow result in x=false. The camel case True is not accepted in the assignment. If the publish uses all lowercase true, then the output is as expected.

version: 1.0

input:
  - x: false

output:
  - x: <% ctx().x %>

tasks:
  task1:
    action: core.echo message=<% str(ctx().x) %>
    next:
      - publish: x=True

This problem does not exist if publish is done normally in non-shorthand format like the workflow below.

version: 1.0

input:
  - x: false

output:
  - x: <% ctx().x %>

tasks:
  task1:
    action: core.echo message=<% str(ctx().x) %>
    next:
      - publish:
         - x: True

Workflow stuck in running state

If the action is not listed syntactically correct the workflow will stay in a running state forever:
workflow:

version: 1.0
tasks:
  add_domains_to_ticket:
    with:
      items: "domain  in {{ctx().mitigate_domains }}"
      action: "core.echo message={{item(domain)}}"



input:
  - mitigate_domains
  - ticket: null

meta:

pack: intel
enabled: true
runner_type: orquesta
name: ticket_add_domains_mitigate
parameters:
  mitigate_domains:
    type: "array"
    items:
      type: "string"
  ticket:
    type: integer
entry_point: workflows/ticket_add_domains_mitigate.yaml

the fix is:

tasks: add_domains_to_ticket:
  with:
    items: "domain in {{ctx().mitigate_domains }}"
  action: "core.echo message={{item(domain)}}"

Passing true to decrypt of st2kv function does not work

Taking the examples.orquesta-st2kv workflow as an example, https://github.com/StackStorm/st2/blob/master/contrib/examples/actions/workflows/orquesta-st2kv.yaml. If the expression <% st2kv(ctx().key_name, decrypt=ctx().decrypt) %> is changed to <% st2kv(ctx().key_name, decrypt=true) %>, the value is not decrypted. If the literal true is explicitly set to bool like <% st2kv(ctx().key_name, decrypt => bool('true')) %>, then the value is decrypted.

Value in quotes in shorthand publish should be evaluated as string type

For shorthand publish such as x="false" and y="1234" where the value is in quotes (single or double), the value should be kept as string type and not be auto-converted to bool, integer, or other types.

Add ability in task spec to wait for a lock before proceeding

There are use cases where users want to synchronize access to a resource (i.e. server) across multiple and different workflow executions. Currently, StackStorm offers concurrency policies to synchronize executions for one given workflow.

The following proposal allows specification of the lock requirement across different workflow definition so synchronization is possible against different workflow executions on the same resource(s). In the proposal below, a new wait attribute is introduced to the task spec. The wait attribute takes the name of the lock, which can be a unique name or name of the resource. The task will sleep/delay until the lock is acquired. On exit from the task scope, the lock will be released.

An additional wait delay can be specified. If the lock is not acquired within the wait period, the task will abandon the wait and exit with failure. On failure, the task transition will be traversed as normal so that clean up or retry is possible.

tasks:
  task1:
    wait: <% ctx().resource_lock_name %>
    action: ...

tasks:
  task1:
    wait:
      lock: <% ctx().resource_lock_name %>
      delay: 600
    action: ...

The mechanism will require on external platform such as redis which is currently already being used for various use cases in StackStorm (required for concurrency policies and running complex branching workflows, etc.).

The ujson 2.0.x doesn't compatible with Orquesta

The ujson library that orquesta uses major upgraded on Mar 8, 2020. But it doesn't compatible with current Orquesta.
(c.f. https://pypi.org/project/ujson/#history)

Following code is am example to confirm it.

from orquesta import conducting
from orquesta.specs import native as native_specs
from orquesta import statuses
from orquesta.utils import jsonify as json_util

wf_def = """
version: 1.0

vars:
  - xs:
    - fee

tasks:
  task1:
    with: <% ctx(xs) %>
    action: core.echo message=<% item() %>
"""

spec = native_specs.WorkflowSpec(wf_def)

conductor = conducting.WorkflowConductor(spec)
conductor.request_workflow_status(statuses.RUNNING)

actual_tasks = conductor.get_next_tasks()
json_util.deepcopy(actual_tasks)

Case of using ujson(v1.35)

Case of using ujson(v2.0.0)

Case of using ujson(v2.0.2 (current latest version))

Streaming and tailing output

Add support of streaming and tailing output (stream log and progress entries).

Join ALL and conditional branches conflict？

I want to use conditions to determine which paths in the workflow to take, and finally use join all to summarize all the results that meet the criteria.
However, the final step did not execute .

Cleanup/rollback and fail workflow on task error

There is a common use case to launch cleanup/rollback and fail the workflow on task error.

Currently for the workflow below, it doesn't work. When task1 fails and the transition is evaluated, the fail command in the transition will immediately fail the workflow without running cleanup_task.

task1:
  action: mock.some_action_that_will_fail
  next:
    - when: <% succeeded() %>
       do:
          - cleanup_task
    - when: <% failed() %>
       do:
          - cleanup_task
          - fail

To workaround, users would have to do the following.

task1:
  action: mock.some_action_that_will_fail
  next:
    - when: <% succeeded() %>
       do:
          - cleanup_task
    - when: <% failed() %>
       do:
          - cleanup_task_on_failure

cleanup_task:
  action: mock.some_cleanup_action

cleanup_task_on_failure:
  action: mock.some_cleanup_action
  next:
    - do: fail

The above makes for a long winded approach. We want to launch the cleanup task first, set the workflow status to fail, and then let the cleanup task run to completion.

Add function to identify execution routes for a workflow

Add a function that traverses the workflow execution graph and identify the possible execution routes from the graph leaves. This is required by st2flow to display specific routes in the runtime execution graph.

Implement rerun from specific task(s)

Implement rerun of workflow execution from specific task(s). See the command st2 execution re-run <execution-id> --tasks and Mistral runner on how this is implemented. Rerunning from a task should continue with the existing workflow execution record. The orquesta runner should determine if the specified task can be rerun and then request task execution on the workflow engine. Please note that this is unlike rerun of python/shell actions or rerun workflow execution from the start which will create a new database record for the action execution.

Increasing number of items increases execution time exponentially

For the given workflow definition: https://gist.github.com/batk0/32e03d360903cdfcdb8075267419b892 execution time increases exponentially with increasing number of items.

For example:
14 items = 84s
32 items = 181s
76 items = 813s
101 items = 1566s

Also CPU usage of st2workflowengine is sustained at 100% while WF is running

Next task tasks skipped on failure of one of the parallel tasks under current task. Why?

Actual behavior

Expected behavior

transition_ticket has no terminate/fail action

Stackstorm workflow output:

{
  "output": {
    "stdout": "Finished!!!"
  },
  "errors": [
    {
      "message": "Execution failed. See result for details.",
      "type": "error",
      "result": {
        "stdout": "",
        "result": "JiraError HTTP Invalid transition name. None
\t",
        "stderr": "",
        "exit_code": 0
      },
      "task_id": "transition_ticket"
    }
  ]
}

Task Retry in Workflow Definition

SUMMARY

When orquesta workflow tasks fail, there should be a built-in syntax solution for retrying. Today our production workflows make use of a combination of decrementing then publishing retry counter var, and before routing to the task again. Retry is listed as an upcoming feature in the documentation and I'm interested in helping with its implementation.

ISSUE TYPE

Feature Idea

Workflow succeeded before the join task is executed

Given the following workflow, where more than one tasks transition to this complete task regardless of success or failure, the workflow will succeed after task1 and task2 are complete and the complete task will not be executed. The join: all is not satisfied and the conductor thinks that there is no more task to execute.

version: 1.0

tasks:
  task1:
    action: core.noop
    next:
      - when: <% succeeded() %>
        do: complete
      - when: <% failed() %>
        do: complete

  task2:
    action: core.noop
    next:
      - when: <% succeeded() %>
        do: complete
      - when: <% failed() %>
        do: complete

  complete:
    join: all
    action: core.noop

`Unknown function "st2kv"` when running examples.orchestra-st2kv

vagrant@st2test:~$ st2 run examples.orchestra-st2kv key_name="something"
.
id: 5b332e2e55fc8c57d28831cc
action.ref: examples.orchestra-st2kv
parameters:
  key_name: something
status: failed
start_timestamp: Wed, 27 Jun 2018 06:26:54 UTC
end_timestamp: Wed, 27 Jun 2018 06:26:54 UTC
result:
  errors:
  - message: Unknown function "st2kv"
    task_id: create_vm
  output: null

Qrquesta sub-workflow tasks do not quickly continue after completing

Qrquesta main workflow calls out to a Qrquesta sub-workflow. The sub-workflow executes successfully but does not quickly return.

date

Fri Mar 29 22:28:52 CDT 2019

cat /etc/redhat-release

CentOS Linux release 7.6.1810 (Core)

st2 --version

st2 3.0dev (c592795), on Python 2.7.5

cat /etc/st2/st2.conf

mode = standalone
[coordination]
url = redis://:some-pass@localhost:6379

Redis enabled for workflowengine coordination.

redis-server --version

Redis server v=3.2.12 sha=00000000:0 malloc=jemalloc-3.6.0 bits=64 build=7897e7d0e13773f

The sub-workflow would normally finish execution in 15-30 seconds when called directly via st2 run.

All sub-workflow tasks exceed 100 seconds.

5c9ec7b0da6c0b4fdb2f1a25 | succeeded (223s elapsed) | fetch_server_hardware_cpu | easydcim.show_device_parts | Sat, 30 Mar 2019 01:34:40 UTC
5c9ec7b5da6c0b4fdb2f1a29 | succeeded (185s elapsed) | fetch_server_hardware_hdd | easydcim.show_device_parts | Sat, 30 Mar 2019 01:34:45 UTC
5c9ec7bbda6c0b4fdb2f1a2f | succeeded (234s elapsed) | fetch_server_hardware_ram | easydcim.show_device_parts | Sat, 30 Mar 2019 01:34:50 UTC
5c9ec7c3da6c0b4fdb2f1a35 | succeeded (265s elapsed) | fetch_server_hardware_san | easydcim.show_device_parts | Sat, 30 Mar 2019 01:34:58 UTC
5c9ec7ccda6c0b4fdb2f1a3e | succeeded (242s elapsed) | fetch_server_hardware_ssd | easydcim.show_device_parts | Sat, 30 Mar 2019 01:35:07 UTC

st2 run easydcim.show_device_parts id_num=13 type_id=20

...........
id: 5c9ed02bda6c0b66f7876196
action.ref: easydcim.show_device_parts
parameters:
id_num: 13
type_id: 20
status: succeeded
start_timestamp: Sat, 30 Mar 2019 02:10:51 UTC
end_timestamp: Sat, 30 Mar 2019 02:11:14 UTC
result:
output:

Incomplete next staged concurrent task with items if last running nested item fails.

The below workflow didn't execute Nested6 for subsequent items in listed, if it fails for any item in listed. Basically if last running concurrent nested task is failed than it marks the staged task as completed and didn't pick the next task from staged items.

I have proposed a change which worked for me in a PR #221. Where I have update the TASK_STATE_MACHINE_DATA map for current state to new state for event events.ACTION_FAILED_TASK_DORMANT_ITEMS_INCOMPLETE: statuses.RUNNING.

---
version: "1.0"
description: "Workflow desc"
input:
- "nested"
- "break_on"
- "l_c"
- "listed"
tasks:
  Start:
    action: "core.noop"
    next:
    - when: "<% succeeded() %>"
      do: "Nested6"

  Nested6:
    action: "core.local cmd='exit 1'"
    input:
      b: "<% item(list_i) %>"
      l_c: "<% ctx().l_c %>"
    next:
    - when: "<% succeeded() %>"
      do: "Join11"
    - when: "<% failed() %>"
      do: "Join12"
    with:
      items: list_i in <% ctx(listed) %>
      concurrency: 1

  Join11:
    action: "core.noop"
    next:
    - when: "<% failed() %>"
      do: "fail"
    join: "all"
  Join12:
    action: "core.noop"
    next:
    - when: "<% failed() %>"
      do: "fail"
    join: "all"

"ExpressionEvaluationException: Unable to resolve key" error should include list of available keys

Orquesta/Jinja/YAQL currently throws an error when you try to use a key that does not exist:

"YaqlEvaluationException: Unable to resolve key 'result' in expression '<% result().result.output %>' from context."

But this leaves you guessing as to what valid keys you can use.

It would greatly ease workflow development if we dumped a list of valid keys in the error message:

"YaqlEvaluationException: Unable to resolve key 'result' in expression '<% result().result.output %>' from context; must be one of 'option1', 'option2', or 'option3'."

But if we want to avoid l10n/i18n issues with lists (I'm also not sure we handle this at all), we can also simply dump an array of options:

"YaqlEvaluationException: Unable to resolve key 'result' in expression '<% result().result.output %>' from context; must be one of ['option1', 'option2', 'option3']."

Parent status stays in 'resuming' when the inquiry times out

SUMMARY

The Orquesta workflow stays in resuming status if the an inquiry within the workflow doesn't have both when: <% succeeded() %> and when: <% failed() %>.

To reproduce the issue: here

ISSUE TYPE
Bug Report

STACKSTORM VERSION
st2 2.10.3, on Python 2.7.5

Workflow fails unexpectedly (repro included)

SUMMARY

A workflow that we expect to pause is failing.

STACKSTORM VERSION

root@b7e90889c25a:/# st2 --version
st2 3.1.0, on Python 2.7.6

OS, environment, install method

Docker

Steps to reproduce the problem

here is the workflow definition

Install and run the workflow:

root@b7e90889c25a:/# st2 pack install https://github.com/trstruth/bug

        [ succeeded ] download pack
        [ succeeded ] make a prerun
        [ succeeded ] install pack dependencies
        [ succeeded ] register pack

+-------------+-----------------------------------------------+
| Property    | Value                                         |
+-------------+-----------------------------------------------+
| name        | Bug                                           |
| description | A pack of workflow repros for st2 bug reports |
| version     | 0.1.0                                         |
| author      | Tristan                                       |
+-------------+-----------------------------------------------+
root@b7e90889c25a:/# st2 run bug.branch-failure
..
id: 5d51c279cad05a01403cc430
action.ref: bug.branch-failure
parameters: None
status: failed
start_timestamp: Mon, 12 Aug 2019 19:48:09 UTC
end_timestamp: Mon, 12 Aug 2019 19:48:12 UTC
result:
  errors:
  - message: Execution failed. See result for details.
    result:
      failed: true
      return_code: 127
      stderr: 'bash: asdf: command not found'
      stdout: ''
      succeeded: false
    task_id: t2
    type: error
  - message: Execution failed. See result for details.
    result:
      failed: true
      return_code: 127
      stderr: 'bash: asdf: command not found'
      stdout: ''
      succeeded: false
    task_id: t3
    type: error
  output: null
+--------------------------+------------------------+------+------------+-------------------------------+
| id                       | status                 | task | action     | start_timestamp               |
+--------------------------+------------------------+------+------------+-------------------------------+
| 5d51c279cad05a0039b31e91 | succeeded (1s elapsed) | t1   | core.noop  | Mon, 12 Aug 2019 19:48:09 UTC |
| 5d51c27acad05a0039b31e94 | failed (1s elapsed)    | t2   | core.local | Mon, 12 Aug 2019 19:48:10 UTC |
| 5d51c27acad05a0039b31e97 | failed (1s elapsed)    | t3   | core.local | Mon, 12 Aug 2019 19:48:10 UTC |
+--------------------------+------------------------+------+------------+-------------------------------+

Expected Results

Since t3 failing should trigger an inquiry, the failure of the workflow is unexpected.

Actual Results

The workflow fails.

How to use notify with orquesta workflow and jinja expression support?

Issue: Not able to notify in slack channel with workflow output/stdout. How to use workflow context and variable in notify parameter, without writing for each of the task in the workflow

Workflow-meta.yaml

---
description: A Nessus scanner reporter validator workflow
enabled: true
runner_type: orquesta
entry_point: workflows/nessus_report_validation_workflow.yaml
name: nessus_report_validation_workflow
pack: st2pack_poc
notify:
  on-complete:
    routes:
      - slack
    message: "Action has finished"
  on-success:
    routes:
      - slack
    message: "Succeeded: {{action_results.stdout}}"
  on-failure:
    routes:
      - slack
    message: |
     Yikes!! It failed. {{ stdout }} {{ result().stdout }} <% ctx() %> <% result().issue_key %>  <% stdout %>!!
parameters:
...

Slack channel output:

Substring literal "in" breaks with items syntax

SUMMARY

The statement with: interface in <% list("1","2","3") %> fails because the string interface contains the substring in. The resulting error message is

TypeError: The value of "terface in <% list("1","2","3") %>" is not type of list.

I this this happens because it is breaking the regex here

I've created a repro workflow here

ISSUE TYPE

Bug Report

STACKSTORM VERSION

st2 2.10.1, on Python 2.7.6

OS / ENVIRONMENT / INSTALL METHOD

default docker container

decrypt_kv jinja filter fails if the key isn't in the kv store

An action file that has a secret parameter with a default value as follows:

action_param:
type: string
description: "This will fail the action"
default: "{{ st2kv.system.test_param | decrypt_kv }}"
secret: true

will fail with the following message if 'test_param' is not in the kv store:

ERROR: 400 Client Error: Bad Request
MESSAGE: Failed to render parameter "test_param": Referenced datastore item "st2kv.system.test_param" doesn't exist or it contains an empty string

Alternatively, if I pass a value that doesn't exist in the kv store but I don't try to decrypt it, then a blank string gets passed into the action parameter. I would expect an encrypted value to pass a blank string as well instead of failing.

Context variables are being overwritten on join

Given the following workflow where task1, task2, and task3 running in parallel and joining on task99

version: 1.0

description: A basic parallel workflow.

input:
  - x: false
  - y: false
  - z: false

output:
  - data:
      x: <% ctx().x %>
      y: <% ctx().y %>
      z: <% ctx().z %>

tasks:
  init:
    action: core.noop
    next:
      - do: task1, task2, task3

  task1:
    action: core.noop
    next:
      - publish: x=true
        do: task99

  task2:
    action: core.noop
    next:
      - publish: y=true
        do: task99

  task3:
    action: core.noop
    next:
      - publish: z=true
        do: task99

  task99:
    join: all
    action: core.noop

The expected output should be the following.

data:
  x: true
  y: true
  z: true

However, the actual output is this.

data:
  x: false
  y: false
  z: true

When executing task99, orquesta takes the final context from task1, task2, and task3 and merge them together to form the initial context for task99. The final context for task1 has x=true but y and z as false. The final context for task2 has y=true but x and z as false. And so forth for task3. When the context is merged sequentially, the final context for task3 overwrites those from task1 and task2 resulting in x and y being false.

"The join task is unreachable", but it should be

Hi,

Let's consider this workflow:

version: 1.0

input:
  - max_retry_count
  - retry_delay_sec

vars:
  - SOME_MSG: some msg
  - retry_count: 1

tasks:
  task_1:
    action: core.noop
    next:
      - when: <% succeeded() %>
        do:
          - task_2

  task_2:
    action: core.local
    input:
      cmd: '>&2 echo <% ctx(SOME_MSG) %> ; false'
    next:
      - when: <% failed() and ctx(SOME_MSG) in result().stderr %>
        do:
          - task_3
      - when: <% failed() and (not (ctx(SOME_MSG) in result().stderr)) and ctx(retry_count) < ctx(max_retry_count) %>
        publish:
          - retry_count: <% ctx(retry_count) + 1 %>
        do:
          - task_3
      - when: <% failed() and (not (ctx(SOME_MSG) in result().stderr)) and ctx(retry_count) >= ctx(max_retry_count) %>
        do:
          - task_6a
          - task_6b
      - when: <% succeeded() %>
        do:
          - task_5

  task_3:
    action: core.local
    input:
      cmd: 'false'
    next:
      - when: <% succeeded() %>
        do:
          - task_5
      - when: <% failed() %>
        do:
          - task_6a
          - task_6b

  task_4:
    delay: <% ctx(retry_delay_sec) %>
    action: core.noop
    next:
      - when: <% succeeded() or failed() %>
        do:
          - task_2

  task_5:
    action: core.noop

  task_6a:
    action: core.noop
    next:
      - when: <% succeeded() %>
        do:
          - task_7

  task_6b:
    action: core.noop
    next:
      - when: <% succeeded() %>
        do:
          - task_7

  task_7:
    join: 2
    action: core.local
    input:
      cmd: 'false'

Because the values are hard-coded, the workflow should always do:

task_1
task_2
task_3
task_6a and task_6b in parallel
join: task_7

However, when running this workflow, I have the following error:

{
  "output": null,
  "errors": [
    {
      "spec_path": "tasks.task_7",
      "message": "The join task \"task_7\" is unreachable.",
      "type": "semantic",
      "schema_path": "properties.tasks.patternProperties.^\\w+$"
    }
  ]
}

Note that if I remove the 3rd when of task_2 (i.e. - when: <% failed() and (not (ctx(SOME_MSG) in result().stderr)) and ctx(retry_count) >= ctx(max_retry_count) %>), the error disappears and the workflow works.

Any idea what's the problem here?

Thanks!

combined schema (anyOf, allOf, oneOf) error message enhancement

I think this goes here, but if it goes on StackStorm/st2, I can file the issue there.
This issue is about improving the error message returned when a schema validation error occurs due to not matching a oneOf schema.

I ran an orquesta workflow and for the task transition I used the incorrect:

publish:
  message: foobar

instead of the correct:

publish:
  - message: foobar

Doing that gave me an error message like:

    {
      "spec_path": "tasks.validate_input.next[0].publish",
      "message": "{'error_message': 'wrong format for input param foobar: {{ ctx().foobar }}'} is not valid under any of the given schemas",
      "type": "syntax",
      "schema_path": "properties.tasks.patternProperties.^\\w+$.properties.next.items.properties.publish.oneOf"
    },

Then, I went to look up the schema to figure out what I did wrong, mistakenly guessing that it didn't like my use of Jinja. My error was passing in a dict instead of an array. Could the error message be improved to say what the options are when the schemaPath is oneOf? (in this case string or array of unique one item dicts)

Allow YAQL/Jinja expression in join

Currently, join can either be all or an integer that represents number of incoming edges with examples being join: all or join: 3. There are cases where the number of incoming edges need to be determined dynamically and join does not currently accept YAQL or Jinja expression.

Workflow execution with error sometimes

SUMMARY

I have a workflow running by a rule with interval timer that the first task iterates an array using with-item, sometimes I get the error "TypeError: 'NoneType' object has no attribute 'getitem'". The stranger is that doesn't occur in every execution.

STACKSTORM VERSION

st2 3.0.1, on Python 2.7.16

OS, environment, install method

OS: Oracle Linux 7.5
Custom HA installation with ST2 RPM without changes. All services separate as a single VM.

Steps to reproduce the problem

Workflow 1 content:

description: Simple workflow that check VM interface
input:
- lbs
- env_name
- acs_url
tasks:
    each_vip:
        with: <% ctx(lbs) %>
        action: workflows.check-interface-vm
        input:
            lb_id: <% item() %>
            env_name: <% ctx(env_name) %> 
            acs_url: <% ctx(acs_url) %>

Subworkflow

enabled: true
name: check-interface-vm
description: "Subworkflow"
pack: workflows
runner_type: orquesta
parameters:
  lb_id:
    required: true
    type: string
    description: "LB ID."
    position: 0
  env_name:
    required: true
    type: string
    description: "Env."
    position: 1
  acs_url:
    required: false
    type: string
    description: "URL ACS."
    position: 2
entry_point: workflows/check-interface-vm.yaml
data_files:
- file_path: workflows/check-interface-vm.yaml
  content: |
    version: 1.0
    description: "Subworkflow."
    
    input:
      - lb_id
      - env_name
      - acs_url

    tasks:
      get_lb_info:
        action: cloudstack.lb_list_members
        input:
          lb_id: <% ctx(lb_id) %>
          url: <% ctx(acs_url) %>
          apikey: <% st2kv('acs_apikey', decrypt => true) %>
          secretkey: <% st2kv('acs_secretkey', decrypt => true) %>
        next:
          - when: <% succeeded() %>
            publish:
              - lb_info: <% result().result.loadbalancerruleinstance %>
            do: get_vm_info
          - when: <% failed() %>
            do: fail

      get_vm_info:
        with: <% ctx(lb_info) %>
        action: cloudstack.vm_get_info
        input:
          vm_id: <% item().id %>
          url: <% ctx(acs_url) %>
          apikey: <% st2kv('acs_apikey', decrypt => true) %>
          secretkey: <% st2kv('acs_secretkey', decrypt => true) %>
        next:
          - when: <% succeeded() %>
            publish:
              - vm_info: <% result().result.virtualmachine %>
            do: ping_vm
          - when: <% failed() %>
            do: fail

      ping_vm:
        with: <% ctx(vm_info).select($.nic.ipaddress) %>
        action: networking_utils.ping
        input:
          host: "<% item().first().last() %>"
          count: 10
        next:
          - when: <% failed() %>
            publish:  
              - vm_tested: <% result().where($.failed = true).stdout.select($.split(' ')[1]) %>
            do: delete_vm_failed_interface

      delete_vm_failed_interface:
        with: <% ctx(vm_info).where(ctx(vm_tested).contains($.nic.ipaddress.first().last())) %>
        action: cloudstack.vm_force_destroy
        input:
          vm_id: <% item().first().id %>
          url: <% ctx(acs_url) %>
          apikey: <% st2kv('acs_apikey', decrypt => true) %>
          secretkey: <% st2kv('acs_secretkey', decrypt => true) %>
        next:
          - when: <% succeeded() %>
            do: post_slack_deleted
          - when: <% failed() %>
            do: fail

      post_slack_deleted:
        with: <% ctx(vm_info).where(ctx(vm_tested).contains($.nic.ipaddress.first().last())) %>
        action: slack.chat.postMessage
        input:
          text: "[CHECK INTERFACE - `<% ctx(env_name).toUpper() %>`] *<% item().first().name %>* *DOWN*."
          http_method: "POST"
          username: "bot"
          channel: "channel"

Expected Results

All workflows and actions executed with success.

Actual Results

Sometimes all actions inside workflow are executed with success but workflow fails with the message: TypeError: 'NoneType' object has no attribute '__getitem__'

Looking at logs I can see the error message with some identification:

ERROR 140336294574032 workflows [-] [5d6ed69d4341ee7ea9f75a6f] Unknown error
while processing task execution. Failing task execution "5d6ed69f1b80fa05b9fa7d07"

And when I try to get task execution information:

$ st2 execution get 5d6ed69f1b80fa05b9fa7d07 
ERROR: Execution with id 5d6ed69f1b80fa05b9fa7d07 not found.

Getting workflow information I can't see these task execution that failed 5d6ed69f1b80fa05b9fa7d07:

id: 5d6ed69d4341ee7ea9f75a6f
action.ref: workflows.check-farm
parameters:
  env_name: DEV
  lbs:
  - 6c0c9e83-21d4-455d-9c01-e7ecdfc90cf4
  - 715f34af-3a96-4aa4-aac4-2c04baa314d4
status: failed (18s elapsed)
start_timestamp: Tue, 03 Sep 2019 21:09:49 UTC
end_timestamp: Tue, 03 Sep 2019 21:10:07 UTC
result:
  errors:
  - message: Execution failed. See result for details.
    task_id: each_vip
    type: error
  - message: 'TypeError: ''NoneType'' object has no attribute ''__getitem__'''
    route: 0
    task_id: each_vip
    type: error
  output:
    each_vip:
      items:
      - result:
          output: null
        status: succeeded
      - result:
          output: null
        status: succeeded
    finish:
      failed: false
      return_code: 0
      stderr: ''
      stdout: Completed
      succeeded: true
    start:
      failed: false
      return_code: 0
      stderr: ''
      stdout: Started checking [u'6c0c9e83-21d4-455d-9c01-e7ecdfc90cf4', u'715f34af-3a96-4aa4-aac4-2c04baa314d4']
      succeeded: true
+-----------------------------+-------------------------+-------------+------------------------------------+-------------------------------+
| id                          | status                  | task        | action                             | start_timestamp               |
+-----------------------------+-------------------------+-------------+------------------------------------+-------------------------------+
|   5d6ed69e1b80fa05b9fa7d06  | succeeded (1s elapsed)  | start       | core.echo                          | Tue, 03 Sep 2019 21:09:50 UTC |
| + 5d6ed69f1b80fa05b9fa7d09  | succeeded (15s elapsed) | each_vip    | workflows.check-interface-vm | Tue, 03 Sep 2019 21:09:51 UTC |
|    5d6ed6a01b80fa05b9fa7d0e | succeeded (2s elapsed)  | get_lb_info | libcloud.balancer_list_members     | Tue, 03 Sep 2019 21:09:52 UTC |
|    5d6ed6a31b80fa05b9fa7d15 | succeeded (1s elapsed)  | get_vm_info | libcloud.get_vm                    | Tue, 03 Sep 2019 21:09:55 UTC |
|    5d6ed6a41b80fa05b9fa7d19 | succeeded (1s elapsed)  | get_vm_info | libcloud.get_vm                    | Tue, 03 Sep 2019 21:09:56 UTC |
|    5d6ed6a71b80fa05b9fa7d1f | succeeded (5s elapsed)  | ping_vm     | networking_utils.ping              | Tue, 03 Sep 2019 21:09:58 UTC |
|    5d6ed6a71b80fa05b9fa7d21 | succeeded (5s elapsed)  | ping_vm     | networking_utils.ping              | Tue, 03 Sep 2019 21:09:59 UTC |
| + 5d6ed69f1b80fa05b9fa7d0b  | succeeded (13s elapsed) | each_vip    | workflows.check-interface-vm | Tue, 03 Sep 2019 21:09:51 UTC |
|    5d6ed6a11b80fa05b9fa7d11 | succeeded (1s elapsed)  | get_lb_info | libcloud.balancer_list_members     | Tue, 03 Sep 2019 21:09:53 UTC |
|    5d6ed6a31b80fa05b9fa7d17 | succeeded (2s elapsed)  | get_vm_info | libcloud.get_vm                    | Tue, 03 Sep 2019 21:09:55 UTC |
|    5d6ed6a61b80fa05b9fa7d1c | succeeded (5s elapsed)  | ping_vm     | networking_utils.ping              | Tue, 03 Sep 2019 21:09:58 UTC |
|   5d6ed6ae1b80fa05b9fa7d24  | succeeded (1s elapsed)  | finish      | core.echo                          | Tue, 03 Sep 2019 21:10:06 UTC |
+-----------------------------+-------------------------+-------------+------------------------------------+-------------------------------+

If has any other information that I can pass, please let me know.

json/yaml ser/des functions in YAQL

Only json(str) which does "json string -> ojbect" is available for now.

Mistral supports several more functions:
https://github.com/StackStorm/st2mistral/blob/master/st2mistral/functions/data.py

I think it is good to have at least following:

json string to object: already implemented as json()
yaml string to object
object to json string
object to yaml string

Join failure within nested workflows can cause Parent workflow to run indefinitely.

Summary

Using a nested workflow, when a join fails due to "unreachable" in the child workflow can cause the parent workflow to run indefinitely, even though the parent workflow reaches an acceptable completion point.

Error Messages

I've seen two cases of error messages when this scenario presents

"message": "UnreachableJoinError: The join task|route \"aggregate|1\" is partially satisfied but unreachable."

"message": "The join task \"aggregate\" is unreachable. A join task is determined to be unreachable if there are nested forks from multi-referenced tasks that join on the said task. This is ambiguous to the workflow engine because it does not know at which level should the join occurs.",

The longest I've seen it go, was until I manually canceled it the following day at 69,552 seconds (over 19 hours)

Environment details

ST2 Version

st2 --version
st2 3.2.0, on Python 2.7.12

Distro

cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.5 LTS"

Other

Installed from one-liner script
Using EC2 server, not Docker or other virtualization
Running on AMI ami-0e32ec5bc225539f5 in AWS

Reproduction Workflow Examples

I have tested with these reproduction workflows that the problem presents itself.

Parent WF

parent_wf.meta.yaml

pack: default
enabled: true
runner_type: orquesta
name: parent_wf
entry_point: workflows/parent_wf.yaml

parent_wf.yaml

version: 1.0
tasks:
  # [483, 337]
  task1:
    action: default.child_wf
    with:
      items: <% ctx(hosts) %>
      concurrency: 3
    next:
      - do:
          - complete
  # [483, 486]
  complete:
    action: core.noop
    join: all
vars:
  - hosts: ["host1","host2","host3"]

Child WF

child_wf.meta.yaml

pack: default
enabled: true
runner_type: orquesta
name: child_wf
entry_point: workflows/child_wf.yaml

child_wf.yaml

version: 1.0
tasks:
  # [489, 163]
  run:
    action: core.noop
    next:
      - do:
          - succeeds
          - fails
  # [348, 313]
  succeeds:
    action: core.local
    input:
      cmd: echo 'success'
    next:
      # #1072c6
      - do:
          - aggregate

  # [666, 311]
  fails:
    action: core.local
    input:
      cmd: echo 'fail'; exit 1
    next:
      # #1072c6
      - do:
          - aggregate

  # [518, 461]
  aggregate:
    action: core.noop
    join: all
    next:
      # #629e47
      - do:
          - continue_wf

  # [518, 593]
  continue_wf:
    action: core.noop

Expected Result

Child workflow join fails because upstream action failure
Parent Workflow sees failure of child workflow
Parent Workflow waits for all child workflows to complete
Parent workflow moves onto complete action
Parent workflow enters Success/Failed State accordingly

Observed Result

Child workflow join fails because upstream action failure
Parent Workflow sees failure of child workflow
Parent Workflow waits for all child workflows to complete
Parent workflow moves onto complete action
Parent workflow continues in running State until canceled manually

Workaround

An acceptable workaround I have found is ensuring that each parallel silo (fork) of the child workflow, prior to being joined, has a core.noop to ensure that a success always happens, which allows the join to succeed and continue gracefully.

This causes the "Expected Result" to be observed.

self looping start task is not identified as start task

If a start task is self looping, the conductor doesn't identify it as a start task.

  task1:
    action: core.noop
    next:
      - when: '<% ctx("loop") = true %>'
        publish:
          - loop: False
        do: task1

self looping and terminating task does not terminate workflow on completion

If the last task in the workflow is self looping and there's no other transition, on exiting of the cycle, the conductor does not terminate the workflow.

  task1:
    action: core.noop
    next:
      - do: task2
  task2:
    action: core.noop
    next:
      - when: '<% ctx("loop") = true %>'
        publish:
          - loop: False
        do: task2

Tasks in a fork from a cycle do not execute

Given the following workflow, the task cheer will only execute once if the task toil cycle back to the task query. The workflow engine did not recognize the fork under decide_cheer to be part of the cycle and blocked the task from updating the workflow state.

version: '1.0'

description: A sample workflow with a fork that stems from a cycle.

vars:
  - cheer: false
  - work: false

tasks:
  # This init task is required to tell the workflow engine where to start.
  # Otherwise with just a cycle, the workflow engine cannot tell where to start.
  init:
    action: core.noop
    next:
      - do: query

  # This is the start of the cycle in the workflow.
  query:
    action: core.noop
    next:
      - when: <% succeeded() %>
        publish: cheer=<% result() %> work=<% result() %>
        do: decide_cheer, decide_work


  # The branch under the decide_work task will cycle back to the query task.
  decide_work:
    next:
      - when: <% ctx().work %>
        do: notify_work, toil
  notify_work:
    action: core.noop
  toil:
    action: core.echo message="This is hard work."
    next:
      - do: query


  # The branch under the decide_cheer task is a fork from the cycle.
  decide_cheer:
    next:
      - when: <% ctx().cheer %>
        do: cheer
  cheer:
    action: core.echo message="You can do it!"

401 Client Error: Unauthorized MESSAGE: Unauthorized - One of Token or API key required.

I am getting a

requests.exceptions.HTTPError: 401 Client Error: Unauthorized
MESSAGE: Unauthorized - One of Token or API key required. for url: http://127.0.0.1:9101/v1/keys/<keyname>

exception when running an action in an orquesta workflow.

If I execute the action outside of the orquesta workflow or in a mistral-v2 workflow it works as expected.

The action is calling: self.action_service.set_key and self.action_service.get_key

It seems like this is failing due to the api token not being present in the call from orquesta.

Canceling workflow with subworkflow can lead to execution stuck in canceling state

Given an orquesta workflow that calls other workflows. If a cancel command is sent before the st2workflowengine is still in the middle of processing the subworkflow request, there is a chance that the main workflow will be stuck in canceling state.

Retries using with-items runs a retry even on objects that succeeded as well

We use the with-items to loop over and run an action.
If it fails for even one of them, it retries over all the items in the array again, instead of just retrying the failed ones.

For instance, if param1 was set to 1,2,3,4 in the below sample workflow:

version:                1.0
description:            Test retry workflow
input:
  - param1
  - slack_channel
tasks:
  split_list:
    action: core.noop
    next:
    - when: <% succeeded() %>
      do: print_val
      publish:
        - list_vals: <% ctx(param1).split(',') %>
  print_val:
    with: <% ctx(list_vals) %>
    action: core.local
    input :
      cmd : >
        val=<% item() %>;
        if [ $val -eq 2 ];then exit -1;fi
    retry:
      when: <% failed() %>
      count: 5
      delay: 2
    next:
      - when: <% succeeded() %>
        publish:
          - status: "SUCCESS"
      - when: <% failed() %>
        do:
          - send_failed_alert
        publish:
          - status: "FAILED"
  send_failed_alert:
    action: slack.chat.postMessage
    input:
      channel: "#<% ctx(slack_channel) %>"
      text: "ERROR: failed on param <% ctx(param1) %> failed"`

And here is the output of the run.

Duplicate task name should throw error

Tested with st2 3.0dev (d86747c), on Python 2.7.12

Orquesta workflows with duplicate task names do not throw an error. They proceed, with what looks like the second task with the same name being called.

I think it should throw an error if duplicate task names are detected.

Example workflow metadata:

---
name: orquesta-dupe
pack: default
description: Duplicate tasks
runner_type: orquesta
entry_point: workflows/orquesta-dupe.yaml
enabled: true

Workflow:

version: 1.0

description: A basic workflow with duplicate task names

tasks:
  task1:
    action: core.local cmd="echo task1"
    next:
      - when: <% succeeded() %>
        do: task2
  task2:
    action: core.local cmd='echo "first task2"'
    next:
      - when: <% succeeded() %>
        do: task3
  task2:
    action: core.local cmd='echo "second task2"'

After registration, inspect the workflow to check for errors:

extreme@ubuntu:/opt/stackstorm/packs$ st2 workflow inspect --action default.orquesta-dupe
No errors found in workflow definition.
extreme@ubuntu:/opt/stackstorm/packs$

Run workflow, look at output:

extreme@ubuntu:/opt/stackstorm/packs$ st2 run default.orquesta-dupe
...
id: 5ca3ff955f627d08b84be500
action.ref: default.orquesta-dupe
parameters: None
status: succeeded
start_timestamp: Wed, 03 Apr 2019 00:34:28 UTC
end_timestamp: Wed, 03 Apr 2019 00:34:34 UTC
result:
  output: null
+--------------------------+------------------------+-------+------------+------------------------------+
| id                       | status                 | task  | action     | start_timestamp              |
+--------------------------+------------------------+-------+------------+------------------------------+
| 5ca3ff975f627d0453668034 | succeeded (1s elapsed) | task1 | core.local | Wed, 03 Apr 2019 00:34:31    |
|                          |                        |       |            | UTC                          |
| 5ca3ff995f627d0453668037 | succeeded (0s elapsed) | task2 | core.local | Wed, 03 Apr 2019 00:34:33    |
|                          |                        |       |            | UTC                          |
+--------------------------+------------------------+-------+------------+------------------------------+
extreme@ubuntu:/opt/stackstorm/packs$ st2 execution get 5ca3ff995f627d0453668037
id: 5ca3ff995f627d0453668037
status: succeeded (0s elapsed)
parameters:
  cmd: echo "second task2"
result:
  failed: false
  return_code: 0
  stderr: ''
  stdout: second task2
  succeeded: true
extreme@ubuntu:/opt/stackstorm/packs$

Note that it was the second task2 that was called.

User feedback

Below is user feedback via Slack. Posting it here so it doesn't get lost with Slack message rollover

StackStorm's New Workflows

Unit Testing

Currently, in my opinion, the biggest downfall of Mistral is the lack of unit
testing around workflows. Without unit testing we're relying on smoke testing to
ensure these workflows behave properly (not very effective or maintainable).

Things we would like to test in a workflow include:

Task execution yes/no
Task execution order
Error handling
Publish expressions (Jinja, YAQL)
Transition expressions (Jinja, YAQL)
Workflow inputs and outputs

I think this would require some sort of ability to mock action output so that
these things can be tested.

Publish and reuse in same task

Currently, in Mistral, all of the publish and publish-on-error statements
occur in bulk meaning i can't reuse one of the variables published in the
current task until the next task is executed.

Example:

task1:
  action: core.local
  input:
    cmd: "echo hello"
  publish:
    output: "{{ task('task1').result.stdout }}"
    # this errors out because output isn't available in the contex yet
    hello_world: "{{ _.output + ' world' }}"

Maybe converting the publish statement to an array would make it easier
to execute in sequence?

task1:
  action: core.local
  input:
    cmd: "echo hello"
  publish:
    - output: "{{ task('task1').result.stdout }}"
    # output should now be available in the current context, so the following 
    # should work
    - hello_world: "{{ _.output + ' world' }}"

When condition on a task level

Sometimes it's necessary to skip a task given a condition. Currently we have to
work around this by adding this "skip" condition into every on-success statement
in the workflow that may call our task. This is a maintenance burdem.

Ansible example:

- name: install stackstorm pack
  shell:
    cmd: "st2 pack install {{ st2_pack }}"
  when: st2_pack is defined

Mistral example (current implementation):

input:
  - servicenow_provision_id: null

task_1:
  action: std.noop
  publish:
    last_task: "task_1"
  on-success:
    - task_servicenow_update: "{{ _.servicenow_provision_id }}"
    - task_2

task_2:
  action: std.noop
  publish:
    last_task: "task_2"
  on-success:
    - task_servicenow_update: "{{ _.servicenow_provision_id }}"
    
task_servicenow_update:
  action: encore_servicenow.provision_state_update
  input:
    provision_id: "{{ _.servicenow_provision_id }}"
    current_state: "{{ _.last_task }}"

Mistral example using a when condition on task_servicenow_update task:

input:
  - servicenow_provision_id: null

task_1:
  action: std.noop
  publish:
    last_task: "task_1"
  on-success:
    - task_servicenow_update
    - task_2

task_2:
  action: std.noop
  publish:
    last_task: "task_2"
  on-success:
    - task_servicenow_update
    
task_servicenow_update:
  action: encore_servicenow.provision_state_update
  when: "{{ _.servicenow_provision_id }}"
  input:
    provision_id: "{{ _.servicenow_provision_id }}"
    current_state: "{{ _.last_task }}"

"Main" task / entry-point

There are many occasions on Slack where users of Mistral are confused by its
default execution behavior. People usually forget to tie tasks together with
explicit on-success, on-error, or on-complete statements, causing tasks
to be run in parallel resulting in odd and unpredictable behaviors.

It might be better to explicity define a "main" or "entry-point" in the workflow.

description: My entry-point workflow
input:
  - x
  - y 
  - z
  
# the tasks that should be executed as "roots"
# this can either be a string, or a list so that >1 can be executed in parallel?
entry-point: my_first_task

tasks:
  my_first_task:
    action: std.noop
    on-success:
      - some_other_task
    
  some_other_task:
    action: std.noop

Linear execution by default

On the same lines as the last point about "main tasks" people new to Mistral are
often confused by the non-linearity as the default operating state of the
workflow engine.

Maybe as an alternative to the "main task" idea above, workflows could be executed
linearly and in parallel/DAG-optimized using an option. I'm going to
suggest a workflow parameter called execution_model. The execution_model
with a value of linear means that tasks are executed in order just like an
actionchain or Ansible. Another implementation could be the parallel execution
model that will switch it over to a Mistral-like parallel execution by creating
a DAG and executing as much in parallel as possible.

description: > 
  My linear workflow, all of these are executed in order
  without the need for on-success.
execution_model: linear
  
tasks:
  task_0:
    action: std.noop
    
  task_1:
    action: std.noop

  task_2:
    action: std.noop

description: > 
  Parallel execution model tries to do as much in parallel as possible, requires
  on-success and on-error to construct our DAG.
execution_model: parallel
  
tasks:
  task_0:
    action: std.noop
    on-success:
      - task_1
    
  task_1:
    action: std.noop
    on-success:
      - task_2
      - task_3

  task_2:
    action: std.noop
    
  task_3:
    action: std.noop

Keep: defined inputs and output

We really like having the inputs and outputs sections explicity defined.
Please keep these!

Keep: allowed temporaries in workflow

We currently use (and maybe abuse) the ability of having input values defined
in the workflow that are not present in the parameters of the action itself.
These are mostly used as temporary variables that are local to the workflow

Inflated workflow execution graph leads to increase resource consumption and delays

Given the workflow definition at https://github.com/StackStorm/st2ci/blob/86655cb71ae0a1ff813b4bd0828399b54206e146/actions/workflows/st2_pkg_e2e_test.yaml, orquesta converts this workflow into an execution graph for runtime. Due to the number of conditional jumps, multiple references to common tasks, and common tasks having many steps, this leads to an inflated execution graph. This results in very high CPU and memory usage and long delayed when the conductor tries to load the workflow definition. The resource consumption and delay are likely occurring where the conductor is trying to identify start tasks for the graph.

error message from a task error event is not logged in the workflow errors output

if there is a task error event, the error message is not logged in the conductor's errors list for the workflow.

Can't access the single result for `with` itme in the current thread

ST2 version: st2 3.2dev, on Python 2.7.12
The bug is found by customer.

result() is for all the results from the with and can’t access the single result for the current thread.
The Workflow that duplicated issue:

actions/orquesta-with-items-hostname.yaml
---
name: orquesta-with-items-hostname
description: A workflow demonstrating with items.
runner_type: orquesta
entry_point: workflows/tests/orquesta-with-items-hostname.yaml
pack: examples
enabled: true

workflows/tests/orquesta-with-items-hostname.yaml
version: 1.0
description: A workflow demonstrating with items and concurrent processing.

vars:
  - members: [{"hostname": "Extreme", }, {"hostname": "Google"}, {"hostname":"StackStorm"}]

tasks:
  task1:
    with: <% ctx(members) %>
    action: examples.get_url
    input:
      hostname: <% item().get(hostname) %>
    next:
      - when: <% task(task1).result.result = "https://extreme.com" %>
        do: notify
  notify:
    action: core.echo message="notify should be called"
output:
  - url: <% task(task1).result %>

actions/get_url.yaml
---
name: "get_url"
pack: "examples"
enabled: true
runner_type: "python-script"
entry_point: "get_url.py"
parameters:
  hostname:
    type: string
    required: true

import json
from st2common.runners.base_action import Action


class GetUrl(Action):
    def run(self, hostname=None):
        if hostname == "Extreme":
            return 'https://extreme.com'
        if hostname == "Google":
            return 'https://goodle.com'
        if hostname =="StackStorm":
            return 'https://StackStorm.com'

Execution result:

st2 run examples.orquesta-with-items-hostname
..
id: 5d66c4d20761294f53faac38
action.ref: examples.orquesta-with-items-hostname
parameters: None
status: failed
start_timestamp: Wed, 28 Aug 2019 18:15:46 UTC
end_timestamp: Wed, 28 Aug 2019 18:15:49 UTC
result:
  errors:
  - message: 'YaqlEvaluationException: Unable to resolve key ''result'' in expression ''<% task(task1).result.result = "https://extreme.com" %>'' from context.'
    route: 0
    task_id: task1
    task_transition_id: notify__t0
    type: error
  output:
    url:
      items:
      - result:
          exit_code: 0
          result: https://extreme.com
          stderr: ''
          stdout: ''
        status: succeeded
      - result:
          exit_code: 0
          result: https://goodle.com
          stderr: ''
          stdout: ''
        status: succeeded
      - result:
          exit_code: 0
          result: https://StackStorm.com
          stderr: ''
          stdout: ''
        status: succeeded
+--------------------------+------------------------+-------+------------------+--------------------+
| id                       | status                 | task  | action           | start_timestamp    |
+--------------------------+------------------------+-------+------------------+--------------------+
| 5d66c4d307612910be906f12 | succeeded (1s elapsed) | task1 | examples.get_url | Wed, 28 Aug 2019   |
|                          |                        |       |                  | 18:15:47 UTC       |
| 5d66c4d307612910be906f14 | succeeded (1s elapsed) | task1 | examples.get_url | Wed, 28 Aug 2019   |
|                          |                        |       |                  | 18:15:47 UTC       |
| 5d66c4d307612910be906f16 | succeeded (2s elapsed) | task1 | examples.get_url | Wed, 28 Aug 2019   |
|                          |                        |       |                  | 18:15:47 UTC       |
+--------------------------+------------------------+-------+------------------+--------------------+

Rerunning `examples.orchestra-join` results in a failure

vagrant@st2test:~$ st2 run examples.orchestra-join
...
id: 5b332ce055fc8c57d28831af
action.ref: examples.orchestra-join
parameters: None
status: succeeded
start_timestamp: Wed, 27 Jun 2018 06:21:20 UTC
end_timestamp: Wed, 27 Jun 2018 06:21:24 UTC
result:
  output:
    messages:
    - Fee fi fo fum
    - I smell the blood of an English man
    - Be alive, or be he dead
    - I'll grind his bones to make my bread
+--------------------------+------------------------+--------+-----------+-------------------------------+
| id                       | status                 | task   | action    | start_timestamp               |
+--------------------------+------------------------+--------+-----------+-------------------------------+
| 5b332ce055fc8c58798f6a20 | succeeded (0s elapsed) | task1  | core.noop | Wed, 27 Jun 2018 06:21:20 UTC |
| 5b332ce055fc8c5821391bd2 | succeeded (1s elapsed) | task2  | core.echo | Wed, 27 Jun 2018 06:21:20 UTC |
| 5b332ce155fc8c5821391bd5 | succeeded (0s elapsed) | task4  | core.echo | Wed, 27 Jun 2018 06:21:21 UTC |
| 5b332ce155fc8c5821391bd8 | succeeded (0s elapsed) | task6  | core.noop | Wed, 27 Jun 2018 06:21:21 UTC |
| 5b332ce255fc8c5821391bdc | succeeded (1s elapsed) | task5  | core.noop | Wed, 27 Jun 2018 06:21:21 UTC |
| 5b332ce255fc8c5821391bdf | succeeded (0s elapsed) | task3  | core.echo | Wed, 27 Jun 2018 06:21:22 UTC |
| 5b332ce255fc8c5821391be2 | succeeded (0s elapsed) | task7  | core.echo | Wed, 27 Jun 2018 06:21:22 UTC |
| 5b332ce355fc8c5821391be4 | succeeded (1s elapsed) | task8  | core.noop | Wed, 27 Jun 2018 06:21:22 UTC |
| 5b332ce355fc8c5821391be7 | succeeded (1s elapsed) | task9  | core.noop | Wed, 27 Jun 2018 06:21:23 UTC |
| 5b332ce455fc8c5821391bea | succeeded (0s elapsed) | task10 | core.noop | Wed, 27 Jun 2018 06:21:24 UTC |
+--------------------------+------------------------+--------+-----------+-------------------------------+
vagrant@st2test:~$ st2 execution re-run 5b332ce055fc8c57d28831af
..
id: 5b332cf155fc8c57d28831b2
action.ref: examples.orchestra-join
parameters: None
status: failed
start_timestamp: Wed, 27 Jun 2018 06:21:37 UTC
end_timestamp: Wed, 27 Jun 2018 06:21:40 UTC
result:
  errors:
  - message: Conflict saving DB object with id "5b332cf255fc8c5821391bf4" and rev "2".
    task_id: task8
  output: null
+--------------------------+------------------------+-------+-----------+-------------------------------+
| id                       | status                 | task  | action    | start_timestamp               |
+--------------------------+------------------------+-------+-----------+-------------------------------+
| 5b332cf155fc8c58798f6a23 | succeeded (0s elapsed) | task1 | core.noop | Wed, 27 Jun 2018 06:21:37 UTC |
| 5b332cf255fc8c5821391bed | succeeded (0s elapsed) | task2 | core.echo | Wed, 27 Jun 2018 06:21:38 UTC |
| 5b332cf255fc8c5821391bf0 | succeeded (0s elapsed) | task4 | core.echo | Wed, 27 Jun 2018 06:21:38 UTC |
| 5b332cf255fc8c5821391bf3 | succeeded (0s elapsed) | task6 | core.noop | Wed, 27 Jun 2018 06:21:38 UTC |
| 5b332cf355fc8c5821391bf6 | succeeded (0s elapsed) | task8 | core.noop | Wed, 27 Jun 2018 06:21:39 UTC |
| 5b332cf355fc8c5821391bfa | succeeded (0s elapsed) | task5 | core.noop | Wed, 27 Jun 2018 06:21:39 UTC |
| 5b332cf455fc8c5821391c01 | succeeded (2s elapsed) | task3 | core.echo | Wed, 27 Jun 2018 06:21:39 UTC |
| 5b332cf355fc8c5821391bfd | succeeded (1s elapsed) | task7 | core.echo | Wed, 27 Jun 2018 06:21:39 UTC |
| 5b332cf455fc8c5821391c02 | requested              | task9 | core.noop | Wed, 27 Jun 2018 06:21:40 UTC |
+--------------------------+------------------------+-------+-----------+-------------------------------+

stackstorm / orquesta Goto Github PK

orquesta's People

Contributors

Stargazers

Watchers

Forkers

orquesta's Issues

Metadata file (test_input.yaml)

Orquesta file (workflows/test_input.yam)

Execution result

Environment of st2

Note

Summary

Reproduction

Expected results

Observed Results

Workaround

Case of using ujson(v1.35)

Case of using ujson(v2.0.0)

Case of using ujson(v2.0.2 (current latest version))

Actual behavior

Expected behavior

SUMMARY

ISSUE TYPE

date

cat /etc/redhat-release

st2 --version

cat /etc/st2/st2.conf

redis-server --version

st2 run easydcim.show_device_parts id_num=13 type_id=20

SUMMARY

STACKSTORM VERSION

OS, environment, install method

Steps to reproduce the problem

Expected Results

Actual Results

SUMMARY

ISSUE TYPE

STACKSTORM VERSION

OS / ENVIRONMENT / INSTALL METHOD

SUMMARY

STACKSTORM VERSION

OS, environment, install method

Steps to reproduce the problem

Expected Results

Actual Results

Summary

Error Messages

Environment details

ST2 Version

Distro

Other

Reproduction Workflow Examples

Parent WF

parent_wf.meta.yaml

parent_wf.yaml

Child WF

child_wf.meta.yaml

child_wf.yaml

Expected Result

Observed Result

Workaround

StackStorm's New Workflows

Unit Testing

Publish and reuse in same task

When condition on a task level

"Main" task / entry-point

Linear execution by default

Keep: defined inputs and output

Keep: allowed temporaries in workflow

Recommend Projects

Recommend Topics

Recommend Org

Jobs