taskcluster / taskgraph Goto Github PK

View Code? Open in Web Editor NEW

12.0 7.0 31.0 3.24 MB

Generates task dependency graphs for Taskcluster CI

License: Mozilla Public License 2.0

Makefile 0.08% Batchfile 0.09% Python 98.89% Dockerfile 0.43% Shell 0.28% JavaScript 0.21% PowerShell 0.01%

taskgraph's Introduction

Taskgraph

Taskgraph is a Python library to generate graphs of tasks for the Taskcluster CI service. It is the recommended approach for configuring tasks once your project outgrows a single .taskcluster.yml file and is what powers the over 30,000 tasks and counting that make up Firefox's CI.

For more information and usage instructions, see the docs.

How It Works

Taskgraph leverages the fact that Taskcluster is a generic task execution platform. This means that tasks can be scheduled via its comprehensive API, and aren't limited to being triggered in response to supported events.

Taskgraph leverages this execution platform to allow CI systems to scale to any size or complexity.

A decision task is created via Taskcluster's normal .taskcluster.yml file. This task invokes taskgraph.
Taskgraph evaluates a series of yaml based task definitions (similar to those other CI offerings provide).
Taskgraph applies transforms on top of these task definitions. Transforms are Python functions that can programmatically alter or even clone a task definition.
Taskgraph applies some optional optimization logic to remove unnecessary tasks.
Taskgraph submits the resulting task graph to Taskcluster via its API.

Taskgraph's combination of declarative task configuration combined with programmatic alteration are what allow it to support CI systems of any scale. Taskgraph is the library that powers the 30,000+ tasks making up Firefox's CI.

Installation

Taskgraph supports Python 3.8 and up, and can be installed from Pypi:

pip install taskcluster-taskgraph

Alternatively, the repo can be cloned and installed directly:

git clone https://github.com/taskcluster/taskgraph
cd taskgraph
python setup.py install

In both cases, it's recommended to use a Python virtual environment.

Get Involved

If you'd like to get involved, please see our contributing docs!

taskgraph's People

Contributors

Stargazers

Watchers

taskgraph's Issues

artifact-reference should support non-public artifacts

mozilla-extensions/xpi-template#37
Essentially, we probably want to remove this check.

Given that:

a) we can use artifact-reference for scriptworkers, which often do have private artifact scopes baked in (until we resolve mozilla-releng/scriptworker#426, after which we'll need to grant the scope to the task as normal), and
b) if we grant the given task the proper private artifact scope and enable the taskcluster-proxy, non-scriptworker tasks can also download private artifacts,

I don't see a need for this artificial restriction.

Add a tool for tracing transforms

Imported from: https://bugzilla.mozilla.org/show_bug.cgi?id=1676972

One of the biggest challenges of debugging the taskgraph is tracing what happens in the transforms. Let's say there's an oddity in test-windows7-32/opt-mochitest-browser-chrome-e10s-1 and you want to debug what's going on. The experience is miserable because:

There are thousands of tasks, so whether using logging or a debugger, you'll have to sift through or filter out all of the irrelevant tasks.
The labels are not finalized until the end. So the task isn't going to be called test-windows7-32/opt-mochitest-browser-chrome-e10s-1 in the beginning, making it hard to narrow down on a single task.
Tasks can split out into more tasks which complicates matters further.

I'd love to have a --trace-transforms flag that when specified dumps out logging from transforms that is specific to the final task at the end. I envision this being used in conjunction with --tasks-regex so we can limit which task logs are dumped.

The implementation here is going to be difficult, but I think it should be possible. Roughly my idea is:

Have a special TransformLogger that buffers logs in a tree-like data structure. Each node in the tree contains the output from a single yield of a single transform. Each node (except the root) has a single parent that represents the output from the previous transform. Each child node (there can be multiple if a transform splits tasks), contains the output from the next transform. Each leaf node contains a finished task label.

Then, when --trace-transforms is passed in:

For each leaf node matching a task specified by --tasks-regex (might be all of them if not specified), we can reverse back up the output tree and stitch together only the output from nodes along that path. Which should give us the full output for only that specific task. We can add some extra logs to delineate where a transform started and finished. We can also log transform durations.

If --trace-transforms is not passed in, we can make the logging calls a no-op so there wouldn't be any perf penalty.

check_task_dependencies should also check `soft-dependencies` and `if-dependencies`

There's code in transforms.task to check if a task exceeds MAX_DEPENDENCIES. However, it only checks the dependencies key, not soft-dependencies or if-dependencies. This lead to the app-services nightly-alert task passing that validation check, but failing the validation check when it came to POST the data (https://firefox-ci-tc.services.mozilla.com/tasks/Ot_koH-ISoeWy4EI5zZ3Gw/runs/0/logs/public/logs/live.log)

Issue loading parameters on Community CI Decision Task

I ran taskgraph full -p task-id=KMSME5jMQparn-7-8hyOdA to test the task graph for a taskcluster release, but got this output:

2022-05-02 11:27:23,980 - INFO - Loading graph configuration.
Traceback (most recent call last):
  File "/opt/homebrew/lib/python3.9/site-packages/yaml/reader.py", line 156, in update
    data, converted = self.raw_decode(self.raw_buffer,
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/homebrew/lib/python3.9/site-packages/taskgraph/main.py", line 742, in main
    args.command(vars(args))
  File "/opt/homebrew/lib/python3.9/site-packages/taskgraph/main.py", line 409, in show_taskgraph
    generate_taskgraph(options, parameters, logdir)
  File "/opt/homebrew/lib/python3.9/site-packages/taskgraph/main.py", line 188, in generate_taskgraph
    out = format_taskgraph(options, spec, logfile(spec))
  File "/opt/homebrew/lib/python3.9/site-packages/taskgraph/main.py", line 144, in format_taskgraph
    tg = getattr(tgg, options["graph_attr"])
  File "/opt/homebrew/lib/python3.9/site-packages/taskgraph/generator.py", line 170, in full_task_graph
    return self._run_until("full_task_graph")
  File "/opt/homebrew/lib/python3.9/site-packages/taskgraph/generator.py", line 412, in _run_until
    k, v = next(self._run)
  File "/opt/homebrew/lib/python3.9/site-packages/taskgraph/generator.py", line 264, in _run
    parameters = self._parameters(graph_config)
  File "/opt/homebrew/lib/python3.9/site-packages/taskgraph/parameters.py", line 326, in get_parameters
    parameters = load_parameters_file(
  File "/opt/homebrew/lib/python3.9/site-packages/taskgraph/parameters.py", line 309, in load_parameters_file
    kwargs = yaml.load_stream(f)
  File "/opt/homebrew/lib/python3.9/site-packages/taskgraph/util/yaml.py", line 24, in load_stream
    loader = UnicodeLoader(stream)
  File "/opt/homebrew/lib/python3.9/site-packages/yaml/loader.py", line 34, in __init__
    Reader.__init__(self, stream)
  File "/opt/homebrew/lib/python3.9/site-packages/yaml/reader.py", line 85, in __init__
    self.determine_encoding()
  File "/opt/homebrew/lib/python3.9/site-packages/yaml/reader.py", line 135, in determine_encoding
    self.update(1)
  File "/opt/homebrew/lib/python3.9/site-packages/yaml/reader.py", line 164, in update
    raise ReaderError(self.name, position, character,
yaml.reader.ReaderError: unacceptable character #x008b: invalid start byte
  in "<file>", position 1

I made sure to export TASKCLUSTER_ROOT_URL=https://community-tc.services.mozilla.com as well before running. (Received 404s before setting the root url - likely due to taskgraph defaulting to FxCI root)

Upgrade Taskgraph to latest `image_builder` image

In https://bugzilla.mozilla.org/show_bug.cgi?id=1661637 we moved the image_builder image from the taskcluster user on DockerHub over to mozillareleases and made a new 5.0.0 image builder release.

This was upgraded in Gecko taskgraph, but here we're still stuck on the older image.

Cached tasks don't work for pull-requests on level 1-only repositories

There's some logic in Taskgraph such that the index used for cached tasks in pull request gets elevated to level 3:

taskgraph/src/taskgraph/util/cached_tasks.py

Line 61 in f21088f

min_level = max(min_level, 3)

We do this so PRs use cached tasks that were generated from a push (rather than another PR). However for repos where everything is level 1 (including pushes), these index routes will never exist and we'll never optimize the cached_task away.

Then again using cached_tasks in a level-1 only repo means that PRs could overwrite index tasks from pushes. One possible way of fixing this would be to add the tasks_for value to the index in this circumstance. Or maybe it's easiest to say that cached_tasks aren't supported in level-1 only repos.

Remove 'always-target' feature

Always target is a feature that was added to address a narrow use case in Gecko. It's not too clear that it is useful outside of Gecko as other repos tend not to have complicated optimization logic.

There is a slight discrepency in their implementations now too, on Gecko we only use the feature for hg-push graphs. As the original creator of this feature. So if we want to merge the two Taskgraph's we either need to:

Sync the Gecko change over here.
Remove always-target and refactor Gecko to perform the same logic in the target-tasks phase.

As the original author of this feature, I think this feature was implemented the wrong way, and option 2 is a better approach here. So that means we should drop support for always-target here.

Rename `job` transforms to `run`

Similar to #25

The name job has never made sense to me here.. since these transforms are all about setting things up for use with the run-task script, I'd propose calling them the run transforms.. But other suggestions welcome.

Create loader schemas for validating `kind.yml` files

The format of the kind.yml files is currently determined by loaders. We should provide schemas for what these loaders expect and validate kind.yml files against them. This would result in clearer expectations about what is allowed in these files.

This came from a request from :asuth who is looking into ways of integrating Taskgraph with Searchfox. He'd prefer the schema follow the JSON-schema spec. I'm not sure I'd want two separate schema validation methods being used (jsonschema + voluptuous), but maybe we could have a CI step to export the schema into the JSON-schema format or something.

Add a transform to easily create `summary` tasks

Often people want to have a task that can summarize the results of other tasks. Either the entire graph or some subset. In Gecko there's a code-review task that does just this:
https://searchfox.org/mozilla-central/source/taskcluster/ci/code-review/kind.yml

It basically waits for all tasks with the code-review attribute to finish, and then sends a pulse message to notify consumers that they're ready (the consumers do the actual status inspection).

I propose we:

Create a transform file that does something similar to the code review transform
Implement some pre-defined "behaviors" the task can follow. Here are some example behaviours:

noop - Task always passes, routes can notify that all tasks being summarized are finished
require_pass - Task passes if all dependencies pass, otherwise it fails (can be useful if using an on-exception notify route)
custom - Task runs some arbitrary command as normal (?)

Make TransformConfig.kind_dependencies_tasks into a dictionary keyed by task label

Currently this is a list of tasks:

taskgraph/src/taskgraph/generator.py

Line 55 in 2b79c1d

kind_dependencies_tasks = [

This isn't ideal because if you have the label and want to grab the dependency task, you need to iterate over all tasks in the list and compare labels one by one. In Gecko, we use a dict keyed by label:
https://searchfox.org/mozilla-central/rev/86c98c486f03b598d0f80356b69163fd400ec8aa/taskcluster/gecko_taskgraph/generator.py#54

This allows for simpler logic in many cases. Moving this over to the Gecko method will also be needed to merge the two Taskgraphs.

Decision task should declare artifacts in task payload

If a decision task fails, e.g. like this one, because of missing scopes when submitting a task, it would be useful for the task definition to be persisted so that a user has full context on the content of the task that could not be submitted.

It looks like taskgraph submits public/task-graph.json, public/target-tasks.json etc at runtime, rather than declaring them as task artifacts. This means, if the task fails for any reason before they are dynamically uploaded, the task will not persist this information.

If this is changed, so that the task definition includes the list of artifacts that the decision task intends to publish, they will be persisted, even if the task fails, which will make debugging easier.

Provide a better error message when invalid dependencies are specified

Over in mozilla-mobile/firefox-android#1578 I was helping @rahulsainani out with a Taskgraph problem. Turns out it was just an invalid dependency being listed, but the error Taskgraph spits out is terrible:

Traceback (most recent call last):
  File "/opt/homebrew/lib/python3.10/site-packages/taskgraph/main.py", line 828, in main
    args.command(vars(args))
  File "/opt/homebrew/lib/python3.10/site-packages/taskgraph/main.py", line 411, in show_taskgraph
    generate_taskgraph(options, parameters, logdir)
  File "/opt/homebrew/lib/python3.10/site-packages/taskgraph/main.py", line 190, in generate_taskgraph
    out = format_taskgraph(options, spec, logfile(spec))
  File "/opt/homebrew/lib/python3.10/site-packages/taskgraph/main.py", line 146, in format_taskgraph
    tg = getattr(tgg, options["graph_attr"])
  File "/opt/homebrew/lib/python3.10/site-packages/taskgraph/generator.py", line 168, in full_task_graph
    return self._run_until("full_task_graph")
  File "/opt/homebrew/lib/python3.10/site-packages/taskgraph/generator.py", line 422, in _run_until
    k, v = next(self._run)
  File "/opt/homebrew/lib/python3.10/site-packages/taskgraph/generator.py", line 335, in _run
    yield self.verify("full_task_graph", full_task_graph, graph_config, parameters)
  File "/opt/homebrew/lib/python3.10/site-packages/taskgraph/generator.py", line 429, in verify
    verifications(name, obj, *args, **kwargs)
  File "/opt/homebrew/lib/python3.10/site-packages/taskgraph/util/verify.py", line 104, in __call__
    verification.verify(*args, **kwargs)
  File "/opt/homebrew/lib/python3.10/site-packages/taskgraph/util/verify.py", line 53, in verify
    graph.for_each_task(
  File "/opt/homebrew/lib/python3.10/site-packages/taskgraph/taskgraph.py", line 32, in for_each_task
    task = self.tasks[task_label]
KeyError: 'generate-baseline-profiles'

We should provide a much nicer error that makes it clear which task has the problem, and provides a hint on how to fix it. I guess one tricky thing here is that we can't make this a verification as we need to iterate through the graph in order to run the verifications in the first place!

So possibly this will need to go directly in generator.py itself.

Drop support for Python 3.6

This version is officially unsupported. The only blocker here is making sure that consumers are using the latest decision docker image as older versions still had 3.6 baked in. Luckily since consumers are pinning Taskgraph, this isn't a hard blocker.

Use the Taskcluster Python client rather than direct http requests

Currently we just make manually crafted API requests to the TC endpoint rather than going through the Taskcluster Python client. This has been working fine for a long time, but we are likely missing out on some robustness and could push some complexity out of Taskgraph.

But the bigger reason to consider switching soonish, is that we want to start using the object service in Firefox CI. The official clients all already support artifact verification out of the box. If Taskgraph were using the Python client, it would start getting the benefit of artifact integrity as soon as things start switching over to the object service.

Provide more defaults for kinds

One of the bigger barriers to using taskgraph is how intimidating it is to write configs. One particular part of this is how much there often is in a kind that doesn't directly relate to the task being written, or is essentially boilerplate that most kinds need. As a concrete example, this is a current kind definition used by MozillaVPN:

loader: taskgraph.loader.transform:loader

transforms:
    - taskgraph.transforms.job:transforms
    - taskgraph.transforms.task:transforms

tasks:
    taskgraph-definition:
        worker-type: b-linux
        worker:
            docker-image: {in-tree: base}
            max-run-time: 3600
        description: "Test the full `mozilla_vpn_taskgraph` to validate the latest changes"
        treeherder:
            symbol: test-taskgraph-definition
            kind: test
            platform: tests/opt
            tier: 1
        run:
            using: run-task
            use-caches: true
            cwd: '{checkout}'
            command: >-
                pip3 install -r taskcluster/requirements.txt &&
                taskgraph full --p taskcluster/test/params &&
                taskgraph full

The entirity of the loader and transforms keys are boilerplate, as well as much of the run section. Many other parts would be unnecessary if we provided better defaults in the job and/or task transforms. In the end, I think it's possible to shrink the above down to:

tasks:
    taskgraph-definition:
        worker:
            docker-image: {in-tree: base}
        description: "Test the full `mozilla_vpn_taskgraph` to validate the latest changes"
        run:
            command: >-
                pip3 install -r taskcluster/requirements.txt &&
                taskgraph full --p taskcluster/test/params &&
                taskgraph full

...at which point, the entire kind is something that I think we could reasonable expect a project owner to understand and write.

Concretely, I suggest we make the following changes:

        artifacts:
            - type: directory
              name: public/build
              path: /builds/worker/artifacts

run can also provide a number of defaults
- using can be run-task
- use-caches can be true
- cwd can be {checkout}
- worker-type can be b-linux
  - I could be convinced this is a bad idea - but I think it's true that unless you're building for Windows or macOS, your task is probably running on Linux.

Because these changes will be making many things implicit where they used to be explicit, we must (as in, do not land without this) update the docs with better reference information if/when we do this. The goal here is to provide very sensible defaults to lower the learning curve to adopting taskgraph (and by extension, Taskcluster) -- we do not want to make it more difficult to use a more sophisticated configuration as projects outgrow early simplicity.

Provide a single way of specifying defaults for custom parameters

There are currently two ways to specify defaults for custom parameters:

There's the defaults_fn argument to the extend_parameters function: https://github.com/ahal/taskgraph/blob/6b25e9e55c64b52e8d1af0dba267dbf3ace43dc6/src/taskgraph/parameters.py#L128
Then there's the decision-parameters function defined in config.yml that the decision task calls: https://github.com/ahal/taskgraph/blob/6b25e9e55c64b52e8d1af0dba267dbf3ace43dc6/src/taskgraph/decision.py#L238

The former only works when running Taskgraph locally, because we ignore those defaults whenever strict=True. The latter only works when running from a Decision task, as it's only ever invoked from decision.py.

This is silly, there should be a single method of providing defaults that works both locally, and when running from a Decision task (with the ability to specify different values for the latter).

run-task shouldn't fetch tags from the head_repo

a9a5fae made run-task fetch tags from the head repo. In a PR, where the head and base repos are different, tags can also be different, and conflicting, causing the fetch to fail.
I don't know the reason this was added but maybe it would be enough to explicitly fetch tags from the base repo, as the only thing we should care about from the head repo is the commit?

Generic 'index_builder' is Mercurial specific

The generic 'index_builder' that gets used by default adds routes referencing pushlog-id and such:
https://github.com/taskcluster/taskgraph/blob/main/src/taskgraph/transforms/task.py#L180

We should try and make it truly generic across VCS types as well. We'll either need to move the functionality the current one provides to a new Mercurial specific builder, or perhaps we can just get Gecko to define it if it's the only consumer that needs those routes.

Additional pre-commit checks

Now that we're using pre-commit, should we enable any additional checks?

Some checks that I wouldn't mind adding:

isort
type checking
conventional commits
codespell
taskcluster_yml_validator
pyupgrade

Suggestions or rebuttals to the above welcome!

Dump Python and other runtime info in run-task

Imported from: https://bugzilla.mozilla.org/show_bug.cgi?id=1696947

It's often handy to know what version of Python a task is running against. However this info is usually missing from the logs. While it would be tricky to tell which version of Python a command will use (e.g mach might decide to use 2 vs 3 and we have no way of telling that from the taskgraph), we could get run-task to dump the versions of all default Python executables. E.g:

$ python --version
$ python2 --version
$ python3 --version

Then we'd only have to know which of those are being used where.

Example `.taskcluster.yml` file is not valid

In the payload, the field feature should be named features and it's also missing the required field, maxRunTime.

Print a nice error message when using `taskgraph {load|build}-image` without docker started

Currently you get the following very confusing output:

$ taskgraph load-image --task-id=ROzeMiL1Rv-5A_uIhJr4UA
Traceback (most recent call last):
  File "/home/ahal/.pyenv/versions/taskgraph/lib/python3.7/site-packages/requests/adapters.py", line 470, in send
    low_conn.endheaders()
  File "/home/ahal/.pyenv/versions/3.7.12/lib/python3.7/http/client.py", line 1276, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/home/ahal/.pyenv/versions/3.7.12/lib/python3.7/http/client.py", line 1036, in _send_output
    self.send(msg)
  File "/home/ahal/.pyenv/versions/3.7.12/lib/python3.7/http/client.py", line 976, in send
    self.connect()
  File "/home/ahal/.pyenv/versions/taskgraph/lib/python3.7/site-packages/requests_unixsocket/adapters.py", line 41, in connect
    sock.connect(socket_path)
FileNotFoundError: [Errno 2] No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ahal/dev/taskgraph/src/taskgraph/main.py", line 544, in load_image
    ok = load_image_by_task_id(args["task_id"], args.get("tag"))
  File "/home/ahal/dev/taskgraph/src/taskgraph/docker.py", line 61, in load_image_by_task_id
    result = load_image(artifact_url, tag)
  File "/home/ahal/dev/taskgraph/src/taskgraph/docker.py", line 198, in load_image
    docker.post_to_docker(download_and_modify_image(), "/images/load", quiet=0)
  File "/home/ahal/dev/taskgraph/src/taskgraph/util/docker.py", line 49, in post_to_docker
    headers={"Content-Type": "application/x-tar"},
  File "/home/ahal/.pyenv/versions/taskgraph/lib/python3.7/site-packages/requests/sessions.py", line 577, in post
    return self.request('POST', url, data=data, json=json, **kwargs)
  File "/home/ahal/.pyenv/versions/taskgraph/lib/python3.7/site-packages/requests/sessions.py", line 529, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/ahal/.pyenv/versions/taskgraph/lib/python3.7/site-packages/requests/sessions.py", line 645, in send
    r = adapter.send(request, **kwargs)
  File "/home/ahal/.pyenv/versions/taskgraph/lib/python3.7/site-packages/requests/adapters.py", line 501, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: [Errno 2] No such file or directory

Verify artifacts in `run-task` on docker-worker

There's currently a bug where docker-worker doesn't fail if an artifact doesn't exist. Ideally this would be fixed in docker-worker, but given we're aiming to deprecate it and a general lack of engineering, it might be worth fixing this in run-task as a hacky solution until we do.

The proposal here is for run_task.py to pass the list of artifacts we're expecting into the run-task script (possibly only if a certain attribute exists to preserve backwards compatibility), then the run-task script will check these locations and raise an exception if any of them don't exist.

DONTBUILD keyword is ignored on `github-push` events

The DONTBUILD keyword is a feature we implemented in taskgraph. If a commit message contains DONTBUILD then taskgraph doesn't schedule any taskcluster tasks (e.g.). It works pretty well on mercurial but last week I found it doesn't work at all on Github. See this push of firefox-android.

Steps to reproduce

Follow the alternative installation steps https://github.com/taskcluster/taskgraph#installation
touch FOO && git add FOO && git commit -m 'Add foo DONTBUILD
export TASKCLUSTER_ROOT_URL='https://firefox-ci-tc.services.mozilla.com' && export TASK_ID='dummyTask'

 taskgraph decision
   --pushlog-id='0' \
   --pushdate='0' \
   --project='taskgraph' \
   --owner='[email protected]' \
   --level='1' \
   --base-repository='https://github.com/taskcluster/taskgraph' \
   --base-ref='main' \
   --base-rev='HEAD~1' \
   --head-repository='https://github.com/taskcluster/taskgraph' \
   --head-ref='main' \
   --head-rev='HEAD' \
   --repository-type="git" \
   --tasks-for='github-push' \
   --message="Add foo DONTBUILD" \

Expected results

The script runs successfully and and more importantly filter_target_tasks should prune all tasks:

2023-02-20 15:13:21,469 - INFO - Filter filter_target_tasks pruned 9 tasks (0 remain)

Actual results

The script fails because it schedules tasks, but more filter_target_tasks doesn't filter out every task.

2023-02-20 15:09:24,135 - INFO - Filter filter_target_tasks pruned 6 tasks (3 remain)

Side notes

I recommend searching the string DONTBUILD. It will highlight where the fix should happen. The fix is likely a one-liner but we can use this opportunity to add more unit tests.

After the fix, the script will still fail for another reason. Let's handle that other reason in #190.

Support Git repositories in `files_changed.py` and `skip-unless-changed` optimization strategy

This strategy raises a RuntimeError if used on a Git repo:

taskgraph/src/taskgraph/optimize.py

Line 455 in b4081f4

raise RuntimeError(

This is because files_changed.py queries hg.m.o to retrieve the files. Though there's no reason it couldn't query e.g Github for this information as well. Though I'm unsure whether there's an API for pushes...

Alternatively maybe this information could be found locally by comparing the head REF to the origin branch and git diff-tree or something similar.

A bit of thought is needed here, but it should be do-able for most cases.

`.DS_Store` files should be ignored by git

Found while reviewing #240. When writing code on macOS, Finder generates .DS_Store to cache some metadata about files in the current folder. This is a binary file that is not meant to be shared. We can easily ignore thanks to this file:

taskgraph/.gitignore

Lines 1 to 3 in a0dd5d3

 # Editor 

 *~ 

 *.dir-locals.el # Emacs directory variable files.

It should just be a matter of adding .DS_Store to this file.

Mechanism for automatically fetching secrets

When a task needs to use a secret, we need to create a custom wrapper script that first downloads the secret from the secrets service, sets it up, and then invokes the regular command. This makes creating tasks that need secrets inconvenient.. There's no way to just "declare" that the secret is needed in the task definition and have everything just work.

I propose we introduce a new run-task based mechanism for handling secrets. At a high level it would work like this:

A task declares which secrets it needs in a top-level secrets key.
The run_task.py transforms would massage this into a format we can easily stuff into an environment variable.
The run-task script reads said env and fetches the required secrets before proceeding with the task

There is precedent here as this is exactly how fetches work.

To start, the schema for defining the secrets should support both environment variables, as well as files. Maybe something like:

secrets:
    - secret: myproject/secret
      key: api_token
      env: API_TOKEN
    - secret: myproject/other/secret
      file: /builds/worker/secret

The above definition would:
A) store the value of api_token in the myproject/secret secret into the API_TOKEN env
B) write the entirety of the myproject/other/secret secret wholesale into a file at /builds/worker/secret

run-task: save destination checkout in an env

Imported from: https://bugzilla.mozilla.org/show_bug.cgi?id=1375496

When using the --vcs-checkout flag in run-task, we should save the checkout dir in an environment variable so consumers don't need to hardcode it somewhere.

`taskgraph <subcommand> --diff -p dirname` should display exceptions immediately

e.g. taskgraph target-graph --diff -p path/to/params/ will hang with no output or sign that anything's wrong if one or more of the generated taskgraphs hits an exception.

Tie docker images to Taskgraph releases

There are currently several docker images that are generated in Taskgraph's CI and then shared to consumers (either via index or upload to Dockerhub). These are:

decision
decision-mobile
index-task
image_builder
fetch (?)

(I don't believe fetch is being used outside of Taskgraph, but it's a good candidate to expose so projects don't need to re-define it)

The problem is that these images get generated / released at arbitrary points in time. So it's not clear which versions of Taskgraph they were generated from. It also means that they contain different versions of code from the ones present in the taskgraph library consumers are using. For example, run-task gets baked into the decision images. This means that the Decision task will run with a different version of run-task than the one tasks are using.

I think instead of releasing these images at arbitrary points in time, we should tie their release to actual Taskgraph releases. This way we could specify, e.g:

image: mozillareleases/taskgraph-decision:1.5.1

in the .taskcluster.yml. If we decide to do this, we'll likely want to create some tasks that upload the images to GCP image store automatically and have them run in a release graph.

Some food for thought.

Support `single_dep` and `multi_dep` functionality

The multi_dep and single_dep loaders (and related transforms) are some of the most widely copied Taskgraph logic from project to project. We've mentioned upstreaming this into Taskgraph countless times! However, over the years I've come to realize that this loader is unnecessary and overly complicated (most recently while working on mozilla-releng/mozilla-taskgraph#7 in conjunction with the linked firefox-android PR).

Essentially it's doing too much at once, building upstream artifacts, copying attributes, resolving keys. The root issue is that it adds a lot of complexity right up front, which every later transform then needs to deal with. Instead we should add complexity little by little, and only as needed.

I'm still formulating thoughts here, but I think my rough plan is to:

Add transforms for adding dependencies given a set of kinds.
Add a utility file for deriving upstream-artifacts from a task's dependencies.
Add transforms for copying attributes from a primary-dep.
Push more logic into the scriptworker payload builders.

I believe with the above pieces, we'll be able to completely obsolete the single_dep and multi_dep loaders and replace it with a much simpler and easy to follow setup.

Support toolchains on generic-worker

Central has the ability to run toolchain tasks on generic-worker:
https://searchfox.org/mozilla-central/rev/87ecd21d3ca517f8d90e49b32bf042a754ed8f18/taskcluster/gecko_taskgraph/transforms/job/toolchain.py#211

We should sync this feature over to standalone taskgraph. This will be useful for mozillavpn where they've essentially re-implemented toolchains (but with different routes so ci-config needs extra permissions).

Checkout in `run-task` fails in certain scenarios

Failure reference: https://firefox-ci-tc.services.mozilla.com/tasks/at-bWwQKQ4q_3RUfPgE45w/runs/0/logs/public/logs/live.log#L28

Example:

$ git clone https://github.com/mozilla-rally/rally-core-addon
Cloning into 'rally-core-addon'...
remote: Enumerating objects: 5038, done.
remote: Counting objects: 100% (32/32), done.
remote: Compressing objects: 100% (30/30), done.
remote: Total 5038 (delta 19), reused 4 (delta 2), pack-reused 5006
Receiving objects: 100% (5038/5038), 10.50 MiB | 7.72 MiB/s, done.
Resolving deltas: 100% (3153/3153), done.

$ cd rally-core-addon 

$ git fetch --no-tags https://github.com/mozilla-rally/rally-core-addon release
From https://github.com/mozilla-rally/rally-core-addon
 * branch            release    -> FETCH_HEAD

$ git checkout -f -B release release
fatal: 'release' is not a commit and a branch 'release' cannot be created from it

Taskgraph panics when git checkouts are not full clones

For ex. the VPN had a Taskgraph check action (Run, Source). The action didn't clone, it ran git init, git remote add origin and git fetch to get a specific ref. This caused taskgraph to panic because origin/HEAD and main were missing in the checkout.

Improve performance of `test_util_vcs.py`

This test takes awhile, and since we run it across multiple Python versions, is the main reason why the unit task takes so long to run. It would be nice to figure out why it's so slow and see if we can speed it up at all.

Rename 'job*' keys in transform loader to 'task'

The fact we use 'job' everywhere in the TransformLoader (jobs, job-defaults, jobs-from) is an artifact from the old buildbot / tbpl days. It's confusing since task is very much the standard terminology in Taskcluster land. Let's rename these and deprecate the term job.

One path forward here might be to:

Land a patch that supports both job and task, but if job is used, log a DeprecationWarning.
Remove support for job in a future major release.

taskgraph.actions.util.create_tasks broken in 2.0.0

As far as I can tell df9eb09 added a new param to optimize_task_graph but one of the callers (https://github.com/taskcluster/taskgraph/blob/2.0.0/src/taskgraph/actions/util.py#L161) didn't get updated to pass it. @rvandermeulen reported this breaking the add-new-jobs action in fenix with TypeError: optimize_task_graph() missing 1 required positional argument: 'decision_task_id'

Allow fetch tasks to be generated by embedded entries in other tasks

For simple use cases, embedded fetch definitions into tasks could be much nicer than the indirection of having a separate kind for them. It can also be nice to embed them where only one entry in a kind needs them -- it puts the fetch with the thing that actually needs it, which makes it more obvious what's going on, and lowers maintenance burden a small amount.

I've put together a prototype for how this could work in these patches:

It's a bit hacky at the moment, but it does prove that we can generate unrelated tasks (eg: a fetch) while processing tasks from another kind (eg: build). This may seem very strange, and it certainly breaks new ground in taskgraph. When I look at things through the eyes of someone building a project (in this case MozillaVPN), the tasks I really care about are build, test, etc. -- things that produce artifacts that are of value to developers or users. fetch, on the other hand, is an implementation detail. There are certainly good reasons for them existing in a separate kind at times (most obviously, when a single fetch is used by many downstream tasks) -- but that shouldn't be strictly necessary.

Obviously the linked patches are not remotely landable. There's hardcoding that has to be fixed, and most importantly, the generation of the TransformsSequence would need to be formalized (especially the fetch_only_job_transfoms part, where it's taking only part of the job transforms -- I have some ideas on how to clean that up).

We can also just close this as undesirable if it's going too far or a bad idea for some reason.

Allow duplicated dependencies in task definitions

Imported from: https://bugzilla.mozilla.org/show_bug.cgi?id=1574189

nalexander writes:

This has now bitten me twice, so it's time to file. The dependencies: block is a map from string alias to string job name. The TC API rejects repeated dependencies in the submitted task definition; for example:

[task 2019-08-15T04:34:27.531Z] Schema Validation Failed!
[task 2019-08-15T04:34:27.531Z] Rejecting Schema: https://schemas.taskcluster.net/queue/v1/create-task-request.json#
[task 2019-08-15T04:34:27.531Z] Errors:
[task 2019-08-15T04:34:27.531Z]   * data.dependencies should NOT have duplicate items (items ## 1 and 0 are identical)
[task 2019-08-15T04:34:27.531Z] 
[task 2019-08-15T04:34:27.531Z] ---
[task 2019-08-15T04:34:27.531Z] 
[task 2019-08-15T04:34:27.531Z] * method:     createTask
[task 2019-08-15T04:34:27.531Z] * errorCode:  InputValidationError
[task 2019-08-15T04:34:27.531Z] * statusCode: 400
[task 2019-08-15T04:34:27.531Z] * time:       2019-08-15T04:34:27.496Z
[task 2019-08-15T04:34:27.535Z] 
[task 2019-08-15T04:34:27.535Z] Schema Validation Failed!
[task 2019-08-15T04:34:27.535Z] Rejecting Schema: https://schemas.taskcluster.net/queue/v1/create-task-request.json#
[task 2019-08-15T04:34:27.535Z] Errors:
[task 2019-08-15T04:34:27.535Z]   * data.dependencies should NOT have duplicate items (items ## 7 and 6 are identical)
[task 2019-08-15T04:34:27.536Z] 
[task 2019-08-15T04:34:27.536Z] ---
[task 2019-08-15T04:34:27.536Z] 
[task 2019-08-15T04:34:27.536Z] * method:     createTask
[task 2019-08-15T04:34:27.536Z] * errorCode:  InputValidationError
[task 2019-08-15T04:34:27.536Z] * statusCode: 400
[task 2019-08-15T04:34:27.536Z] * time:       2019-08-15T04:34:27.501Z
[task 2019-08-15T04:34:27.613Z] 
[task 2019-08-15T04:34:27.613Z] Schema Validation Failed!
[task 2019-08-15T04:34:27.613Z] Rejecting Schema: https://schemas.taskcluster.net/queue/v1/create-task-request.json#
[task 2019-08-15T04:34:27.613Z] Errors:
[task 2019-08-15T04:34:27.613Z]   * data.dependencies should NOT have duplicate items (items ## 6 and 5 are identical)
[task 2019-08-15T04:34:27.613Z] 
[task 2019-08-15T04:34:27.613Z] ---
[task 2019-08-15T04:34:27.613Z] 
[task 2019-08-15T04:34:27.613Z] * method:     createTask
[task 2019-08-15T04:34:27.613Z] * errorCode:  InputValidationError
[task 2019-08-15T04:34:27.613Z] * statusCode: 400
[task 2019-08-15T04:34:27.613Z] * time:       2019-08-15T04:34:27.579Z

Fine, fair enough. But it's legitimately useful to be able to give the same target task name multiple aliases; for example, I want to refer to a Linux build for extracting one of the test archives, but there may already be a reference to that Linux build for extracting the package under test. Right now I need to conditionally change my alias so as to not have a repeated dependency.

Can we make the dependency handling code know that the target dependencies are a set, and manage multiple aliases to the same task?

`build-docker` tasks shouldn't be silently injected in optimized graph when `DONTBUILD` is provided

Found while looking at #189. DONTBUILD makes sure no tasks are scheduled. However, it's not the case today. Today, build-docker images are still pulled in between the target phase and the optimized. I don't think we should keep this behavior.

Marking it as a good second bug because it looks like a good follow-up on #189.

Cannot run `pip-compile-multi` anymore: Could not find a version that matches importlib-resources<3.5,>=3.0,>=5.4

Steps to reproduce

Run:

docker run -t -v "$PWD:/src" -w /src python:3.6 bash -cx "pip install pip-compile-multi && pip-compile-multi --generate-hashes base --generate-hashes dev --generate-hashes test --allow-unsafe"

Expected result

All requirement files get bumped.

Actual result

Only base.txt does. The script fails to handle test.txt (which also blocks dev.txt). It errors out this way:

Finding the best candidates:
  found candidate alabaster==0.7.12 (constraint was >=0.7,<0.8)
  [...]
  found candidate importlib-metadata==4.8.3 (constraint was >=0.12,>=4.8.3)
Could not find a version that matches importlib-resources<3.5,>=3.0,>=5.4 (from sphinx-book-theme==0.2.0->-r requirements/test.in (line 11))
Tried: 0.1.0, 0.1.0, 0.2, 0.2, 0.3, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.8, 1.0, 1.0.1, 1.0.1, 1.0.2, 1.0.2, 1.1.0, 1.1.0, 1.2.0, 1.2.0, 1.3.0, 1.3.0, 1.3.1, 1.3.1, 1.4.0, 1.4.0, 1.5.0, 1.5.0, 2.0.0, 2.0.0, 2.0.1, 2.0.1, 3.0.0, 3.0.0, 3.1.0, 3.1.0, 3.1.1, 3.1.1, 3.2.0, 3.2.0, 3.2.1, 3.2.1, 3.3.0, 3.3.0, 3.3.1, 3.3.1, 4.0.0, 4.0.0, 4.1.0, 4.1.0, 4.1.1, 4.1.1, 5.0.0, 5.0.0, 5.0.2, 5.0.2, 5.0.3, 5.0.3, 5.0.4, 5.0.4, 5.0.5, 5.0.5, 5.0.6, 5.0.6, 5.0.7, 5.0.7, 5.1.0, 5.1.0, 5.1.1, 5.1.1, 5.1.2, 5.1.2, 5.1.3, 5.1.3, 5.1.4, 5.1.4, 5.2.0, 5.2.0, 5.2.1, 5.2.1, 5.2.2, 5.2.2, 5.2.3, 5.2.3, 5.3.0, 5.3.0, 5.4.0, 5.4.0
There are incompatible versions in the resolved dependencies:
  importlib-resources<3.5,>=3.0 (from sphinx-book-theme==0.2.0->-r requirements/test.in (line 11))
  importlib-resources>=5.4 (from virtualenv==20.16.5->tox==3.26.0->-r requirements/test.in (line 13))

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/pipcompilemulti/cli_v1.py", line 26, in cli
    recompile()
  File "/usr/local/lib/python3.6/site-packages/pipcompilemulti/actions.py", line 31, in recompile
    compile_topologically(env_confs, deduplicator)
  File "/usr/local/lib/python3.6/site-packages/pipcompilemulti/actions.py", line 38, in compile_topologically
    if env.maybe_create_lockfile():
  File "/usr/local/lib/python3.6/site-packages/pipcompilemulti/environment.py", line 51, in maybe_create_lockfile
    self.create_lockfile()
  File "/usr/local/lib/python3.6/site-packages/pipcompilemulti/environment.py", line 80, in create_lockfile
    raise RuntimeError("Failed to pip-compile {0}".format(self.infile))
RuntimeError: Failed to pip-compile requirements/test.in

This is likely a regression introduced in #83.

Additional notes

It hasn't blocked me so far since I've been able to bump base.txt anyway. I just filed this ticket in order to keep a record of that bug.

Fix false assumptions in 'build_docker_worker_payload'

We call util.docker.parse_volumes in the build_docker_worker_payload function
https://github.com/taskcluster/taskgraph/blob/main/src/taskgraph/transforms/task.py#L342

This in turn makes several false assumptions, namely that the root dir is taskcluster/ci and that there exists a kind called docker-image:
https://github.com/taskcluster/taskgraph/blob/main/src/taskgraph/util/docker.py#L305

Neither of these things are necessarily true. Rather than re-parsing the docker image's kind.yml file, we should store the path to the Dockerfile as an attribute (or tag) on the docker image task. Then we should move the check from the first link to a "verification" in util/verify.py. This way we'll be able to access the path to the Dockerfile directly from the dependency.

~/.hgrc can break some tests

I happened to have an ancient entry in my .hgrc:

[paths] 
review = https://reviewboard-hg.mozilla.org/autoreview

...and found that it causes a few tests to fail:

test/test_util_vcs.py::test_remote_name_no_remote[hg] FAILED                                          [ 88%]
test/test_util_vcs.py::test_remote_name[hg] FAILED                                                    [ 88%]
test/test_util_vcs.py::test_all_remote_names[hg] FAILED                                               [ 89%]
test/test_util_vcs.py::test_remote_name_many_remotes[hg] FAILED

In this case I should've removed this from my .hgrc a long time ago -- but it seems like tests should ideal not depend on this externality.

Replace voluptuous with a faster schema validator

Imported from: https://bugzilla.mozilla.org/show_bug.cgi?id=1652123

When profiling taskgraph in Gecko, we discovered that voluptuous is responsible for a significant performance hit to Taskgraph. So much so that we disbale schema validation entirely when taskgraph.fast is set. We also avoid using voluptuous to set default values and instead set defaults via a transform.

There are many more performant validation libraries out there. In another project, I once successfully replaced voluptuous with validx (but we should investigate all options).

Enable Dependabot and Code Scanning alerts

To help stay on top of security issues that may become present in the project, I propose that Dependabot and Code Scanning alerts should be enabled for this repository.

https://github.com/taskcluster/taskgraph/security

Move `image_builder` docker image into Taskgraph

Currently we use a special Docker image to build all the other Docker image tasks. But for historical reasons, this image lives in mozilla-central. Now that this repo is the repo of record, and since image tasks are a core feature of Taskgraph, we should migrate the image generation over to here.

Don't hardcode Mac Python version in `run_task.py` transforms

taskgraph/src/taskgraph/transforms/job/run_task.py

Line 209 in 526f4d8

command = ["/tools/python36/bin/python3", "run-task"]

Generate kinds concurrently

Currently we generate each kind one after the other, which can take awhile (we're approaching 5 min in Gecko now). Instead, we should generate the tasks for kinds in parallel.

We'll have to take kind-dependencies into account and come up with a way to synchronize workers such that we only generate a kind one all of its dependencies have also been generated.

Decision task checks for all missing scopes before task submission

Imported from: https://bugzilla.mozilla.org/show_bug.cgi?id=1416858

Aki writes:
Currently, the decision task builds the task graph in memory in several steps, then submits the tasks to the queue. The first task submission to fail with scope errors will kill the decision task with an error message. However, when adding a number of new tasks and scopes, it's possible that the missing scopes listed in the error message are only a subset of the scopes needed for submitting the complete graph.

It would be great to have the decision task calculate all the scopes required for the task graph, and determine if it has sufficient scopes, before submitting any tasks. Then when we file scopes bugs, we'd be able to request the full set of missing scopes, rather than the currently failing subset.

taskcluster / taskgraph Goto Github PK

taskgraph's Introduction

Taskgraph

How It Works

Installation

Get Involved

taskgraph's People

Contributors

Stargazers

Watchers

Forkers

taskgraph's Issues

Steps to reproduce

Expected results

Actual results

Side notes

Steps to reproduce

Expected result

Actual result

Additional notes

Recommend Projects

Recommend Topics

Recommend Org

Jobs