GithubHelp home page GithubHelp logo

reliability's Introduction

Node.js Core CI Reliability

This repo is used for tracking flaky tests on the Node.js CI and fixing them.

Current status: work in progress. Please go to the issue tracker to discuss!

Updating this repo

Updates should be merged as soon as possible. We can revert or modify afterwards. This repo is mostly for coordination so we need to move fast and reduce the noise.

The Goal

Make the CI green again.

The Definition of Green

  • A green CI run is a run with a SUCCESS status, UNSTABLE does not count as green

  • Taking the last 100 runs, at any given time the green rate is calculated as follows

    SUCCESS / (100 - RUNNING - ABORTED)
    

CI Health History

See https://nodejs-ci-health.mmarchini.me/#/job-summary

UTC Time RUNNING SUCCESS UNSTABLE ABORTED FAILURE Green Rate
2018-06-01 20:00 1 1 15 11 72 1.13%
2018-06-03 11:36 3 6 21 10 60 6.89%
2018-06-04 15:00 0 9 26 10 55 10.00%
2018-06-15 17:42 1 27 4 17 51 32.93%
2018-06-24 18:11 0 27 2 8 63 29.35%
2018-07-08 19:40 1 35 2 4 58 36.84%
2018-07-18 20:46 2 38 4 5 51 40.86%
2018-07-24 22:30 2 46 3 4 45 48.94%
2018-08-01 19:11 4 17 2 2 75 18.09%
2018-08-14 15:42 5 22 0 14 59 27.16%
2018-08-22 13:22 2 29 4 9 56 32.58%
2018-10-31 13:28 0 40 13 4 43 41.67%
2018-11-19 10:32 0 48 8 5 39 50.53%
2018-12-08 20:37 2 18 4 3 73 18.95%

Handling Failed CI runs

Flaky Tests

TODO: automate all of this in ncu-ci

Identifying Flaky Tests

When checking the CI results of a PR, if there is one or more failed tests (with not ok as the TAP result):

  1. If the failed test is not related to the PR (does not touch the modified code path), search the test name in the issue tracker of this repo. If there is an existing issue, add a reply there using the reproduction template, and open a pull request updating flakes.json.
  2. If there are no new existing issues about the test, run the CI again. If the failure disappears in the next run, then it is potential flake. See When discovering a potential flake on the CI on what to do for a new flake.
  3. If the failure reproduces in the next run, it is likely that the failure is related to the PR. Do not re-run CI without code changes in the next 24 hours, try to debug the failure.
  4. If the cause of the failure still cannot be identified 24 hours later, and the code has not been changed, start a CI run and see if the failure disappears. Go back to step 3 if the failure still reproduces, and go to step 2 if the failure disappears.

When Discovering a Potential New Flake on the CI

  1. Open an issue in this repo using the flake issue template:

    • Title should be Investigate path/under/the/test/directory/without/extension, for example Investigate async-hooks/test-zlib.zlib-binding.deflate.
  2. Add the Flaky Test label and relevant subsystem labels (TODO: create useful labels).

  3. Open a pull request updating flakes.json.

  4. Notify the subsystem team related to the flake.

Infrastructure failures

When the CI run fails because:

  • There are network connection issues
  • There are tests fail with ENOSPAC (No space left on device)
  • The CI machine has trouble pulling source code from the repository

Do the following:

  1. Search in this repo with the error message and see if there is any open issue about this.
  2. If there is an existing issue, wait until the problem gets fixed.
  3. If there are no similar issues, open a new one with the build infra issue template.
  4. Add label Build Infra.
  5. Notify the @nodejs/build-infra team in the issue.

Build File Failures

When the CI run of a PR that does not touch the build files ends with build failures (e.g. the run ends before the test runner has a chance to run):

  1. Search in this repo with the error message that contains keywords like fatal, error, etc.
  2. If there is a similar issue, add a reply there using the reproduction template.
  3. If there are no similar issues, open a new one with the build file issue template.
  4. Add label Build Files.
  5. Notify the @nodejs/build-files team in the issue.

TODO

  • Settle down on the flake database schema
  • Read the flake database in ncu-ci so people can quickly tell if a failure is a flake
  • Automate the report process in ncu-ci
  • Migrate existing issues in nodejs/node and nodejs/build, close outdated ones.
  • Automate CI health history tracking

reliability's People

Contributors

bzoz avatar joyeecheung avatar maclover7 avatar sagirk avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.