aicoe-aiops / ocp-ci-analysis Goto Github PK

Developing AI tools for developers by leveraging the data made openly available by OpenShift and Kubernetes CI platforms.

Home Page: https://old.operate-first.cloud/data-science/ai4ci/

License: GNU General Public License v3.0

Python 0.12% Makefile 0.01% Jupyter Notebook 99.87% Shell 0.01%

ocp-ci-analysis's People

Contributors

Stargazers

Watchers

ocp-ci-analysis's Issues

Additional EDA on TestGrid Data set

To close issue #15 and build upon the initial EDA work done in #16 there are a number of additional questions that we would like answered about the TestGrid dataset. Specifically:

How comparable are the testgrids?
How do we analyze them in aggregate to learn from their combined behavior?
How many/ which tests do they all have in common?
Are their time series dates comparable?
Are there sub-groups that should only be compared with one another?
Is looking at the grid matrices independent of test names a valid approach for issue identification?
What is the expected behavior of a test over time across multiple jobs.
How does the entire test platform/specific tests perform on a given day?
How does the entire test platform behavior evolve over time.
Is there sufficient data here for useful ML approaches?
Can we develop some meaningful alerting/ problem identification with the results of the above questions?

Acceptance Criteria:

Notebook that address the questions above.

Spike: Look into test grid data as potential issues classification problem.

Testgrid

Given the image below, it appears that a human expert can determine the types of errors occurring based on the patterns present in the test grid.

Is it possible for us to develop an automated method for doing the same?

Add data collection step from sippy to this repo.

As a data scientists I would like a simple way to collect new data from openshift CI using sippy so that we can included updated datasets into our analysis process.

Acceptance Criteria:

a script that can achieve the following:

install go
git clone github.com/openshift/sippy
make
mkdir /tmp/sippy
# fetch a fresh copy of the raw data from testgrid
./sippy --fetch-data /tmp/sippy --release 4.6
# perform the analysis on the raw data
./sippy --local-data /tmp/sippy --release 4.6 -o json > /tmp/sippy.json ```

Connect Thoth Bots to this Repo

As a repo owner I would like to include all the git automation tools developed by the Thoth team.

Aakanksha up to speed with project

As a data scientist and contributor to this project I need to have a strong hands-on understanding of all the work done to date as well as knowledge of how to improve upon and extend the existing work.

PLEASE READ: This issue should be used as a template. Please make a copy of it and replace <NAME> with your name when creating the new issue.

Acceptance Criteria:

Use ocp-ci-analysis:latest image on https://jupyterhub-opf-jupyterhub.apps.cnv.massopen.cloud/ and successfully run every notebook in the notebooks directory.
Submit at least 1 Issue/PR fixing a bug, fixing a graph formatting, changing an unclear notebook section, or adding a small additional data analysis to a notebook. (Look for something to improve as you go through the existing work 😃)
Familiarize yourself with the following 3 resources:
* Sippy Repo and Dashboard for an example of metrics and TestGrid data analysis.
* TestGrid Repo and Dashboard to familiarize yourself with our initial data source.
* Prow and google cloud storage to see the underlying CI data informing these higher levels of abstraction.

Understand the testgrid ecosystem

Have a look at https://github.com/GoogleCloudPlatform/testgrid and the video linked from there.
Testgrid already applies some logic to the test runs, like identifying boards with flaky tests, or boards without tests reporting in. You can also report some additional metrics like test coverage.

is there a central definition for the data uploaded to RH testgrid? Test status, name, infra (gws)
is there some tooling to download the data from testgrid, in python?
is there some prior work on analyzing testgrid data at scale? blogs?

Look into Drain3 as potential log parsing tool for CI logs

https://github.com/IBM/Drain3

Short video walkthrough for EDA

As a potential Data Science Contributor I wan to quickly understand how to work with the data, review and reproduce the EDA that has already been done.

Acceptance Criteria:

Short video on data and data access

Initial EDA on google cloud storage log data.

There is a fair amount of semi-structured log (text) data that gets generated by the CI process. There are likely valuable insights in this data that could be leveraged by SME's if there was an automated way of reducing the total amount of logs that had to be reviewed. As a data scientist, I would like to understand the nature of this data and how best to access it in a data science friendly format, so that I can contribute to the development of automated ML methods for analyzing it.

Acceptance Criteria:

EDA notebook for log data found here

Spike: Identify TestGrid Metrics/KPI's - PR Performance

As Data Scientists our first task is to convert the raw data generated by the CI processes into some meaningful KPI's/Metrics/Features (numbers) that we can track and use to describe the state or behavior of the CI process over time. One of the key elements of the CI process are the PR's that are being tested. The ability to understanding the potential behavior of these code changes is critical.

PR KPI's could be things like "Number of commits before merge", "Diff size", "PR complexity", etc. As data scientists, we must admit that we are not currently subject matter experts in CI monitoring and do not know the best metrics to track for monitoring CI to support developers . As such, we need to perform a research spike and look for example KPI's used in the industry that we could collect and monitor from the TestGrid Data.

Acceptance Criteria:

Open an issue with one new PR performance KPI. Issue must include link to resource used to discover the metric, an explanation of why it would be useful to track, and a brief outline describing how we could generate it from our existing data sources.

Identifying flakey test in TestGrid data

flaky test result exhibits both a passing and a failing result with the same code. Hence, It takes a lot of effort on the developer side to manually determine whether a new failure is a flaky result or a legitimate failure. Hence, we are interested in identifying failure due to a flakey test in TestGrid data using data-driven methods.

Acceptance criteria:

Notebook on Identifying flakey test in TestGrid data

Missing commit ID's for Openshift testgrids

For testgrid of k8s, we have commit ID's for each run. Marked in the red circle image below:

But for testgrid of Openshift, we don't have commit ID's. Marked in the red circle in the image below:

We can collect a lot of metadata if we have commit ID available for each run. From those Commit IDs we can create different features that might be used for the data analysis part. Some of those features are

What type of file is changed in a commit? because Test cases that fail due to changes in config files are very likely to be flaky
A test case that has failed on a git revision that changed a file which was previously changed by more than two authors recently is highly likely to be a real failure
A test case that has failed on a git revision where many source code files were changed, is highly likely to be a real failure.

Acceptance criteria:

Communication with an appropriate team to include commit IDs in test grid.

Sanket up to speed with project

Acceptance Criteria:

Use ocp-ci-analysis:latest image on https://jupyterhub-opf-jupyterhub.apps.cnv.massopen.cloud/ and successfully run every notebook in the notebooks directory.
Submit at least 1 Issue/PR fixing a bug, fixing a graph formatting, changing an unclear notebook section, or adding a small additional data analysis to a notebook. (Look for something to improve as you go through the existing work 😃)
Familiarize yourself with the following 3 resources:
* Sippy Repo and Dashboard for an example of metrics and TestGrid data analysis.
* TestGrid Repo and Dashboard to familiarize yourself with our initial data source.
* Prow and google cloud storage to see the underlying CI data informing these higher levels of abstraction.

ML Request: Implement a Predictive Test Selection Tool

In an effort to leverage the CI data available to us and improve the kubernetes development process through machine learning, we should look into the development of a predictive test selection tool that can be used to identify a limited number of tests that are most likely to find a regression for a given code change.

Please see this blog post from Facebook engineering outlining their approach.

As noted in the blog the "system automatically develops a test selection strategy by learning from a large data set of historical code changes and test outcomes." Which should be feasible for us given the data we have access to in this project.

Include the job run url into the correlation results

As an end user investigating the underlying issue represented by the highly correlated failure sets, I would like to also be provided with the job run url for the instances where these failures occurred so that I can see more details about the failures and detriment a root cause.

Acceptance Criteria:

Add job run urls as an additional column to the output of Inittial_EDA.ipynb

Review flakiness detection in testgrid

We want to know if there is an opportunity to use ML techniques to improve upon the existing Flake detection tool currently being used by testgrid. To answer that question we first have to identify how the current Flake Detection tool is implemented.

Acceptance Criteria:

Explanation of Flake detection implementation.

Write over arching project doc

As a data scientists I want to make sure that this project is well defined so that all stakeholders agree on the work to be done.

Acceptance Criteria

Project Document Agreed upon by all stakeholders

Catalog the existing Research Papers/Articles for flaky test detection.

Is your feature request related to a problem? Please describe.
There is a lot of research done on flaky test detection. We want to catalog/collect the existing research work. We can explore these research papers/articles in the future.

Describe the solution you'd like
Markdown with a short summary of different research papers.

Review Sippy Analysis Output

As a data scientist I want to list what analysis output is generated by Sippy to determine if it could be recreated in a notebook environment.

Acceptance Criteria:

Jupyter Notebook that recreates values generated by Sippy or an explanation why it can't be done.

Define 2 initial projects

Decide on 2 initial projects that can be done with CI data , and write project docs for them

Complete Sippy Notebook EDA

The Sippy EDA notebook, currently only looks at a portion of the available data set. As a data scientists and contributor to this project, I would like a full explanation of the aggregated CI data that I have access to, so that I do not have to repeat the discovery phase myself.

Acceptance Criteria:

EDA notebook is complete, including an exploratory section for each section of the Sippy data sample.

Spike: Identify TestGrid Metrics/KPIs - Test Performance

As Data Scientists our first task is to convert the raw data generated by the CI processes into some meaningful KPI's/Metrics/Features (numbers) that we can track and use to describe the state or behavior of the CI process over time. One of the key elements of the CI process are the tests. Understanding the health and behavior of these test suites is critical.

Test KPI's could be things like "Test Pass Rate", "Test Run Rate", "Number of Correlated Failures with Test", etc. Sippy is currently quantifying these types of metrics and might be a good place to start looking for examples. But, as data scientists, we must admit that we are not currently subject matter experts in CI monitoring and do not know the best metrics to track for monitoring CI platform test health. As such, we need to perform a research spike and look for example KPI's used in the industry that we could collect and monitor from the TestGrid Data.

Acceptance Criteria:

create a markdown document of KPI's. Issue must include link to resource used to discover the metric, an explanation of why it would be useful to track, and a brief outline describing how we could generate it from our existing data sources.

Rename Project

Please add your suggestions to this issue for new project name. We'll decide/vote on it in our next sprint meeting.

List of Potential OpenShift CI Data ML Projects

There are a number of potential avenues of investigation for providing ML or automated analysis to the OpenShift CI data. After an initial review of existing work the three ideas that have been presented to date are:

Identify canary failures
Analyze job runs with a large number of test failures
Look for correlation patterns in test failures

That said, I'm sure there are many more potential projects that could be pursued with CI data that could benefit openshift. Please use this issue as a forum to list and discuss these potential projects.

Documentation of different cell labels in the testgrid

Is your feature request related to a problem? Please describe.
There are different cell labels for testgrids,
For example Green cell, Red Cell, Red cell with 'F' annotation, Purple cell, Cell with 'R' annotation

We want to understand the meaning of each of these cells. The logic behind the annotation or color

Describe the solution you'd like
Google doc and markdown with a description of all types of cells and the logic behind annotation.

ML Request : Implement a probabilistic flakiness score for tests

Develop and implement a probabilistic flakiness score for each test as outlined in this article from Facebook engineering. It provides a reliable real time metric that can provide insight into the health of individual tests in a CI pipeline and provide engineers information on where to focus there efforts in updating tests.

Remove dependency on "nbimporter" as it is no longer maintained and breaks pre-commit checks

Describe the bug

The nbimporter package is used throughout the repo to import functions from other notebooks, but is no longer supported by developers and breaks the pre-commit check. Recommend removing it and accessing shared function another way.

To Reproduce

Steps to reproduce the behavior:

Open a new notebook
Import nbimporter
Import function from adjacent notebook
Run git add <new-notebook>
Run pre-commit
See error: F401 'nbimporter' imported but unused from flake8-nb

Expected behavior

pre-commit does not produce any F401 errors

Screenshots

Additional context

From the repo's readme:

Collect a fixed train/test/validate data set for TestGrid

As a data scientists, its important to have a fixed immutable dataset to work with while developing, evaluating and validating our initial models. Since the TestGrid data updates everyday, there is potential for poor reproduciblity of experiments if we don't maintain a fixed experimental data set before applying to the live data.

Acceptance Criteria:

Maximum available data at date of collection for TestGrid data for Red Hat.
Stored and accessible in Ceph (or other public hosting)

Write EDA Notebook based on available Sippy Data.

Beyond the failure correlations already started, there may be other features in the sippy data that could be used for additional analysis. In order to do our data science due diligence, we will create a notebook going through this dataset.

Initial EDA

At the onset of the project, as a data scientist I would like to examine the type of data we will be working with, as well as provide some minor insights around correlated tests.

Acceptance Criteria:

EDA notebook the explores sippydata.json
Find highly correlated failure sets in sippydata.json

Short Video Walkthrough of indepth EDA notebook

Short Video Walkthrough of in-depth EDA notebook

Acceptance Criteria:

Video recorded and signed off by majority of team
Published to youtube
Linked on opf site.

Links to resources

Vertical white column: usually means "install failed". That's because we can't run tests if the installer didn't complete, and test grid omits squares. Almost every "infrastructure" flake related to the process of CI will be in this category - if there is any green in the column, odds are the problem is either in the test or in the cluster, not in the CI cluster or the job itself (very rarely will this be something network related).

Rows with red interspersed with green: almost always a flaky test, but sometimes a core bug across. multiple tests. You can guesstimate the frequency by counting the red squares and then visually estimating how many runs showed up over that interval (the boxes are regular, so I usually just hold up two fingers for 20 results and then see how many fall into that. If you look at this page frequently, you can also narrow down the day when the flake was introduced because it'll start flaking at some point

Rows with solid red chunks: almost always a regression either in the test or the product. If you see it on test-grid, it usually means it's post merge, so either someone force merged (bad!) or the test behaves differently between PR and release jobs (for instance, auth and whether it's on quay).

Row with solid red chunk and white to the right: a new test was added that is failing when run in the release job. In the picture below, that's the storage test we added that worked in PR but didn't work against the older RHCoS image

Repeating vertical red bars: If you see a set of rows that all fail together on the same runs, that usually means a subsystem has a bug. Previously, we were seeing that on quota, so every 5-10 runs all of the quota tests would fail in a given run because the kube-apiserver stopped handling quota and all tests would fail.

Failure waterfall: If you see a meandering line moving from bottom to top, right to left, this almost always means "core control plane (cluster infra) flake during the run". This is because the sort order of the grid prioritizes failed runs, and so you can see different tests hitting the flake each time. e2e tests are run in random order, so the same test is unlikely to fail twice if the test runs at different times on each run. Also, in e2e we re-run up to a limited number of failures at the end of the test to see whether these are reproducible failures or just a flaky tests. If the test passes the second time we record it here (red square) but the run itself is allowed to pass if the limit is low.

In the picture below, the line is caused by the kube-apiserver doing a rolling restart after the e2e tests are started (which shouldn't be happening) but if graceful restart is working correctly, the tests shouldn't fail (the point of graceful restart is to drain all short lived requests before we stop accepting new connections). It impacts different tests each time, and some tests are more impacted than others.
-- Clayton

Relative link for readme.md file in project-doc.md

Is your feature request related to a problem? Please describe.
Currently, we have a copy readme.md file inside docs/publish/project-doc.md
Hence, whenever we are updating readme.md file we need to update docs/publish/project-doc.md

Describe the solution you'd like
We can use relative link in one of the Markdown files to avoid two different copy of the same file.

Acceptance criteria
relative link of readme.md file in docs/publish/project-doc.md

Milestone 1: EDA Notebook and project doc on operate first website

We want to make sure that the work we are doing for the OCP CI data analysis is easy to follow and interact with so that we can get more contributions from other Data Scientists.

Acceptance Criteria:

Include well written and polished version of the following on operate-first.github.io

Project Document: Outlining project goal
Rendered notebook on initial testgrid EDA
Rendered notebook on in-depth testgrid EDA
Polish all content, focused on ease of use by new contributors

Oindrilla up to speed with project

PLEASE READ: This issue should be used as a template. Please make a copy of it and replace <NAME> with your name when creating the new issue.

Acceptance Criteria:

Use ocp-ci-analysis:latest image on https://jupyterhub-opf-jupyterhub.apps.cnv.massopen.cloud/ and successfully run every notebook in the notebooks directory.
Submit at least 1 Issue/PR fixing a bug, fixing a graph formatting, changing an unclear notebook section, or adding a small additional data analysis to a notebook. (Look for something to improve as you go through the existing work 😃)
Familiarize yourself with the following 3 resources:
* Sippy Repo and Dashboard for an example of metrics and TestGrid data analysis.
* TestGrid Repo and Dashboard to familiarize yourself with our initial data source.
* Prow and google cloud storage to see the underlying CI data informing these higher levels of abstraction.

<NAME> up to speed with project

PLEASE READ: This issue should be used as a template. Please make a copy of it and replace <NAME> with your name when creating the new issue.

Acceptance Criteria:

Use Openshift CI Analysis Notebook Image on https://jupyterhub-opf-jupyterhub.apps.smaug.na.operate-first.cloud/ and successfully run every notebook in the notebooks directory.
Submit at least 1 Issue/PR fixing a bug, fixing a graph formatting, changing an unclear notebook section, or adding a small additional data analysis to a notebook. (Look for something to improve as you go through the existing work 😃)
Familiarize yourself with the following 3 resources:
* Sippy Repo and Dashboard for an example of metrics and TestGrid data analysis.
* TestGrid Repo and Dashboard to familiarize yourself with our initial data source.
* Prow and google cloud storage to see the underlying CI data informing these higher levels of abstraction.

Narrowing scope to a useful ML application

Work with the openshift team to list and prioritize ML projects that can be done with the CI data

[spike] evaluate if this is applicable to AICoE-CI too

Is your feature request related to a problem? Please describe.
let's see if this can deliver value to out own CI too

/kind question
Cc @harshad16

Creating project doc for failure type classification with the TestGrid data

As a data scientist, I want to make sure that the 'TestGrid failure type classification' project is well defined so that all stakeholders agree on the work to be done.

Acceptance Criteria

Project Document Agreed upon by all stakeholders

Infra Flake, Install Flake, New test Failure detectors

Create a notebook with methods for automatically identifying the following 3 types of errors in test grid data sets

Infra Flake
Install Flake
New Test Error

Documentation: Continuous Integration Artifacts From a Data Science Perspective

As a Data Scientists interested in applying my machine learning expertise to the problem of developing intelligent CI/CD tools, I would like clear and concise documentation explaining the CI/CD process, giving special attention to the data types and artifacts (logs, metrics, bug reports, code diffs, etc) generated by these development processes and how these data artifacts relate to each other so that their is a lower the barrier to entry to make meaningful contributions in this domain.

My assumption is that the average data scientist has little experience with the inner workings of large scale application development infrastructure. This lack of domain expertise could create a potential major blocker to contributions. I want to make sure that we have a simple to understand, well vetted (accurate) and singular "anatomy of the Kubernetes/OpenShift CI process" documented that contributors can reference when developing new tools.

This should also address the need in our planning document for an "Anatomy of Kubernetes/OpenShift CI Data"

Acceptance Criteria:

Blog posted to operate-first.cloud
Explainer video posted to youtube channel.
extend readme

slight tweaks on readability of initial_EDA.ipynb

I dont know how to comment on individual cells in an already committed notebook, so I did it in the PR

#2 (review)

Just some small suggestions, otherwise great notebook I could follow and understand.

Include `Testgrid_flakiness_detection.ipynb` notebook on OperateFirst website.

Is your feature request related to a problem? Please describe.
We want the flakiness detection notebook on Operate first website so that we get more contributions/feedbacks from other Data Scientists.

Describe the solution you'd like
Rendered Testgrid_flakiness_detection.ipynb notebook on Operate first website

Spike: Research Existing Aiops Features/ Offerings

In an effort to drive opensource AIOps for CI/CD, we want to ensure that we have a complete and up to date understanding of what features are available, being developed, and considered state-of-the-art both in industry and in the opensource community. To start we will do a research spike, identifying the existing offerings by leading AIOps service providers.

Acceptance Criteria:

Open an issue outlining an opensource alternative to an offering provided for each leading AIOps service provider.

Some existing providers

Karan up to speed with project

Acceptance Criteria:

Use ocp-ci-analysis:latest image on https://jupyterhub-opf-jupyterhub.apps.cnv.massopen.cloud/ and successfully run every notebook in the notebooks directory.
Submit at least 1 Issue/PR fixing a bug, fixing a graph formatting, changing an unclear notebook section, or adding a small additional data analysis to a notebook. (Look for something to improve as you go through the existing work 😃)
Familiarize yourself with the following 3 resources:
- Sippy Repo and Dashboard for an example of metrics and TestGrid data analysis.
- TestGrid Repo and Dashboard to familiarize yourself with our initial data source.
- Prow and google cloud storage to see the underlying CI data informing these higher levels of abstraction.

Public KPI Dashboard (initial draft)

Is your feature request related to a problem? Please describe.
A publicly available dashboard to display and share our CI metrics.

Describe the solution you'd like

A Superset dashboard running on the MOC-ODH operate first instance.
Publicly available
Monitoring in place to measure usage

Downloading github metadata from commitid's

Is your feature request related to a problem? Please describe.
In testgrid, we have commit ids for each run. We want to download the GitHub metadata from these commit ids.

Acceptance criteria
Download the following metadata for each commit ids

What files are changed in a particular commit
Author of the commit
For each changed file: How many times that particular file is changed previously.
For each changed file: How many authors edited that particular file.

Provide public runable EDA notebooks

Convert the testgrid EDA notebooks into images accessible on binder/google colab and eventually once its availble MOCODH is available.

Hema up to speed with project

PLEASE READ: This issue should be used as a template. Please make a copy of it and replace <NAME> with your name when creating the new issue.

Acceptance Criteria:

Use ocp-ci-analysis:latest image on https://jupyterhub-opf-jupyterhub.apps.cnv.massopen.cloud/ and successfully run every notebook in the notebooks directory.
Submit at least 1 Issue/PR fixing a bug, fixing a graph formatting, changing an unclear notebook section, or adding a small additional data analysis to a notebook. (Look for something to improve as you go through the existing work 😃)
Familiarize yourself with the following 3 resources:
* Sippy Repo and Dashboard for an example of metrics and TestGrid data analysis.
* TestGrid Repo and Dashboard to familiarize yourself with our initial data source.
* Prow and google cloud storage to see the underlying CI data informing these higher levels of abstraction.

Find place to publicly host data sets

Although the data is public, we will want an additional location to store our interim datasets as we analyze it, as well as keep immutable test/train/validation sets that do not change daily and are independent of the services availability.

This should be done with a Ceph bucket hosted on the MOC-ODH environment.

Acceptance Criteria:

Publicly hosted object storage on MOC

aicoe-aiops / ocp-ci-analysis Goto Github PK

ocp-ci-analysis's People

Contributors

Stargazers

Watchers

Forkers

ocp-ci-analysis's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs