GithubHelp home page GithubHelp logo

ci-chat-bot's Introduction

ci-chat-bot (aka cluster-bot)

The @cluster-bot is a Slack App that allows users the ability to launch and test OpenShift Clusters from any existing custom built releases.

The App is currently running in the Red Hat Internal slack workspace. To begin using the cluster-bot: Select the + in the Apps section, to Browse Apps, search for cluster-bot, and then select its tile. You'll be placed in a new channel with the App, and you'll be ready to begin launching clusters!

To see the available commands, type help.

For any questions, concerns, comments, etc, please reach out in the #forum-ocp-crt channel.

Links

ci-chat-bot's People

Contributors

2uasimojo avatar alexnpavel avatar alvaroaleman avatar ardaguclu avatar bparees avatar bradmwilliams avatar cgwalters avatar danilo-gemoli avatar dcbw avatar droslean avatar emilienm avatar harche avatar hasbro17 avatar hongkailiu avatar hoxhaeris avatar jmguzik avatar jupierce avatar omertuc avatar openshift-ci[bot] avatar openshift-merge-bot[bot] avatar openshift-merge-robot avatar osherdp avatar petr-muller avatar rvanderp3 avatar smarterclayton avatar smg247 avatar stevekuznetsov avatar vrutkovs avatar wangke19 avatar wking avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

ci-chat-bot's Issues

`build` release image not accessible to regular api.ci user

I asked the slack bot to create a release for me:

build openshift/machine-api-operator#502,openshift/cluster-autoscaler-operator#133

And it did: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-assemble/18/build-log.txt

However, I do not have access to the release with my api.ci user:

$ oc adm release info --registry-config /home/notstack/pullsecret.json registry.svc.ci.openshift.org/ci-ln-gt0wv3b/release:latest 
error: unable to read image registry.svc.ci.openshift.org/ci-ln-gt0wv3b/release:latest: unauthorized: authentication required

I'm not the author of the PR's, but I also don't see any user being added to have access to that namespace. I'd expect the requester to cluster-bot to have access, or even better, anyone with valid api.ci creds...

`build` isn't working again

I asked the bot:

build openshift/machine-config-operator#1972,openshift/machine-api-operator#632,openshift/installer#3929

It's launching release-openshift-origin-installer-launch-gcp: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1291766374934253568

The bot responded to me with job build openshift/machine-config-operator#1972,openshift/machine-api-operator#632,openshift/installer#3929 failed, however that job is still running with no log output.

unable to launch 4.1 cluster

Used launch 4.1 aws command two different times, both failed:

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/2118

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/2120

Looks like the bot is using the wrong image ref format:

error: could not resolve inputs: could not determine inputs for step [release:latest]: ImageStreamImport.image.openshift.io "release-import" is invalid: spec.images[0].from.name: Invalid value: "registry.svc.ci.openshift.org/ocp/[email protected]": invalid reference format

Support openshift 4 cluster with https://kubernetes.io/docs/concepts/storage/volumes/#cephfs

It would be nice to have at least one Openshift cluster bot configuration with persistent volume with ReadWriteMany access mode. Or like alternative it could be with ReadWriteOnce access mode, but with support parallel read/write operations between few pods, something like https://kubernetes.io/docs/concepts/storage/volumes/#cephfs pv or https://kubernetes.io/docs/concepts/storage/volumes/#nfs .

Actual state
I checked all standard cluster bot configuration and I see all of them uses pv with ReadWriteOnce accessMode, but without support simultaneously read/write between few pods:

launch -> pv https://kubernetes.io/docs/concepts/storage/volumes/#gcepersistentdisk

launch 4.10 aws -> pv https://kubernetes.io/docs/concepts/storage/volumes/#awselasticblockstore

launch 4.10 azure -> pv https://kubernetes.io/docs/concepts/storage/volumes/#azuredisk

launch 4.10 openstack -> pv https://kubernetes.io/docs/concepts/storage/volumes/#cinder

launch 4.10 vsphere -> pv https://kubernetes.io/docs/concepts/storage/volumes/#vspherevolume

launch 4.10 metal Failed to create cluster, or cluster isn't pingable
launch 4.10 hypershift Failed to create cluster

launch 4.10 ovirt -> csi

Manual prolongation of cluster lifetime

Not wanting to comment on an closed issue (#35), so I'm opening a new one instead.

It would be really great to be able to manually prolong a cluster's lifetime, as was already proposed at #35 (comment).

I know we all launch clusters and forget about them sometimes, so I'm not proposing an extended lifetime at cluster creation but rather to have the possibility to prolong a given cluster when it's near expiration time (say, 30 minutes before). Maybe, cluster bot could be modified so that it notifies a cluster's requester 30 minutes before expiration and lets them type "prolong" or "extend" so that another hour of lifetime is added on top. And maybe can repeat this up to 3 times (so, total possible prolongation would be 3 hours).

No way to destroy a cluster

Cluster bot clusters have a pre-determined expiration time, but it would be great to be able to tear down a cluster when you're done testing, or if you accidentally deployed a new cluster, for example.

password appears empty when cluster is not ready

From my perspective, it makes no sense to show this message:

image

It should only be shown once it is confirmed there is no error (cluster has been correctly launched). Otherwise, it can be confusing.

Consider re-ordering launch defaults or providing additional launch selectors

xref (how image resolution happens today): https://github.com/openshift/ci-chat-bot/blob/master/manager.go#L695-L707 with the current logic; simply X.Y resolutions always prefer / resolve nightly builds, then CI builds then GA releases.

As a user who primary works with 'released' clusters/versions; having the launch command (inputting 4.10) default to deploying a 'nightly' build is frustrating/less than desired (for my use-cases). Thus I would like to see us re-order the defaults (so that GA release are evaluated first; then Nighties then CI builds), however this use-case may be less desirable for other functions/parts of our organization.

Thus I would like to propose that the launch command be alerted take an additional argument that is the release status type of cluster you want to deploy or source your image reference from. Ex: launch 4.10 release|ga or launch 4.10 nightly.

Build no longer working?

I tried to kick-off a build, but I can't find the job in prow, and I never heard back from cluster-bot.

image

Please allow geography selection

Love this bot, I can't tell how useful I find it in my day-to-day work.

Would it be possible to add a feature to specify the location while launching the cluster? e.g.

launch 4.5 india
launch 4.5 germany

It would be really helpful to reduce the latency when using the bot from the other side of the world.

Thanks again for this awesome bot :)

Cluster bot indicated failure prior to cluster launch finishing

Cluster bot told me my cluster failed to launch, but then later it succeeded:

From slack:

cluster-botAPP [8:58 AM]
your cluster failed to launch: pod never became available: timed out waiting for the condition (see https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/2082 for details)

new messages
cluster-botAPP [9:10 AM]
Your cluster is ready, it will be shut down automatically in ~108 minutes.

Support team clusters

I think we should have semi-persistent "team" clusters that can last longer than default ones (but are still periodically flushed and reprovisioned) and also have a workflow where the creds can be easily shared among a team.

Even among the team, there'd still be an "owner" who might be e.g. actively testing their operator, but they'd be able to release it and e.g. turn back on the CVO and leave things hopefully a in a clean state for the next user. Or, commonly two members of a team might be concurrently working on a PR.

Another (quite common for me) use case is "I just want to run some read-only commands (as kube:admin) against a recent-ish 4.6 cluster".

Setup:

  • A team is defined as one of the github team aliases we have today, e.g. @openshift/openshift-team-red-hat-coreos
  • We synchronize membership of the GH team to Slack groups

UX, talking to @cluster-bot

<user> teamcluster request
<bot> cluster https://console.team-redhat-coreos.devcluster.openshift.com/ is currently available
<bot> (kubeconfig attachment)
<bot> use `teamcluster done` when you're done

Scenario where cluster is in use:

<user> teamcluster request
<bot> cluster https://console.team-redhat-coreos.devcluster.openshift.com/ is in use by @miabbott
<bot> use `teamcluster ping` to send a message to @miabbott requesting current credentials
<user> teamcluster ping

Bot to @miabbott:

<bot> user @cgwalters has requested access, reply "yes" to grant, "no <reason>" to indicate refusal
<miabbott> <yes or no>

Alternative flow for immediate grant of "readonly" access:

<user> teamcluster request ro
<bot> cluster https://console.team-redhat-coreos.devcluster.openshift.com/ is in use by @miabbott
<bot> (kubeconfig attachment)
<bot> You requested read-only access, please avoid changing the cluster

(Bot also ends a message to @miabbott like: <bot> User @cgwalters requested readonly access)

Mergeable PR said as needing to be rebased

It looks like there is a bug with the usage of the Mergeable flag here:

ci-chat-bot/manager.go

Lines 945 to 947 in 9dac707

if !pr.Mergeable {
return nil, fmt.Errorf("pull request %s needs to be rebased to branch %s", spec, pr.Base.Ref)
}

This value can not only be true or false, but also null, cf https://docs.github.com/en/rest/reference/pulls#get-a-pull-request, when github is refreshing this flag. And it happens to us quite often that a PR is said needing rebase although it's actually mergeable; repeating the cluster-bot command several times in a row would ultimately make it work.

So, would it be possible to either have a small retry mechanism when Mergeable is null, or alternatively, allow to build on non-mergeable PRs ?

FAQ should have information about how to request help

it would be nice if the FAQ had an entry about which slack channel(s) to post in when cluster-bot has an issue that is not covered in the current FAQ.

e.g. if cluster-bot fails to make a cluster because of some quota or permissions issue, where should we go to report that? what if cluster-bot is having an outage of some sort, who should we tell?

`build` feature not working

@cluster-bot help says:

build from - Create a new release image from one or more pull requests. The successful build location will be sent to you when it completes and then preserved for 12 hours.

I asked it to build me a release using the command below:

build openshift/machine-api-operator#502,openshift/cluster-autoscaler-operator#133

But it reported:

cluster-botAPP  2:21 PM
configuration error, unable to find prow job matching job-env=aws,job-type=build

Did I do something wrong?

rosa create {version} {duration} starts to have problems from this morning CET

Hi,

I use a lot of times Cluster Bot on Slack with the following command/message:

rosa create {version} {duration}

example: rosa create 4.14.9 6h

From this morning CET I have the following response from Cluster Bot:

rosa create 4.14.12 6h

Failed to create cluster: Failed to run command: exit status 1

So after that I try with another rosa create command, changing the version:

rosa create 4.14.9 6h

you have already requested a cluster; 4 minutes have elapsed

And I need to do:

done

Cluster mzms4-nur3s-ynf successfully marked for deletion

So seems that the first create with 4.14.12 did something, because when I call done it answer me with a clusterName (mzms4-nur3s-ynf), but it response me with: "Failed to create cluster: Failed to run command: exit status 1".

And so I can not create anymore rosa cluster with Cluster Bot from this morning CET.

Why?

Thank you

4.4 Cluster Failed to launch

Hey there,

Nightly, 4.4, and 4.3 clusters are failing to launch with the following raw logs.

to reproduce simply send launch 4.4 to the cluster bot

2020/05/19 15:50:17 ci-operator version v20200519-e987a65
2020/05/19 15:50:17 No source defined
time="2020-05-19T15:50:18Z" level=info msg="Parsed kubeconfig context: ci/api-ci-openshift-org:443"
2020/05/19 15:50:18 Resolved release:latest registry.svc.ci.openshift.org/ocp/release@sha256:3dbd4e60c26b694488f01eeb3d8ed2bdfaee5a79b3b0c3ea83368767d2ce0cb2
2020/05/19 15:50:18 Using namespace https://console.svc.ci.openshift.org/k8s/cluster/projects/ci-ln-69fwwhk
2020/05/19 15:50:18 Running [release-inputs], [release:latest], [images], launch
2020/05/19 15:50:18 Creating namespace ci-ln-69fwwhk
2020/05/19 15:50:18 Setting up pipeline imagestream for the test
2020/05/19 15:50:18 Created secret launch-cluster-profile
2020/05/19 15:50:18 Created secret pull-secret
2020/05/19 15:50:18 Tagged shared images from ocp/4.4:${component}, images will be pullable from registry.svc.ci.openshift.org/ci-ln-69fwwhk/stable:${component}
2020/05/19 15:50:25 Importing release image latest
2020/05/19 16:05:47 No custom metadata found and prow metadata already exists. Not updating the metadata.
2020/05/19 16:05:48 Ran for 15m30s
error: some steps failed:
  * could not run steps: step [release:latest] failed: the following tags from the release could not be imported to stable after five minutes: aws-machine-controllers, azure-machine-controllers, baremetal-installer, baremetal-machine-controllers, baremetal-operator, baremetal-runtimecfg, cli, cli-artifacts, cloud-credential-operator, cluster-authentication-operator, cluster-autoscaler, cluster-autoscaler-operator, cluster-bootstrap, cluster-config-operator, cluster-csi-snapshot-controller-operator, cluster-dns-operator, cluster-etcd-operator, cluster-image-registry-operator, cluster-ingress-operator, cluster-kube-apiserver-operator, cluster-kube-controller-manager-operator, cluster-kube-scheduler-operator, cluster-kube-storage-version-migrator-operator, cluster-machine-approver, cluster-monitoring-operator, cluster-network-operator, cluster-node-tuned, cluster-node-tuning-operator, cluster-openshift-apiserver-operator, cluster-openshift-controller-manager-operator, cluster-policy-controller, cluster-samples-operator, cluster-storage-operator, cluster-svcat-apiserver-operator, cluster-svcat-controller-manager-operator, cluster-update-keys, cluster-version-operator, configmap-reloader, console, console-operator, container-networking-plugins, coredns, csi-snapshot-controller, deployer, docker-builder, docker-registry, etcd, gcp-machine-controllers, grafana, haproxy-router, hyperkube, insights-operator, installer, installer-artifacts, ironic, ironic-hardware-inventory-recorder, ironic-inspector, ironic-ipa-downloader, ironic-machine-os-downloader, ironic-static-ip-manager, jenkins, jenkins-agent-maven, jenkins-agent-nodejs, k8s-prometheus-adapter, keepalived-ipfailover, kube-client-agent, kube-etcd-signer-server, kube-proxy, kube-rbac-proxy, kube-state-metrics, kube-storage-version-migrator, kuryr-cni, kuryr-controller, libvirt-machine-controllers, local-storage-static-provisioner, machine-api-operator, machine-config-operator, machine-os-content, mdns-publisher, multus-admission-controller, multus-cni, oauth-server, operator-registry, ovirt-machine-controllers, ovn-kubernetes, pod, prom-label-proxy, prometheus, prometheus-alertmanager, prometheus-config-reloader, prometheus-node-exporter, prometheus-operator, sdn, service-ca-operator, service-catalog, telemeter, tests, thanos
2020/05/19 16:05:48 could not load result reporting options: open : no such file or directory

done did not prevent user from being notified later

I0406 16:13:37.929985       1 manager.go:999] Job "chat-bot-2020-04-06-161337.8588" requested by user "UCUUVUMRA" with mode launch prow job release-openshift-origin-installer-launch-aws - params=, inputs=[]main.JobInput{main.JobInput{Image:"registry.svc.ci.openshift.org/ocp/release:4.5.0-0.nightly-2020-04-06-134001", Version:"4.5.0-0.nightly-2020-04-06-134001", Refs:[]apiv1.Refs(nil)}}
I0406 16:13:37.930037       1 manager.go:1066] Job "chat-bot-2020-04-06-161337.8588" starting cluster for "UCUUVUMRA"
I0406 16:13:37.944350       1 prow.go:548] Job "chat-bot-2020-04-06-161337.8588" started a prow job that will create pods in namespace ci-ln-l6278sk
I0406 16:14:07.959956       1 prow.go:655] Job "chat-bot-2020-04-06-161337.8588" waiting for setup container in pod ci-ln-l6278sk/launch-aws to complete
I0406 16:18:34.583638       1 manager.go:1169] another worker is already running for chat-bot-2020-04-06-161337.8588
I0406 16:19:37.670571       1 manager.go:1095] user "UCUUVUMRA" requests job "chat-bot-2020-04-06-161337.8588" to be terminated
E0406 16:19:37.672201       1 manager.go:1097] unable to terminate running cluster chat-bot-2020-04-06-161337.8588: projects.project.openshift.io "ci-ln-l6278sk" is forbidden: User "system:serviceaccount:ci:ci-chat-bot" cannot delete projects.project.openshift.io in the namespace "ci-ln-l6278sk": no RBAC policy matched
E0406 16:19:37.672201       1 manager.go:1097] unable to terminate running cluster chat-bot-2020-04-06-161337.8588: projects.project.openshift.io "ci-ln-l6278sk" is forbidden: User "system:serviceaccount:ci:ci-chat-bot" cannot delete projects.project.openshift.io in the namespace "ci-ln-l6278sk": no RBAC policy matched
E0406 16:19:37.672201       1 manager.go:1097] unable to terminate running cluster chat-bot-2020-04-06-161337.8588: projects.project.openshift.io "ci-ln-l6278sk" is forbidden: User "system:serviceaccount:ci:ci-chat-bot" cannot delete projects.project.openshift.io in the namespace "ci-ln-l6278sk": no RBAC policy matched
E0406 16:19:37.672201       1 manager.go:1097] unable to terminate running cluster chat-bot-2020-04-06-161337.8588: projects.project.openshift.io "ci-ln-l6278sk" is forbidden: User "system:serviceaccount:ci:ci-chat-bot" cannot delete projects.project.openshift.io in the namespace "ci-ln-l6278sk": no RBAC policy matched
I0406 16:19:37.961509       1 manager.go:1176] Job "chat-bot-2020-04-06-161337.8588" aborted due to detecting completion: pod never became available: job is complete
I0406 16:19:37.961571       1 manager.go:1203] Job "chat-bot-2020-04-06-161337.8588" complete, notify "UCUUVUMRA"
I0406 16:30:38.790379       1 prow.go:548] Job "chat-bot-2020-04-06-161337.8588" started a prow job that will create pods in namespace ci-ln-l6278sk
I0406 16:30:38.803962       1 prow.go:655] Job "chat-bot-2020-04-06-161337.8588" waiting for setup container in pod ci-ln-l6278sk/launch-aws to complete
I0406 16:36:40.831475       1 manager.go:1169] another worker is already running for chat-bot-2020-04-06-161337.8588
I0406 16:42:43.031123       1 manager.go:1169] another worker is already running for chat-bot-2020-04-06-161337.8588
I0406 16:48:45.281802       1 manager.go:1169] another worker is already running for chat-bot-2020-04-06-161337.8588
I0406 16:54:48.005749       1 manager.go:1169] another worker is already running for chat-bot-2020-04-06-161337.8588
I0406 16:55:23.816570       1 prow.go:708] Job "chat-bot-2020-04-06-161337.8588" waiting for kubeconfig from pod ci-ln-l6278sk/launch-aws
I0406 16:55:54.558084       1 manager.go:1203] Job "chat-bot-2020-04-06-161337.8588" complete, notify "UCUUVUMRA"
I0406 17:06:52.742700       1 prow.go:548] Job "chat-bot-2020-04-06-161337.8588" started a prow job that will create pods in namespace ci-ln-l6278sk
I0406 17:06:52.751793       1 prow.go:655] Job "chat-bot-2020-04-06-161337.8588" waiting for setup container in pod ci-ln-l6278sk/launch-aws to complete
I0406 17:06:52.757276       1 prow.go:708] Job "chat-bot-2020-04-06-161337.8588" waiting for kubeconfig from pod ci-ln-l6278sk/launch-aws
I0406 17:18:57.270600       1 prow.go:548] Job "chat-bot-2020-04-06-161337.8588" started a prow job that will create pods in namespace ci-ln-l6278sk
I0406 17:18:57.282919       1 prow.go:655] Job "chat-bot-2020-04-06-161337.8588" waiting for setup container in pod ci-ln-l6278sk/launch-aws to complete
I0406 17:18:57.289395       1 prow.go:708] Job "chat-bot-2020-04-06-161337.8588" waiting for kubeconfig from pod ci-ln-l6278sk/launch-aws

Cluster Bot Failing to start temp AWS cluster

sending launch nightly to the cluster bot

returns

unable to find release image stream: imagestreams.image.openshift.io "release" is forbidden: User "system:serviceaccount:ci:ci-chat-bot" cannot get resource "imagestreams" in API group "image.openshift.io" in the namespace "ocp"

Extend cluster uptime

Currently, the default uptime is 2 hours, but for certain testing I would need to extend that by X minutes/hours. It would be nice to have a command to extend the lifecycle of the cluster by a certain amount of time.

Another idea is to specify uptime ahead, so launch 3h. That would mean users can't indefinitely extend the cluster the same way they could in an already running cluster.

Bot keeps sending messages about clusters that were flagged for shutdown

Hi! I have idea for a new feature - filter/do not send messages about failing jobs on clusters, on which done command was called already

example:
I ask for a cluster launch,
I see it fails because of aws provider.
I use done command to flag it for shutdown so I am allowed to provision cluster from different provider
I ask for different cluster launch nightly gcp
I keep getting errors like your cluster failed to launch: job failed, see logs (logs) related to the first cluster, which is already marked for shutdown

behavior that I would expect:
No more error messages about clusters manually marked for shutdown (they are irrelevant + I have to check each job failure manually until the second cluster gets provisioned)

Links to custom payloads are being misinterpreted

QE has been running test upgrade quay.io/openshift-release-dev/ocp-release:4.2.9 registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2019-12-05-213858 a few times and chat bot didn't interpret the slack message correctly.

Prow job has been created but the release image param was '<http://quay.io/openshift-release-dev/ocp-release:4.2.9|quay.io/openshift-release-dev/ocp-release:4.2.9> ' instead of just quay.io/openshift-release-dev/ocp-release:4.2.9

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.