cryostatio / cryostat-operator Goto Github PK

View Code? Open in Web Editor NEW

33.0 33.0 19.0 2.34 MB

A Kubernetes Operator to facilitate the setup and management of Cryostat.

Home Page: https://cryostat.io

License: Apache License 2.0

Dockerfile 0.70% Go 95.11% Makefile 4.13% Shell 0.07%

cryostat go golang hacktoberfest hacktoberfest2021 kubernetes metrics monitoring observability openshift operator

cryostat-operator's People

Contributors

Stargazers

Watchers

Forkers

andrewazores ebaron jiekang alexjsenn cybernetics josh-matsuoka aptmac jan-law jaadbarg adithyaakrishna divyakelaskar tthvo maxcao13 mwangggg aali309 openguidou siam1113 733amir sefroberg

cryostat-operator's Issues

Grafana has no default visualizations for recordings

container-jfr log level env var should be controllable via CR spec

container-jfr uses an environment variable to set its logging level. There should be a way to set this variable via the operator's ContainerJFR CRs.

Work with Kubernetes

We should detect whether the operator is running on OpenShift or Kubernetes and it should run properly in both scenarios. This will mainly mean using Ingress for Kubernetes and Route for OpenShift.

Add a logo

We should add a community logo which is visible in the Operator Marketplace and various related pages.

Pre-customized Grafana Dashboard Image

Currently, the Grafana dashboard image deployed by the Operator is a vanilla upstream Grafana image, which the Operator (or various scripts like rh-jmc-team/container-jfr's smoketest.sh) need to configure by performing various HTTP requests to create the default dashboard. This configuration is is not really configurable at deploy time, only at Operator build time, so there is no reason that the configurations should be done at deploy time. They can and should instead be performed at image creation time, by using a simple Dockerfile based on top of the upstream vanilla Grafana dashboard image and including the dashboard.json. The Operator can then deploy this Grafana container within the ContainerJFR pod instead, and no longer carry its own in-source copy of the dashboard definition JSON string (pkg/controller/grafana/dashboard.go), and not need to perform additional HTTP requests against the dashboard to check its health and configure the dashboard.

This same image-build-time configuration may also apply to the configuration for linking the jfr-datasource into the Grafana image, but maybe not.

This customized Grafana image should probably be split out into a new repo like rh-jmc-team/grafana-dashboard. This way it can be easily (re)built on its own and consumed by various deployments, including the Operator.

[Story] Detect certificate renewal and redeploy

When certificates are close to expiry, cert-manager will automatically renew them. We should detect this and redeploy Container JFR and update the routes that use TLS re-encryption.

containerjfr FlightRecorder is invalid

I see this in the operator's logs after a make deploy sample_app2 using crc 1.4.0:

{"level":"error","ts":1579791293.0731778,"logger":"controller-runtime.controller","msg":"Reconciler error","controller":"service-controller","request":"default/containerjfr","error":"FlightRecorder.rhjmc.redhat.com \"containerjfr\" is invalid: []: Invalid value: map[string]interface {}{\"apiVersion\":\"rhjmc.redhat.com/v1alpha1\", \"kind\":\"FlightRecorder\", \"metadata\":map[string]interface {}{\"creationTimestamp\":\"2020-01-23T14:54:53Z\", \"generation\":1, \"labels\":map[string]interface {}{\"app\":\"containerjfr\"}, \"name\":\"containerjfr\", \"namespace\":\"default\", \"ownerReferences\":[]interface {}{map[string]interface {}{\"apiVersion\":\"v1\", \"blockOwnerDeletion\":true, \"controller\":true, \"kind\":\"Service\", \"name\":\"containerjfr\", \"uid\":\"3370babf-3df0-11ea-9a0b-52fdfc072182\"}}, \"uid\":\"4fee9a64-3df0-11ea-9a0b-52fdfc072182\"}, \"spec\":map[string]interface {}{\"recordingActive\":false}}: validation failure list:\nspec.port in body is required\nspec.recordingRequests in body is required","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/home/andrew/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/andrew/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:218\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/andrew/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:192\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/home/andrew/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:171\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/home/andrew/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/home/andrew/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/home/andrew/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:88"}

Operator should be able to set ContainerJFR's RJMX credentials

See cryostatio/cryostat#221

Container JFR controller should watch owned deployment

Right now the Container JFR controller watches ContainerJFR CRs and pods owned by a ContainerJFR CR. Since Container JFR is now created in a Deployment, the controller should watch Deployments owned by ContainerJFR CRs and not pods (which are owned by their replica set). This could be extended to watch all objects the controller creates: services, secrets, routes, etc., to re-create them if they are deleted.

Better error handling when cert-manager is not available

Currently, if you deploy the operator without cert-manager installed, and do not explicitly disable cert-manager with the DISABLE_SERVICE_TLS environment variable, you'll see a repeated error in the operator logs:

"level":"error","ts":1604679625.5861356,"logger":"controller_containerjfr",
"msg":"Could not be read","Request.Namespace":"default","Name":"containerjfr-self-signed",
"Kind":"*v1.Issuer","error":"no matches for kind \"Issuer\" in version \"cert-manager.io/v1\""

Defaulting to requiring cert-manager to be explicitly disabled was done for security concerns, but we should try to improve usability here. Perhaps we can at least make it more obvious to the user that they must either install cert-manager or explicitly disable it.

This method may be useful to detect whether cert-manager is installed:
https://pkg.go.dev/k8s.io/client-go/discovery#DiscoveryClient.ServerResourcesForGroupVersion

API Enhancements

Some useful functionality in Container JFR is not exposed via the operator's API. Some of this is existing in Container JFR today and some still a work-in-progress. Let's track what sorts of changes should go into the next version of our CRDs.

Confirmed:

Listing event templates in FlightRecorder (#100)
Report URL in Recordings (#89)
JMX authentication credentials in FlightRecorder/Recording (#98)
Add TrustedCertSecrets property to ContainerJFR CRD (#142)
Add application URL to Container JFR CRD (#187)
Options to control creation of Ingress objects on Kubernetes (#120)
Document new API additions (#134)

Declined:

~~container-jfr log level env var should be controllable via CR spec (#63)~~

Some other ideas:

Use Conditions to inform the user about errors in spec that require fixing. (e.g. authentication failures)
Enable/Disable security features ~~TLS hostname verification parameter in ContainerJFR spec~~
Allow user to archive running recordings and archive a recording more than once

Add more k8s labels to instances managed by container-jfr-operator

For example, the RH Jaeger operator has the following labels on it's Jaeger instances:

app=jaeger
app.kubernetes.io/component=all-in-one
app.kubernetes.io/instance=jaeger-all-in-one-inmemory
app.kubernetes.io/managed-by=jaeger-operator
app.kubernetes.io/name=jaeger-all-in-one-inmemory
app.kubernetes.io/part-of=jaeger
pod-template-hash=856b547bf

I think the app.kubernetes.io labels should be added to the container-jfr instances.

Script to check/update license headers

It would be nice to have a simple script that checks source files for license headers, and adds them if they're missing.

Error 502 Bad Gateway: Cannot assign requested address

The container-jfr instance within the operator is no longer able to start its Vertx webserver.

Document new API additions

We should update API.md with the new changes introduced in v1beta1. e.g. How to specify JMX credentials, how to use a template for Recordings.

Grafana controller uses basic auth for configuration

Currently in grafana_controller.go , the controller uses basic auth with default credentials to access the Grafana dashboard instance and add configurations for the jfr-datasource and dashboard panels. This should be replaced with some form of access token provided by the Grafana API and added in request headers.

FlightRecorder should include event templates

Along with the list of event types, the FlightRecorder CR should include event templates provided by Container JFR's API.

Periodic/sporadic HTTP 502s

Requests to the containerjfr service routes frequently produce HTTP 502 Bad Gateway responses, seemingly from the OpenShift Ingress controller (not the actual application).

WebSocket Error "subprotocol is invalid" in Google Chrome

After authenticating with the OC token, the web client has the following error (and becomes unusable):

WebSocket error {"stack":"Error: Failed to construct 'WebSocket': The subprotocol 'base64url.bearer.authorization.containerjfr.bk1xMHd5Qk9fSVRiYWJ5WkktZHFHTGN0Y1ZuckdxbTNDcVJucUpnQndUUQ==' is invalid.\n at WebSocketSubject._connectSocket (webpack-internal:///194:3998:17)\n at WebSocketSubject._subscribe (webpack-internal:///194:4095:18)\n at WebSocketSubject.Observable._trySubscribe (webpack-internal:///194:1288:25)\n at WebSocketSubject.Subject._trySubscribe (webpack-internal:///194:1501:51)\n at WebSocketSubject.Observable.subscribe (webpack-internal:///194:1274:22)\n at SafeSubscriber.eval [as _next] (webpack-internal:///194:4219:22)\n at SafeSubscriber.__tryOrUnsub (webpack-internal:///194:1126:16)\n at SafeSubscriber.next (webpack-internal:///194:1064:22)\n at Subscriber._next (webpack-internal:///194:1010:26)\n at Subscriber.next (webpack-internal:///194:987:18)"}

Operator should have a minimal/"headless" container-jfr deployment option

See cryostatio/cryostat#68

The Operator should provide some way to deploy container-jfr in the minimal/headless image variant, and without Grafana/jfr-datasource deployed within the pod. This would minimize the application footprint within the cluster while still serving the requirements for starting/stopping/retrieving recordings via the Operator CRD APIs.

Services should have health checks

Liveness/readiness probes should be defined for containers

Grafana not reconfigured when pod is killed

The Grafana controller watches services, and when the Container JFR pod is killed and recreated, the service does not receive an update event. This means the reconciler is never invoked to configure the new Grafana container.

Increment version and create new operator bundle

This time around we're adding cert-manager as a recommended dependency. We should be able to get OLM to install this automatically if we list the cert-manager CRDs we use under the Required CRDs section in the CSV:
https://docs.openshift.com/container-platform/4.5/operators/operator_sdk/osdk-generating-csvs.html#osdk-crds-required_osdk-generating-csvs

FlightRecorder CRD should include target auth credentials to include with API calls

As a follow-on to #96 , there should be some way for users to specify target auth credentials for ContainerJFR to use when establishing a connection to any given target JVM. The most natural way to do this to me seems to be to add the credentials somehow to the FlightRecorder CRD, or perhaps link that CRD to a Secrets object containing the credentials, which the Operator can then read and include in an HTTP header when performing various API actions against the ContainerJFR server.

operator-sdk should be updated to v0.11.0+

v0.12.0 is released but requires an updated go as well, so perhaps v0.11.0 is the better move at the moment.

I've tried upgrading to both but I run into what appear to be go module issues, along the lines of this:

kubernetes/kubernetes#79384

Perhaps because of/related to this:

operator-framework/operator-sdk#2030

End-to-end testing

Now that we have some unit tests, we should start looking at end-to-end testing to help detect issues such as #112 early. Controller-runtime provides an envtest package to help with this.

https://book.kubebuilder.io/reference/envtest.html

Connection to Grafana container fails

In diagnosing/testing the current broken Grafana setup when deploying with the Operator, I now have a setup where the ContainerJFR instance is properly able to recognize and attempt to connect to the Grafana container (see rh-jmc-team/containerjfr#278). But, when I click the "View in Grafana..." menu item of a recording, I am brought to a Grafana dashboard that has no jfr-datasource configured and no preset dashboard definition. This is seen in the Operator logs:

{"level":"error","ts":1601325018.8350742,"logger":"controller-runtime.controller","msg":"Reconciler error","controller":"grafana-controller","request":"default/containerjfr-grafana","error":"Get https://containerjfr-grafana-default.apps-crc.testing/api/health: x509: certificate signed by unknown authority","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/home/andrew/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/andrew/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:258\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/andrew/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:232\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/home/andrew/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/home/andrew/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/home/andrew/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/home/andrew/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:88"}

It seems the Operator's grafana-controller fails to connect to the Grafana container due to a self-signed cert issue, which results in the Operator never configuring the Grafana container's jfr-datasource or dashboard.

Add unit tests for Endpoints and ContainerJFR controllers

These two should be conceptually similar to the existing FlightRecorder and Recording controller tests, although without needing the Container JFR httptest server.

unsupported type invalid type for invalid type, from ./pkg/apis/rhjmc/v1alpha1.ContainerJFR

Cannot build container-jfr-operator. I'm trying to build on windows subsystem for linux. I've had no issue building other pieces and parts of container-jfr

container-jfr-operator$ make clean image BUILDER=$(which docker)
rm -rf build/_output
operator-sdk generate k8s
INFO[0000] Running deepcopy code-generation for Custom Resource group versions: [rhjmc:[v1alpha1 v1alpha2], ]
F0730 17:13:10.063606 10829 deepcopy.go:885] Hit an unsupported type invalid type for invalid type, from ./pkg/apis/rhjmc/v1alpha1.ContainerJFR
make: *** [Makefile:22: k8s] Error 255

git show HEAD
commit f6e3e25 (HEAD -> main, origin/main, origin/HEAD)
Author: Elliott Baron [email protected]
Date: Tue Jul 21 10:44:10 2020 -0400

Reorder undeploy target to better handle recordings (#92)

go version
go version go1.13.3 linux/amd64

uname -a
Linux ######## 4.19.104-microsoft-standard #1 SMP Wed Feb 19 06:37:35 UTC 2020 x86_64 GNU/Linux

docker version
Client: Docker Engine - Community
Version: 19.03.12
API version: 1.40
Go version: go1.13.10
Git commit: 48a66213fe
Built: Mon Jun 22 15:45:36 2020
OS/Arch: linux/amd64
Experimental: false

Server: Docker Engine - Community
Engine:
Version: 19.03.12
API version: 1.40 (minimum version 1.12)
Go version: go1.13.10
Git commit: 48a66213fe
Built: Mon Jun 22 15:49:27 2020
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: v1.2.13
GitCommit: 7ad184331fa3e55e52b890ea95e65ba581ae3429
runc:
Version: 1.0.0-rc10
GitCommit: dc9208a3303feef5b3839f4323d9beb36df0a9dd
docker-init:
Version: 0.18.0
GitCommit: fec3683

Env Var CONTAINER_JFR_TEMPLATE_PATH unset

This environment variable is used to set the directory where uploaded templates are saved, similar to what CONTAINER_JFR_ARCHIVE_PATH does for archived recordings.

WebSocket connection to container-jfr is prematurely closed

Doing export IMAGE_TAG=quay.io/andrewazores/container-jfr-operator:0.3.0; make image && podman push $IMAGE_TAG && podman image prune && make deploy sample_app2 and checking the operator logs, I see messages like the following:

{"level":"info","ts":1580248536.2963161,"logger":"controller_flightrecorder","msg":"Reconciling FlightRecorder","Request.Namespace":"default","Request.Name":"containerjfr"}
{"level":"info","ts":1580248536.328504,"logger":"containerjfr_client","msg":"sent command","json":{"command":"is-connected","args":null}}
{"level":"error","ts":1580248536.3304381,"logger":"containerjfr_client","msg":"could not read response","message":{"command":"is-connected","args":null},"error":"websocket: control frame length > 125","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/home/andrew/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128\ngithub.com/rh-jmc-team/container-jfr-operator/pkg/client.(*ContainerJfrClient).syncMessage\n\tcontainer-jfr-operator/pkg/client/containerjfr_client.go:179\ngithub.com/rh-jmc-team/container-jfr-operator/pkg/client.(*ContainerJfrClient).isConnected\n\tcontainer-jfr-operator/pkg/client/containerjfr_client.go:97\ngithub.com/rh-jmc-team/container-jfr-operator/pkg/client.(*ContainerJfrClient).Connect\n\tcontainer-jfr-operator/pkg/client/containerjfr_client.go:72\ngithub.com/rh-jmc-team/container-jfr-operator/pkg/controller/flightrecorder.(*ReconcileFlightRecorder).Reconcile\n\tcontainer-jfr-operator/pkg/controller/flightrecorder/flightrecorder_controller.go:125\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/andrew/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:216\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/andrew/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:192\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/home/andrew/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:171\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/home/andrew/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/home/andrew/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/home/andrew/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:88"}
{"level":"error","ts":1580248536.3305044,"logger":"controller_flightrecorder","msg":"failed to connect to target JVM","error":"websocket: control frame length > 125","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/home/andrew/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128\ngithub.com/rh-jmc-team/container-jfr-operator/pkg/controller/flightrecorder.(*ReconcileFlightRecorder).Reconcile\n\tcontainer-jfr-operator/pkg/controller/flightrecorder/flightrecorder_controller.go:127\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/andrew/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:216\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/andrew/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:192\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/home/andrew/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:171\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/home/andrew/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/home/andrew/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/home/andrew/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:88"}
{"level":"error","ts":1580248536.3306093,"logger":"controller-runtime.controller","msg":"Reconciler error","controller":"flightrecorder-controller","request":"default/containerjfr","error":"websocket: control frame length > 125","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/home/andrew/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/andrew/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:218\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/andrew/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:192\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/home/andrew/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:171\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/home/andrew/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/home/andrew/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/home/andrew/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:88"}

This appears to occur for every WebSocket connection. I have not yet bisected it to narrow down which change caused this, but I would suspect it is somehow related to the WebSocket SubProtocol auth token handling change recently. I had tested that previously and seen no issues so this may be intermittent, or related to some other change.

Add instructions on using the operator via marketplace

A reword of something like:

Install operator via marketplace, targeting a specific cluster

Create ResourceKind: Container JFR, modify the yaml, inserting minimal: true|false

A complete deployment will include jfr-datasource, grafana and the web ui, container-jfr-web. A minimal deployment excludes the above.

Update to operator-sdk >= 0.15.0

It appears that operator bundling functionality was added to Operator SDK in 0.15.0:
https://github.com/operator-framework/operator-sdk/blob/v0.15.0/doc/cli/operator-sdk_bundle_create.md

Perhaps this is replacing operator-courier? There may also be other useful new features for us.

Grafana configuration does not work after exposecontroller removal

Environment variables for Grafana datasource and dashboard are missing protocol part, which is required.

OpenShift 4 "topology" graphic shows operator, but no apparent link to ContainerJFR instance(s)

(As Developer role)

After an Administrator installs the Operator, they can create a ContainerJFR CR which triggers the creation of a ContainerJFR Pod and its associated resources (Services+Routes, PersistentVolumeClaim). The exposed URL for one of these Routes is the ContainerJFR application URL.

There seems to be no easy way for a Developer user to find this application URL from their view within the OpenShift Console. They can see the Operator in the workspace topology, but cannot see the ContainerJFR Pod or its associated resources.

Deploying some other application (ex. quay.io/andrewazores/container-jmx-docker-listener) into the workspace produces a nice node in the topology graph, with a link to the application URL.

Update API usage doc

#72 brings a number of changes to the API provided by this operator. We need to update the API.md documentation to reflect these changes.

User is required to manually intervene for linking jfr-datasource to Grafana

kubectl plugins for creating, stopping, etc. Recordings

@ebaron

Operator expects container-jfr service be named "containerjfr"

The name of the container-jfr service should be taken from the ContainerJFR CR rather than hardcoded. There is a TODO in the flightrecorder_controller regarding this already.

ClusterServiceVersion version tag should be automatically populated/updated

The clusterserviceversion YAML file contains, and is named according to, the latest version number of the operator bundle and image. This is annoying to update and easy to make a mistake by missing a field, or making a typo, etc. This should somehow be automated away - for example, the CSV YAML could be generated from a template. The CSV could then be updated/generated/published by invoking a script or Makefile target, with the version tag provided by something like an environment variable.

WebSocket API calls should be replaced with HTTP REST calls

See:

JMX connections may be secured and require user authentication. For various reasons, this authentication layer will not be implemented across the WebSocket Command Channel, so any such API calls are being re-implemented with HTTP request handlers. The JMX auth credentials will then be supplied using an optional HTTP header. Connection failures will result in a 4xx status code (currently 407) from the ContainerJFR server to indicate to clients that the request was refused due to target connection authentication, as opposed to the 401 used to signify lacking client auth from ex. an OpenShift account token.

TLS edge termination should be enabled

ContainerJFR has been updated to work behind SSL proxies since cryostatio/cryostat#82 / cryostatio/cryostat#83 . This should allow the Operator to configure TLS edge termination for the container-jfr routes

Recording CRs should include reportUrl

Similarly to the downloadUrl already included, ContainerJFR also provides a reportUrl for autogenerated rules-based analyses. Should this URL be added to the Recording CRD?

WebSocket auth token should be sent via subprotocol

This is a pre-emptive bug against #45 . That PR updates a controller to supply the service account auth token via a query parameter. However, there is planned work at cryostatio/cryostat#100 to switch from a query parameter to a websocket subprotocol, so the implementation here will need to be updated to mirror this to correspond to that future work item.

Persist Container JFR RJMX credentials in Secret

Following up on #97, we should store the generated RJMX credentials in a Secret for later use by the API when connecting to Container JFR. The Secret can be created along with the rest of Container JFR's resources.

We can reference this Secret in the Container definition for Container JFR using the envFrom property as seen here:
https://kubernetes.io/docs/concepts/configuration/secret/#use-case-as-container-environment-variables

Grafana dashboard container uses default admin auth credentials

The Grafana dashboard container comes pre-configured with a default admin account. This should be disabled and some sort of secret configuration added so that the outside world does not have admin access to the Grafana container. Relating to #50 , there should be some access token that the operator service account can use for configuring the Grafana instance, and this (or a similar) token, or some non-default credentials, should be somehow attached to the Grafana instance in a way that the cluster admin is still able to get admin access into the Grafana instance as well, in case there is a need for additional configuration.

ClusterServiceVersion service account permissions are duplicates

The service account permissions listed in the clusterserviceversion YAML are copied in from the role.yaml . Similar to the concerns in #48 , this is error-prone and should be automated away. Either the clusterserviceversion YAML should be generated from a template and the permissions section copied in from role.yaml, or the role.yaml should be generated by extracting the permissions out of the CSV YAML.

Automatically apply FlightRecorder label to Recordings

The rhjmc.redhat.com/flightrecorder label is meant to allow for potential clients to query all recordings for a pod using a label selector [1]. The reference to this FlightRecorder is already present in the Recording object, so we could populate this label automatically in the Recording controller. This would save the user (or client) from applying this label themselves.

[1] https://github.com/rh-jmc-team/container-jfr-operator/blob/30c3791442501abffcf0b6273fc37dfe049a6c2c/API.md#creating-a-new-flight-recording

Can't delete recordings after operator

Since we use a finalizer to clean up recording resources, recording deletion is prevented until the operator removes the finalizer. If the operator pod is deleted before this can happen, trying to delete recordings will hang.

To fix this, we could add an OwnerReference to Recordings, listing either the operator's pod or the ContainerJFR custom resource as the owner, with BlockOwnerDeletion set to true. If I understand correctly, an attempt to delete the owner should attempt to delete the dependents, while keeping the owner alive to handle the finalizer.

Lock down jfr-datasource to localhost traffic only

Since jfr-datasource is deployed within the same pod as Container JFR and Grafana, and does not need to be consumed directly by the user, we can bind over localhost and remove its service. This can be done with the quarkus.http.host system property. Currently this is hard-coded to 0.0.0.0 in the jfr-datasource Dockerfile [1]. We could make this configurable using an environment variable.

[1] https://github.com/rh-jmc-team/jfr-datasource/blob/8e4c0b595ee21af13d259ccd8c12ab5d1290a591/src/main/docker/Dockerfile.native#L6

cryostatio / cryostat-operator Goto Github PK

cryostat-operator's People

Contributors

Stargazers

Watchers

Forkers

cryostat-operator's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs