GithubHelp home page GithubHelp logo

open-telemetry / opentelemetry-operator Goto Github PK

View Code? Open in Web Editor NEW
1.1K 1.1K 411.0 7.84 MB

Kubernetes Operator for OpenTelemetry Collector

License: Apache License 2.0

Shell 0.71% Makefile 1.32% Dockerfile 0.60% Go 97.22% TypeScript 0.11% Java 0.03% Python 0.01% JavaScript 0.01%
hacktoberfest kubernetes-operator opentelemetry opentelemetry-collector

opentelemetry-operator's People

Contributors

anuraaga avatar avadhut123pisal avatar bhiravabhatla avatar bogdandrutu avatar changexd avatar chrlic avatar dependabot[bot] avatar frzifus avatar iblancasa avatar ishwarkanse avatar jaronoff97 avatar jdcrouse avatar jpkrohling avatar kevinearls avatar kielek avatar kristinapathak avatar mat-rumian avatar matej-g avatar moh-osman3 avatar objectiser avatar opentelemetrybot avatar pavolloffay avatar pureklkl avatar rashmichandrashekar avatar rsvarma95 avatar rubenvp8510 avatar swiatekm avatar tylerhelmuth avatar vineethreddy02 avatar yuriolisa avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

opentelemetry-operator's Issues

Update to collector v0.12.0

The collector v0.12.0 is being released today. This issue is to track the update to this version, including the upgrade procedure based on the changelog.

Issue "Not the leader"

Hi,

I migrated my cluster with Velero to a new cluster and it seems since then the operator is unable to do anything. It doesn't recreate deployment for the OpenTelemetryCollector resource I create.

I tried to remove the operator deployment and recreate but here are the logs I get:

{"level":"info","ts":1587319935.2470012,"logger":"cmd","msg":"Starting the OpenTelemetry Operator","opentelemetry-operator":"v0.0.2","opentelemetry-collector":"0.0.2","build-date":"2019-10-07T12:05:27Z","go-version":"go1.12.9","go-arch":"amd64","go-os":"linux","operator-sdk-version":"v0.10.0"}
{"level":"info","ts":1587319935.24741,"logger":"leader","msg":"Trying to become the leader."}
{"level":"info","ts":1587319935.507039,"logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":1587319936.6374328,"logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":1587319939.018057,"logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":1587319943.5538852,"logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":1587319952.259436,"logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":1587319969.6250787,"logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":1587319987.8278975,"logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":1587320004.0431714,"logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":1587320020.5508134,"logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":1587320036.8663146,"logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":1587320053.8358185,"logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":1587320071.4925714,"logger":"leader","msg":"Not the leader. Waiting."}

Do you have any idea what could cause that?

Thank you,

Deletion of resources shouldn't fail if the webhook server isn't available

When the operator isn't up and running, removing resources will fail with:

$ kubectl delete otelcol opentelemetrycollector-sample
Error from server (InternalError): Internal error occurred: failed calling webhook "vopentelemetrycollector.kb.io": Post "https://opentelemetry-operator-webhook-service.opentelemetry-operator-system.svc:443/validate-opentelemetry-io-v1alpha1-opentelemetrycollector?timeout=30s": dial tcp 10.96.165.160:443: connect: connection refused

$ kubectl get otelcols
NAME                            MODE         VERSION   AGE
opentelemetrycollector-sample   deployment   0.10.0    3h39m

Create a tools.go

Create a tools.go with the imports for the tools we need to build the project, such as operator-sdk, kustomize, controller-gen, ...

Change calls to Update() to Patch()

Instead of doing Update() calls for the changes we detect, we should attempt to use Patch(), to reduce the number of failures caused by stale objects.

From:

		if err := params.Client.Update(ctx, updated); err != nil {
			return fmt.Errorf("failed to apply changes: %w", err)
		}

To:

	// at the beginning of the reconciliation function, store a copy of `params.Instance` as `changed`
	patch := client.MergeFrom(&params.Instance)
	if err := params.Client.Patch(ctx, &changed, patch); err != nil {
		return fmt.Errorf("failed to apply changes: %w", err)
	}

Allow volumes to be specified/mounted

When a receiver/exporter has TLS configured, it's common to have the certs mounted somewhere. As such, the operator needs to accept volume information and use it as part of the collector's deployment.

Auto-detect platform

As a prerequisite of #32, we need to be able to determine the platform the operator is running in, similar to what we have in the Jaeger Operator.

Black-box vs. white-box testing

The Operator SDK (and most other Go projects I've seen) has a white box testing approach by default. Kubebuilder has, apparently, opinions about it and follows a black box approach, using specific test packages (foo vs foo_test) so that the API can be validated as well. The problem with this is that it requires the subject to have features that are needed only by the tests at first, and potentially forever. This makes the subject more complex, without a real need.

Context: #31 (comment)

It's good to note that both can live alongside, but this issue is to mainly discuss what's our preferred approach, resorting to the other approach only when it's not feasible to use the preferred approach.

Release v0.0.3

Would love to take advantage of a couple of the newest commits, specifically the port derivation. Is a new release in the works?

Add support for "distributions"

Currently, we cannot deploy opentelemetry-collector-contrib using this operator because although the image field can be configured to opentelemetry-collector-contrib:<VERSION>, the container command name is different than opentelemetry-collector.

Here is a sample deployment that we should be able to support with the operator:

      containers:
      - image: otel/opentelemetry-collector-contrib:0.7.0
        command:
          - "/otelcontribcol"
          - "--config=/conf/otel-agent-config.yaml"
          - "--mem-ballast-size-mib=165"

We could add a field in the CRD to support a configurable command field.

Update to collector v0.13.0

The collector v0.13.0 is being released today. This issue is to track the update to this version, including the upgrade procedure based on the changelog.

Tests for reconcilers

The reconcilers (pkg/collector/reconcile) are currently untested. This issue is to track building the unit tests for them.

Cannot create Collector : Webhook deadline exceeded

Hi,

I cannot create the simplest OpenTelemetryCollector. I get the following error when I try to create it from STDIN:

Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "mopentelemetrycollector.kb.io": Post https://opentelemetry-operator-webhook-service.opentelemetry-operator-system.svc:443/mutate-opentelemetry-io-v1alpha1-opentelemetrycollector?timeout=30s: context deadline exceeded

The Opentelemetry Operator Controller Manager is up and running in the namespace opentelemetry-operator-system:

$ kubectl get po -n opentelemetry-operator-system                 
NAME                                                         READY   STATUS    RESTARTS   AGE
opentelemetry-operator-controller-manager-56f75fbb5d-qrdst   2/2     Running   0          17m

And the logs of both containers (manager and kube-rbac-proxy) do not show any error.

I installed the required resources using:

$ kubectl apply -f https://github.com/open-telemetry/opentelemetry-operator/releases/latest/download/opentelemetry-operator.yaml

Is there something I am missing ?

Thank you for your help !

Example CRD leads to error in collector pod: "unknown processors type 'queued_retry'"

I ran the default CRD as shown in the main readme:

$ kubectl apply -f - <<EOF
> apiVersion: opentelemetry.io/v1alpha1
> kind: OpenTelemetryCollector
> metadata:
>   name: simplest
> spec:
>   config: |
>     receivers:
>       jaeger:
> 
>     processors:
>       queued_retry:
> 
>     exporters:
>       logging:
> 
>     service:
>       pipelines:
>         traces:
>           receivers: [jaeger]
>           processors: [queued_retry]
>           exporters: [logging]
> EOF
opentelemetrycollector.opentelemetry.io/simplest created

and once the pod was available it entered a CrashloopBackoff state, with the following logs:

$ kubectl logs pod/simplest-collector-c7d4768f7-np882
{"level":"info","ts":1586977441.512723,"caller":"service/service.go:265","msg":"Starting...","NumCPU":4}
{"level":"info","ts":1586977441.512775,"caller":"service/service.go:104","msg":"Setting up own telemetry..."}
{"level":"info","ts":1586977441.512987,"caller":"service/telemetry.go:93","msg":"Serving Prometheus metrics","port":8888}
{"level":"info","ts":1586977441.513012,"caller":"service/service.go:137","msg":"Loading configuration..."}
2020/04/15 19:04:01 Cannot load configuration: unknown processor type "queued_retry"

I tried removing the processor from the config (because they are optional according to the collector docs, which then led to this error:

$ kubectl logs pod/simplest-collector-c7d4768f7-np882
{"level":"info","ts":1586977563.5065687,"caller":"service/service.go:265","msg":"Starting...","NumCPU":4}
{"level":"info","ts":1586977563.5066233,"caller":"service/service.go:104","msg":"Setting up own telemetry..."}
{"level":"info","ts":1586977563.5068412,"caller":"service/telemetry.go:93","msg":"Serving Prometheus metrics","port":8888}
{"level":"info","ts":1586977563.5068662,"caller":"service/service.go:137","msg":"Loading configuration..."}
2020/04/15 19:06:03 Cannot load configuration: must have at least one pipeline

Which is odd because there is a pipeline configured:

$ kubectl get -o yaml otelcols/simplest
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"opentelemetry.io/v1alpha1","kind":"OpenTelemetryCollector","metadata":{"annotations":{},"name":"simplest","namespace":"otel-collector"},"spec":{"config":"receivers:\n  jaeger:\n\nprocessors:\n  queued_retry:\n\nexporters:\n  logging:\n\nservice:\n  pipelines:\n    traces:\n      receivers: [jaeger]\n      processors: [queued_retry]\n      exporters: [logging]\n"}}
    prometheus.io/path: /metrics
    prometheus.io/port: "8888"
    prometheus.io/scrape: "true"
  creationTimestamp: "2020-04-15T19:03:10Z"
  generation: 2
  name: simplest
  namespace: otel-collector
  resourceVersion: "29709"
  selfLink: /apis/opentelemetry.io/v1alpha1/namespaces/otel-collector/opentelemetrycollectors/simplest
  uid: fc1d64fa-8ee0-40e9-a687-7e83272fb409
spec:
  config: |
    receivers:
      jaeger:

    exporters:
      logging:

    service:
      pipelines:
        traces:
          receivers: [jaeger]
          processors: [queued_retry]
          exporters: [logging]
status:
  replicas: 0
  version: 0.0.2

Version:

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"19+", GitVersion:"v1.19.0-alpha.1.527+5668dbdc6e60f5", GitCommit:"5668dbdc6e60f54d45f2022fddf8a92359cdcac5", GitTreeState:"clean", BuildDate:"2020-04-13T18:28:17Z", GoVersion:"go1.13.4", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.0-rc.1", GitCommit:"4d0922f", GitTreeState:"clean", BuildDate:"2020-04-13T22:15:34Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}

Error in operator logs when OTEL Collector CR instance is managed by Flux || GitOps

It might seem that the operator is trying to do some update to the DS labels but doesnt copy the flux label down when creating the DS.

{"level":"error","ts":1605103614.469881,"logger":"controllers.OpenTelemetryCollector","msg":"failed to reconcile daemon sets","error":"failed to reconcile the expected daemon sets: failed to apply changes: DaemonSet.apps \"main-collector\" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{\"app.kubernetes.io/component\":\"opentelemetry-collector\", \"app.kubernetes.io/instance\":\"opentelemetry-system.main\", \"app.kubernetes.io/managed-by\":\"opentelemetry-operator\", \"app.kubernetes.io/name\":\"main-collector\", \"app.kubernetes.io/part-of\":\"opentelemetry\", \"fluxcd.io/sync-gc-mark\":\"sha256.S_o66xL9t1DMr3tS8jPOC8WO8DnOZ3mX1Rm3ZtvJS9M\"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:132\ngithub.com/open-telemetry/opentelemetry-operator/controllers.(*OpenTelemetryCollectorReconciler).RunTasks\n\t/workspace/controllers/opentelemetrycollector_controller.go:145\ngithub.com/open-telemetry/opentelemetry-operator/controllers.(*OpenTelemetryCollectorReconciler).Reconcile\n\t/workspace/controllers/opentelemetrycollector_controller.go:134\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:244\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:218\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:197\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:90"}

Rethink what "Replicas" mean in the status object

We might want to think again what "replicas" mean, especially when talking about mode: sidecar. Also, do we count all currently running replicas, or the desired ones? Should we have multiple values, like desired/current that the Deployment has?

Align operator version with the tooling around it

We currently get the operator's version based on git tags, which brings a v prefix to the version. Some of the tooling around the operator/kubebuilder expects numeric versions, so, we should adjust the Makefile to use the appropriate notations.

Example:

$ make bundle
...
...
FATA[0000] invalid command options: release/v0.2.0-2-gb203820 is not a valid semantic version: Invalid character(s) found in major number "release/v0" 

Sidecar injection using admission webhooks

Kubebuilder provides a facility to provision certs and install webhooks. We can use that when doing the sidecar injection, instead of the approach followed by the jaeger-operator.

Create webhook to trim messages from the status object

As part of #55, we added a new array of messages to the Status object. For long lived instances, this might potentially become a big list of messages, so, we might want to consider trimming it to the last N messages.

Questions:

  • should this be a webhook? This is what makes more sense to me.
  • should we add a timestamp to the message?
  • should we remove by date, or by number of messages?
  • what would be a good age/number to keep?
  • should we add another message, saying that older messages have been removed?

Validating/Defaulter webhooks

As preparation for #45, it would be good to have a simpler, noop Validating/Defaulter webhook. This is useful to test the webhook setup and configuration, and will provide the skeleton for allowing complex CR validation in the future.

Move back to plain go tests

With the move to kubebuilder, the tests we had in the old version of the operator were converted to ginkgo, as proposed by kubebuilder.

However, this move didn't make things better, making it actually worse in a few scenarios, like debugging or running individual tests.

This task is to convert the tests back to the plain go style.

Allow logger to be configured via CLI

The kubebuilder bootstrap code comes with a default logger configuration for development purposes, but allows flags to be bound, so that users can configure it. We should be using that.

Refresh go proxy module information on release

When releasing the opentelemetry-operator, we might want to make an HTTP request to the Go module proxy, so that it knows about the new version. This should then cause the pkg.go.dev website to refresh the Godoc for this operator.

This is the URL that we should call for new releases: https://proxy.golang.org/github.com/open-telemetry/opentelemetry-operator/@v/v0.14.0.info , replacing 0.14.0 with the released version.

More info: https://go.dev/about

Create a service account per CR

One of the resources that the operator should create for each CR is a service account. This way, admins can grant permissions to specific instances to perform special operations. For instance, an OpenTelemetry Collector might be configured with Prometheus sd_configs, which might ultimately lead the instance to list services.

Similarly, the spec should allow a service account to be specified.

Cannot install CRDs (schema issue?)

Hi @jpkrohling,

When trying to install the first YAML file containing CRD definitions I get this error:
The CustomResourceDefinition "opentelemetrycollectors.opentelemetry.io" is invalid: spec.validation.openAPIV3Schema: Invalid value: apiextensions.JSONSchemaProps{ID:"", Schema:"", Ref:(*string)(nil), Descripti...

It says at the end must only have "properties", "required" or "description" at the root if the status subresource is enabled.

I'm running Kubernetes 1.11.10

Thank you,

EDIT: Seems a known compatibility issue with Kub' 1.11 ๐Ÿ˜ญ

Update to collector v0.14.0

The collector v0.14.0 was being released yesterday. This issue is to track the update to this version, including the upgrade procedure based on the changelog.

Enable opentelemetry collector image to be supplied to operator as a CLI flag

It is possible to build custom opentelemetry collectors that contain a specific subset of receivers, processors, exporters, etc that are of interest to a user.

Although this image can be specified as part of a CR - it may be better to hide that level of detail from the CR author and instead enable the operator to be configured to know the specific collector image to use, e.g. defined via a CLI flag to the operator.

Publish container images

After the migration to kubebuilder, we don't have scripts anymore that publish container images to a repository.

Build a release script

From a new release/v.* tag, the release script should:

  1. build the binaries
  2. build the container images
  3. create the github release
  4. push the images

Detect when gRPC TLS is being used by the remote endpoint

When using Jaeger provisioned by the Jaeger Operator in OpenShift, most of the servers are configured to enable TLS.

When using the OpenTelemetry Operator, the provisioned OpenTelemetry Collector doesn't configure the exporter to talk TLS to the counterpart, causing data to not be transmitted. On the server side, a "TLS handshake" error is shown.

We need to, by default in OpenShift, set 'CAFile' to the service-ca, which is the same default as the Jaeger Operator: https://github.com/open-telemetry/opentelemetry-collector/blob/4eca960a4eb02104694324cf161ad9ec944c44c9/config/configtls/configtls.go#L35

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.