aws-observability / aws-otel-helm-charts Goto Github PK
View Code? Open in Web Editor NEWAWS Distro for OpenTelemetry (ADOT) Helm Charts
Home Page: https://aws-otel.github.io/
License: Apache License 2.0
AWS Distro for OpenTelemetry (ADOT) Helm Charts
Home Page: https://aws-otel.github.io/
License: Apache License 2.0
Describe the bug
Usually, the namespace in Helm Charts set explicitly by Helm itself, as namespace name itself via --namespace
or simply -n
and as the choice to create namespace via --create-namespace
. So that we don't have to "host" the namespace, because Helm will take care of managing it.
In chart, we're declaring namespace explicitly the namespace called amazon-metrics
and set up all resources there, which is confusing, when you try to deploy chart to the different namespace.
My proposal is to:
.Values.adotCollector.daemonSet.namespace
to .Release.Namespace
in order to use the namespace set by HelmValues.global.namespaceOverride
value one layer up to Values.namespaceOverride
so it will look the same as the most of Helm charts.Steps to reproduce
helm install -n [NAMESPACE_NAME] [RELEASE_NAME] [REPO_NAME]/adot-exporter-for-eks-on-ec2
helm install -n monitoring adot aws-otel/adot-exporter-for-eks-on-ec2
What did you expect to see?
All resources being created in [NAMESPACE_NAME]
, which is monitoring
in my case
What did you see instead?
All resources being created in amazon-metrics
namespace, which is also created by the chart
Environment
This issue is environment-agnostic
Additional context
I'm willing to fix it by forking the chart and sending the PR for my proposal. Please let me know what you think about this issue.
Describe the bug
tried to helm install adot, after i attached the CloudWatchAgentServerPolicy policy to the eks ng, i could see the metrics and logs on cloudwatch, however i see these logs when i output the logs of the collector. not really sure how it affects functionality.
kubectl logs adot-collector-daemonset-c98m8 -n amazon-metrics
E0917 19:25:08.199040 1 leaderelection.go:367] Failed to update lock: leases.coordination.k8s.io is forbidden: User "system:serviceaccount:amazon-metrics:adot-collector-sa" cannot create resource "leases" in API group "coordination.k8s.io" in the namespace "amazon-metrics"
E0917 19:25:17.959812 1 leaderelection.go:334] error initially creating leader election record: leases.coordination.k8s.io is forbidden: User "system:serviceaccount:amazon-metrics:adot-collector-sa" cannot create resource "leases" in API group "coordination.k8s.io" in the namespace "amazon-metrics"
Steps to reproduce
eks 1.21
helm install
container-insights aws-observability/adot-exporter-for-eks-on-ec2
-f values.yml
same default values file with recievers and exporters updated for cloudwatch
ampexporters:
namespaces: ""
endpoint: ""
resourcetootel: false
authenticator: "sigv4auth"
service:
metrics:
receivers: ["awscontainerinsightreceiver"]
processors: ["batch/metrics"]
exporters: ["awsemf"]
extensions: ["health_check", "sigv4auth"]
What did you expect to see?
no failed or error logs
What did you see instead?
failed and error logs
Hi,
I would like to suggest a new feature (multi pipeline support) for the helm chart.
Currently only one pipeline is supported by the ADOT configuration. I have a use case, where I would like to write some metrics to Prometheus and some metrics to Cloudwatch.
A configuration like this would be a solution.
data:
adot-config: |
extensions:
health_check:
...
service:
pipelines:
metrics/prometheus:
receivers:
- prometheus
processors:
- batch/metrics
exporters:
- awsprometheusremotewrite
metrics/cloudwatch:
receivers:
- awscontainerinsightreceiver
processors:
- batch/metrics
exporters:
- awsemf
However, currently there is no possibiliaty to configure the service like that.
I deployed this to our cluster but had IMDS errors and only partial logging and after reading many Github Issues threads, realized the chart ships a really old version and sets IMDS v1 by default.
Suggest modifying the default value for aws/aws-for-fluent-bit version in values.yaml to 2.28.1 from 2.21.1. The v2.21.1 release is from Nov 2021 and there have been 18 releases since then, fixing a great number of issues and making improvements.
Additionally, setting imdsVersion to v2 by default (instead of v1) may lead to better outcomes.
This stuff is reasonably challenging to setup and validate, it's why people come for the chart, so keeping it up-to-date will help users be successful with it without a lot of labor tracking down defects that are already fixed in related packages.
Hi team,
I was going over some documentation associated to this repo today:
However, I noticed that in #82, the templates for logging were removed from the helm chart, citing "stability of Logs upstream in the OTel community in 2023". I don't see any further issues or rationale behind this in the PR.
After installing the Helm chart I don't see any container logs in CloudWatch.
adot-collector
and something in my configuration is wrong?I've created values.yml:
---
awsRegion: "us-east-1"
clusterName: "my_cluster_name"
fluentbit:
serviceAccount:
annotations:
eks.amazonaws.com/role-arn: "arn:aws:iam::012345678901:role/AmazonEKSFluentBitRole"
eks.amazonaws.com/sts-regional-endpoints: "true"
adotCollector:
daemonSet:
serviceAccount:
annotations:
eks.amazonaws.com/role-arn: "arn:aws:iam::012345678901:role/AmazonEKSOTELCollectorRole"
eks.amazonaws.com/sts-regional-endpoints: "true"
And installed the chart:
helm install cloudwatch-container-insights aws-observability/adot-exporter-for-eks-on-ec2 -f values.yml
While the values are being merged:
helm get values cloudwatch-container-insights
USER-SUPPLIED VALUES:
adotCollector:
daemonSet:
serviceAccount:
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::012345678901:role/AmazonEKSOTELCollectorRole
eks.amazonaws.com/sts-regional-endpoints: "true"
awsRegion: us-east-1
clusterName: my_cluster_name
fluentbit:
serviceAccount:
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::012345678901:role/AmazonEKSFluentBitRole
eks.amazonaws.com/sts-regional-endpoints: "true"
The generated manifest does not include the annotations (I've stripped some content and left only ServiceAccount kind):
helm get manifest cloudwatch-container-insights
# Source: adot-exporter-for-eks-on-ec2/templates/adot-collector/serviceaccount.yaml
# Service account provides identity information for a user to be able to authenticate processes running in a pod.
apiVersion: v1
kind: ServiceAccount
metadata:
name: adot-collector-sa
namespace: amzn-cloudwatch-metrics
---
# Source: adot-exporter-for-eks-on-ec2/templates/aws-for-fluent-bit/serviceaccount.yaml
# Service account provides identity information for a user to be able to authenticate processes running in a pod.
apiVersion: v1
kind: ServiceAccount
metadata:
name: fluent-bit
namespace: amazon-cloudwatch
One more confirmation that the annotations were not created:
kubectl get sa -n amzn-cloudwatch-metrics adot-collector-sa -o yaml
apiVersion: v1
kind: ServiceAccount
metadata:
annotations:
meta.helm.sh/release-name: cloudwatch-container-insights
meta.helm.sh/release-namespace: kube-system
creationTimestamp: "2022-03-23T10:44:38Z"
labels:
app.kubernetes.io/managed-by: Helm
name: adot-collector-sa
namespace: amzn-cloudwatch-metrics
resourceVersion: "418055"
uid: 0bf1c377-eba7-4f72-9098-1f587037556f
secrets:
- name: adot-collector-sa-token-44bnz
kubectl get sa -n amazon-cloudwatch fluent-bit -o yaml
apiVersion: v1
kind: ServiceAccount
metadata:
annotations:
meta.helm.sh/release-name: cloudwatch-container-insights
meta.helm.sh/release-namespace: kube-system
creationTimestamp: "2022-03-23T10:44:38Z"
labels:
app.kubernetes.io/managed-by: Helm
name: fluent-bit
namespace: amazon-cloudwatch
resourceVersion: "418054"
uid: ec9456ab-7c4e-4b0b-910b-c5f48e76e6df
secrets:
- name: fluent-bit-token-8sb8t
Expected results: Annotations for serviceAccount to be created in order to use IRSA.
Describe the bug
A clear and concise description of what the bug is.
Steps to reproduce
If possible, provide a recipe for reproducing the error.
Updated the receivers and the exporters to offload metrics and logs to CloudWatch.
helm install \
[RELEASE_NAME] [REPO_NAME]/adot-exporter-for-eks-on-ec2 \
--set clusterName=[CLUSTER_NAME] --set awsRegion=[AWS_REGION]
What did you expect to see?
A clear and concise description of what you expected to see.
FluentBit and Collector pods are both running
What did you see instead?
A clear and concise description of what you saw instead.
FluentBit pods do not deploy.
Environment
Describe any aspect of your environment relevant to the problem.
Additional context
Add any other context about the problem here.
Describe the issue
Setting envFrom
will help passing in sensitive variables, such as AWS Credentials or AMP Credentials.
My proposal is to:
{{ .Values.enfFrom }}
to the templatevalues.yaml
default value: envFrom: {}
What did you expect to see?
Environment variables being populated from secret/configmap to the daemonset/sidecar
Environment
This issue is environment-agnostic
Additional context
I'm willing to fix it by forking the chart and sending the PR for my proposal. Please let me know what you think about this issue.
Hi, I want to unset CPU limit for adot-collector-container but if I don't set the CPU limit in values.yaml, it uses the default value (200m).
helm chart version: 0.14.0
I am attaching my values.yaml file.
While I tried to use the ADOT Collector to scrape metrics from the prometheus /metric endpoint, I recognized that the config ignores the prometheus.io/port pod annotation.
I fixed it by updating the ampreceivers.scrapeConfigs
configuration
from
- source_labels: [__address__]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $$1:$$2
target_label: __address__
to
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $$1:$$2
target_label: __address__
An evaluation should be made to see if the kubeVersion
in adot-exporter-for-eks-on-ec2
chart is still valid. Deprecated/Removed APIs may be in use that are not compatible with newer versions of EKS.
I have installed aws otel collector using the help chart provided here in this repository. I am able to send the metrics to cloudwatch and i can see logs appearing in cloudwatch logs as well, which is a good sign that the collector is working.
The installation went fine with couple of hiccups. I tried instrumenting a sample application but the pod is unable to connect to collector service on port 4317. Logs below:
My otel helm release looks like below:
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
name: aws-otel
namespace: infra
spec:
releaseName: aws-otel
interval: 5m
chart:
spec:
chart: adot-exporter-for-eks-on-ec2
sourceRef:
kind: HelmRepository
name: aws-otel
namespace: infra
values:
nameOverride: aws-otel
clusterName: dev
awsRegion: "us-west-2"
adotCollector:
image:
name: "aws-otel-collector"
repository: "amazon/aws-otel-collector"
tag: "v0.29.0"
daemonSetPullPolicy: "IfNotPresent"
sidecarPullPolicy: "Always"
daemonSet:
enabled: true
daemonSetName: "adot-collector-daemonset"
createNamespace: false
namespace: "infra"
clusterRoleName: "dataos-core-dev-adot-collector-role"
clusterRoleBindingName: "adot-collector-role-binding"
command:
- "/awscollector"
- "--config=/conf/adot-config.yaml"
resources:
limits:
cpu: "200m"
memory: "200Mi"
requests:
cpu: "200m"
memory: "200Mi"
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
exporters:
awsxray:
region: us-west-2
processors:
memory_limiter:
limit_mib: 100
check_interval: 5s
extensions:
sigv4auth:
assume_role:
arn: "arn:aws:iam::xxxxxxxxxx:role/adot-collector-sa"
sts_region: "us-west-2"
cwexporters:
namespace: "ContainerInsights"
logGroupName: "aws-otel"
logStreamName: "InputNodeName"
enabled: true
dimensionRollupOption: "NoDimensionRollup"
parseJsonEncodedAttrValues: [ "Sources", "kubernetes" ]
metricDeclarations: |
# node metrics
- dimensions: [[NodeName, InstanceId, ClusterName]]
metric_name_selectors:
- node_cpu_utilization
- node_memory_utilization
- node_network_total_bytes
- node_cpu_reserved_capacity
- node_memory_reserved_capacity
- node_number_of_running_pods
- node_number_of_running_containers
- dimensions: [[ClusterName]]
metric_name_selectors:
- node_cpu_utilization
- node_memory_utilization
- node_network_total_bytes
- node_cpu_reserved_capacity
- node_memory_reserved_capacity
- node_number_of_running_pods
- node_number_of_running_containers
- node_cpu_usage_total
- node_cpu_limit
- node_memory_working_set
- node_memory_limit
# pod metrics
- dimensions: [[PodName, Namespace, ClusterName], [Service, Namespace, ClusterName], [Namespace, ClusterName], [ClusterName]]
metric_name_selectors:
- pod_cpu_utilization
- pod_memory_utilization
- pod_network_rx_bytes
- pod_network_tx_bytes
- pod_cpu_utilization_over_pod_limit
- pod_memory_utilization_over_pod_limit
- dimensions: [[PodName, Namespace, ClusterName], [ClusterName]]
metric_name_selectors:
- pod_cpu_reserved_capacity
- pod_memory_reserved_capacity
- dimensions: [[PodName, Namespace, ClusterName]]
metric_name_selectors:
- pod_number_of_container_restarts
# cluster metrics
- dimensions: [[ClusterName]]
metric_name_selectors:
- cluster_node_count
- cluster_failed_node_count
# service metrics
- dimensions: [[Service, Namespace, ClusterName], [ClusterName]]
metric_name_selectors:
- service_number_of_running_pods
# node fs metrics
- dimensions: [[NodeName, InstanceId, ClusterName], [ClusterName]]
metric_name_selectors:
- node_filesystem_utilization
# namespace metrics
- dimensions: [[Namespace, ClusterName], [ClusterName]]
metric_name_selectors:
- namespace_number_of_running_pods
service:
pipelines:
traces:
processors:
- memory_limiter
receivers:
- otlp
exporters:
- awsxray
metrics:
receivers: [ "awscontainerinsightreceiver"]
processors: [ "batch/metrics" ]
exporters: [ "awsemf"]
extensions: [ "sigv4auth" ]
`
Apart from the above issue There are a couple of things i do not understand:
sigv4auth
.Any help on this would be really appreciated as i cannot find anything on the internet related to such a problem.
Hi,
Making this a general issue and not a bug report, as I did not have time to re-test. There is a large possibility I am wrong here, and I apologize if that is the case!
Does the adot-exporter-for-eks-on-ec2
chart support container metrics from EC2 hosts using Bottlerocket and containerd?
I am seeing no containerdsock
mount in values.yaml
, and that is in-line with the No pod metrics when using Bottlerocket for Amazon EKS common error, and it would be in-line with the experience I had with CloudWatch Container Insights.
Hi,
As per my understanding, this Helm chart takes care of deploying the following Agent/Collector on k8s cluster.
1.) FluentBit agent - It will get deployed as DaemonSet on k8s cluster and be responsible for gathering and offloading Application, Host, and DataPlane logs into CloudWatch.
2.) OTEL collector - It will also get deployed as DaemonSet on k8s cluster and be responsible for gathering and offloading metrics data into CloudWatch.
So I'm wondering what's the role of CloudWatch agent in this setup. I do see the following section in Values.yaml file and also these fields are being referenced in INPUT section in configmap.yaml file.
cloudwatchAgent:
path: "/var/log/containers/cloudwatch-agent*"
dockerModeParser: "cwagent_firstline"
db: "/var/fluent-bit/state/flb_cwagent.db"
memBufLimit: "5MB"
Describe the bug
adot-exporter-for-eks-on-ec2 Helm installation does not work with existing namespace and SA. Im using CDK blueprints and im trying to create a namespace amazon-metrics
and then creating an IRSA by name adot-collector-sa
and then deploying the helm with following additional values :
let values: ValuesSchema = {
awsRegion: cluster.stack.region,
clusterName: cluster.clusterName,
fluentbit: {
enabled: true
},
serviceAccount: {
create: false,
},
adotCollector: {
daemonSet: {
createNamespace: false,
service: {
metrics: {
receivers: ["awscontainerinsightreceiver"],
exporters: ["awsemf"],
}
},
serviceAccount: {
create: false,
},
cwexporters: {
logStreamName: "EKSNode",
}
}
}
};
helm installation fails with below errors which clearly shows that serviceAccount: create: false
does not work. Appreciate any resolutions on this. This is a dependency for a blueprints CDK EKS Addon.
2:49:47 PM | CREATE_FAILED | Custom::AWSCDK-EKS-HelmChart | blueprintconstruct...oreksonec2037B3D69
Received response status [FAILED] from custom resource. Message returned: Error: b'Release "adot-eks-addon" does not exist. Installing it now.\nError: rendered manifests contain a r
esource that already exists. Unable to continue with install: ServiceAccount "adot-collector-sa" in namespace "amazon-metrics" exists and cannot be imported into the current release
: invalid ownership metadata; label validation error: missing key "app.kubernetes.io/managed-by": must be set to "Helm"; annotation validation error: missing key "meta.helm.sh/relea
se-name": must be set to "adot-eks-addon"; annotation validation error: missing key "meta.helm.sh/release-namespace": must be set to "default"\n'
Steps to reproduce
If possible, provide a recipe for reproducing the error.
What did you expect to see?
A clear and concise description of what you expected to see.
What did you see instead?
A clear and concise description of what you saw instead.
Environment
Describe any aspect of your environment relevant to the problem.
Additional context
Add any other context about the problem here.
Hi Team,
Enhancement #23 enabled cloudwatch log group retention for:
But it did not enable it for the performance
log which is used by the adot-collector for the metrics. I think it needs to be added to the adot-config
data in the adot-collector ConfigMap
, something like:
exporters:
awsemf:
namespace: {{ .Values.adotCollector.daemonSet.cwexporters.namespace }}
log_group_name: '/aws/containerinsights/{{ .Values.clusterName }}/performance'
log_stream_name: {{ .Values.adotCollector.daemonSet.cwexporters.logStreamName }}
log_retention: 60
As described on this page: https://aws-otel.github.io/docs/getting-started/cloudwatch-metrics
It would be great if the Helm chart could be updated with this extra option.
Thanks
My goal is to use this chart to deploy FluentBit to forward logs to CloudWatch logs as described in the docs here.
I'm setting my values to the following:
adotCollector.daemonSet.service.metrics.receivers
is awscontainerinsightreceiver
adotCollector.daemonSet.service.metrics.exporters
is awsemf
.Note these values are slightly different than what the doc referenced above says they should be, but I believe these are the correct ones. I've tried the other ones too of course. Its unclear to me if the adotcollector is only metrics or has something to do with logs as well.
I see that the collector has quite a bit of variety when it comes to components and collectors, processors etc. but I'm lost as to what the magic combination might be.
With the above config thefluent-bit
pods fail to start and enter CrashLoopBackOff
. Doing kubectl logs <fluent bit podname>
yields the following but I have no way to debug this as I can't connect to the pod to view that application-log.conf
file.
Fluent Bit v1.8.9
* Copyright (C) 2019-2021 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io
Error: Configuration file contains errors. Aborting
AWS for Fluent Bit Container Image Version 2.21.1[2022/04/08 16:52:33] [ Error] File application-log.conf
[2022/04/08 16:52:33] [ Error] Error in line 58: Key has an empty value
Questions
fluentBit
portion of values. I have fluentBit.enabled
set to true
but nothing else seems like something I should set and I don't see anything in the docs about that.Regarding the prerequisites, I have bound two worker role managed policies to my nodes: CloudWatchLogsFullAccess
and CloudWatchAgentServerPolicy
but since I'm able to see the metrics pods start up and send their metrics to CloudWatch perhaps this part is fine.
I'm deploying the FluentBit part of this chart. Here's what my pods look like:
$ kubectl get pods --all-namespaces | grep amazon
amazon-cloudwatch fluent-bit-448z6 1/1 Running 0 123m
amazon-cloudwatch fluent-bit-9s8jz 1/1 Running 0 123m
amazon-cloudwatch fluent-bit-jblg5 1/1 Running 0 123m
amazon-cloudwatch fluent-bit-ts4kg 1/1 Running 0 123m
amazon-metrics adot-collector-daemonset-2s4zj 1/1 Running 0 123m
amazon-metrics adot-collector-daemonset-9fhd7 1/1 Running 0 123m
amazon-metrics adot-collector-daemonset-g6t9m 1/1 Running 0 123m
amazon-metrics adot-collector-daemonset-qdcf2 1/1 Running 0 123m
The docs say here that 4 log groups should be created once the pods are deployed. But in the CloudWatch dashboard I only see one group /performance
for the cluster.
What additional config do I need to see application logs? The only thing I'm doing now is setting fluentBit.enabled
to true in the values.
My pods generate logs (I can see them by doing kubectl logs <pod name>
at least).
The logs for a given fluent-bit pod look like this:
Fluent Bit v1.8.9
* Copyright (C) 2019-2021 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io
[2022/04/08 17:57:22] [ info] [engine] started (pid=1)
[2022/04/08 17:57:22] [ info] [storage] created root path /var/fluent-bit/state/flb-storage/
[2022/04/08 17:57:22] [ info] [storage] version=1.1.5, initializing...
[2022/04/08 17:57:22] [ info] [storage] root path '/var/fluent-bit/state/flb-storage/'
[2022/04/08 17:57:22] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2022/04/08 17:57:22] [ info] [storage] backlog input plugin: storage_backlog.8
[2022/04/08 17:57:22] [ info] [cmetrics] version=0.2.2
[2022/04/08 17:57:22] [ info] [input:storage_backlog:storage_backlog.8] queue memory limit: 4.8M
[2022/04/08 17:57:22] [ info] [filter:kubernetes:kubernetes.0] https=1 host=kubernetes.default.svc port=443
[2022/04/08 17:57:22] [ info] [filter:kubernetes:kubernetes.0] local POD info OK
[2022/04/08 17:57:22] [ info] [filter:kubernetes:kubernetes.0] testing connectivity with API server...
[2022/04/08 17:57:22] [ info] [filter:kubernetes:kubernetes.0] connectivity OK
Describe the bug
In the FluentBit config, IDMS is hardcoded to v1
. When using IDMS v2
, the v1
endpoint is no longer working and FluentBit complains.
Steps to reproduce
Deploy the Helm chart with FluentBit enabled, on a cluster where IDMSv2 is enabled.
What did you expect to see?
The helm chart working and logs appearing in CloudWatch Logs.
What did you see instead?
[error] [filter:aws:aws.2] Could not retrieve ec2 metadata from IMDS
Environment
EKS version: 1.21
EC2 workers: Managed Node Group, version 1.21
, using Bottlerocket OS 1.4.2 (aws-k8s-1.21), with IDMSv2 enabled.
Additional context
Note hardcoded v1
string:
Hi Team,
By default, CloudWatch retains the logs forever. Many of our customers expressed interest in retaining the logs for a specific duration in CloudWatch service to save the storage cost. We implemented that by configuring the log_retention_days parameter in FluentBit agent config file while installing the agent separately on the EKS cluster.
It will be great if the Helm chart can also support this capability. I can work on this enhancement. Please let me know if that works for the team. Thanks.
Describe the bug
I'd like to be able to set a image pull secrets in the daemonsets for the helm chart.
Steps to reproduce
N/A
What did you expect to see?
Allow for a helm value to be set that will specify an image pull secret to be used.
What did you see instead?
This is just not configurable yet.
Environment
Useful for enterprise solutions
Additional context
This is useful in cases where images are stored in an Enterprise repository, and access to the images requires a docker secret.
Describe the bug
Following the documentation I am getting a failure for the adot collector add-on. It used to work for version 0.1.0 of the chart, but fails with all higher versions
Steps to reproduce
Follow documentation for offloading metrics to amazon cloudwatch. I can see that the instructions for the two sections are identical:
CloudWatch
CloudWatch and AMP.
What did you expect to see?
Expected an option to enable CloudWatch metrics only.
What did you see instead?
Appears that AMP exporter is active by default
adot collector daemonset failed with the following error message:
builder/exporters_builder.go:40 Exporter is starting... {"kind": "exporter", "name": "awsprometheusremotewrite"} │
│ Error: cannot start exporters: invalid endpoint: "http://some.url:9411/api/prom/push" │
│ 2022/07/13 02:26:20 application run finished with error: cannot start exporters: invalid endpoint: "http://some.url:9411/api/prom/push" │
│ Stream closed EOF for amazon-metrics/adot-collector-daemonset-n54d9 (adot-collector-container)
Environment
EKS 1.21
Additional context
Add any other context about the problem here.
Add OTEL Prometheus to Helm chart for full Container Insights Prometheus functionality
Noticed this while working on #32. I believe the doc here has incorrect values. I'm fairly sure I wasn't able to get the adot collector started unless I used these instead of what are listed.
The doc suggests that:
adotCollector.daemonSet.service.metrics.receivers
is awscontainerinsight
adotCollector.daemonSet.service.metrics.exporters
is awsemfexporter
.I think the correct values are:
adotCollector.daemonSet.service.metrics.receivers
is awscontainerinsightreceiver
adotCollector.daemonSet.service.metrics.exporters
is awsemf
.I installed the adot collector using the helm using argocd, but if I check the logs for the daemonset it shows below error .When I check the clusterrole template file https://github.com/aws-observability/aws-otel-helm-charts/blob/main/charts/adot-exporter-for-eks-on-ec2/templates/adot-collector/clusterrole.yaml I do not see the permission of resource "services"
Failed to watch *v1.Service: failed to list *v1.Service: services is forbidden: User "system:serviceaccount:amazon-metrics:adot-collector-sa" cannot list resource "services" in API group "" at the cluster scopeW0512 17:39:48.906520 1 reflector.go:535] k8s.io/[email protected]/tools/cache/reflector.go:229: failed to list *v1.Service: services is forbidden: User "system:serviceaccount:amazon-metrics:adot-collector-sa" cannot list resource "services" in API group "" at the cluster scope
Hello, i have a cluster on eu-south-1
region, but some feature of ContainerInsight is still not deployed to this region; so i setup the chart to send log and metrics to eu-west-1
. The log are send correctly to the right region, but the metrics are still on eu-south-1. It's seems you missed to propagate the value of awsRegion
to the awsemf
exporter config.
Just append region
property to awsemf
exporter:
region: {{ .Values.awsRegion }}
Hope this help, have a nice day
Hello,
since our system runs on EKS Fargate configuration would be great to have an Helm chart usefull to install and configure the AWS-OTEL.
Regards,
Vincenzo.
Hi
I use this helm chart to install ADOT collector in EKS. I deploy the ADOT collector with daemonset mode and I can see it started successfully. There is no any error in the collector's log. I also modify the configmap so that the collector can receive logs via otlp and export to them to CloudWatch Log. This is also looking ok.
However, from my app in another pod, the collector endpoint is "http://cluster-node-IP:4317", when it tries to send out logs to the collector, nothing happens. There is no any message in the app's log. There is no any new messages in the controller log too.
Then I enable the self-diagnostic log for OpenTelemetry in my app, I can see these two exceptions when it tries to send out logs to the collector.
Exception 1 - HTTP/2 handshake error
2024-05-15T04:16:15.7280926Z:Exporter failed send data to collector to {0} endpoint. Data will not be sent. Exception: {1}{http://10.17.72.214:4317/}{Grpc.Core.RpcException: Status(StatusCode="Unavailable", Detail="Error starting gRPC call. HttpRequestException: An error occurred while sending the request. IOException: An HTTP/2 connection could not be established because the server did not complete the HTTP/2 handshake. ObjectDisposedException: Cannot access a disposed object.
Object name: 'System.Net.Sockets.NetworkStream'.", DebugException="System.Net.Http.HttpRequestException: An error occurred while sending the request.")
---> System.Net.Http.HttpRequestException: An error occurred while sending the request.
---> System.IO.IOException: An HTTP/2 connection could not be established because the server did not complete the HTTP/2 handshake.
---> System.ObjectDisposedException: Cannot access a disposed object.
Object name: 'System.Net.Sockets.NetworkStream'.
at System.Net.Sockets.NetworkStream.WriteAsync(ReadOnlyMemory1 buffer, CancellationToken cancellationToken) at Grpc.Net.Client.Balancer.Internal.StreamWrapper.WriteAsync(ReadOnlyMemory
1 buffer, CancellationToken cancellationToken)
at System.Net.Http.Http2Connection.SetupAsync(CancellationToken cancellationToken)
--- End of inner exception stack trace ---
at System.Net.Http.Http2Connection.SetupAsync(CancellationToken cancellationToken)
at System.Net.Http.HttpConnectionPool.ConstructHttp2ConnectionAsync(Stream stream, HttpRequestMessage request, IPEndPoint remoteEndPoint, CancellationToken cancellationToken)
--- End of inner exception stack trace ---
at System.Net.Http.HttpConnectionPool.ConstructHttp2ConnectionAsync(Stream stream, HttpRequestMessage request, IPEndPoint remoteEndPoint, CancellationToken cancellationToken)
at System.Net.Http.HttpConnectionPool.AddHttp2ConnectionAsync(QueueItem queueItem)
at System.Threading.Tasks.TaskCompletionSourceWithCancellation1.WaitWithCancellationAsync(CancellationToken cancellationToken) at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken) at System.Net.Http.DiagnosticsHandler.SendAsyncCore(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken) at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken) at Grpc.Net.Client.Balancer.Internal.BalancerHttpHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken) at Grpc.Net.Client.Internal.GrpcCall
2.RunCall(HttpRequestMessage request, Nullable1 timeout) --- End of inner exception stack trace --- at Grpc.Net.Client.Internal.HttpClientCallInvoker.BlockingUnaryCall[TRequest,TResponse](Method
2 method, String host, CallOptions options, TRequest request)
at Grpc.Core.Interceptors.InterceptingCallInvoker.b__3_0[TRequest,TResponse](TRequest req, ClientInterceptorContext2 ctx) at Grpc.Core.ClientBase.ClientBaseConfiguration.ClientBaseConfigurationInterceptor.BlockingUnaryCall[TRequest,TResponse](TRequest request, ClientInterceptorContext
2 context, BlockingUnaryCallContinuation2 continuation) at Grpc.Core.Interceptors.InterceptingCallInvoker.BlockingUnaryCall[TRequest,TResponse](Method
2 method, String host, CallOptions options, TRequest request)
at OpenTelemetry.Proto.Collector.Logs.V1.LogsService.LogsServiceClient.Export(ExportLogsServiceRequest request, CallOptions options)
at OpenTelemetry.Proto.Collector.Logs.V1.LogsService.LogsServiceClient.Export(ExportLogsServiceRequest request, Metadata headers, Nullable`1 deadline, CancellationToken cancellationToken)
at OpenTelemetry.Exporter.OpenTelemetryProtocol.Implementation.ExportClient.OtlpGrpcLogExportClient.SendExportRequest(ExportLogsServiceRequest request, DateTime deadlineUtc, CancellationToken cancellationToken)}
Exception 2 - Broken Pipe error
2024-05-15T07:22:39.8784780Z:Exporter failed send data to collector to {0} endpoint. Data will not be sent. Exception: {1}{http://10.17.72.214:4317/}{Grpc.Core.RpcException: Status(StatusCode="Unavailable", Detail="Error starting gRPC call. HttpRequestException: An error occurred while sending the request. IOException: The request was aborted. IOException: Unable to write data to the transport connection: Broken pipe. SocketException: Broken pipe", DebugException="System.Net.Http.HttpRequestException: An error occurred while sending the request.")
---> System.Net.Http.HttpRequestException: An error occurred while sending the request.
---> System.IO.IOException: The request was aborted.
---> System.IO.IOException: Unable to write data to the transport connection: Broken pipe.
---> System.Net.Sockets.SocketException (32): Broken pipe
at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.CreateException(SocketError error, Boolean forAsyncThrow)
at System.Net.Sockets.NetworkStream.WriteAsync(ReadOnlyMemory1 buffer, CancellationToken cancellationToken) at Grpc.Net.Client.Balancer.Internal.StreamWrapper.WriteAsync(ReadOnlyMemory
1 buffer, CancellationToken cancellationToken)
at System.Net.Http.Http2Connection.FlushOutgoingBytesAsync()
at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
at System.Net.Http.Http2Connection.FlushOutgoingBytesAsync()
at System.Net.Http.Http2Connection.ProcessOutgoingFramesAsync()
at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
at System.Runtime.CompilerServices.AsyncTaskMethodBuilder1.AsyncStateMachineBox
1.MoveNext(Thread threadPoolThread)
at System.Threading.ThreadPoolWorkQueue.Dispatch()
at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
--- End of stack trace from previous location ---
--- End of inner exception stack trace ---
at System.Net.Http.Http2Connection.FlushOutgoingBytesAsync()
--- End of inner exception stack trace ---
at System.Net.Http.Http2Connection.ThrowRequestAborted(Exception innerException)
at System.Net.Http.Http2Connection.Http2Stream.CheckResponseBodyState()
at System.Net.Http.Http2Connection.Http2Stream.TryEnsureHeaders()
at System.Net.Http.Http2Connection.Http2Stream.ReadResponseHeadersAsync(CancellationToken cancellationToken)
at System.Net.Http.Http2Connection.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
--- End of inner exception stack trace ---
at System.Net.Http.Http2Connection.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken)
at System.Net.Http.DiagnosticsHandler.SendAsyncCore(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
at Grpc.Net.Client.Balancer.Internal.BalancerHttpHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
at Grpc.Net.Client.Internal.GrpcCall2.RunCall(HttpRequestMessage request, Nullable
1 timeout)
--- End of inner exception stack trace ---
at Grpc.Net.Client.Internal.HttpClientCallInvoker.BlockingUnaryCall[TRequest,TResponse](Method2 method, String host, CallOptions options, TRequest request) at Grpc.Core.Interceptors.InterceptingCallInvoker.<BlockingUnaryCall>b__3_0[TRequest,TResponse](TRequest req, ClientInterceptorContext
2 ctx)
at Grpc.Core.ClientBase.ClientBaseConfiguration.ClientBaseConfigurationInterceptor.BlockingUnaryCall[TRequest,TResponse](TRequest request, ClientInterceptorContext2 context, BlockingUnaryCallContinuation
2 continuation)
at Grpc.Core.Interceptors.InterceptingCallInvoker.BlockingUnaryCall[TRequest,TResponse](Method2 method, String host, CallOptions options, TRequest request) at OpenTelemetry.Proto.Collector.Trace.V1.TraceService.TraceServiceClient.Export(ExportTraceServiceRequest request, CallOptions options) at OpenTelemetry.Proto.Collector.Trace.V1.TraceService.TraceServiceClient.Export(ExportTraceServiceRequest request, Metadata headers, Nullable
1 deadline, CancellationToken cancellationToken)
at OpenTelemetry.Exporter.OpenTelemetryProtocol.Implementation.ExportClient.OtlpGrpcTraceExportClient.SendExportRequest(ExportTraceServiceRequest request, DateTime deadlineUtc, CancellationToken cancellationToken)}
Do you know what is the root cause of this problem?
Besides this helm chart to install the collector, is there anything else that must be installed first?
Thank you
TP
Installing the chart is not possible anymore.
Simple installation via helm:
helm install cloudwatch-container-insights aws-observability/adot-exporter-for-eks-on-ec2 -f values.yml
will fail with:
Error: INSTALLATION FAILED: create: failed to create: Request entity too large: limit is 3145728
I think the documentation directory can be safely added to .helmignore
Update Helm chart to use Recent collector version, Kubernetes Version and Sigv4 Auth for prometheusremotewrite
.
We are currently using the ADOT collector on our 1.23 EKS cluster and we are using this to send OTLP traces to Amazon Opensearch. We are successfully able to ingest the traces and see the service map for our application.
However, since 1.23 is nearing end of support and we are planning to move to v1.24 of EKS.
Docker is not supported as a container runtime on 1.24 and we plan to use containerd as our container runtime.
Currently in the values.yaml file, I see that this Helm chart mounts "/var/lib/docker" and "var/run/docker.sock" as volumes to the collector Daemonset.
Since docker is not supported on EKS version 1.24 and later, will this Collector daemonset still function as intended.
Can someone officially verify this Helm chart support for v1.24
Describe the issue
For now, default (static) configuration file exports metrics to AMP ignoring the option to deploy it only to Cloudwatch. In case if you don't use AMP, and expect your metrics being pushed only to Cloudwatch, the exporter will break and get stuck in CrashLoop, because no AMP URL has been set.
I ended up creating my own configmap and forking the whole chart codebase to make it fit.
data:
adot-config: |
extensions:
health_check:
sigv4auth:
region: us-east-1
receivers:
awscontainerinsightreceiver:
collection_interval:
container_orchestrator:
add_service_as_attribute:
prefer_full_pod_name:
add_full_pod_name_metric_label:
processors:
batch/metrics:
timeout: 60s
exporters:
awsemf:
namespace: ContainerInsights
log_group_name: '/aws/containerinsights/clou-eu-central-1/performance'
log_stream_name: InputNodeName
region: eu-central-1
resource_to_telemetry_conversion:
enabled: true
dimension_rollup_option: NoDimensionRollup
parse_json_encoded_attr_values:
- Sources
- kubernetes
metric_declarations:
# node metrics
- dimensions: [[NodeName, InstanceId, ClusterName]]
metric_name_selectors:
- node_cpu_utilization
- node_memory_utilization
- node_network_total_bytes
- node_cpu_reserved_capacity
- node_memory_reserved_capacity
- node_number_of_running_pods
- node_number_of_running_containers
- dimensions: [[ClusterName]]
metric_name_selectors:
- node_cpu_utilization
- node_memory_utilization
- node_network_total_bytes
- node_cpu_reserved_capacity
- node_memory_reserved_capacity
- node_number_of_running_pods
- node_number_of_running_containers
- node_cpu_usage_total
- node_cpu_limit
- node_memory_working_set
- node_memory_limit
# pod metrics
- dimensions: [[PodName, Namespace, ClusterName], [Service, Namespace, ClusterName], [Namespace, ClusterName], [ClusterName]]
metric_name_selectors:
- pod_cpu_utilization
- pod_memory_utilization
- pod_network_rx_bytes
- pod_network_tx_bytes
- pod_cpu_utilization_over_pod_limit
- pod_memory_utilization_over_pod_limit
- dimensions: [[PodName, Namespace, ClusterName], [ClusterName]]
metric_name_selectors:
- pod_cpu_reserved_capacity
- pod_memory_reserved_capacity
- dimensions: [[PodName, Namespace, ClusterName]]
metric_name_selectors:
- pod_number_of_container_restarts
# cluster metrics
- dimensions: [[ClusterName]]
metric_name_selectors:
- cluster_node_count
- cluster_failed_node_count
# service metrics
- dimensions: [[Service, Namespace, ClusterName], [ClusterName]]
metric_name_selectors:
- service_number_of_running_pods
# node fs metrics
- dimensions: [[NodeName, InstanceId, ClusterName], [ClusterName]]
metric_name_selectors:
- node_filesystem_utilization
# namespace metrics
- dimensions: [[Namespace, ClusterName], [ClusterName]]
metric_name_selectors:
- namespace_number_of_running_pods
service:
pipelines:
metrics:
receivers:
- awscontainerinsightreceiver
processors:
- batch/metrics
exporters:
- awsemf
extensions:
- health_check
- sigv4auth
We need to build a template which will allow users to choose the exporter destination as well as the whole pipeline.
My proposal is to:
Template configuration file with Helm means and set in values the format, whether it's an export to AMP, or to Cloudwatch or both.
What did you expect to see?
Configuration file being adapted for the Cloudwatch use-case
Environment
This issue is environment-agnostic
Additional context
I'm willing to fix it by forking the chart and sending the PR for my proposal. Please let me know what you think about this issue.
Describe the bug
Setting the following properties of log_retention
to true
and 60
the retention days on CloudWatch console is still set to Never Expire
fluentbit:
enabled: true
image:
tag: 2.28.1
output:
applicationLog:
log_retention:
enabled: true
days: 60
dataplaneLog:
log_retention:
enabled: true
days: 60
hostLog:
log_retention:
enabled: true
days: 60
Environment
kubernets 1.23
fluentbit 2.28.1 (i tried with 2.21.1 with the same result)
aws-otel-helm-charts 0.7.0
Have a nice day
fluentbit and fargateLog already contains an attribute enabled which can be set to false or true.
Please add same attribute "enabled" to adotCollector section as well so on can decide what should be installed for all 3 components.
Describe the bug
Those pod attributes that otel-exporter definitely needs to function, such as env
, volume
, command
and so on, won't need to be set by values, rather being placed directly in templates/
directory.
The current approach is not convenient, when you want to set your own environment variable, such as AWS SDK Credentials for example, and with that you have to list in your values file all other environmnent variables declared in default values file - otherwise, they will be gone. Same thing with volumes. Generally, there is an idea that values file can be shorter and such inconveniences will be avoided.
My proposal is to:
env
, volume
, command
and other default attributes from values.yaml
to templates
, for both daemonset and sidecar.Steps to reproduce
containersName: "adot-collector-container"
env:
- name: "AWS_REGION"
valueFrom:
secretKeyRef:
name: "adot-aws-credentials"
key: "AWS_REGION"
- name: "AWS_ACCESS_KEY_ID"
valueFrom:
secretKeyRef:
name: "adot-aws-credentials"
key: "AWS_ACCESS_KEY_ID"
- name: "AWS_SECRET_ACCESS_KEY"
valueFrom:
secretKeyRef:
name: "adot-aws-credentials"
key: "AWS_SECRET_ACCESS_KEY"
command: ...
helm install -n [NAMESPACE_NAME] [RELEASE_NAME] [REPO_NAME]/adot-exporter-for-eks-on-ec2
helm install -n monitoring adot aws-otel/adot-exporter-for-eks-on-ec2
What did you expect to see?
Declared variables set for daemonset/sidecar altogether with default variables, such as K8S_POD_NAME
What did you see instead?
All default variables are gone.
Environment
This issue is environment-agnostic
Additional context
I'm willing to fix it by forking the chart and sending the PR for my proposal. Please let me know what you think about this issue.
After deploying the chart to a second cluster it's crashing with the following error.
2023/05/15 11:14:35 ADOT Collector version: v0.28.0
2023/05/15 11:14:35 found no extra config, skip it, err: open /opt/aws/aws-otel-collector/etc/extracfg.txt: no such file or directory
SDK 2023/05/15 11:14:35 WARN falling back to IMDSv1: operation error ec2imds: getToken, http response error StatusCode: 403, request to EC2 IMDS failed
Error: invalid configuration: extensions::sigv4auth: could not retrieve credential provider: failed to refresh cached credentials, unexpected empty EC2 IMDS role list
2023/05/15 11:14:35 application run finished with error: invalid configuration: extensions::sigv4auth: could not retrieve credential provider: failed to refresh cached credentials, unexpected empty EC2 IMDS role list
Is there a way to increase logging for the sigv4auth extension?
As per AWS documentation on dockershim deprecation: "Amazon EKS AMIs that are officially published will have containerd as the only runtime starting with version 1.23. This is targeted for end of the second quarter of 2022"
My customer has tested the aws-otel-helm-charts solution with only containerd used by the cluster and the solution stopped working - probably because of mounts to docker-specific host paths.
The customer is asking, if/when the solution will be adjusted to use only containerd
Hi, how can i specify the configuration for Prometheus Remote Write ?
Like here : https://github.com/aws-samples/amazon-eks-observability-demo/blob/main/observability/resources/adot-configmap.yaml
adot-collector-config: |
exporters:
awsprometheusremotewrite:
# replace this with your endpoint
endpoint: "${APS_REMOTE_WRITE_ENDPOINT}"
# replace this with your region
aws_auth:
region: "${APS_REGION}"
service: "aps"
namespace: "adot"
awsxray:
region: "${AWS_REGION}"
logging:
loglevel: debug
extensions:
health_check:
pprof:
endpoint: :1888
zpages:
endpoint: :55679
service:
extensions: [pprof, zpages, health_check]
pipelines:
traces:
receivers: [otlp]
exporters: [awsxray]
metrics:
receivers: [prometheus]
exporters: [logging, awsprometheusremotewrite]
In order to make IRSA possible we need dedicated annotation for fluent-bit and adot-collector-sa ServiceAccount
Currently only the following is available in values.yaml
serviceAccount:
create: true
annotations: {}
name: ""
Please provide a serviceAccount section for fluent-bit including annotation and add annotations into already existing adot-collector-sa ServiceAccount section
Currently it is only possible to attach IAM policy to worker node profile.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.