CloudWatch Agent Dockerfile and K8s YAML templates for CloudWatch Container Insights.

License: MIT No Attribution

Dockerfile 9.74% Shell 90.26%

amazon-cloudwatch-container-insights's Introduction

Amazon Cloudwatch Container Insights

CloudWatch Agent Dockerfile and K8s YAML templates for CloudWatch Container Insights.
For more information about deploying Container Insights, see the deployment documentation

License Summary

This sample code is made available under the MIT-0 license. See the LICENSE file.

amazon-cloudwatch-container-insights's People

Contributors

Stargazers

Watchers

Forkers

vaibhavpokale mhausenblas psa-bmustapha varsy aaresh-sharma tomoaks ashwanijha04 ierwind gojko spensireli sudaan rmmacala jomojowo cassius7 andyjcooperuk hdj630 gubupt pxaws ndevops zhuzengluu pranavsuresh98 asakermohd oussama-mechlaoui yimuniao eduardoaw pingleig dawookie kkgamini dahu33 fenixil viktord omerh haojhcwa therealgoku curtishubbard mklee786 sre-redrocket sasha7 moabukar drewzhang13 chkp-zivhada emodatt08 leandrodamascena chandrashekar-nallamilli mehmetyazicioglu javabrett estherk0 vojtarylko saxypandabear youwalther65 huavey bkh4mmm argally chinedunsidinanya hd-sharma mhemken-vts ambrose-chen pettitwesley mlosekwa wenlong-ihoment sethamazon chitujay mciortan bbodenmiller naurel1467 hojin-choi mrvisser priyavartk aravind666 pramod5623 thasija pie-r znarendraka itsmeshabeeb fajisola brajthakur2210 evangtim2 ngoyal16 riita10069 charliekeeegan rafaelpereyra rothgar it1devops p02787 ankan-devops williazz utpal-lotlikar khanhntd htquach jefchien mikereinhold laxmiprasannavengalasetty thrivehealth contacthari21 movence chadpatel otterley oleg-soroka-epam aibazhang ramil-brion

amazon-cloudwatch-container-insights's Issues

CloudWatch Agent Not Working for G4DN Nodes

We are using cwagent-kubernetes-monitoring and fluentd k8s yaml templates to enable logging capture in CloudWatch for CPU and GPU nodes in our EKS cluster. Our configuration works great for CPU nodes but for GPU (G4DN series) the cloudwatch-agent and fluentd-cloudwatch DaemonSets do not add respective containers when a new GPU node is added to the pool. Has anyone else seen this issue? IAM permissions are fine as other images are downloaded from ECR, permissions allow read for all ECR resources, and the other containers are deployed as expected.

Images we are using:

fluentd-kubernetes-daemonset:v1.7.3-debian-cloudwatch-1.0
cloudwatch-agent:1.245315.0

Cloudwatch-agent pods keep CrashLoopBackOff in Private EKS Cluster

Summary

I found cloudwatch-agent pods keeps CrashLoopBackOff in Private EKS Cluster.
What I need to do for this isuue?
ec2tagger: Unable to retrieve InstanceId. This plugin must only be used on an EC2 instance

# kubectl get pod -n amazon-cloudwatch
NAME                       READY   STATUS             RESTARTS   AGE
cloudwatch-agent-2vhfl     0/1     CrashLoopBackOff   10         33m
cloudwatch-agent-mqdrr     0/1     CrashLoopBackOff   10         33m
cloudwatch-agent-x228l     0/1     CrashLoopBackOff   10         33m
fluentd-cloudwatch-ck6jp   1/1     Running            0          33m
fluentd-cloudwatch-ft72n   1/1     Running            0          33m
fluentd-cloudwatch-t6n5p   1/1     Running            0          33m

# kubectl logs -f pod/cloudwatch-agent-mqdrr -n amazon-cloudwatch
2020/08/26 07:22:42 I! 2020/08/26 07:22:39 E! ec2metadata is not available
2020/08/26 07:22:39 I! attempt to access ECS task metadata to determine whether I'm running in ECS.
2020/08/26 07:22:40 W! retry [0/3], unable to get http response from http://169.254.170.2/v2/metadata, error: unable to get response from http://169.254.170.2/v2/metadata, error: Get http://169.254.170.2/v2/metadata: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2020/08/26 07:22:41 W! retry [1/3], unable to get http response from http://169.254.170.2/v2/metadata, error: unable to get response from http://169.254.170.2/v2/metadata, error: Get http://169.254.170.2/v2/metadata: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2020/08/26 07:22:42 W! retry [2/3], unable to get http response from http://169.254.170.2/v2/metadata, error: unable to get response from http://169.254.170.2/v2/metadata, error: Get http://169.254.170.2/v2/metadata: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2020/08/26 07:22:42 I! access ECS task metadata fail with response unable to get response from http://169.254.170.2/v2/metadata, error: Get http://169.254.170.2/v2/metadata: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers), assuming I'm not running in ECS.
I! Detected the instance is OnPrem
2020/08/26 07:22:42 Reading json config file path: /opt/aws/amazon-cloudwatch-agent/bin/default_linux_config.json ...
/opt/aws/amazon-cloudwatch-agent/bin/default_linux_config.json does not exist or cannot read. Skipping it.
2020/08/26 07:22:42 Reading json config file path: /etc/cwagentconfig/..2020_08_26_06_51_18.480101466/cwagentconfig.json ...
2020/08/26 07:22:42 Find symbolic link /etc/cwagentconfig/..data
2020/08/26 07:22:42 Find symbolic link /etc/cwagentconfig/cwagentconfig.json
2020/08/26 07:22:42 Reading json config file path: /etc/cwagentconfig/cwagentconfig.json ...
Valid Json input schema.
Got Home directory: /root
No csm configuration found.
No metric configuration found.
Configuration validation first phase succeeded

2020/08/26 07:22:42 I! Config has been translated into TOML /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.toml
2020/08/26 07:22:42 I! AmazonCloudWatchAgent Version 1.245315.0.
2020-08-26T07:22:42Z I! will use file based credentials provider
2020-08-26T07:22:42Z I! Starting AmazonCloudWatchAgent (version 1.245315.0)
2020-08-26T07:22:42Z I! Loaded outputs: cloudwatchlogs
2020-08-26T07:22:42Z I! Loaded inputs: cadvisor k8sapiserver
2020-08-26T07:22:42Z I! Tags enabled:
2020-08-26T07:22:42Z I! Agent Config: Interval:1m0s, Quiet:false, Hostname:"ip-10-0-2-150.ap-south-1.compute.internal", Flush Interval:1s
2020-08-26T07:22:42Z I! k8sapiserver Switch New Leader: ip-10-0-3-218.ap-south-1.compute.internal
2020-08-26T07:23:02Z E! ec2tagger: Unable to retrieve InstanceId. This plugin must only be used on an EC2 instance

Environment

I used Quick Start Setup for Container Insights on Amazon EKS
I tried using the GitHub page private-eks-cluster and I created it.

# kubectl version
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.8", GitCommit:"9f2892aab98fe339f3bd70e3c470144299398ace", GitTreeState:"clean", BuildDate:"2020-08-13T16:12:48Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.13-eks-2ba888", GitCommit:"2ba888155c7f8093a1bc06e3336333fbdb27b3da", GitTreeState:"clean", BuildDate:"2020-07-17T18:48:53Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}

cloudwatch-agent daemonset pods keep crashing in cluster with concourse deployed.

This is issue is based on AWS support case with ID 6901474011. It was advised I created an issue here to follow up the progress.

I have an issue with the cloudwatch-agent pods that crash because concourse containers starts containers by itself, this is causing a panic in the cloudwatch agent. Error printed is:

panic: metric being merged has conflict in fields, src: {map[container_memory_cache:6590095360 container_memory_failcnt:0 container_memory_mapped_file:7602176 container_memory_max_usage:7279558656 container_memory_rss:257601536 container_memory_swap:0 container_memory_usage:7102160896 container_memory_working_set:1936506880] map[ContainerId:f13df96d6d02d3a0db9dd7c9fc0df7f1071728a1250ebaeb4e8fa6afab9d3e44 ContainerName:yad-helm-concourse-cicd-worker K8sPodName:yad-helm-concourse-cicd-worker-0 Namespace:concourse PodId:a16f894b-5f28-459a-a5b5-ad5e379d86a3 Timestamp:1585550390956] Container}, dest: {map[container_memory_cache:0 container_memory_failcnt:0 container_memory_mapped_file:0 container_memory_max_usage:0 container_memory_rss:0 container_memory_swap:0 container_memory_usage:0 container_memory_working_set:0] map[ContainerId:f13df96d6d02d3a0db9dd7c9fc0df7f1071728a1250ebaeb4e8fa6afab9d3e44 ContainerName:yad-helm-concourse-cicd-worker K8sPodName:yad-helm-concourse-cicd-worker-0 Namespace:concourse PodId:a16f894b-5f28-459a-a5b5-ad5e379d86a3 Timestamp:1585550391456] Container}

goroutine 21 [running]:
github.com/influxdata/telegraf/plugins/inputs/cadvisor/extractors.(*CAdvisorMetric).Merge(0xc00000d640, 0xc00000d6c0)
	/local/p4clients/pkgbuild-bwz9v/workspace/src/CWAgent/src/github.com/influxdata/telegraf/plugins/inputs/cadvisor/extractors/extractor.go:70 +0x2a4
github.com/influxdata/telegraf/plugins/inputs/cadvisor.mergeMetrics(0xc000ab4c00, 0x21, 0x40, 0x1f, 0x0, 0x0)
	/local/p4clients/pkgbuild-bwz9v/workspace/src/CWAgent/src/github.com/influxdata/telegraf/plugins/inputs/cadvisor/merger.go:15 +0x1df
github.com/influxdata/telegraf/plugins/inputs/cadvisor.processContainers(0xc0000d3d40, 0x32, 0x32, 0x1, 0xc0002e3c19, 0x3, 0x32, 0x0, 0x0)
	/local/p4clients/pkgbuild-bwz9v/workspace/src/CWAgent/src/github.com/influxdata/telegraf/plugins/inputs/cadvisor/container_info_processor.go:60 +0x4e5
github.com/influxdata/telegraf/plugins/inputs/cadvisor.(*Cadvisor).Gather(0xc00050ee40, 0x25dc340, 0xc0002f8a40, 0x0, 0x0)
	/local/p4clients/pkgbuild-bwz9v/workspace/src/CWAgent/src/github.com/influxdata/telegraf/plugins/inputs/cadvisor/cadvisor_linux.go:71 +0x1cb
github.com/influxdata/telegraf/agent.gatherWithTimeout.func1(0xc00009e240, 0xc00007e1c0, 0xc0002f8a40)
	/local/p4clients/pkgbuild-bwz9v/workspace/src/CWAgent/src/github.com/influxdata/telegraf/agent/agent.go:186 +0x49
created by github.com/influxdata/telegraf/agent.gatherWithTimeout
	/local/p4clients/pkgbuild-bwz9v/workspace/src/CWAgent/src/github.com/influxdata/telegraf/agent/agent.go:185 +0xc7

I was told by support a fix will be available in the next release, but they couldn't tell me when this would be. can anyone here tell me when the fix will be ready?

CloudWatch Agent sidecar not collecting EMF logs

Hi my team has been following the examples and tutorials using EMF within Fargate tasks but been having issues. Our application currently emits logs in EMF and works fine when running in Lambda.

We've install the sidecar container as recommend by this repo:

We observed no EMF logs were being collected. When we enable the agent debug mode we see the metrics buffer remain at 0 while the application container is outputting EMF logs. Is there any additional configuration we have to do in order to get the EMF logs to the agent?

We did find this article (very similar to the other linked above): https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Embedded_Metric_Format_Generation_CloudWatch_Agent.html and it mentions AWS_EMF_AGENT_ENDPOINT environment variable. I believe the python repo https://github.com/awslabs/aws-embedded-metrics-python supports this variable. However, we designed our own implementation for EMF. Is this something we need to consider supporting?

Thank you in advance!

Applying multiline log parsing

I've tried to enable multiline log parsing on this solution, but haven't been able to get the multiple line logs concatenated into one. I've followed the instructions on these links (https://docs.fluentbit.io/manual/pipeline/filters/multiline-stacktrace, https://docs.fluentbit.io/manual/pipeline/inputs/tail#multiline-support), but none of these work on this solution.

How do you enable multiline log parsing on this solution for any kind of log, regarding whether it has a timestamp on it or not?

I look forward to hearing from you soon

Pod logs not getting pushed with FluentD with various errors

I have several jobs running and the logs which were earlier getting pushed to cloudwatch insights. I was able to access them by job name.
However, recently, I only see cloudwatch and fluentd pod logs without actual logs that I need for my jobs.

In some cases, I get this error on fluentd:

unable to get http response from http://169.254.170.2/v2/metadata, error: unable to get response from http://169.254.170.2/v2/metadata, error: Get \"http://169.254.170.2/v2/metadata\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)\n

And in others I get this on cloudwatch agent:

access ECS task metadata fail with response unable to get response from http://169.254.170.2/v2/metadata, error: Get \"http://169.254.170.2/v2/metadata\": context deadline exceeded (Client.Timeout exceeded while awaiting headers), assuming I'm not running in ECS.\nNo csm configuration found.\nNo metric configuration found.\nConfiguration validation first phase succeeded\

I am new to this and I do not fully understand how things work. Can someone please help me out in understanding what must be going wrong? Is this a version upgrade issue or something?

Thanks!

EKS 1.24 fluentbit support for cri-o format logs

The log format has changed as EKS supports Kubernetes 1.24 by containerd.
I think it would be better to change the following parser for fluentbit to something like multiline.parser cri.

https://github.com/aws-samples/amazon-cloudwatch-container-insights/blob/master/k8s-deployment-manifest-templates/deployment-mode/daemonset/container-insights-monitoring/fluent-bit/fluent-bit.yaml#L71-L74

Unable to pull latest image for cwagent-fluent-bit-quickstart.yaml

Following the instructions at https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-setup-EKS-quickstart.html, the cloudwatch-agent image being pulled by the URL is not the latest version seen in the quickstart yaml file i.e 1.247355.0b252062

URL used to pull the image - curl https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/latest/k8s-deployment-manifest-templates/deployment-mode/daemonset/container-insights-monitoring/quickstart/cwagent-fluent-bit-quickstart.yaml

Code does not support CW logs retention

I have installed CloudWatch Container Insights per the cwagent-fluentd-quickstart

But this code does not support CloudWatch logs retention policy.

Is there any repo or code which supports retention code for CW logs?

fluentd with invalid line found file="/var/log/dmesg" message

I just deployed fluentd on EKS with the command

and I got the following errors

$ kubectl logs -n amazon-cloudwatch   fluentd-cloudwatch-4lcnh 

2020-06-16 16:04:00 +0000 [warn]: #0 [in_tail_dmesg] invalid line found file="/var/log/dmesg" line="[    0.320007] ACPI: Using IOAPIC for interrupt routing" error="invalid time format: value = [ 0.320007] ACPI:, error_class = ArgumentError, error = string doesn't match"
2020-06-16 16:04:00 +0000 [warn]: #0 [in_tail_dmesg] invalid line found file="/var/log/dmesg" line="[    0.320018] PCI: Using host bridge windows from ACPI; if necessary, use \"pci=nocrs\" and report a bug" error="invalid time format: value = [ 0.320018] PCI:, error_class = ArgumentError, error = string doesn't match"
2020-06-16 16:04:00 +0000 [warn]: #0 [in_tail_dmesg] invalid line found file="/var/log/dmesg" line="[    0.328763] ACPI: Enabled 16 GPEs in block 00 to 0F" error="invalid time format: value = [ 0.328763] ACPI:, error_class = ArgumentError, error = string doesn't match"
2020-06-16 16:04:00 +0000 [warn]: #0 [in_tail_dmesg] invalid line found file="/var/log/dmesg" line="[    0.336023] ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-ff])" error="invalid time format: value = [ 0.336023] ACPI:, error_class = ArgumentError, error = string doesn't match"
2020-06-16 16:04:00 +0000 [warn]: #0 [in_tail_dmesg] invalid line found file="/var/log/dmesg" line="[    0.340006] acpi PNP0A03:00: _OSC: OS supports [ASPM ClockPM Segments MSI]" error="invalid time format: value = [ 0.340006] acpi, error_class = ArgumentError, error = string doesn't match"

Upgrading fluentd version + cloudwatch-agent

Bumped versions of images fluentd and cloudwatch-agent to latest version.

Also changed the cloudwatch-agent from fixed to latest to keep the pattern of all the other services.

IMDSv2 should be default and version should be configurable

Following the docs on AWS, I end up with a fluent-bit that uses IMDSv1. Whereas the recommendation by AWS is to only use IMDSv2.

So the configmap fluent-bit-cluster-info should have an extra field to choose v1 vs v2. Also the default should be v2, not v1. However it doesn't seem like a default is possible, the user would have to choose one or the other. A helm chart would be so much easier. Any chance for that?

BTW aws/aws-for-fluent-bit#177 discusses the issue I am having. In case others hit this issue, if you shell into one of the fluent-bit containers and run /fluent-bit/bin/fluent-bit --version, you will I get fluent-bit version, but this is not the same as the docker image version, which you can find in /AWS_FOR_FLUENT_BIT_VERSION file in the docker image:

bash-4.2# ls /
AWS_FOR_FLUENT_BIT_VERSION  boot  ecs		 etc	     home  lib64  media  opt   root  sbin  sys	usr
bin			    dev   entrypoint.sh  fluent-bit  lib   local  mnt	 proc  run   srv   tmp	var
bash-4.2# more AWS_FOR_FLUENT_BIT_VERSION
2.28.5
bash-4.2#

AWS cloud agent deployment failure on ROSA

Hi team, I have a Rosa cluster deployed in region us-east-2 by following https://console.redhat.com/openshift/create/rosa/welcome

I am trying to setup the cloudwatch agent to collect the metrics by following.
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-setup-EKS-quickstart.html

Before installation
We attached CloudWatchAgentServerPolicy to each EC2 worker nodes.

But during daemonset deployment we got error:

 [telegraf] Error running agent: could not initialize processor ec2tagger: ec2tagger: Unable to retrieve InstanceId. This plugin must only be used on an EC2 instance`

detail log:

`2021/11/19 10:04:19 I! 2021/11/19 10:04:16 E! ec2metadata is not available
2021/11/19 10:04:16 I! attempt to access ECS task metadata to determine whether I'm running in ECS.
2021/11/19 10:04:17 W! retry [0/3], unable to get http response from http://169.254.170.2/v2/metadata, error: unable to get response from http://169.254.170.2/v2/metadata, error: Get "http://169.254.170.2/v2/metadata": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2021/11/19 10:04:18 W! retry [1/3], unable to get http response from http://169.254.170.2/v2/metadata, error: unable to get response from http://169.254.170.2/v2/metadata, error: Get "http://169.254.170.2/v2/metadata": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2021/11/19 10:04:19 W! retry [2/3], unable to get http response from http://169.254.170.2/v2/metadata, error: unable to get response from http://169.254.170.2/v2/metadata, error: Get "http://169.254.170.2/v2/metadata": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2021/11/19 10:04:19 I! access ECS task metadata fail with response unable to get response from http://169.254.170.2/v2/metadata, error: Get "http://169.254.170.2/v2/metadata": context deadline exceeded (Client.Timeout exceeded while awaiting headers), assuming I'm not running in ECS.
I! Detected the instance is OnPrem
2021/11/19 10:04:19 Reading json config file path: /opt/aws/amazon-cloudwatch-agent/bin/default_linux_config.json ...
/opt/aws/amazon-cloudwatch-agent/bin/default_linux_config.json does not exist or cannot read. Skipping it.
2021/11/19 10:04:19 Reading json config file path: /etc/cwagentconfig/..2021_11_19_09_40_24.899664046/cwagentconfig.json ...
2021/11/19 10:04:19 Find symbolic link /etc/cwagentconfig/..data
2021/11/19 10:04:19 Find symbolic link /etc/cwagentconfig/cwagentconfig.json
2021/11/19 10:04:19 Reading json config file path: /etc/cwagentconfig/cwagentconfig.json ...
Valid Json input schema.
Got Home directory: /root
No csm configuration found.
No metric configuration found.
Configuration validation first phase succeeded

2021/11/19 10:04:19 I! Config has been translated into TOML /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.toml
2021-11-19T10:04:19Z I! Starting AmazonCloudWatchAgent 1.247348.0
2021-11-19T10:04:19Z I! Loaded inputs: cadvisor k8sapiserver
2021-11-19T10:04:19Z I! Loaded aggregators:
2021-11-19T10:04:19Z I! Loaded processors: ec2tagger k8sdecorator
2021-11-19T10:04:19Z I! Loaded outputs: cloudwatchlogs
2021-11-19T10:04:19Z I! Tags enabled:
2021-11-19T10:04:19Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"ip-10-0-137-118.us-east-2.compute.internal", Flush Interval:1s
2021-11-19T10:04:19Z I! [logagent] starting
2021-11-19T10:04:19Z I! [logagent] found plugin cloudwatchlogs is a log backend
2021-11-19T10:04:28Z E! [processors.ec2tagger] ec2tagger: Unable to retrieve InstanceId. This plugin must only be used on an EC2 instance
2021-11-19T10:04:28Z E! [telegraf] Error running agent: could not initialize processor ec2tagger: ec2tagger: Unable to retrieve InstanceId. This plugin must only be used on an EC2 instance

Here is my cluster info:

yuri9doo86@loaclhost aws % ..//rosa describe cluster -c cherryrosatest
Name:                       cherryrosatest
ID:                         1ogfk2f7f49ah56deg0neq1qks1cgbs2
External ID:                b136164e-766b-4ab0-a71e-7b4415ed4663
OpenShift Version:          4.9.5
Channel Group:              stable
DNS:                        cherryrosatest.3y19.p1.openshiftapps.com
AWS Account:               XXXXXX
API URL:                    https://api.cherryrosatest.3y19.p1.openshiftapps.com:6443
Console URL:                https://console-openshift-console.apps.cherryrosatest.3y19.p1.openshiftapps.com
Region:                     us-east-2
Multi-AZ:                   false
Nodes:
 - Control plane:           3
 - Infra:                   2
 - Compute (Autoscaled):    4-10
Network:
 - Service CIDR:            172.30.0.0/16
 - Machine CIDR:            10.0.0.0/16
 - Pod CIDR:                10.128.0.0/14
 - Host Prefix:             /23
STS Role ARN:               arn:aws:iam::675801125365:role/ManagedOpenShift-Installer-Role
Support Role ARN:           arn:aws:iam::675801125365:role/ManagedOpenShift-Support-Role
Instance IAM Roles:
 - Control plane:           arn:aws:iam::675801125365:role/ManagedOpenShift-ControlPlane-Role
 - Worker:                  arn:aws:iam::675801125365:role/ManagedOpenShift-Worker-Role
Operator IAM Roles:
 - arn:aws:iam::675801125365:role/cherryrosatest-s9a5-openshift-machine-api-aws-cloud-credentials
 - arn:aws:iam::675801125365:role/cherryrosatest-s9a5-openshift-cloud-credential-operator-cloud-cr
 - arn:aws:iam::675801125365:role/cherryrosatest-s9a5-openshift-image-registry-installer-cloud-cre
 - arn:aws:iam::675801125365:role/cherryrosatest-s9a5-openshift-ingress-operator-cloud-credentials
 - arn:aws:iam::675801125365:role/cherryrosatest-s9a5-openshift-cluster-csi-drivers-ebs-cloud-cred
State:                      ready
Private:                    No
Created:                    Nov 16 2021 05:56:28 UTC
Details Page:               https://console.redhat.com/openshift/details/s/20zKtEuEok2WNe2NVsKQwtEF7Fh
OIDC Endpoint URL:          https://rh-oidc.s3.us-east-1.amazonaws.com/1ogfk2f7f49ah56deg0neq1qks1cgbs2

Add namespace to ClusterRole

amazon-cloudwatch namespace needs to be added to the ClusterRole

amazon-cloudwatch-container-insights/k8s-yaml-templates/fluentd/fluentd.yaml

Line 10 in 31076db

name: fluentd-role

Is CloudWatch Prometheus agent an open source project?

I am curious if this is open sourced? I didn't find any information yet.
Seems CloudWatch Prometheus agent replaces Prometheus and it's configuration compatible with Prometheus. I am thinking why not to create a Prometheus -> CloudWatch sidecar to convert the metrics to CW format directly? It doesn't even need an agent. k8s user can still use prometheus and easily adapt to CW. Current design seems a little bit heavier and it's nibbling aways Prometheus.

Add nodeSelector to prevent scheduled pods on Windows worker nodes

It's becoming common the multi OS worker nodes on the EKS.

When deployment the DaemonSet as no nodeSelector is added the pods will try to be scheduled on Windows nodes and fail.

Solution idea, is to add the node selector to based on the Operational System with

      nodeSelector:
        beta.kubernetes.io/os: linux

CloudWatch Prometheus agent can't read metrics export

The Cloudwatch prometheus agent is unable to read or determine the type of exported metric data:

2020-08-03T22:00:05Z D! [97/500] Unsupported Prometheus metric: kamailio_sl_200_replies with type:
2020-08-03T22:00:05Z D! [98/500] Unsupported Prometheus metric: kamailio_sl_202_replies with type:
2020-08-03T22:00:05Z D! [99/500] Unsupported Prometheus metric: kamailio_sl_2xx_replies with type:
2020-08-03T22:00:05Z D! [100/500] Unsupported Prometheus metric: kamailio_sl_300_replies with type:
2020-08-03T22:00:05Z D! [101/500] Unsupported Prometheus metric: kamailio_sl_301_replies with type:
2020-08-03T22:00:05Z D! [102/500] Unsupported Prometheus metric: kamailio_sl_302_replies with type:
2020-08-03T22:00:05Z D! [103/500] Unsupported Prometheus metric: kamailio_sl_3xx_replies with type:
2020-08-03T22:00:05Z D! [104/500] Unsupported Prometheus metric: kamailio_sl_400_replies with type:
2020-08-03T22:00:05Z D! [105/500] Unsupported Prometheus metric: kamailio_sl_401_replies with type:
2020-08-03T22:00:05Z D! [106/500] Unsupported Prometheus metric: kamailio_sl_403_replies with type:
2020-08-03T22:00:05Z D! [107/500] Unsupported Prometheus metric: kamailio_sl_404_replies with type:
2020-08-03T22:00:05Z D! [108/500] Unsupported Prometheus metric: kamailio_sl_407_replies with type:
2020-08-03T22:00:05Z D! [109/500] Unsupported Prometheus metric: kamailio_sl_408_replies with type:
2020-08-03T22:00:05Z D! [110/500] Unsupported Prometheus metric: kamailio_sl_483_replies with type:
2020-08-03T22:00:05Z D! [111/500] Unsupported Prometheus metric: kamailio_sl_4xx_replies with type:
2020-08-03T22:00:05Z D! [112/500] Unsupported Prometheus metric: kamailio_sl_500_replies with type:
2020-08-03T22:00:05Z D! [113/500] Unsupported Prometheus metric: kamailio_sl_5xx_replies with type:

The metric data is exported in counter format with no labels:

kamailio_sl_200_replies 12 1596492670513
kamailio_sl_202_replies 0 1596492670513
kamailio_sl_2xx_replies 0 1596492670513
kamailio_sl_300_replies 0 1596492670513
kamailio_sl_301_replies 0 1596492670513
kamailio_sl_302_replies 0 1596492670513
kamailio_sl_3xx_replies 0 1596492670513
kamailio_sl_400_replies 0 1596492670513
kamailio_sl_401_replies 0 1596492670513
kamailio_sl_403_replies 0 1596492670513
kamailio_sl_404_replies 0 1596492670513
kamailio_sl_407_replies 0 1596492670513
kamailio_sl_408_replies 0 1596492670513
kamailio_sl_483_replies 0 1596492670513
kamailio_sl_4xx_replies 0 1596492670513

The export data can be read by a stand alone Prometheus, I used the latest docker image prometheus:v2.19.3.

Is this export format not valid for the current version of the Cloud Watch Prometheus agent?

Failed to pull image "busybox"

After last fluentd-cloudwatch update I have this in logs.

`kubectl get events -A | grep -i "error|warning|failed" | grep fluentd-cloudwatch

Failed to pull image "busybox": rpc error: code = Unknown desc = Error response from daemon: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit

amazon-cloudwatch 12m Warning Failed pod/fluentd-cloudwatch-h8q6w Error: ErrImagePull
amazon-cloudwatch 12m Warning Failed pod/fluentd-cloudwatch-h8q6w Error: ImagePullBackOff
amazon-cloudwatch 19m Warning Failed pod/fluentd-cloudwatch-qd8hh Failed to pull image "busybox": rpc error: code = Unknown desc = Error response from daemon: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit`

fluentd-cloudwatch could not start anymore.

Could you include variable to use either public AWS ECR repository or private AWS ECR repository?
If the variable was not set use by default busybox docker hub repository.
If the variable was set use repository provided in the variable.

Agent fail to get ecs task metadata 51678/v1/tasks

I keep getting this error bellow

W! failing to call ecsagent taskinfo endpoint, error: get response with unexpected length from http://10.x.x.x:51678/v1/tasks, response length: -1

Usage for on premise Kubernetes cluster.

Hi,

I tried using this as per the AWS documentation but it seems that it would only run on AWS instances.
Is there anyway to use this on physical nodes running an RKE Kubernetes cluster to send metrics to CloudWatch Container Insights?

Kind regards,

Eric V.

fluent-bit daemonset scheduled on fargate nodes

Following https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-setup-logs-FluentBit.html#Container-Insights-FluentBit-setup

I ran in to the issue that the provided fluent-bit daemon definition tolerates all NoSchedule taints, and gets scheduled on all nodes, including fargate. The pods scheduled on fargate naturally fails, due to being a daemonset.

I solved the issue by removing all tolerations from the deamonset spec.

amazon-cloudwatch-container-insights/k8s-deployment-manifest-templates/deployment-mode/daemonset/container-insights-monitoring/fluent-bit/fluent-bit.yaml

Line 355 in e46954d

tolerations:

As stated, removing the tolerations works, but I not understanding why the tolerations are there in the first place prompted me to open this ticket.

Fargate sidecar not working

I found issue #8 and I'm having a similar issue.

I tried using the config provided in https://github.com/aws-samples/amazon-cloudwatch-container-insights/blob/master/ecs-task-definition-templates/deployment-mode/sidecar/cwagent-emf/cwagent-emf-fargate.json as a sidecar for a Fargate container.

This returns the same ec2metadata is not available error for me using latest image:

2020-04-09 21:13:482020/04/09 19:13:48 E! Error: no inputs found, did you provide a valid config file?
2020-04-09 21:13:482020/04/09 19:13:48 I! AmazonCloudWatchAgent Version 1.230621.0.
2020-04-09 21:13:48I! Detected the instance is ECS
2020-04-09 21:13:482020/04/09 19:13:48 Reading json config file path: /opt/aws/amazon-cloudwatch-agent/bin/default_linux_config.json ...
2020-04-09 21:13:48/opt/aws/amazon-cloudwatch-agent/bin/default_linux_config.json does not exist or cannot read. Skipping it.
2020-04-09 21:13:48Cannot access /etc/cwagentconfig: lstat /etc/cwagentconfig: no such file or directory2020/04/09 19:13:48 Reading json config from from environment variable CW_CONFIG_CONTENT.
2020-04-09 21:13:48Valid Json input schema.
2020-04-09 21:13:48I! detect region from ecs
2020-04-09 21:13:48No csm configuration found.
2020-04-09 21:13:48No metric configuration found.
2020-04-09 21:13:48Configuration validation first phase succeeded
2020-04-09 21:13:48
2020-04-09 21:13:482020/04/09 19:13:48 I! Config has been translated into TOML /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.toml
2020-04-09 21:13:482020/04/09 19:13:48 I! 2020/04/09 19:13:48 E! ec2metadata is not available

My Fargate is also using the XRays sidecar without any problems, and is running in a public VPC.

I'm using v1.3 of Fargate

INFO: Collecting Logs on Bottlerocket AMI

Bottlerocket log collection

No host logs on Bottlerocket

The Bottlerocket AMIs are meant to be a very stripped down container OS. Consequently, there are fewer log types to collect.

I've reached out the bottlerocket team and they said that all logs should go to journald on Bottlerocket.

On Bottlerocket, the /aws/containerinsights/Cluster_Name/host will not be populated because /var/log/dmesg, /var/log/secure, and /var/log/messages files do not exist on bottlerocket.

dmesg logs can be obtained with journalctl -k or journalctl --dmesg:

bash-5.1# journalctl -k
May 19 19:21:55 localhost kernel: Linux version 5.15.108 (builder@buildkitsandbox) (x86_64-bottlerocket-linux-gnu-gcc (Buildroot 2022.11.1) 11.3.0, GNU ld (GNU Binutils) 2.38) #1 SMP Tue May 9 23:54:36 UTC 2023
May 19 19:21:55 localhost kernel: Command line: BOOT_IMAGE=(hd0,gpt3)/vmlinuz console=tty0 console=ttyS0,115200n8 net.ifnames=0 netdog.default-interface=eth0:dhcp4,dhcp6? quiet bootconfig root=/dev/dm-0 rootwait ro raid=noautodetect random.trust_cpu=on selinux=1 enforcing=1 "dm-mod.create=root,,,ro,0 1884160 verity 1 PARTUUID=9b48037e-3bab-4072-8daf-082d8fae9f5e/PARTNROFF=1 PARTUUID=9b48037e-3bab-4072-8daf-082d8fae9f5e/PARTNROFF=2 4096 4096 235520 1 sha256 f12f74a243a23635effccde072228bc55a9f06c4c6e587001619bff01f6f8a16 80248401eae3104a1d94e19d04da0bc2b90289d599845063bdf6b330c422e713 2 restart_on_corruption ignore_zero_blocks" -- systemd.log_target=journal-or-kmsg systemd.log_color=0 systemd.show_status=true
May 19 19:21:55 localhost kernel: KASLR enabled
May 19 19:21:55 localhost kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
May 19 19:21:55 localhost kernel: x86/fpu: Supporting XSAVE

However, Fluent Bit appears to only be able to collect systemd unit journald logs. The docs say that it can filter by systemd unit file only: https://docs.fluentbit.io/manual/pipeline/inputs/systemd

And when I tried an input with no filters, no logs were collected:

[INPUT]
        Name                systemd
        Tag                 systemd.all.*
        DB                  /var/fluent-bit/state/systemd.db
        Read_From_Tail      Off

Here's the contents of /var/log on my node:

[ec2-user@admin]$ sudo sheltie
bash-5.1# ls /var/log
aws-routed-eni    containers  dmesg  journal  kdump  pods  private  support

The logs in aws-routed-eni might be interesting to some users and could be collected with a Tail input: https://docs.fluentbit.io/manual/pipeline/inputs/tail

bash-5.1# cd aws-routed-eni/
bash-5.1# ls
egress-v4-plugin.log  ipamd.log  plugin.log

Notes:

There is a dmesg directory on my node in /var/log, but its empty. I've reached out the bottlerocket team and they said that all logs should go to journald on Bottlerocket. As noted above, you can use journalctl to obtain them.
Make sure you use the admin container if you choose to poke around the filesystem. I found that it seemed some directories were not visible without admin.

Pod Logs will still be collected

Pod logs can still be collected without any change in experience.

Dataplane logs will still be collected

Kubelet and Containerd logs can still be collected without any change in experience.

[INPUT]
        Name                systemd
        Tag                 dataplane.systemd.*
        Systemd_Filter      _SYSTEMD_UNIT=kubelet.service
        Systemd_Filter      _SYSTEMD_UNIT=containerd.service
        DB                  /var/fluent-bit/state/systemd.db
        Path                /var/log/journal
        Read_From_Tail      ${READ_FROM_TAIL}

Too high cpu request for fluentbit daemonset

As per the example provided on https://github.com/aws-samples/amazon-cloudwatch-container-insights/blob/master/k8s-deployment-manifest-templates/deployment-mode/daemonset/container-insights-monitoring/fluent-bit/fluent-bit.yaml, fluentbit is requesting 500m of CPU inside EKS.

    resources:
        limits:
          memory: 200Mi
        requests:
          cpu: 500m
          memory: 100Mi

According to fluentbit documentation and statistics provided by AWS, fluentbit is much less CPU & memory intensive as compared to fluentd, so why is this example using a higher CPU than the fluentd counterparts from the past? I would like to know how do I decide how much CPU/memory must be allocated to the daemonset pod because with the above configuration, it is almost impossible for me to use a t3.medium or t3.large instance as most of its CPU is consumed by the daemonset pod leaving none to little room for real workload pods.

wrong parser for dmesg logs

The dmesg logs have this format:

[    4.124564] AVX2 version of gcm_enc/dec engaged.
[    4.129423] AES CTR mode by8 optimization enabled
[    4.356031] RPC: Registered named UNIX socket transport module.
[    4.361383] RPC: Registered udp transport module.
[    4.365930] RPC: Registered tcp transport module.
[    4.370478] RPC: Registered tcp NFSv4.1 backchannel transport module.

That is not compliant with the current parser:

[PARSER]
        Name                syslog
        Format              regex
        Regex               ^(?<time>[^ ]* {1,2}[^ ]* [^ ]*) (?<host>[^ ]*) (?<ident>[a-zA-Z0-9_\/\.\-]*)(?:\[(?<pid>[0-9]+)\])?(?:[^\:]*\:)? *(?<message>.*)$
        Time_Key            time
        Time_Format         %b %d %H:%M:%S

Update registry to use ECR public

I noticed when deploying this that the manifests point to dockerhub instead of ECR public. Is there a reason for that? With dockerhub pull limits and the amount of times this agent is pulled it is probably better to default the image path to the one located here https://gallery.ecr.aws/cloudwatch-agent/cloudwatch-agent

I can PR the change but searching the repo I couldn't quite figure out where to make the change first.

When Containerd is used for container runtime, Fluentd cannot Forward.

The configuration of the fluentd sample is based on the assumption that docker logs are handled, so it should be changed to support logs when the runtime is containerd.

Docker Log

{
　"log":"2022/12/17 08:03:23 {message}",
　"stream":"stdout", 
　"time":"2021-12-16T23:32:06.226543453Z" 
}

Containerd Log

2021-12-16T23:32:06.226543453Z+09:00 stderr F 2022/12/17 08:03:23 {message}

Docker is in json format, but containerd is in one-line text format.
Therefore, type @JSON cannot parse the containerd log.

log

fluentd-cloudwatch-zzcbx fluentd-cloudwatch 2023-01-15 13:10:57 +0000 [warn]: #0 [in_tail_fluentd_logs] pattern not matched: "2023-01-15T13:10:38.87435193Z stdout P 2023-01-15 13:10:38 +0000 [warn]: #0 [in_tail_fluentd_logs] pattern not matched: \"2023-01-15T13:10:17.714317364Z stdout P 2023-01-15 13:10:17 +0000 [warn]: #0 [in_tail_fluentd_logs] pattern not matched: \\\"2023-01-15T13:10:08.118074592Z stdout P 2023-01-15 13:10:08 +0000 [warn]: #0 [in_tail_fluentd_logs] pattern not matched: \\\\\\\"2023-01-15T13:10:02.387905607Z stdout P

Agent not sending any metrics due k8sapiserver OnStoppedLeading

Hi,

we recently had the issue, that the cloudwatch agent on our EKS Cluster stopped sending any metrics (fluentd worked properly). So we had a look at the logs of the pod, which was on the leader node and found this:

kubectl logs -n amazon-cloudwatch cloudwatch-agent-h875s 2020-03-14T01:50:50Z I! k8sapiserver OnStoppedLeading: ip-10-71-217-46.eu-central-1.compute.internal

We didn't do anything with this daemon set, so the question is, how this could happen and how can we prevent this behaviour?

After we killed the pod and the new pod started with
020-03-17T14:12:48Z I! k8sapiserver OnStartedLeading: ip-10-71-217-46.eu-central-1.compute.internal

the metrics were sent to CloudWatch.

Regards,
kirnberger

Workaround for pods not being able to access EC2 IMDS

In my EKS cluster I have disabled the Instance Metadata Service (IMDS) v1 and set the IMDS hop count to 1, to prevent pods from accessing the IMDS. This prevents the cloudwatch-agent daemonset from starting, since it looks into the EC2 IMDS.

2020/11/12 15:56:28 I! 2020/11/12 15:56:25 E! ec2metadata is not available
2020/11/12 15:56:25 I! attempt to access ECS task metadata to determine whether I'm running in ECS.
2020/11/12 15:56:26 W! retry [0/3], unable to get http response from http://169.254.170.2/v2/metadata, error: unable to get response from http://169.254.170.2/v2/metadata, error: Get "http://169.254.170.2/v2/metadata": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Is there a config item or workaround to use this agent without allowing pods to access the EC2 IMDS?

Arm64 support, including multi-arch amazon/cloudwatch-agent

Hello,

Cloudwatch Agent itself supports Arm64[1], as does fluentd[2].

As EKS and ECS now both support Graviton/Graviton2-based images[3], Container Insights should support them as well.

Thanks!

[1] https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Install-CloudWatch-Agent.html
[2] https://hub.docker.com/r/arm64v8/fluentd/
[3] https://aws.amazon.com/blogs/containers/eks-on-graviton-generally-available/

Override Container Insights Metrics

Can you please post the default configuration for the metrics section of the config json so that we can override it? The default is more than we need and the costs are prohibitive.

    {
      "agent": {
        "region": "${var.aws_region}"
      },
      "metrics": {
        ???
      },
      "logs": {
        "metrics_collected": {
          "kubernetes": {
            "cluster_name": "${var.cluster_name}",
            "metrics_collection_interval": 60
          }
        },
        "force_flush_interval": 5
      }
    }

It would be nice if there was an exclude as it would be easier to configure this way. I don't need disk or network stats, for example.

0/6 nodes are available: 1 Too many pods, 5 node(s) didn't match Pod's node affinity/selector.

Hi,

I have followed the Quick Start Container Insights - tutorial.

I used the following command:

ClusterName='clusterName'
LogRegion='logRegion'
FluentBitHttpPort='2020'
FluentBitReadFromHead='Off'
[[ ${FluentBitReadFromHead} = 'On' ]] && FluentBitReadFromTail='Off'|| FluentBitReadFromTail='On'
[[ -z ${FluentBitHttpPort} ]] && FluentBitHttpServer='Off' || FluentBitHttpServer='On'
curl https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/latest/k8s-deployment-manifest-templates/deployment-mode/daemonset/container-insights-monitoring/quickstart/cwagent-fluent-bit-quickstart.yaml | sed 's|{{cluster_name}}|'${ClusterName}'|;s|{{region_name}}|'${LogRegion}'|;s|{{http_server_toggle}}|"'${FluentBitHttpServer}'"|;s|{{http_server_port}}|"'${FluentBitHttpPort}'"|;s|{{read_from_head}}|"'${FluentBitReadFromHead}'"|;s|{{read_from_tail}}|"'${FluentBitReadFromTail}'"|' | kubectl apply -f -

It applied successfully, however, the pods are err'ing out with the following error:

0/6 nodes are available: 1 Too many pods, 5 node(s) didn't match Pod's node affinity/selector.

What am I doing wrong? For reference, I am using EKS with Fargate. I read online about removing the tolerations, however, i'm unsure if this would fix the issue. And I thought about adding the nodeSelector key, but unsure where exactly to put it inside the .yaml file that is downloaded.

Your assistance would be great!

Thank you.

EKS IAM Roles for Service Accounts ready?

Is Amazon Cloudwatch Container-Insights compatible with EKS IAM Roles for Service Accounts feature?

Or more directly, is the latest amazon cloudwatch-agent version built with a compatible SDK version?

Asking here because the official doc do not mention anything about IAM Roles for Service Accounts.

Thanks!

Unable to Kustomize Container Insights Quickstart Manifest

Hi,

With reference to https://github.com/aws-samples/amazon-cloudwatch-container-insights/blob/master/k8s-deployment-manifest-templates/deployment-mode/daemonset/container-insights-monitoring/quickstart/cwagent-fluent-bit-quickstart.yaml (I'm referencing the latest version, but this seems to affect all releases of this Manifest) I'm trying to patch changes to this Manifest using Kubernetes default tooling (kubectl) and a Kustomization file.

Per the instructions for deploying this Manifest (https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-setup-EKS-quickstart.html) the file contains eight {{ and }} sequences; the presence of these sequences breaks the YAML parser within Kubectl meaning I cannot use kubectl kustomize to patch changes into this Manifest without first replacing the {{ and }} sequences using sed.

I appreciate this may be by design, but would it be possible to update the delimiters within this file to allow direct manipulation of it using Kustomize?

Kustomize Version: v4.5.7
Kustomize Error: error: yaml: invalid map key: map[string]interface {}{"cluster_name":""}

Override container insights metrics log group

Context: I'm running cloudwatch agent for container insights on EKS.

Right now, container insights metrics are stored on the following Log Group from Cloudwatch:

/aws/containerinsights/{{eks_cluster_name}}/performance

For pure cosmetical reasons, I would like to change this default to something else.

I found on the official documentation how to customize it for log directives.that deals with logs_collected

log_group_name – Optional. Specifies what to use as the log group name in CloudWatch Logs. As part of the name, you can use {instance_id}, {hostname}, {local_hostname}, and {ip_address} as variables within the name. {hostname} retrieves the hostname from the EC2 metadata, and {local_hostname} uses the hostname from the network configuration file.

Unfortunately, that directive does not seem to work with metrics, nor I could find any examples of it anywhere else.

Any hints on how (and if) it's done?

Resource limit set too low

In the Deployment yaml,

the Resource Limit should not be set to the same as the Request limit
200Mi limit is too low, I have let to 400Mi

https://github.com/aws-samples/amazon-cloudwatch-container-insights/blob/master/k8s-yaml-templates/fluentd/fluentd.yaml#L396

We are seeing FluentD oomkilled. Average memory utilization is 170mb.

Is init container still needed?

According to the example yaml, we need to use an init container as a workaround for some issue in the fluentd image.

amazon-cloudwatch-container-insights/k8s-deployment-manifest-templates/deployment-mode/daemonset/container-insights-monitoring/fluentd/fluentd.yaml

Lines 364 to 375 in 2ae134f

 # Because the image's entrypoint requires to write on /fluentd/etc but we mount configmap there which is read-only, 

 # this initContainers workaround or other is needed. 

 # See https://github.com/fluent/fluentd-kubernetes-daemonset/issues/90 

 initContainers: 

 - name: copy-fluentd-config 

 image: busybox 

 command: ['sh', '-c', 'cp /config-volume/..data/* /fluentd/etc'] 

 volumeMounts: 

 - name: config-volume 

 mountPath: /config-volume 

 - name: fluentdconf 

 mountPath: /fluentd/etc

However, we are not observing any problems without this init container, and based on this issue I think this may have been fixed long ago. fluent/fluentd-kubernetes-daemonset#161

Is this init container still needed? If not can it be removed from the example yaml?

Missing AWS_EMF_AGENT_ENDPOINT environment variable in example ECS task definition for EMF

I'm deploying my app on ECS and I'm using EMF to publish structured logs and metrics. I'm using the Java client library and the example task definition, but I can not get my logs received by the CloudWatch Agent.

After some investigation, I found I'm missing a AWS_EMF_AGENT_ENDPOINT envirionment variable in the definition. Also, this variable is included in the AWS documentation for using EMF with CloudWatch Agent, so I think it should be included in the 2 examples here as well (EC2 and Fargate) to avoid confusion to customers.

Allow to specify specific namespaces and/or node selectors to watch

Is there a way to either whitelist/blacklist namespaces from collecting/sending logs, and or only specific nodes?

Also is there a way to provide an expiration on the log groups that get automatically created?

Default cwagent-prometheus configurations to include Karpenter

The default CloudWatch Agent prometheus scraping configuration should include Karpenter metrics out of the box.

In the DeamonSet yaml the FluentBit pod has blanket tolerations, so EKS tried to schedule them on Fargate nodes

Actual

I've installed CloudWatch Container Insights per the DaemonSet QuickStart.

This creates 2 DeamonSets, 1 for CloudWatch agent and 1 for FluentBit. I notice that Kubernetes tries to schedule the Fluentbit DaemonSet pod also on (pre-existing) Fargate nodes, which obviously does not work (the pods stay pending forever). Interestingly, no pods are created against Fargate nodes for the CloudWatch daemonset (which is good).

Expected

FluentBid pods are not scheduled against Fargate nodes (as that is not supported on Fargate anyway, these pods stay pending forever).

If I want to run FluentBit in Fargate, I'll use a sidecar container for it.

Root cause

At the bottom of DaemonSet QuickStart there's the pod spec for FluentBit that has these tolerations:

      tolerations:
      - key: node-role.kubernetes.io/master
        operator: Exists
        effect: NoSchedule
      - operator: "Exists"
        effect: "NoExecute"
      - operator: "Exists"
        effect: "NoSchedule"

The last 2 tolerations have no key and are thus blanket tolerations, triggering the scheduling on (pre-existing) Fargate nodes.

Note that the pod spec for the CloudWatch agent in the same yaml doesn't have any tolerations.

Fargate ready ?

is Fargate ready?

because I'm receiving this error:
2019/10/10 14:09:58 E! ec2metadata is not available

Fargate can't call EC2 metadata right?
aws/amazon-ecs-cli#434

Error while deployment of Fluentbit using fluent-bit.yaml

We are trying to deploy fluentbit using Fluentbit.yaml but it is showing error message "error: error parsing STDIN: error converting YAML to JSON: yaml: line 26: mapping values are not allowed in this context"

Clarify usage of ContainerInsights & Prometheus on EKS

Hi, I am using cloudwatch-agent to collect logs and stats for container insights. I also want to use the Prometheus integration, but from the samples, it seems like I have to create an extra deployment just for the Prometheus integration. Is that correct? Can't I run the Prometheus integration as part of my existing daemonset? It is unclear to me how to proceed in this scenario, would really welcome more clarity.

How to enable multiline parsing on Java stack traces?

Hi team,

Does anyone know what is the right way of handling Java stack traces?

I have tried the built-in multiline.parser java in the Fluentbit docu but it still didn't work. Java stack traces are still getting pushed to our OpenSearch index individually.

I have also logged an AWS Support ticket but they said they don't support it.

Let me know if you have any ideas. Thanks!

Unable to build the sample Dockerfile with multi architecture

using this sample:
https://github.com/aws-samples/amazon-cloudwatch-container-insights/blob/master/cloudwatch-agent-dockerfile/Dockerfile

when building it with this command :
docker buildx build --platform linux/amd64,linux/arm64/v8 .

I'm getting this error:
#13 117.7 dpkg: error processing archive amazon-cloudwatch-agent.deb (--install):
#13 117.7 package architecture (amd64) does not match system (arm64)
#13 117.7 Errors were encountered while processing:
#13 117.7 amazon-cloudwatch-agent.deb

Could you assist to resolve it?

AWS CloudWatch Agent failing on OCP 4.6 cluster

I am trying to set up aws CloudWatch Agent to Collect ocp 4.6 Cluster Metrics using this link https://docs.amazonaws.cn/en_us/AmazonCloudWatch/latest/monitoring/Container-Insights-setup-metrics.html

but the pods are failing with following errors:

[ec2-user@ip-10-0-2-100 ~]$ oc logs cloudwatch-agent-57htk
2021/02/02 21:17:21 I! 2021/02/02 21:17:18 E! ec2metadata is not available
2021/02/02 21:17:18 I! attempt to access ECS task metadata to determine whether I'm running in ECS.
2021/02/02 21:17:19 W! retry [0/3], unable to get http response from http://169.254.170.2/v2/metadata, error: unable to get response from http://169.254.170.2/v2/metadata, error: Get "http://169.254.170.2/v2/metadata": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2021/02/02 21:17:20 W! retry [1/3], unable to get http response from http://169.254.170.2/v2/metadata, error: unable to get response from http://169.254.170.2/v2/metadata, error: Get "http://169.254.170.2/v2/metadata": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2021/02/02 21:17:21 W! retry [2/3], unable to get http response from http://169.254.170.2/v2/metadata, error: unable to get response from http://169.254.170.2/v2/metadata, error: Get "http://169.254.170.2/v2/metadata": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2021/02/02 21:17:21 I! access ECS task metadata fail with response unable to get response from http://169.254.170.2/v2/metadata, error: Get "http://169.254.170.2/v2/metadata": context deadline exceeded (Client.Timeout exceeded while awaiting headers), assuming I'm not running in ECS.
I! Detected the instance is OnPrem
2021/02/02 21:17:21 Reading json config file path: /opt/aws/amazon-cloudwatch-agent/bin/default_linux_config.json ...
/opt/aws/amazon-cloudwatch-agent/bin/default_linux_config.json does not exist or cannot read. Skipping it.
2021/02/02 21:17:21 Reading json config file path: /etc/cwagentconfig/..2021_02_02_21_14_55.157039078/cwagentconfig.json ...
2021/02/02 21:17:21 unable to scan config dir /etc/cwagentconfig with error: unable to parse json, error: invalid character '\n' in string literal
No json config files found, please provide config, exit now

2021/02/02 21:17:21 I! Return exit error: exit code=99
2021/02/02 21:17:21 I! there is no json configuration when running translator

AWS CloudWatch Agent failing on OCP 4.6 cluster

I am trying to set up aws CloudWatch Agent to Collect ocp 4.6 Cluster Metrics using this link https://docs.amazonaws.cn/en_us/AmazonCloudWatch/latest/monitoring/Container-Insights-setup-metrics.html

but the pods are failing with following errors:

[ec2-user@ip-10-0-2-100 ~]$ oc logs cloudwatch-agent-57htk
2021/02/02 21:17:21 I! 2021/02/02 21:17:18 E! ec2metadata is not available
2021/02/02 21:17:18 I! attempt to access ECS task metadata to determine whether I'm running in ECS.
2021/02/02 21:17:19 W! retry [0/3], unable to get http response from http://169.254.170.2/v2/metadata, error: unable to get response from http://169.254.170.2/v2/metadata, error: Get "http://169.254.170.2/v2/metadata": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2021/02/02 21:17:20 W! retry [1/3], unable to get http response from http://169.254.170.2/v2/metadata, error: unable to get response from http://169.254.170.2/v2/metadata, error: Get "http://169.254.170.2/v2/metadata": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2021/02/02 21:17:21 W! retry [2/3], unable to get http response from http://169.254.170.2/v2/metadata, error: unable to get response from http://169.254.170.2/v2/metadata, error: Get "http://169.254.170.2/v2/metadata": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2021/02/02 21:17:21 I! access ECS task metadata fail with response unable to get response from http://169.254.170.2/v2/metadata, error: Get "http://169.254.170.2/v2/metadata": context deadline exceeded (Client.Timeout exceeded while awaiting headers), assuming I'm not running in ECS.
I! Detected the instance is OnPrem
2021/02/02 21:17:21 Reading json config file path: /opt/aws/amazon-cloudwatch-agent/bin/default_linux_config.json ...
/opt/aws/amazon-cloudwatch-agent/bin/default_linux_config.json does not exist or cannot read. Skipping it.
2021/02/02 21:17:21 Reading json config file path: /etc/cwagentconfig/..2021_02_02_21_14_55.157039078/cwagentconfig.json ...
2021/02/02 21:17:21 unable to scan config dir /etc/cwagentconfig with error: unable to parse json, error: invalid character '\n' in string literal
No json config files found, please provide config, exit now

2021/02/02 21:17:21 I! Return exit error: exit code=99
2021/02/02 21:17:21 I! there is no json configuration when running translator

log group expiration setting

Is there any way to override the log group expiration setting to change from never expire?

Add Helm Chart support

Is the Helm Chart support in the roadmap?

I am planing to add container insights for Amazon EKS as a plugin for AWS CDK
aws/aws-cdk#7160

It would be great if we can publish container insights Helm Chart to aws/eks-charts repository. And in AWS CDK we can simply addChart() to install the container insights for Amazon EKS clusters.

	# Because the image's entrypoint requires to write on /fluentd/etc but we mount configmap there which is read-only,
	# this initContainers workaround or other is needed.
	# See https://github.com/fluent/fluentd-kubernetes-daemonset/issues/90
	initContainers:
	- name: copy-fluentd-config
	image: busybox
	command: ['sh', '-c', 'cp /config-volume/..data/* /fluentd/etc']
	volumeMounts:
	- name: config-volume
	mountPath: /config-volume
	- name: fluentdconf
	mountPath: /fluentd/etc

aws-samples / amazon-cloudwatch-container-insights Goto Github PK

amazon-cloudwatch-container-insights's Introduction

Amazon Cloudwatch Container Insights

License Summary

amazon-cloudwatch-container-insights's People

Contributors

Stargazers

Watchers

Forkers

amazon-cloudwatch-container-insights's Issues

Summary

Environment

Bottlerocket log collection

No host logs on Bottlerocket

Pod Logs will still be collected

Dataplane logs will still be collected

Docker Log

Containerd Log

log

Actual

Expected

Root cause

Recommend Projects

Recommend Topics

Recommend Org

Jobs