GithubHelp home page GithubHelp logo

newrelic / nri-kubernetes Goto Github PK

View Code? Open in Web Editor NEW
42.0 9.0 51.0 15.73 MB

New Relic integration for Kubernetes

Home Page: https://docs.newrelic.com/docs/integrations/kubernetes-integration/get-started/introduction-kubernetes-integration

License: Apache License 2.0

Dockerfile 0.08% Go 92.65% Shell 2.68% Makefile 0.54% Smarty 3.81% Python 0.26%
kubernetes monitoring

nri-kubernetes's Introduction

New Relic Open Source community plus project banner.

New Relic integration for Kubernetes codecov

New Relic's Kubernetes integration gives you full observability into the health and performance of your environment, no matter whether you run Kubernetes on-premises or in the cloud. It gives you visibility about Kubernetes namespaces, deployments, replica sets, nodes, pods, and containers. Metrics are collected from different sources:

  • kube-state-metrics service provides information about state of Kubernetes objects like namespace, replicaset, deployments and pods (when they are not in running state)
  • /stats/summary kubelet endpoint gives information about network, errors, memory and CPU usage
  • /pods kubelet endpoint provides information about state of running pods and containers
  • /metrics/cadvisor cAdvisor endpoint provides missing data that is not included in the previous sources.
  • /metrics from control plane components: ETCD,controllerManager, apiServer and scheduler

Check out our documentation in order to find out more how to install and configure the integration, learn what metrics are captured and how to query them.

Table of contents

Installation

Start by checking the compatibility and requirements and then follow the installation steps.

For troubleshooting, see Not seeing data or Error messages.

Helm chart

You can install this chart using nri-bundle located in the helm-charts repository or directly from this repository by adding this Helm repository:

helm repo add nri-kubernetes https://newrelic.github.io/nri-kubernetes
helm upgrade --install nri-kubernetes/newrelic-infrastructure -f your-custom-values.yaml

For further information of the configuration needed for the chart just read the chart's README.

Usage

Learn how to find and use data and review the description of all captured data.

Running the integration against a static data set

Development

Run e2e Tests

  • See e2e/README.md for more details regarding running e2e tests.

Tests

For running unit tests, run

make test

Run local development environment

We use Minikube and Tilt to spawn a local environment that it will reload after any changes inside the charts or the integration code.

Make sure you have these tools or install them:

Create a values-local.yaml file from the values-local.yaml.sample using a valid license key and your cluster name.

Start the local environment:

make local-env-start

Notice that local images are build and pushed to docker running inside the minikube cluster since we are running eval $(minikube docker-env) before launching Tilt.

Note: when running the local-dev-environment with a Kubernetes cluster < v1.21, you will need to remove the apiVersion templating for the CronJob resource and manually set apiVersion: batch/v1beta1. This is because Tilt uses helm template and helm template doesn't render capabilities: helm/helm#3377.

Running OpenShift locally using CodeReady Containers

  • See OpenShift.md for more details regarding running locally OpenShift environments.

Support

Should you need assistance with New Relic products, you are in good hands with several support diagnostic tools and support channels.

New Relic offers NRDiag, a client-side diagnostic utility that automatically detects common problems with New Relic agents. If NRDiag detects a problem, it suggests troubleshooting steps. NRDiag can also automatically attach troubleshooting data to a New Relic Support ticket. Remove this section if it doesn't apply.

If the issue has been confirmed as a bug or is a feature request, file a GitHub issue.

Support Channels

Privacy

At New Relic we take your privacy and the security of your information seriously, and are committed to protecting your information. We must emphasize the importance of not sharing personal data in public forums, and ask all users to scrub logs and diagnostic information for sensitive information, whether personal, proprietary, or otherwise.

We define “Personal Data” as any information relating to an identified or identifiable individual, including, for example, your name, phone number, post code or zip code, Device ID, IP address, and email address.

For more information, review New Relic’s General Data Privacy Notice.

Contribute

We encourage your contributions to improve this project! Keep in mind that when you submit your pull request, you'll need to sign the CLA via the click-through using CLA-Assistant. You only have to sign the CLA one time per project.

If you have any questions, or to execute our corporate CLA (which is required if your contribution is on behalf of a company), drop us an email at [email protected].

A note about vulnerabilities

As noted in our security policy, New Relic is committed to the privacy and security of our customers and their data. We believe that providing coordinated disclosure by security researchers and engaging with the security community are important means to achieve our security goals.

If you believe you have found a security vulnerability in this project or any of New Relic's products or websites, we welcome and greatly appreciate you reporting it to New Relic through our bug bounty program.

If you would like to contribute to this project, review these guidelines.

To all contributors, we thank you! Without your contribution, this project would not be what it is today.

License

nri-kubernetes is licensed under the Apache 2.0 License.

nri-kubernetes's People

Contributors

alejandrodnm avatar alvarocabanas avatar ardias avatar areina avatar arvdias avatar carlosroman avatar csongnr avatar davidgit avatar dbudziwojskinr avatar dependabot[bot] avatar github-actions[bot] avatar gsanchezgavier avatar htroisi avatar invidian avatar isaacadeleke-nr avatar jorik avatar juanjjaramillo avatar kang-makes avatar mangulonr avatar marcsanmi avatar newrelic-k8s-agents-bot avatar paologallinaharbur avatar renovate[bot] avatar roobre avatar sachin-shankar avatar sigilioso avatar svetlanabrennan avatar vuqtran88 avatar w21froster avatar xqi-nr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nri-kubernetes's Issues

Support for Horizontal Pod Scaling based on custom metrics

Summary

NOTE: # ( An fork of an AWS CloudWatch adapter already exists )

Desired Behavior

NOTE: # ( This adapter allows you to scale your Kubernetes deployment using the Horizontal Pod Autoscaler (HPA) with metrics from NewRelic. )
TIP: # ( Do NOT give us access or passwords to your New Relic account or API keys! )

Possible Solution

Additional context

windows k8's monitoring, no data. 1.21.0 alpha

Description

Tried 1.21 alpha. Pods start and run, and we can see the nodes in inventory but they are not sending back any data.

Steps to Reproduce

Run 1.21.0 alpha for windows

Expected Behavior

Data in new relic console

Relevant Logs / Console output

nrwindows.log

Pod output is attached.

Your Environment

Rancher provisioned cluster running kube 1.17.5. We have newrelic in place and working on the linux nodes, and ksm is running there.

Additional context

We used the existing service account, namespace and ksm as the liux version that already exists.

Move k8s integration Tests

https://fsi-build.pdx.vm.datanerd.us/job/k8s-integration-nri-kubernetes/

This is the most difficult job to migrate.

We are currently using a 3 different cluster configuration hosted in AWS.

The idea is first to check if it is possible to move the tests to a "k8s cluster in docker" spin up and destroyed each time the tests run. Some gh action provides this, otherwise we can check https://kind.sigs.k8s.io/ or at last minikube

This would simplify workflow, reduce costs, and being able to run tests when pipeline are run

Should we add support for 1.19?

Support environment variables for setting scrape interval

Is your feature request related to a problem? Please describe.

I'm always frustrated when trying to set interval for windows agent

Feature Description

I would like to set interval in file nri-kubernetes-definition.yml or nri-kubernetes-definition-windows.yml through env var for container like NRIA_LICENSE_KEY

Describe Alternatives

To have possibility to edit files inside the image.

Additional context

Try to change interval for calling new-relic

Priority

Really Want it

Check the correct version is pointed in deployment ymls before releasing

Description

Release pipeline (#50) does not check that the version being released is the one used in deployment yamls. This can cause inconsistencies with the versioning in the S3 bucket.

Expected Behavior

A deployment named newrelic-infrastructure-k8s-1.2.3.yml is expected to refer to deploy version 1.2.3.

PID 1 should be newrelic-infra-service

PID 1 should be newrelic-infra-service

Description

You are also overriding base image CMD /usr/bin/newrelic-infra-service with /usr/bin/newrelic-infra.

This might lead to unexpected behaviour like broken shutdown cause report or hot config reload misbehaviour.

Expected Behavior

Tini should be the entrypoint (see #21) and then CMD /usr/bin/newrelic-infra-service should run right after.

[Repolinter] Open Source Policy Issues

Repolinter Report

🤖This issue was automatically generated by repolinter-action, developed by the Open Source and Developer Advocacy team at New Relic. This issue will be automatically updated or closed when changes are pushed. If you have any problems with this tool, please feel free to open a GitHub issue or give us a ping in #help-opensource.

This Repolinter run generated the following results:

❗ Error ❌ Fail ⚠️ Warn ✅ Pass Ignored Total
0 0 0 7 0 7

Passed #

Click to see rules

license-file-exists #

Found file (LICENSE). New Relic requires that all open source projects have an associated license contained within the project. This license must be permissive (e.g. non-viral or copyleft), and we recommend Apache 2.0 for most use cases. For more information please visit https://docs.google.com/document/d/1vML4aY_czsY0URu2yiP3xLAKYufNrKsc7o4kjuegpDw/edit.

readme-file-exists #

Found file (README.md). New Relic requires a README file in all projects. This README should give a general overview of the project, and should point to additional resources (security, contributing, etc.) where developers and users can learn further. For more information please visit https://github.com/newrelic/open-by-default.

readme-starts-with-community-plus-header #

The first 5 lines contain all of the requested patterns. (README.md). The README of a community plus project should have a community plus header at the start of the README. If you already have a community plus header and this rule is failing, your header may be out of date, and you should update your header with the suggested one below. For more information please visit https://opensource.newrelic.com/oss-category/.

readme-contains-link-to-security-policy #

Contains a link to the security policy for this repository (README.md). New Relic recommends putting a link to the open source security policy for your project (https://github.com/newrelic/<repo-name>/security/policy or ../../security/policy) in the README. For an example of this, please see the "a note about vulnerabilities" section of the Open By Default repository. For more information please visit https://nerdlife.datanerd.us/new-relic/security-guidelines-for-publishing-source-code.

readme-contains-discuss-topic #

Contains a link to the appropriate discuss.newrelic.com topic (README.md). New Relic recommends directly linking the your appropriate discuss.newrelic.com topic in the README, allowing developer an alternate method of getting support. For more information please visit https://nerdlife.datanerd.us/new-relic/security-guidelines-for-publishing-source-code.

code-of-conduct-should-not-exist-here #

New Relic has moved the CODE_OF_CONDUCT file to a centralized location where it is referenced automatically by every repository in the New Relic organization. Because of this change, any other CODE_OF_CONDUCT file in a repository is now redundant and should be removed. Note that you will need to adjust any links to the local CODE_OF_CONDUCT file in your documentation to point to the central file (README and CONTRIBUTING will probably have links that need updating). For more information please visit https://docs.google.com/document/d/1y644Pwi82kasNP5VPVjDV8rsmkBKclQVHFkz8pwRUtE/view. Did not find a file matching the specified patterns. All files passed this test.

third-party-notices-file-exists #

Found file (THIRD_PARTY_NOTICES.md). A THIRD_PARTY_NOTICES.md file can be present in your repository to grant attribution to all dependencies being used by this project. This document is necessary if you are using third-party source code in your project, with the exception of code referenced outside the project's compiled/bundled binary (ex. some Java projects require modules to be pre-installed in the classpath, outside the project binary and therefore outside the scope of the THIRD_PARTY_NOTICES). Please review your project's dependencies and create a THIRD_PARTY_NOTICES.md file if necessary. For JavaScript projects, you can generate this file using the oss-cli. For more information please visit https://docs.google.com/document/d/1y644Pwi82kasNP5VPVjDV8rsmkBKclQVHFkz8pwRUtE/view.

Add support for node condition metrics from KSM

We can scrape these from KSM.

Notice that the list can vary among vendors and since the list may vary in time we could be forced to fetch everything and create metrics like "condition.abc=value"

Notice that value can be either true, false or unknown. We can stick with those or move it to numerical values as 0/1/2

Since we are collapsing 3 prometheus metrics into 1, deal with possible conflicts due to wrong data input

Support GKE v1.19

We need to test our Kubernetes integrations supports GKE v.1.18 v 1.19, we currently support until version 1.17 (check doc.)

Tasks to be done:

Put default configuration in /etc/newrelic rather than /var/db/newrelic

Currently, the Dockerfile copies the integration config to /var/db/newrelic-infra/integrations.d/nri-kubernetes-config.yml:

ADD nri-kubernetes-config.yml.sample /var/db/newrelic-infra/integrations.d/nri-kubernetes-config.yml

This makes it hard for the user to override this file, specially when the helm chart is in use. This file could instead be dropped in /etc/newrelic-infra/integrations.d/, where the user is allowed to create (and therefore override) files with the integrations_config property of the helm chart.

Assess compatibility with GKE

For the time being, we believe our kubernetes monitoring might be compatible with a GKE environments, but it has not been throughly tested. It would be good to have more insight on this, acknowledge the limitations and fix the issues that could arise in order to offer official suport for this cloud provider.

Fix rounding issue for allocatableCPUCores

Description

It appears we are rounding the allocatableCPUCores attribute and we should not be. I've attached the results in Query Builder and the raw JSON returned by NRDB. It's probably worth sanity checking related attributes to ensure we're not rounding others as well.

image

image

Expected Behavior

I would expect the allocatedCPUCores attirbute to reflect what is returned by the kubectl output.

Your Environment

EKS v1.17.12-eks-7684af
New Relic Infra: newrelic/infrastructure-k8s:2.2.0

Non-integer values for CADVISOR_PORT break the integration

Description

At the moment, the integration only supports defining the port number in the CADVISOR_PORT environment variable, as opposed to a full connection uri as the kubernetes standard dictates.

Adding support for an arbitrarily located cAdvisor can, however, be difficult with the current architecture of the integration.

Expected Behavior

The integration should support the full format of the CADVISOR_PORT env variable or, at least, ignore it if it cannot be handled, rather than blindly appending it to the URL without sanitization:

if port := os.Getenv("CADVISOR_PORT"); port != "" {
// We force to call the standalone cadvisor because k8s < 1.7.6 do not have /metrics/cadvisor kubelet endpoint.
e.Scheme = "http"
e.Host = fmt.Sprintf("%s:%s", c.nodeIP, port)

Endpoint logic is not working properly for API_SERVER_ENDPOINT_URL

The logic is currently not working properly:

The issue is that the secureEndpoint is set and never modified

  • an option would be to change this line only in
WithEndpointURL:
component.Endpoint = *url
to
component.SecureEndpoint = *url

It should not break anything, it should work as well with http endpoints thanks to

component.UseServiceAccountAuthentication = (strings.ToLower(url.Scheme) == "https")

CRI-O Node Not Reporting

The environment has a 4 node, bare-metal configuration. Two of the nodes are currently running Docker and two have been migrated to CRI-O due to Docker support being sunset k8s blog. The most recently configured node is reporting, but the first node is not reporting to New Relic.

time="2020-12-17T23:47:38Z" level=error msg="Integration command failed" error="exit status 1" instance=nri-kubernetes integration=com.newrelic.kubernetes prefix=integration/com.newrelic.kubernetes stderr="time=\"2020-12-17T23:47:38Z\" level=warning msg=\"Environment variable NRIA_CACHE_PATH is not set, using default /tmp/nri-kubernetes.json\"\ntime=\"2020-12-17T23:47:38Z\" level=panic msg=\"No data was populated\"\ntime=\"2020-12-17T23:47:38Z\" level=fatal msg=\"No data was populated\"\n" working-dir=/var/db/newrelic-infra/newrelic-integrations

Steps to Reproduce

The steps are straight forward, build out a new control-plane, master node using CRI-O and add it to the cluster.

Pod Description

...@...:~$ k -n monitoring describe pod nri-bundle-newrelic-infrastructure-9jptt
Name:         nri-bundle-newrelic-infrastructure-9jptt
Namespace:    monitoring
Priority:     0
Node:         <redacted>
Start Time:   Thu, 17 Dec 2020 08:47:57 -0800
Labels:       app=newrelic-infrastructure
              controller-revision-hash=6d9d744546
              mode=privileged
              pod-template-generation=1
              release=nri-bundle
Annotations:  <none>
Status:       Running
IP:           <redacted>
IPs:
  IP:           <redacted>
Controlled By:  DaemonSet/nri-bundle-newrelic-infrastructure
Containers:
  newrelic-infrastructure:
    Container ID:   cri-o://91c8abb18a130b1431c952117f388084422094cd890215a4eca0de9f87d1c55f
    Image:          newrelic/infrastructure-k8s:1.26.6
    Image ID:       docker.io/newrelic/infrastructure-k8s@sha256:e88b843bf175408c9ab846f483763326cefd122247650043738255da812ce53a
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Thu, 17 Dec 2020 08:48:30 -0800
    Ready:          True
    Restart Count:  0
    Limits:
      memory:  300M
    Requests:
      cpu:     100m
      memory:  150M
    Environment:
      NRIA_LICENSE_KEY:              <set to the key 'license' in secret 'nri-bundle-newrelic-infrastructure-config'>  Optional: false
      CLUSTER_NAME:                  <redacted>
      NRK8S_NODE_NAME:                (v1:spec.nodeName)
      NRIA_DISPLAY_NAME:              (v1:spec.nodeName)
      NRIA_CUSTOM_ATTRIBUTES:        {"clusterName":"$(CLUSTER_NAME)"}
      NRIA_PASSTHROUGH_ENVIRONMENT:  KUBERNETES_SERVICE_HOST,KUBERNETES_SERVICE_PORT,CLUSTER_NAME,CADVISOR_PORT,NRK8S_NODE_NAME,KUBE_STATE_METRICS_URL,KUBE_STATE_METRICS_POD_LABEL,TIMEOUT,ETCD_TLS_SECRET_NAME,ETCD_TLS_SECRET_NAMESPACE,API_SERVER_SECURE_PORT,KUBE_STATE_METRICS_SCHEME,KUBE_STATE_METRICS_PORT,SCHEDULER_ENDPOINT_URL,ETCD_ENDPOINT_URL,CONTROLLER_MANAGER_ENDPOINT_URL,API_SERVER_ENDPOINT_URL,DISABLE_KUBE_STATE_METRICS,DISCOVERY_CACHE_TTL
    Mounts:
      /dev from dev (rw)
      /host from host-volume (ro)
      /var/log from log (rw)
      /var/run/docker.sock from host-docker-socket (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from nri-bundle-newrelic-infrastructure-token-cgg6w (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  dev:
    Type:          HostPath (bare host directory volume)
    Path:          /dev
    HostPathType:
  host-docker-socket:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/docker.sock
    HostPathType:
  log:
    Type:          HostPath (bare host directory volume)
    Path:          /var/log
    HostPathType:
  host-volume:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:
  nri-bundle-newrelic-infrastructure-token-cgg6w:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  nri-bundle-newrelic-infrastructure-token-cgg6w
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     :NoSchedule op=Exists
                 :NoExecute op=Exists
                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                 node.kubernetes.io/network-unavailable:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists
                 node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                 node.kubernetes.io/unreachable:NoExecute op=Exists
                 node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:          <none>

Your Environment

...@...:~$ kubeadm version
kubeadm version: &version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.0", GitCommit:"af46c47ce925f4c4ad5cc8d1fca46c7b77d13b38", GitTreeState:"clean", BuildDate:"2020-12-08T17:57:36Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}

...@...:~$ cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.1 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.1 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

Load Balancer IP missing from K8sServiceSample

Description

Our docs show that we collect the loadBalancerIP for Services, however, I do not see this data present in the K8sServiceSample.

image

Expected Behavior

I'm expecting to see a Load Balancer IP for my NGINX Ingress controller service.

Steps to Reproduce

N/A

Your Environment

EKS v1.17.12-eks-7684af
New Relic Infra: newrelic/infrastructure-k8s:2.2.0

[Fargate] Update nri-kubernetes binary to be compatible to fargate

The aim of this task is to change the nri-kubernetes binary so that if manually injected it does not break the experience

  • The data coming from fargate should be labeled something like “serverLess”, the data coming from EC2 something like “standard” (check if nri-docker is doing the same already for ECS fargate and use the same approach).
    (Currently we have already tagging the data with those info)

Investigate if it is possible to merge the nodes into “fake” ones corresponding to each fargate namespace

  • We need an option to enable/disable reporting the nodes (Likely NRIA_IS_FORWARD_ONLY ). Does it make sense to create an additional image or should be simply inject the new environment variables?
  • Clarify what happens to the UI with no node being reported

Wrong containerId with systemD managed containers

With systemD docker container ids have a different format than docker "generic" id's and so the integration reports container identifiers that don't match with other components.

Example:
with systemd: /docker/ae17ce6dcd2f27905cedf80609044290eccd98115b4e1ded08fcf6852cf939ae/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod13118b761000f8fe2c4662d5f32d9532.slice/crio-ebccdd64bb3ef5dfa9d9b167cb5e30f9b696c2694fb7e0783af5575c28be3d1b.scope

without systemd: /docker/d44b560aba016229fd4f87a33bf81e8eaf6c81932a0623530456e8f80f9675ad/kubepods/besteffort/pod6edbcc6c66e4b5af53005f91bf0bc1fd/7588a02459ef3166ba043c5a605c9ce65e4dd250d7ee40428a28d806c4116e97
7588a02459ef3166ba043c5a605c9ce65e4dd250d7ee40428a28d806c4116e97

Expected Behavior

Container id with systemd driver reported in K8s events should equal id's coming from docker without systemd

We need to modify the container id parsing code to account for this difference

1.20/1.21 Support

Matrix for compat testing:

Kubernetes versions:

  • 1.20
  • 1.21

Projects:

[Fargate] Automatically inject the sidecar

This is the most complex feature to be added, we could split this in two or three parts, moreover each point should be parallelizable

Manual injection of the sidecar would not work in many scenarios, therefore we would need an automatic way to inject automatically such sidecar.

  • Find a solution to inject the container only to fargate namespace, following the label selection of the user.
  • Check annotation possibly specifying not to add a sidecar moved to newrelic/newrelic-infra-operator#42
  • Find a solution to create and manage certificates
  • Inject secrets needed to configure the agent in each fargate managed namespace for the license
  • Investigate how to let the container use a working service account (likely create a rolebinding for each existing sa) in the least intrusive way as possible. We would like to remove the old bindings once they are not used anymore
  • Support dryrun Moved to newrelic/newrelic-infra-operator#2

Keep in mind possible permission escalation, that we do not want to reduce the permission of any SA the customer is using.

In case the injection is failing the user should be able to carry on deploying. However the issue should be visible somewhere

[Fargate] Document Automatic injection

  • Specify the requirements and the possibility to do it manually if needed
  • Clearly point to the required permissions added when injecting the sidecar
  • Explain how to garbage collect resources
  • Explain that the solution when failing would be NOT a bloker for deployment
  • Explain in the FAQ the hybrid solution

[Fargate] Document Manual injection

In case the users do no want to automatically inject the sidecar there should be a clear documentation showing how to do it manually

  • The outcome should be identical to the solution with the automatic injection on.
  • Explain in the FAQ the hybrid solution

Moreover this could be also useful to double check the user experience considering that the only difference should be how the pod was injected

Need to pull resourcequota metrics from KSM

Is your feature request related to a problem? Please describe.

No

Feature Description

Need to get resource quota metrics data from KSM. This is useful for ops team to monitor the resource usage compared to the limit that is enforced.

Describe Alternatives

Will need to use Prometheus and then export to NR .

Priority

Really Want

k8s-stg/night: Automate nightly builds for nri-kubernetes

We should automate the generation of nightly builds for nri-kubernetes.
These builds should include:

  • The latest pre-release of nri-kubernetes
  • The latest pre-release of all of the integrations that are present in the bundle

In order to achieve this, we can follow two approaches:

  1. Implement nightly builds on nri-kubernetes repo, using newrelic/infrastructure-bundle-nightly as a base
  2. Deprecate infrastructure-k8s and bundle nri-kubernetes in the infrastructure bundle, thus reusing the logic already present there

Capture the K8s node ready status condition

Feature Description

As an engineer responsible for the health of the Kubernetes clusters, I need to make sure the cluster worker nodes are healthy so that they can schedule containers. If too many are NotReady it is a strong indication that the cluster is not healthy.

Additional context

I would like to capture the k8s node status as an additional attribute in the K8sNodeSample event type:
Screenshot 2020-12-02 at 15 01 55

It looks like the Ready status.condition can be True or False:

k get no ip-10-215-133-9.us-east-2.compute.internal -o json | jq -r '.status.conditions[] | select( .type == "Ready") | .status'
True

Priority

Really Want

[Fargate] Monitor control plane in EKS/Fargate

Investigate what can be scraped regarding the control plane clarifying if the current implementation of nri-kubernetes is capable to do so.

  • In case we cannot scrape it modify nri-kubernetes to do so. Keep in mind we want to avoid scraping duplicate data

Then once possible:

  • Modify manifest/chart configuration to help retrieving such data in Fargate
  • Check if is there any overlap in the hybrid solution.

Research most common conditions for Node Status

The objective is to create a list including all the vendors that we support (keeping in mind that those could change between k8s version).

Possibly we can include extra condition added by common tools, es: Node problem detector

This is useful for Docs, FSI, Alerting (and us)

Allow to namespace discovery operations

Is your feature request related to a problem? Please describe.

The performance impact of non-namespaced resource listings scales linearly with the size of a K8s cluster, being particularly heavy in the etcd component of the control plane.

Currently, we need to perform several listings in order to discover the location of some deployments, notably KSM. We could make this discovery lighter in the API server by limiting resource listing to a single namespace.

Feature Description

A new config option, ksmNamespace could be added. We could prepopulate this variable in the helm chart from the namespace where the solution is being deployed (when ksm dependency is enabled), and also let the user override it as well.

This would require changes in the Kubernetes interface (src/client/client.go:22) and all its usages.

CRI-O based Nodes Log Container Runtime Errors

The environment has a 4 node, bare-metal configuration. Two of the nodes are currently running Docker and two have been migrated to CRI-O due to Docker support being sunset starting in Kubernetes version 1.20 (k8s blog).

Every 10 minutes

On the nodes running CRI-O as the container engine, the following log messages are printed every ten minutes.

time="2020-12-18T00:18:18Z" level=error msg="debug error" component=Agent error="unable to determine open descriptor count for agent: open /host/proc/7/fd: no such file or directory"
...
time="2020-12-18T00:28:18Z" level=error msg="debug error" component=Agent error="unable to determine open descriptor count for agent: open /host/proc/7/fd: no such file or directory"

Every 20 Seconds

time="2020-12-18T00:28:18Z" level=warning msg="instantiating docker sampler process decorator" component="Metrics Process" error="Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?"
time="2020-12-18T00:28:38Z" level=warning msg="instantiating docker sampler process decorator" component="Metrics Process" error="Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?"
time="2020-12-18T00:28:58Z" level=warning msg="instantiating docker sampler process decorator" component="Metrics Process" error="Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?"

Expected Behavior

  1. Log messages should provide actionable details for someone debugging to find out more information on how to address the issue.
  2. The messages should not occur.

Steps to Reproduce

Run CRI-O as your default container engine in a bare-metal environment.

Your Environment

...@...:~$ k -n monitoring describe pod nri-bundle-newrelic-infrastructure-9jptt
Name:         nri-bundle-newrelic-infrastructure-9jptt
Namespace:    monitoring
Priority:     0
Node:         <redacted>
Start Time:   Thu, 17 Dec 2020 08:47:57 -0800
Labels:       app=newrelic-infrastructure
              controller-revision-hash=6d9d744546
              mode=privileged
              pod-template-generation=1
              release=nri-bundle
Annotations:  <none>
Status:       Running
IP:           <redacted>
IPs:
  IP:           <redacted>
Controlled By:  DaemonSet/nri-bundle-newrelic-infrastructure
Containers:
  newrelic-infrastructure:
    Container ID:   cri-o://91c8abb18a130b1431c952117f388084422094cd890215a4eca0de9f87d1c55f
    Image:          newrelic/infrastructure-k8s:1.26.6
    Image ID:       docker.io/newrelic/infrastructure-k8s@sha256:e88b843bf175408c9ab846f483763326cefd122247650043738255da812ce53a
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Thu, 17 Dec 2020 08:48:30 -0800
    Ready:          True
    Restart Count:  0
    Limits:
      memory:  300M
    Requests:
      cpu:     100m
      memory:  150M
    Environment:
      NRIA_LICENSE_KEY:              <set to the key 'license' in secret 'nri-bundle-newrelic-infrastructure-config'>  Optional: false
      CLUSTER_NAME:                  <redacted>
      NRK8S_NODE_NAME:                (v1:spec.nodeName)
      NRIA_DISPLAY_NAME:              (v1:spec.nodeName)
      NRIA_CUSTOM_ATTRIBUTES:        {"clusterName":"$(CLUSTER_NAME)"}
      NRIA_PASSTHROUGH_ENVIRONMENT:  KUBERNETES_SERVICE_HOST,KUBERNETES_SERVICE_PORT,CLUSTER_NAME,CADVISOR_PORT,NRK8S_NODE_NAME,KUBE_STATE_METRICS_URL,KUBE_STATE_METRICS_POD_LABEL,TIMEOUT,ETCD_TLS_SECRET_NAME,ETCD_TLS_SECRET_NAMESPACE,API_SERVER_SECURE_PORT,KUBE_STATE_METRICS_SCHEME,KUBE_STATE_METRICS_PORT,SCHEDULER_ENDPOINT_URL,ETCD_ENDPOINT_URL,CONTROLLER_MANAGER_ENDPOINT_URL,API_SERVER_ENDPOINT_URL,DISABLE_KUBE_STATE_METRICS,DISCOVERY_CACHE_TTL
    Mounts:
      /dev from dev (rw)
      /host from host-volume (ro)
      /var/log from log (rw)
      /var/run/docker.sock from host-docker-socket (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from nri-bundle-newrelic-infrastructure-token-cgg6w (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  dev:
    Type:          HostPath (bare host directory volume)
    Path:          /dev
    HostPathType:
  host-docker-socket:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/docker.sock
    HostPathType:
  log:
    Type:          HostPath (bare host directory volume)
    Path:          /var/log
    HostPathType:
  host-volume:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:
  nri-bundle-newrelic-infrastructure-token-cgg6w:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  nri-bundle-newrelic-infrastructure-token-cgg6w
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     :NoSchedule op=Exists
                 :NoExecute op=Exists
                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                 node.kubernetes.io/network-unavailable:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists
                 node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                 node.kubernetes.io/unreachable:NoExecute op=Exists
                 node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:          <none>

...@...:~$ kubeadm version
kubeadm version: &version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.0", GitCommit:"af46c47ce925f4c4ad5cc8d1fca46c7b77d13b38", GitTreeState:"clean", BuildDate:"2020-12-08T17:57:36Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}

...@...:~$ cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.1 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.1 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

setting APIServerEndpointURL does not change the endpoint being scraped

Description

When set API_SERVER_ENDPOINT_URL the integration is setting the unsecure endpoint and when scrapes the api-server start doing it with the default secure endpoint localhost:443 and if fails falls back to the endpoint set in API_SERVER_ENDPOINT_URL.
If the first attempt to scrape localhost:443 do not fail that endpoint response anything (it happend in openshift 4.6 CodeRady container) the fall back mechanism never get activated and the endpoint that has been set to scrape never gets scraped.

Expected Behavior

When i set API_SERVER_ENDPOINT_URL that endpoint gets scraped directly without having to go over a falling mechanism.

Troubleshooting or NR Diag results

Steps to Reproduce

set API_SERVER_ENDPOINT_URL and verbose mode logs, and search for the api-server running job, it shows that localhost:443 is being scraped at first place.

Your Environment

CodeREady Openshift 4.6

Additional context

Multiple KSM supported

Is your feature request related to a problem? Please describe.

Currently if the node where KSM is running, or the pod goes down. A data Gap it is experienced that lasts discoveryCacheTTL seconds

This is mainly caused by the leader election process to avoid to scrape multiple times the same data.

Feature Description

We should be able to support multiple instances of KSM or at least be able to discover the only instance in a quicker way

Currently the only way to mitigate the issue is to reduce discoveryCacheTTL

Describe Alternatives

We could either:

  • support multiple KSM without duplicating data
  • instead of polling for pods having KSM label we could watch on events and act properly not waiting for discoveryCacheTTL
  • Using the service could help as well to mitigate the issue

Review DistributedKubeStateMetrics flag

Priority

[Must Have]

[Fargate] Benchmark Load on Api

  • On fargate nodes are not capable of scraping themselves, therefore they have to use an unique entrypoint (the proxy).
    This could lead to issues, please investigate if that is a concerns and possible workarounds (cache, proxy, ...)

  • Scraping KSM could represent an issue from the momory/CPU point of view?

Inconsistent Errors in Different Environments

Hello,

We are running the limited release New Relic Kubernetes integration for Windows on our EKS clusters in AWS across several environments.

Description

In several of our environments it works cleanly. In another environment, on one cluster in a given account it seems to work on one of the two Windows nodes but not the other. On another cluster in the same new relic account and environment, it seems to not work on any of the 5 Windows nodes.

Steps to Reproduce

Simply deploy the daemonset

Expected Behavior

No error messages or failures.

Relevant Logs / Console output

When the integration fails we receive these errors consistently:

ime="2020-08-07T10:46:41-07:00" level=error msg="Integration command failed" error="exit status 1" instance=nri-kubernetes integration=com.newrelic.kubernetes prefix=integration/com.newr │
│ elic.kubernetes stderr="time="2020-08-07T10:46:31-07:00" level=warning msg="Cache file (c:\\var\\cache\\nr-kubernetes\\infra-sdk-cache.json) is older than 1m0s, skipping loading from disk."\ │
│ ntime="2020-08-07T10:46:41-07:00" level=panic msg="No data was populated"\ntime="2020-08-07T10:46:41-07:00" level=fatal msg="No data was populated"\n" working-dir="C:\Program Files\New Relic\ │
│ newrelic-infra\newrelic-integrations"

Your Environment

Running Kubernetes 1.16 on EKS.

Additional context

We would like to troubleshoot the issue and narrow down how we can address it.

Thanks!

High Memory Usage on Node that Kube-State-Metrics is deployed too

Description

We are deploying NRI using the NRI Bundle Helm Chart which deploys kube-state-metrics and nri with nri-kubernetes. By default the chart has the memory limit set to 300Mi which we have increased to 500Mi. Unfortunately even that isn't enough and the NRI pod running on the node that kube-state-metrics is running on keeps using much more memory and ends up getting OOM killed.

You can see here the difference between two running instances where the high memory pod is the node that kube-state-metrics is running.

newrelic-bundle-newrelic-infrastructure-l2x86               4m           27Mi
newrelic-bundle-newrelic-infrastructure-l8lkv               435m         444Mi

Here shows they are running on the same node and the nri pod keeps getting oom kiled.

newrelic-bundle-newrelic-infrastructure-l8lkv               1/1     Running     1349       6d      10.182.2.31       stg-kw3-c1-09   <none>           <none>
newrelic-bundle-kube-state-metrics-6bdb969776-zrmwd         1/1     Running     0          6d      192.168.108.12    stg-kw3-c1-09   <none>           <none>

Expected Behavior

Not to use so much memory or a different way to run the nri and kube-state-metrics pods on targeted nodes. (this would mean an issue on the chart repo so if this ends up just needing an arch change I can log an issue there)

Your Environment

Image: newrelic/infrastructure-k8s:1.26.1
Nodes: 52
K8s Version: v1.17.9

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.