GithubHelp home page GithubHelp logo

kubernetes / node-problem-detector Goto Github PK

View Code? Open in Web Editor NEW
2.8K 54.0 606.0 24.7 MB

This is a place for various problem detectors running on the Kubernetes nodes.

License: Apache License 2.0

Makefile 2.27% Shell 3.43% Go 93.43% Dockerfile 0.76% Batchfile 0.06% PowerShell 0.05%

node-problem-detector's Introduction

node-problem-detector

Build Status Go Report Card

node-problem-detector aims to make various node problems visible to the upstream layers in the cluster management stack. It is a daemon that runs on each node, detects node problems and reports them to apiserver. node-problem-detector can either run as a DaemonSet or run standalone. Now it is running as a Kubernetes Addon enabled by default in the GKE cluster. It is also enabled by default in AKS as part of the AKS Linux Extension.

Background

There are tons of node problems that could possibly affect the pods running on the node, such as:

  • Infrastructure daemon issues: ntp service down;
  • Hardware issues: Bad CPU, memory or disk;
  • Kernel issues: Kernel deadlock, corrupted file system;
  • Container runtime issues: Unresponsive runtime daemon;
  • ...

Currently, these problems are invisible to the upstream layers in the cluster management stack, so Kubernetes will continue scheduling pods to the bad nodes.

To solve this problem, we introduced this new daemon node-problem-detector to collect node problems from various daemons and make them visible to the upstream layers. Once upstream layers have visibility to those problems, we can discuss the remedy system.

Problem API

node-problem-detector uses Event and NodeCondition to report problems to apiserver.

  • NodeCondition: Permanent problem that makes the node unavailable for pods should be reported as NodeCondition.
  • Event: Temporary problem that has limited impact on pod but is informative should be reported as Event.

Problem Daemon

A problem daemon is a sub-daemon of node-problem-detector. It monitors specific kinds of node problems and reports them to node-problem-detector.

A problem daemon could be:

  • A tiny daemon designed for dedicated Kubernetes use-cases.
  • An existing node health monitoring daemon integrated with node-problem-detector.

Currently, a problem daemon is running as a goroutine in the node-problem-detector binary. In the future, we'll separate node-problem-detector and problem daemons into different containers, and compose them with pod specification.

Each category of problem daemon can be disabled at compilation time by setting corresponding build tags. If they are disabled at compilation time, then all their build dependencies, global variables and background goroutines will be trimmed out of the compiled executable.

List of supported problem daemons types:

Problem Daemon Types NodeCondition Description Configs Disabling Build Tag
SystemLogMonitor KernelDeadlock ReadonlyFilesystem FrequentKubeletRestart FrequentDockerRestart FrequentContainerdRestart A system log monitor monitors system log and reports problems and metrics according to predefined rules. filelog, kmsg, kernel abrt systemd disable_system_log_monitor
SystemStatsMonitor None(Could be added in the future) A system stats monitor for node-problem-detector to collect various health-related system stats as metrics. See the proposal here. system-stats-monitor disable_system_stats_monitor
CustomPluginMonitor On-demand(According to users configuration), existing example: NTPProblem A custom plugin monitor for node-problem-detector to invoke and check various node problems with user-defined check scripts. See the proposal here. example disable_custom_plugin_monitor
HealthChecker KubeletUnhealthy ContainerRuntimeUnhealthy A health checker for node-problem-detector to check kubelet and container runtime health. kubelet docker containerd

Exporter

An exporter is a component of node-problem-detector. It reports node problems and/or metrics to certain backends. Some of them can be disabled at compile-time using a build tag. List of supported exporters:

Exporter Description Disabling Build Tag
Kubernetes exporter Kubernetes exporter reports node problems to Kubernetes API server: temporary problems get reported as Events, and permanent problems get reported as Node Conditions.
Prometheus exporter Prometheus exporter reports node problems and metrics locally as Prometheus metrics
Stackdriver exporter Stackdriver exporter reports node problems and metrics to Stackdriver Monitoring API. disable_stackdriver_exporter

Usage

Flags

  • --version: Print current version of node-problem-detector.
  • --hostname-override: A customized node name used for node-problem-detector to update conditions and emit events. node-problem-detector gets node name first from hostname-override, then NODE_NAME environment variable and finally fall back to os.Hostname.

For System Log Monitor

  • --config.system-log-monitor: List of paths to system log monitor configuration files, comma-separated, e.g. config/kernel-monitor.json. Node problem detector will start a separate log monitor for each configuration. You can use different log monitors to monitor different system logs.

For System Stats Monitor

  • --config.system-stats-monitor: List of paths to system stats monitor config files, comma-separated, e.g. config/system-stats-monitor.json. Node problem detector will start a separate system stats monitor for each configuration. You can use different system stats monitors to monitor different problem-related system stats.

For Custom Plugin Monitor

  • --config.custom-plugin-monitor: List of paths to custom plugin monitor config files, comma-separated, e.g. config/custom-plugin-monitor.json. Node problem detector will start a separate custom plugin monitor for each configuration. You can use different custom plugin monitors to monitor different node problems.

For Health Checkers

Health checkers are configured as custom plugins, using the config/health-checker-*.json config files.

For Kubernetes exporter

  • --enable-k8s-exporter: Enables reporting to Kubernetes API server, default to true.
  • --apiserver-override: A URI parameter used to customize how node-problem-detector connects the apiserver. This is ignored if --enable-k8s-exporter is false. The format is the same as the source flag of Heapster. For example, to run without auth, use the following config:
    http://APISERVER_IP:APISERVER_PORT?inClusterConfig=false
    
    Refer to heapster docs for a complete list of available options.
  • --address: The address to bind the node problem detector server.
  • --port: The port to bind the node problem detector server. Use 0 to disable.

For Prometheus exporter

  • --prometheus-address: The address to bind the Prometheus scrape endpoint, default to 127.0.0.1.
  • --prometheus-port: The port to bind the Prometheus scrape endpoint, default to 20257. Use 0 to disable.

For Stackdriver exporter

Deprecated Flags

  • --system-log-monitors: List of paths to system log monitor config files, comma-separated. This option is deprecated, replaced by --config.system-log-monitor, and will be removed. NPD will panic if both --system-log-monitors and --config.system-log-monitor are set.

  • --custom-plugin-monitors: List of paths to custom plugin monitor config files, comma-separated. This option is deprecated, replaced by --config.custom-plugin-monitor, and will be removed. NPD will panic if both --custom-plugin-monitors and --config.custom-plugin-monitor are set.

Build Image

  • Install development dependencies for libsystemd and the ARM GCC toolchain

    • Debian/Ubuntu: apt install libsystemd-dev gcc-aarch64-linux-gnu
  • git clone [email protected]:kubernetes/node-problem-detector.git

  • Run make in the top directory. It will:

    • Build the binary.
    • Build the docker image. The binary and config/ are copied into the docker image.

If you do not need certain categories of problem daemons, you could choose to disable them at compilation time. This is the best way of keeping your node-problem-detector runtime compact without unnecessary code (e.g. global variables, goroutines, etc). You can do so via setting the BUILD_TAGS environment variable before running make. For example:

BUILD_TAGS="disable_custom_plugin_monitor disable_system_stats_monitor" make

The above command will compile the node-problem-detector without Custom Plugin Monitor and System Stats Monitor. Check out the Problem Daemon section to see how to disable each problem daemon during compilation time.

Push Image

make push uploads the docker image to a registry. By default, the image will be uploaded to staging-k8s.gcr.io. It's easy to modify the Makefile to push the image to another registry.

Installation

The easiest way to install node-problem-detector into your cluster is to use the Helm chart:

helm repo add deliveryhero https://charts.deliveryhero.io/
helm install --generate-name deliveryhero/node-problem-detector

Alternatively, to install node-problem-detector manually:

  1. Edit node-problem-detector.yaml to fit your environment. Set log volume to your system log directory (used by SystemLogMonitor). You can use a ConfigMap to overwrite the config directory inside the pod.

  2. Edit node-problem-detector-config.yaml to configure node-problem-detector.

  3. Edit rbac.yaml to fit your environment.

  4. Create the ServiceAccount and ClusterRoleBinding with kubectl create -f rbac.yaml.

  5. Create the ConfigMap with kubectl create -f node-problem-detector-config.yaml.

  6. Create the DaemonSet with kubectl create -f node-problem-detector.yaml.

Start Standalone

To run node-problem-detector standalone, you should set inClusterConfig to false and teach node-problem-detector how to access apiserver with apiserver-override.

To run node-problem-detector standalone with an insecure apiserver connection:

node-problem-detector --apiserver-override=http://APISERVER_IP:APISERVER_INSECURE_PORT?inClusterConfig=false

For more scenarios, see here

Windows

Node Problem Detector has preliminary support Windows. Most of the functionality has not been tested but filelog plugin works.

Follow Issue #461 for development status of Windows support.

Development

To develop NPD on Windows you'll need to setup your Windows machine for Go development. Install the following tools:

# Run these commands in the node-problem-detector directory.

# Build in MINGW64 Window
make clean ENABLE_JOURNALD=0 build-binaries

# Test in MINGW64 Window
make test

# Run with containerd log monitoring enabled in Command Prompt. (Assumes containerd is installed.)
%CD%\output\windows_amd64\bin\node-problem-detector.exe --logtostderr --enable-k8s-exporter=false --config.system-log-monitor=%CD%\config\windows-containerd-monitor-filelog.json --config.system-stats-monitor=config\windows-system-stats-monitor.json

# Configure NPD to run as a Windows Service
sc.exe create NodeProblemDetector binpath= "%CD%\node-problem-detector.exe [FLAGS]" start= demand
sc.exe failure NodeProblemDetector reset= 0 actions= restart/10000
sc.exe start NodeProblemDetector

Try It Out

You can try node-problem-detector in a running cluster by injecting messages to the logs that node-problem-detector is watching. For example, Let's assume node-problem-detector is using KernelMonitor. On your workstation, run kubectl get events -w. On the node, run sudo sh -c "echo 'kernel: BUG: unable to handle kernel NULL pointer dereference at TESTING' >> /dev/kmsg". Then you should see the KernelOops event.

When adding new rules or developing node-problem-detector, it is probably easier to test it on the local workstation in the standalone mode. For the API server, an easy way is to use kubectl proxy to make a running cluster's API server available locally. You will get some errors because your local workstation is not recognized by the API server. But you should still be able to test your new rules regardless.

For example, to test KernelMonitor rules:

  1. make (build node-problem-detector locally)
  2. kubectl proxy --port=8080 (make a running cluster's API server available locally)
  3. Update KernelMonitor's logPath to your local kernel log directory. For example, on some Linux systems, it is /run/log/journal instead of /var/log/journal.
  4. ./bin/node-problem-detector --logtostderr --apiserver-override=http://127.0.0.1:8080?inClusterConfig=false --config.system-log-monitor=config/kernel-monitor.json --config.system-stats-monitor=config/system-stats-monitor.json --port=20256 --prometheus-port=20257 (or point to any API server address:port and Prometheus port)
  5. sudo sh -c "echo 'kernel: BUG: unable to handle kernel NULL pointer dereference at TESTING' >> /dev/kmsg"
  6. You can see KernelOops event in the node-problem-detector log.
  7. sudo sh -c "echo 'kernel: INFO: task docker:20744 blocked for more than 120 seconds.' >> /dev/kmsg"
  8. You can see DockerHung event and condition in the node-problem-detector log.
  9. You can see DockerHung condition at http://127.0.0.1:20256/conditions.
  10. You can see disk-related system metrics in Prometheus format at http://127.0.0.1:20257/metrics.

Note:

  • You can see more rule examples under test/kernel_log_generator/problems.
  • For KernelMonitor message injection, all messages should have kernel: prefix (also note there is a space after :); or use generator.sh.
  • To inject other logs into journald like systemd logs, use echo 'Some systemd message' | systemd-cat -t systemd.

Dependency Management

node-problem-detector uses go modules to manage dependencies. Therefore, building node-problem-detector requires golang 1.11+. It still uses vendoring. See the Kubernetes go modules KEP for the design decisions. To add a new dependency, update go.mod and run go mod vendor.

Remedy Systems

A remedy system is a process or processes designed to attempt to remedy problems detected by the node-problem-detector. Remedy systems observe events and/or node conditions emitted by the node-problem-detector and take action to return the Kubernetes cluster to a healthy state. The following remedy systems exist:

  • Draino automatically drains Kubernetes nodes based on labels and node conditions. Nodes that match all of the supplied labels and any of the supplied node conditions will be prevented from accepting new pods (aka 'cordoned') immediately, and drained after a configurable time. Draino can be used in conjunction with the Cluster Autoscaler to automatically terminate drained nodes. Refer to this issue for an example production use case for Draino.
  • Descheduler strategy RemovePodsViolatingNodeTaints evicts pods violating NoSchedule taints on nodes. The k8s scheduler's TaintNodesByCondition feature must be enabled. The Cluster Autoscaler can be used to automatically terminate drained nodes.
  • mediK8S is an umbrella project for automatic remediation system build on Node Health Check Operator (NHC) that monitors node conditions and delegates remediation to external remediators using the Remediation API.Poison-Pill is a remediator that will reboot the node and make sure all statefull workloads are rescheduled. NHC supports conditionally remediating if the cluster has enough healthy capacity, or manually pausing any action to minimze cluster disruption.
  • MachineHealthCheck of Cluster API are responsible for remediating unhealthy Machines.

Testing

NPD is tested via unit tests, NPD e2e tests, Kubernetes e2e tests and Kubernetes nodes e2e tests. Prow handles the pre-submit tests and CI tests.

CI test results can be found below:

  1. Unit tests
  2. NPD e2e tests
  3. Kubernetes e2e tests
  4. Kubernetes nodes e2e tests

Running tests

Unit tests are run via make test.

See NPD e2e test documentation for how to set up and run NPD e2e tests.

Problem Maker

Problem maker is a program used in NPD e2e tests to generate/simulate node problems. It is ONLY intended to be used by NPD e2e tests. Please do NOT run it on your workstation, as it could cause real node problems.

Compatibility

Node problem detector's architecture has been fairly stable. Recent versions (v0.8.13+) should be able to work with any supported kubernetes versions.

Docs

Links

node-problem-detector's People

Contributors

abansal4032 avatar actions-user avatar acumino avatar adohe avatar andyxning avatar dchen1107 avatar euank avatar gkganesh126 avatar grosser avatar hakman avatar hercynium avatar jeremyje avatar jsenon avatar k8s-ci-robot avatar karan avatar linxiulei avatar martinforreal avatar max-rocket-internet avatar mcshooter avatar mmiranda96 avatar mx-psi avatar raghu-nandan-bs avatar random-liu avatar rramkumar1 avatar stmcginnis avatar testwill avatar thockin avatar vteratipally avatar wangzhen127 avatar zyecho avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

node-problem-detector's Issues

Write Readme.md for KernelMonitor

Write Readme.md for KernelMonitor to demonstrate:

  • Motivation of KernelMonitor
  • Usage of current KernelMonitor
  • Future plan of KernelMonitor.

/cc @kubernetes/node-problem-detector-maintainers

now,i canot run it in kub1.5.3

`Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message


28m 28m 1 {kubelet 10.8.65.157} Normal SuccessfulMountVolume MountVolume.SetUp succeeded for volume "kubernetes.io/host-path/5203c2d8-1b73-11e7-8754-46c88707ea0b-log" (spec.Name: "log") pod "5203c2d8-1b73-11e7-8754-46c88707ea0b" (UID: "5203c2d8-1b73-11e7-8754-46c88707ea0b").
28m 28m 1 {kubelet 10.8.65.157} Normal SuccessfulMountVolume MountVolume.SetUp succeeded for volume "kubernetes.io/host-path/5203c2d8-1b73-11e7-8754-46c88707ea0b-localtime" (spec.Name: "localtime") pod "5203c2d8-1b73-11e7-8754-46c88707ea0b" (UID: "5203c2d8-1b73-11e7-8754-46c88707ea0b").
28m 28m 1 {kubelet 10.8.65.157} spec.containers{node-problem-detector} Normal Created Created container with docker id e65759efc680; Security:[seccomp=unconfined]
28m 28m 1 {kubelet 10.8.65.157} spec.containers{node-problem-detector} Normal Started Started container with docker id e65759efc680
28m 28m 1 {kubelet 10.8.65.157} spec.containers{node-problem-detector} Normal Created Created container with docker id a7e78dfbe63a; Security:[seccomp=unconfined]
28m 28m 1 {kubelet 10.8.65.157} spec.containers{node-problem-detector} Normal Started Started container with docker id a7e78dfbe63a
27m 27m 2 {kubelet 10.8.65.157} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "node-problem-detector" with CrashLoopBackOff: "Back-off 10s restarting failed container=node-problem-detector pod=node-problem-detector-gfx8q_default(5203c2d8-1b73-11e7-8754-46c88707ea0b)"

27m 27m 1 {kubelet 10.8.65.157} spec.containers{node-problem-detector} Normal Created Created container with docker id 6d3ac4da1b6e; Security:[seccomp=unconfined]
27m 27m 1 {kubelet 10.8.65.157} spec.containers{node-problem-detector} Normal Started Started container with docker id 6d3ac4da1b6e
27m 27m 2 {kubelet 10.8.65.157} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "node-problem-detector" with CrashLoopBackOff: "Back-off 20s restarting failed container=node-problem-detector pod=node-problem-detector-gfx8q_default(5203c2d8-1b73-11e7-8754-46c88707ea0b)"

`

NPD sending too many node-status updates in scale tests

We ran 4k-node scalability test and observed that fluentd gets oom-killed frequently.
However, npd seems to send too many patch node-status requests (~3k qps out of ~11k total qps) due to it.
Ref: kubernetes/kubernetes#47344 (comment) kubernetes/kubernetes#47865 (comment)

Why is it reporting oom events when kubelet seems to do that?
Can we make npd report ooms for just system processes and let kubelet alone take care of k8s containers (by modifying npd's 'OOMKilling' rule)?

cc @Random-Liu @gmarek @kubernetes/sig-node-bugs

We should add real case e2e test.

We already have a NPD e2e test in kubernetes repo: https://github.com/kubernetes/kubernetes/blob/master/test/e2e/node_problem_detector.go.

However, the e2e test only doesn't actually test the NPD deployed in the cluster. It:

  • Deploy a NPD pod with test config.
  • Generate test log in the test log file.
  • Check whether test node condition is set.

Although this test is still necessary, we should add a new test to really inject log into kernel log or even trigger kernel problem and test the real NPD deployed in the cluster.

The test is disruptive, we may want to add [Feature] tag and create a dedicated jenkins job for it.

/cc @dchen1107

can't update node condition

Hi, I deployed npd yesterday in my k8s 1.6, Now I find it can't update node condition.Below is its log

E0418 09:48:32.826374       7 manager.go:160] failed to update node conditions: Operation cannot be fulfilled on nodes "cs55": there is a meaningful conflict (firstResourceVersion: "881081", currentResourceVersion: "881106"):
 diff1={"metadata":{"resourceVersion":"881106"},"status":{"conditions":[{"lastHeartbeatTime":"2017-04-18T01:48:32Z","type":"DiskPressure"},{"lastHeartbeatTime":"2017-04-18T01:47:32Z","type":"KernelDeadlock"},{"lastHeartbeatTime":"2017-04-18T01:48:32Z","type":"MemoryPressure"},{"lastHeartbeatTime":"2017-04-18T01:48:32Z","type":"OutOfDisk"},{"lastHeartbeatTime":"2017-04-18T01:48:32Z","type":"Ready"}]}}
, diff2={"status":{"conditions":[{"lastHeartbeatTime":"2017-04-18T01:48:32Z","lastTransitionTime":"2017-04-17T09:10:43Z","message":"kernel has no deadlock","reason":"KernelHasNoDeadlock","status":"False","type":"KernelDeadlock"}]}}

I wonder why this happen? @Random-Liu thx

Support SSL certificates

Our k8s API server is running with a self-signed certificate to enable SSL. I am wondering what's the best way to specify a root CA when running node-problem-detector.

Generalize kernel monitor.

Discussed with @apatil.

Kernel monitor was initially introduced to monitor kernel log and detect kernel issues.

However, in fact it could be extended to monitor other logs such as docker log, systemd log etc. by adding new translator. Currently it is already doable, but not very intuitive because:

  1. All files, types and functions are named as kernel xxx.
  2. Translator is not configurable.

We should refactor the code to make it easier and more intuitive to extend kernel monitor:

  • Change kernelmonitor to logmonitor. We'll only use log monitor to monitor kernel log for K8s, but it should be easy for other users to reconfigure and extend it to monitor other logs.
  • Extend the configuration to make translator and log source configurable after #41 landed, including:
    • Make the journald log filter configurable.
    • Make the translate function configurable.

/cc @kubernetes/sig-node

node-problem-detector:v0.4.0 - failed to update node conditions: Timeout

Hello,

I am testing out the node-problem-detector cluster addon and am seeing this error on only 1 node:

  • failed to update node conditions: Timeout: request did not complete within allowed duration

This cluster has had issues in the past due to some etcd3 nodes becoming unhealthy, and flanneld being unable to connect to etcd3 at one point causing kube-apiserver to fail to respond, and hence kubelet, kube-controller-manager, kube-proxy, and kube-scheduler to log various api timeout errors.

After restarting everything, the rest of the nodes are healthy & both etcd3 + kube-apiserver appear to be contactable by the kube-system services. I then started npd-0.4.0 using the above manifest and am seeing only 1 node with this issue. The rest are sometimes logging what appears to be the race condition for updating node status exactly as in #108.

I've checked the 5 node etcd cluster health and only 1 of 5 is in unhealthy state. Requests from kubectl return most of the time, however I am seeing some timeouts from some kubectl describe node XXX commands.

Relevant log sections below:

/var/log/node-problem-detector.log:

[... BEGIN LOG ...]

I0617 05:08:14.386945       7 log_monitor.go:72] Finish parsing log monitor config file: {WatcherConfig:{Plugin:journald PluginConfig:map[source:kernel] LogPath:/var/log/journal
Lookback:5m} BufferSize:10 Source:kernel-monitor DefaultConditions:[{Type:KernelDeadlock Status:false Transition:0001-01-01 00:00:00 +0000 UTC Reason:KernelHasNoDeadlock Message:
kernel has no deadlock}] Rules:[{Type:temporary Condition: Reason:OOMKilling Pattern:Kill process \d+ (.+) score \d+ or sacrifice child\nKilled process \d+ (.+) total-vm:\d+kB, a
non-rss:\d+kB, file-rss:\d+kB} {Type:temporary Condition: Reason:TaskHung Pattern:task \S+:\w+ blocked for more than \w+ seconds\.} {Type:temporary Condition: Reason:UnregisterNe
tDevice Pattern:unregister_netdevice: waiting for \w+ to become free. Usage count = \d+} {Type:temporary Condition: Reason:KernelOops Pattern:BUG: unable to handle kernel NULL po
inter dereference at .*} {Type:temporary Condition: Reason:KernelOops Pattern:divide error: 0000 \[#\d+\] SMP} {Type:permanent Condition:KernelDeadlock Reason:AUFSUmountHung Patt
ern:task umount\.aufs:\w+ blocked for more than \w+ seconds\.} {Type:permanent Condition:KernelDeadlock Reason:DockerHung Pattern:task docker:\w+ blocked for more than \w+ second
s\.}]}
I0617 05:08:14.387062       7 log_watchers.go:40] Use log watcher of plugin "journald"
I0617 05:08:14.387253       7 log_monitor.go:72] Finish parsing log monitor config file: {WatcherConfig:{Plugin:filelog PluginConfig:map[timestamp:^.{15} message:kernel: \[.*\] (
.*) timestampFormat:Jan _2 15:04:05] LogPath:/var/log/kern.log Lookback:5m} BufferSize:10 Source:kernel-monitor DefaultConditions:[{Type:KernelDeadlock Status:false Transition:00
01-01-01 00:00:00 +0000 UTC Reason:KernelHasNoDeadlock Message:kernel has no deadlock}] Rules:[{Type:temporary Condition: Reason:OOMKilling Pattern:Kill process \d+ (.+) score \d
+ or sacrifice child\nKilled process \d+ (.+) total-vm:\d+kB, anon-rss:\d+kB, file-rss:\d+kB} {Type:temporary Condition: Reason:TaskHung Pattern:task \S+:\w+ blocked for more tha
n \w+ seconds\.} {Type:temporary Condition: Reason:UnregisterNetDevice Pattern:unregister_netdevice: waiting for \w+ to become free. Usage count = \d+} {Type:temporary Condition:
 Reason:KernelOops Pattern:BUG: unable to handle kernel NULL pointer dereference at .*} {Type:temporary Condition: Reason:KernelOops Pattern:divide error: 0000 \[#\d+\] SMP} {Typ
e:permanent Condition:KernelDeadlock Reason:AUFSUmountHung Pattern:task umount\.aufs:\w+ blocked for more than \w+ seconds\.} {Type:permanent Condition:KernelDeadlock Reason:Dock
erHung Pattern:task docker:\w+ blocked for more than \w+ seconds\.}]}
I0617 05:08:14.387286       7 log_watchers.go:40] Use log watcher of plugin "filelog"
I0617 05:08:14.387923       7 log_monitor.go:81] Start log monitor
E0617 05:08:14.387946       7 problem_detector.go:65] Failed to start log monitor "/config/kernel-monitor-filelog.json": failed to stat the file "/var/log/kern.log": stat /var/lo
g/kern.log: no such file or directory
I0617 05:08:14.387952       7 log_monitor.go:81] Start log monitor
I0617 05:08:14.389906       7 log_watcher.go:69] Start watching journald
I0617 05:08:14.389947       7 problem_detector.go:74] Problem detector started
I0617 05:08:14.390046       7 log_monitor.go:173] Initialize condition generated: [{Type:KernelDeadlock Status:false Transition:2017-06-17 05:08:14.390031295 +0000 UTC Reason:Ker
nelHasNoDeadlock Message:kernel has no deadlock}]
E0617 06:41:07.988201       7 manager.go:160] failed to update node conditions: Operation cannot be fulfilled on nodes "ip-10-123-1-104.ec2.internal": there is a meaningful confli
ct (firstResourceVersion: "2163685", currentResourceVersion: "2163729"):
 diff1={"metadata":{"resourceVersion":"2163729"},"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T06:41:01Z","type":"DiskPressure"},{"lastHeartbeatTime":"2017-06-17T06:41
:01Z","type":"MemoryPressure"},{"lastHeartbeatTime":"2017-06-17T06:41:01Z","type":"OutOfDisk"},{"lastHeartbeatTime":"2017-06-17T06:41:01Z","type":"Ready"}]}}
, diff2={"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T06:41:02Z","lastTransitionTime":"2017-06-17T05:08:14Z","message":"kernel has no deadlock","reason":"KernelHasNoD
eadlock","status":"False","type":"KernelDeadlock"}]}}
E0617 08:10:02.768550       7 manager.go:160] failed to update node conditions: Operation cannot be fulfilled on nodes "ip-10-123-1-104.ec2.internal": there is a meaningful confli
ct (firstResourceVersion: "2181151", currentResourceVersion: "2181185"):
 diff1={"metadata":{"resourceVersion":"2181185"},"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T08:10:02Z","type":"DiskPressure"},{"lastHeartbeatTime":"2017-06-17T08:10
:02Z","type":"MemoryPressure"},{"lastHeartbeatTime":"2017-06-17T08:10:02Z","type":"OutOfDisk"},{"lastHeartbeatTime":"2017-06-17T08:10:02Z","type":"Ready"}]}}
, diff2={"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T08:10:02Z","lastTransitionTime":"2017-06-17T05:08:14Z","message":"kernel has no deadlock","reason":"KernelHasNoD
eadlock","status":"False","type":"KernelDeadlock"}]}}
E0617 08:40:02.392305       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 08:40:32.394968       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 08:40:52.688694       7 manager.go:160] failed to update node conditions: etcdserver: request timed out, possibly due to connection lost
E0617 10:19:47.886026       7 manager.go:160] failed to update node conditions: Operation cannot be fulfilled on nodes "ip-10-123-1-104.ec2.internal": there is a meaningful confli
ct (firstResourceVersion: "2206191", currentResourceVersion: "2206223"):
 diff1={"metadata":{"resourceVersion":"2206223"},"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T10:19:44Z","type":"DiskPressure"},{"lastHeartbeatTime":"2017-06-17T10:19:44Z","type":"MemoryPressure"},{"lastHeartbeatTime":"2017-06-17T10:19:44Z","type":"OutOfDisk"},{"lastHeartbeatTime":"2017-06-17T10:19:44Z","type":"Ready"}]}}
, diff2={"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T10:19:47Z","lastTransitionTime":"2017-06-17T05:08:14Z","message":"kernel has no deadlock","reason":"KernelHasNoDeadlock","status":"False","type":"KernelDeadlock"}]}}
E0617 10:20:03.868248       7 manager.go:160] failed to update node conditions: Operation cannot be fulfilled on nodes "ip-10-123-1-104.ec2.internal": there is a meaningful conflict (firstResourceVersion: "2206223", currentResourceVersion: "2206276"):
 diff1={"metadata":{"resourceVersion":"2206276"},"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T10:19:57Z","type":"DiskPressure"},{"lastHeartbeatTime":"2017-06-17T10:19:57Z","type":"MemoryPressure"},{"lastHeartbeatTime":"2017-06-17T10:19:57Z","type":"OutOfDisk"},{"lastHeartbeatTime":"2017-06-17T10:19:57Z","type":"Ready"}]}}
, diff2={"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T10:19:58Z","lastTransitionTime":"2017-06-17T05:08:14Z","message":"kernel has no deadlock","reason":"KernelHasNoDeadlock","status":"False","type":"KernelDeadlock"}]}}
E0617 10:27:14.278874       7 manager.go:160] failed to update node conditions: Operation cannot be fulfilled on nodes "ip-10-123-1-104.ec2.internal": there is a meaningful conflict (firstResourceVersion: "2207596", currentResourceVersion: "2207627"):
 diff1={"metadata":{"resourceVersion":"2207627"},"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T10:27:09Z","type":"DiskPressure"},{"lastHeartbeatTime":"2017-06-17T10:27:09Z","type":"MemoryPressure"},{"lastHeartbeatTime":"2017-06-17T10:27:09Z","type":"OutOfDisk"},{"lastHeartbeatTime":"2017-06-17T10:27:09Z","type":"Ready"}]}}
, diff2={"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T10:27:11Z","lastTransitionTime":"2017-06-17T05:08:14Z","message":"kernel has no deadlock","reason":"KernelHasNoDeadlock","status":"False","type":"KernelDeadlock"}]}}
E0617 10:49:49.999965       7 manager.go:160] failed to update node conditions: etcdserver: request timed out, possibly due to connection lost
E0617 11:04:58.654339       7 manager.go:160] failed to update node conditions: Operation cannot be fulfilled on nodes "ip-10-123-1-104.ec2.internal": there is a meaningful conflict (firstResourceVersion: "2214802", currentResourceVersion: "2214843"):
 diff1={"metadata":{"resourceVersion":"2214843"},"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T11:04:56Z","type":"DiskPressure"},{"lastHeartbeatTime":"2017-06-17T11:04:56Z","type":"MemoryPressure"},{"lastHeartbeatTime":"2017-06-17T11:04:56Z","type":"OutOfDisk"},{"lastHeartbeatTime":"2017-06-17T11:04:56Z","type":"Ready"}]}}
, diff2={"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T11:04:58Z","lastTransitionTime":"2017-06-17T05:08:14Z","message":"kernel has no deadlock","reason":"KernelHasNoDeadlock","status":"False","type":"KernelDeadlock"}]}}
E0617 11:13:25.349869       7 manager.go:160] failed to update node conditions: etcdserver: request timed out, possibly due to connection lost
E0617 11:13:27.448321       7 manager.go:160] failed to update node conditions: Operation cannot be fulfilled on nodes "ip-10-123-1-104.ec2.internal": there is a meaningful conflict (firstResourceVersion: "2216376", currentResourceVersion: "2216452"):
 diff1={"metadata":{"resourceVersion":"2216452"},"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T11:13:26Z","type":"DiskPressure"},{"lastHeartbeatTime":"2017-06-17T11:13:26Z","type":"MemoryPressure"},{"lastHeartbeatTime":"2017-06-17T11:13:26Z","type":"OutOfDisk"},{"lastHeartbeatTime":"2017-06-17T11:13:26Z","type":"Ready"}]}}
, diff2={"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T11:13:25Z","lastTransitionTime":"2017-06-17T05:08:14Z","message":"kernel has no deadlock","reason":"KernelHasNoDeadlock","status":"False","type":"KernelDeadlock"}]}}
E0617 11:27:54.707593       7 manager.go:160] failed to update node conditions: etcdserver: request timed out
E0617 11:28:03.676434       7 manager.go:160] failed to update node conditions: etcdserver: request timed out
E0617 11:28:11.687756       7 manager.go:160] failed to update node conditions: Operation cannot be fulfilled on nodes "ip-10-123-1-104.ec2.internal": there is a meaningful conflict (firstResourceVersion: "2219181", currentResourceVersion: "2219220"):
 diff1={"metadata":{"resourceVersion":"2219220"},"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T11:27:56Z","type":"KernelDeadlock"}]}}
, diff2={"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T11:28:07Z","lastTransitionTime":"2017-06-17T05:08:14Z","message":"kernel has no deadlock","reason":"KernelHasNoDeadlock","status":"False","type":"KernelDeadlock"}]}}
E0617 11:28:11.687756       7 manager.go:160] failed to update node conditions: Operation cannot be fulfilled on nodes "ip-10-123-1-104.ec2.internal": there is a meaningful conflict (firstResourceVersion: "2219181", currentResourceVersion: "2219220"):
 diff1={"metadata":{"resourceVersion":"2219220"},"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T11:27:56Z","type":"KernelDeadlock"}]}}
, diff2={"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T11:28:07Z","lastTransitionTime":"2017-06-17T05:08:14Z","message":"kernel has no deadlock","reason":"KernelHasNoDeadlock","status":"False","type":"KernelDeadlock"}]}}
E0617 11:35:22.216998       7 manager.go:160] failed to update node conditions: Operation cannot be fulfilled on nodes "ip-10-123-1-104.ec2.internal": there is a meaningful conflict (firstResourceVersion: "2220568", currentResourceVersion: "2220603"):
 diff1={"metadata":{"resourceVersion":"2220603"},"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T11:35:12Z","type":"DiskPressure"},{"lastHeartbeatTime":"2017-06-17T11:35:12Z","type":"MemoryPressure"},{"lastHeartbeatTime":"2017-06-17T11:35:12Z","type":"OutOfDisk"},{"lastHeartbeatTime":"2017-06-17T11:35:12Z","type":"Ready"}]}}
, diff2={"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T11:35:21Z","lastTransitionTime":"2017-06-17T05:08:14Z","message":"kernel has no deadlock","reason":"KernelHasNoDeadlock","status":"False","type":"KernelDeadlock"}]}}
E0617 11:35:39.515617       7 manager.go:160] failed to update node conditions: etcdserver: request timed out
E0617 11:36:44.370694       7 manager.go:160] failed to update node conditions: Operation cannot be fulfilled on nodes "ip-10-123-1-104.ec2.internal": there is a meaningful conflict (firstResourceVersion: "2220826", currentResourceVersion: "2220882"):
 diff1={"metadata":{"resourceVersion":"2220882"},"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T11:36:40Z","type":"DiskPressure"},{"lastHeartbeatTime":"2017-06-17T11:36:40Z","type":"MemoryPressure"},{"lastHeartbeatTime":"2017-06-17T11:36:40Z","type":"OutOfDisk"},{"lastHeartbeatTime":"2017-06-17T11:36:40Z","type":"Ready"}]}}
, diff2={"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T11:36:43Z","lastTransitionTime":"2017-06-17T05:08:14Z","message":"kernel has no deadlock","reason":"KernelHasNoDeadlock","status":"False","type":"KernelDeadlock"}]}}
E0617 11:43:04.549531       7 manager.go:160] failed to update node conditions: etcdserver: request timed out
E0617 11:52:29.011584       7 manager.go:160] failed to update node conditions: etcdserver: request timed out, possibly due to connection lost
E0617 11:55:55.741212       7 manager.go:160] failed to update node conditions: etcdserver: request timed out, possibly due to connection lost

[...SNIP...]

E0617 20:48:43.758632       7 manager.go:160] failed to update node conditions: etcdserver: request timed out, possibly due to connection lost
E0617 20:50:08.649485       7 manager.go:160] failed to update node conditions: etcdserver: request timed out, possibly due to connection lost
E0617 20:50:36.742814       7 manager.go:160] failed to update node conditions: etcdserver: request timed out, possibly due to connection lost
E0617 20:51:06.745291       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 20:51:18.310623       7 manager.go:160] failed to update node conditions: etcdserver: request timed out, possibly due to connection lost
E0617 20:51:25.507013       7 manager.go:160] failed to update node conditions: etcdserver: request timed out
E0617 20:51:35.682252       7 manager.go:160] failed to update node conditions: etcdserver: request timed out
E0617 20:51:40.731491       7 manager.go:160] failed to update node conditions: Operation cannot be fulfilled on nodes "ip-10-123-1-104.ec2.internal": there is a meaningful confli
ct (firstResourceVersion: "2329653", currentResourceVersion: "2329819"):
 diff1={"metadata":{"resourceVersion":"2329819"},"status":{"conditions":[{"lastTransitionTime":"2017-06-17T20:49:33Z","message":"Kubelet stopped posting node status.","reason":"N
odeStatusUnknown","status":"Unknown","type":"DiskPressure"},{"lastTransitionTime":"2017-06-17T20:49:33Z","message":"Kubelet stopped posting node status.","reason":"NodeStatusUnkn
own","status":"Unknown","type":"MemoryPressure"},{"lastTransitionTime":"2017-06-17T20:49:33Z","message":"Kubelet stopped posting node status.","reason":"NodeStatusUnknown","statu
s":"Unknown","type":"OutOfDisk"},{"lastTransitionTime":"2017-06-17T20:49:33Z","message":"Kubelet stopped posting node status.","reason":"NodeStatusUnknown","status":"Unknown","ty
pe":"Ready"}]}}
, diff2={"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T20:51:39Z","lastTransitionTime":"2017-06-17T05:08:14Z","message":"kernel has no deadlock","reason":"KernelHasNoD
eadlock","status":"False","type":"KernelDeadlock"}]}}
E0617 20:52:05.903239       7 manager.go:160] failed to update node conditions: etcdserver: request timed out, possibly due to connection lost
E0617 20:52:35.941987       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 20:53:05.990698       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 20:53:35.993274       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 20:54:05.995654       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 20:54:35.998026       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 20:55:06.009227       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 20:55:36.011974       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 20:56:01.937423       7 manager.go:160] failed to update node conditions: etcdserver: request timed out, possibly due to connection lost
E0617 20:56:12.384973       7 manager.go:160] failed to update node conditions: etcdserver: request timed out
E0617 20:56:42.390952       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 20:57:12.393356       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 20:57:42.395943       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 20:58:12.405949       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 20:58:22.411682       7 manager.go:160] failed to update node conditions: etcdserver: request timed out, possibly due to connection lost
E0617 20:58:33.305720       7 manager.go:160] failed to update node conditions: etcdserver: request timed out, possibly due to connection lost
E0617 20:58:34.805954       7 manager.go:160] failed to update node conditions: Operation cannot be fulfilled on nodes "ip-10-123-1-104.ec2.internal": there is a meaningful confli
ct (firstResourceVersion: "2330223", currentResourceVersion: "2330380"):
 diff1={"metadata":{"resourceVersion":"2330380"},"status":{"conditions":[{"lastTransitionTime":"2017-06-17T20:52:28Z","message":"Kubelet stopped posting node status.","reason":"N
odeStatusUnknown","status":"Unknown","type":"DiskPressure"},{"lastTransitionTime":"2017-06-17T20:52:28Z","message":"Kubelet stopped posting node status.","reason":"NodeStatusUnkn
own","status":"Unknown","type":"MemoryPressure"},{"lastTransitionTime":"2017-06-17T20:52:28Z","message":"Kubelet stopped posting node status.","reason":"NodeStatusUnknown","status":"Unknown","type":"OutOfDisk"},{"lastTransitionTime":"2017-06-17T20:52:28Z","message":"Kubelet stopped posting node status.","reason":"NodeStatusUnknown","status":"Unknown","type":"Ready"}]}}
, diff2={"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T20:58:33Z","lastTransitionTime":"2017-06-17T05:08:14Z","message":"kernel has no deadlock","reason":"KernelHasNoDeadlock","status":"False","type":"KernelDeadlock"}]}}
E0617 20:59:13.391379       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration

[...SNIP...]
[... MESSAGE REPEATS APPROX EVERY ~1/2 SECOND ...]
[...SNIP...]

E0617 21:22:43.790298       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:23:13.792892       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:23:43.795469       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:24:13.798194       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:24:43.800739       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:25:13.803335       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:25:43.805904       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:26:13.808357       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:26:43.810985       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:27:13.813560       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:27:43.816150       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:28:13.886384       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:28:43.889035       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:29:13.891666       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:29:43.894478       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:30:13.897056       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:30:43.922019       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:31:13.947162       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:31:43.949919       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:32:13.952527       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:32:43.955028       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:33:13.957574       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:33:43.960186       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:34:13.962807       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:34:43.965335       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:35:13.967982       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:35:43.971220       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration

[...SNIP...]
[... MESSAGE REPEATS APPROX EVERY ~1/2 SECOND ...]

Bump the version to v0.3 in makefile, manifest and readme

After adding the standalone mode in #49 , I created and pushed a new docker image gcr.io/google_containers/node-problem-detector:v0.3 which includes these changes for use by kubemark. However this version bump has to be reflected in:

  1. version tag inside the Makefile
  2. image inside the node-problem-detector.yaml
  3. readme, to indicate that v0.3 is the latest version and/or suggested for use?

@Random-Liu Hope this push doesn't disturb any plans you might be having for npd release schedule. If yes, it's not difficult to revert this change, as I just did it for use in kubemark.

NodeProblemDetector tests failing on AWS

c.nodeName, err = os.Hostname()
assumes that the hostname is the same name as the cloud provider instance name. This is false on AWS. NPD needs to be going through CurrentNodeName to get the cloud provider agnostic hostname.

I'm disabling this test on AWS for now, but we're marching towards full support for AWS. Please treat this as if it had broken a build when it was checked in originally.

Write Readme.md for NodeProblemDetector

Write Readme.md for NodeProblemDetector to demonstrate:

  • Motivation and scope of NodeProblemDetector
  • Usage of current NodeProblemDetector
  • Future plan of NodeProblemDetector.

/cc @kubernetes/node-problem-detector-maintainers

NPD Kubernetes 1.6 Planning

NPD (node problem detector) is introduced in Kubernetes 1.3 as a default add-on in GCE cluster.

At that time, it is mainly targeted on default GCE Kubernetes setup. However, as time goes by, some limitations were found such as Journald support, Authentication Issue, Scalability Issue which affected the adoption of NPD in many other environment.

In Kubernetes 1.6, we plan to invest some time to improve NPD, make it production ready and rollout it in GKE.

Here are the working items and priorities:

  • [P0] Journald support. Many important OS distros are using systemd now, such as GCI, CoreOS, CentOS etc. This is essential for NPD adoption. (Issue: #14, PR: #39, #33, @adohe)
  • [P0] Apiserver client option override. By default, NPD is running as DaemonSet and use InClusterConfig to access apiserver. However, this does not work when service account is not available. (Issue: #27, #21). We should make the apiserver client option configurable, so that user can customize it based on their cluster setup. This is prerequisite of Standalone mode (PR: #49, @andyxning)
  • [P0] Standalone mode. Make it possible to run NPD standalone, possibly as a systemd service. DaemonSet is easy to deploy and manage. However, docker still stops all containers when it's dead (live-restore is still in validation). Because of this, NPD may not be able to detect problems when docker is unresponsive. (Issue: #76)
  • [P1] Integrate NPD with K8s e2e framework. NPD is already running in e2e cluster, but the information it collects is not well-surfaced from the test framework. We should make it visible by failing the test or collecting via a dashboard (Issue: kubernetes/kubernetes#30811).
  • [P1] Scalability and performance. #85
    • Some known performance issue needs to be fixed in NPD, such as reduce apiserver access (#37), and improve log parsing efficiency. (#79, #84)
    • More benchmark to verify the performance of NPD. Both benchmark for NPD resource usage and apiserver load introduced by NPD (#50, @shyamjvs, #85).
  • [P2] Formalize the Project. Formalize the process of the project, including:
    • #66 Add change log. (#45) [P2]
    • Define release process. (#67) [P2]
    • Add pre/post submit e2e test. (#43) [P3]
  • [P2] Docker problem detection. Although kernel monitor could be extended to monitor other logs, it still needs some code change to achieve that. We should cleanup the code to make it easier to monitor other logs and add clear documentation for it. (#44) (#88) (PR: #88, #92, #94)
  • [P3] 3rd party problem daemon integration. Kernel monitor is designed to detect known kernel problems with minimum overhead, it is not expected to be a comprehensive solution. NPD should be extensible to integrate with more small problem daemons or more mature solution. (#35)

Note that only P0s are release blocker.

@dchen1107 @fabioy @ajitak
/cc @kubernetes/sig-node-misc

Release instruction is needed.

Forked from #63 (comment).
We need a release instruction to standardize the release process. The release instruction should cover:

  • How to build release packages
    • Build docker image.
    • Build tar ball for standalone.
    • ...
  • How to cut release
    • Version a release
    • Update CHANGELOG.md
    • Create branch/tag
    • Create github release.
    • ...
  • How to update kubernetes
    • Update e2e test
    • Update addon pods in cluster
    • ...

This has relatively lower priority than the features, will file the doc before cutting release.

Use chroot instead of relying on a systemd base image.

We want NPD to support journald log even when it's running as daemonset inside a container.

However, it's hard to consume host journal log inside container. Previously, we use fedora base image, and rely on the journald library inside the base image to understand and read the host journald log. However, there are several limitations:

  1. The fedora base image is pretty big, and there are a lot of things we don't actually need.
  2. The journald version inside the container may be mismatch with the host, and sometimes causes problems. E.g. the npd image doesn't work with new GCI image now, and also the user bug report #114.

A more generic solution may be mount the host / inside the container, and chroot. In this way, we could use the journald on the host directly, which eliminates the problems above.

$ docker run -it --privileged -v /:/rootfs --entrypoint=/bin/bash gcr.io/google_containers/node-problem-detector:v0.4.0
[root@07fb0f31a048 /]# chroot /rootfs
sh-4.3# journalctl -k
-- Logs begin at Sun 2017-06-11 12:12:02 UTC, end at Mon 2017-06-12 18:59:06 UTC. --
Jun 12 18:19:54 e2e-test-lantaol-master kernel: device veth43d8449 entered promiscuous mode
Jun 12 18:19:54 e2e-test-lantaol-master kernel: IPv6: ADDRCONF(NETDEV_UP): veth43d8449: link is not ready
Jun 12 18:19:54 e2e-test-lantaol-master kernel: eth0: renamed from veth75b3ac0
...

I haven't thought through the potential security problem yet, but since NPD is a privileged daemonset, it's reasonable to grant host fs access to it.

failed to build node-problem-detector

  • go version go1.7 darwin/amd64
  • godep v74 (darwin/amd64/go1.7)

make failed with output:

CGO_ENABLED=0 GOOS=linux godep go build -a -installsuffix cgo -ldflags '-w' -o node-problem-detector
vendor/github.com/coreos/go-systemd/sdjournal/functions.go:19:2: no buildable Go source files in /Users/sandflee/goproject/src/k8s.io/node-problem-detector/vendor/github.com/coreos/pkg/dlopen
godep: go exit status 1

"unregister_netdevice" isn't necessarily a KernelDeadlock

I have a node running CoreOS 1221.0.0 with kernel version 4.8.6-coreos.

The node-problem-detector marked it with "KernelDeadlock True Sun, 04 Dec 2016 18:56:20 -0800 Wed, 16 Nov 2016 00:03:33 -0800 UnregisterNetDeviceIssue unregister_netdevice: waiting for lo to become free. Usage count = 1".

If I check my kernel log, I see the following:

$ dmesg -T | grep -i unregister_netdevice -C 3
[Wed Nov 16 08:02:19 2016] docker0: port 5(vethfd2807b) entered blocking state
[Wed Nov 16 08:02:19 2016] docker0: port 5(vethfd2807b) entered forwarding state
[Wed Nov 16 08:02:19 2016] IPv6: eth0: IPv6 duplicate address fe80::42:aff:fe02:1206 detected!
[Wed Nov 16 08:03:33 2016] unregister_netdevice: waiting for lo to become free. Usage count = 1
[Wed Nov 16 08:14:35 2016] vethafecb94: renamed from eth0
[Wed Nov 16 08:14:35 2016] docker0: port 2(veth807b9e2) entered disabled state
[Wed Nov 16 08:14:35 2016] docker0: port 2(veth807b9e2) entered disabled state

Clearly, the node managed to continue to perform operations after printing that message. In addition, pods continue to function just fine and there aren't any long-term issues for me on this node.

I know that the config of what counts as a deadlock is configurable, but perhaps the default configuration shouldn't include this, or the check should be more advanced for it, since as-is it could be quite confusing.

Add RBAC in the example yaml file and document

Default Kubernetes setup removed ABAC support, and is only using RBAC now.
kubernetes/kubernetes#39092

This is fine for the NPD in kube-system, which has enough permission. However, the NPD setup in the example and Readme.md may not be enough.

We should add documentation and example to demonstrate how to deploy NPD in a RBAC K8s cluster.

Performance benchmark and optimization

We've got some data from #2.

However, after that we've made a lot of changes. We should do benchmark to measure the cpu and memory usage. And optimize the log parsing code of kernel monitor if the resource usage is too high.

An accurate and acceptable number is important for NPD rollout.

/cc @dchen1107

Standalone NPD Support

For better reliability, in Kubernetes 1.6, we plan to add standalone NPD support to run NPD as a system daemon on each node.

To achieve this, we need to:

  • Add apiserver-override option to make it possible to use customized client config instead of InClusterConfig. #49
  • Make NPD wait for: #79
    • Apiserver to come up. #19
    • Kubelet to register node.
      This is necessary, because in standalone mode, NPD may be deployed before apiserver and kubelet are functioning. It should wait for them to come up.
  • NPD should host a simple http server for readiness check and health monitor. #83
    • Requred: Add /health for health monitoring.
    • Optional: Add /pprof for performance debugging.
    • Optional: Add debug endpoint to list internal state.
    • Optional: Add grpc problem report endpoint.
  • Resource Limit
    • Get benchmark data about the NPD resource usage #85
    • ~~~Adjust NodeAllocatable according to the resource usage accordingly.~~~
  • Add standalone NPD into K8s GCI cloud init script. kubernetes/kubernetes#40206
    • Package NPD binary as tar ball and upload it to google storage. #71, #74
    • Download and setup NPD binary in K8s GCI cloud init script.
  • NPD should have a specific user account which has necessary permission in K8s cluster. Similar with system:kube-proxy.

does this tool work outside Google?

Panic crash:

kubectl get pods -o wide --all-namespaces
NAMESPACE     NAME                                READY     STATUS             RESTARTS   AGE       NODE
default       node-problem-detector-0kgkw         0/1       CrashLoopBackOff   3          1m        192.168.78.15
default       node-problem-detector-ar3tk         0/1       CrashLoopBackOff   3          1m        192.168.78.16

kubectl logs node-problem-detector-0kgkw
I0623 01:02:18.560287       1 kernel_monitor.go:86] Finish parsing log file: {WatcherConfig:{KernelLogPath:/log/kern.log} BufferSize:10 Source:kernel-monitor DefaultConditions:[{Type:KernelDeadlock Status:false Transition:0001-01-01 00:00:00 +0000 UTC Reason:KernelHasNoDeadlock Message:kernel has no deadlock}] Rules:[{Type:temporary Condition: Reason:OOMKilling Pattern:Kill process \d+ (.+) score \d+ or sacrifice child\nKilled process \d+ (.+) total-vm:\d+kB, anon-rss:\d+kB, file-rss:\d+kB} {Type:temporary Condition: Reason:TaskHung Pattern:task \S+:\w+ blocked for more than \w+ seconds\.} {Type:permanent Condition:KernelDeadlock Reason:AUFSUmountHung Pattern:task umount\.aufs:\w+ blocked for more than \w+ seconds\.} {Type:permanent Condition:KernelDeadlock Reason:DockerHung Pattern:task docker:\w+ blocked for more than \w+ seconds\.} {Type:permanent Condition:KernelDeadlock Reason:UnregisterNetDeviceIssue Pattern:unregister_netdevice: waiting for \w+ to become free. Usage count = \d+}]}
I0623 01:02:18.560413       1 kernel_monitor.go:93] Got system boot time: 2016-06-17 17:51:02.560408109 +0000 UTC
panic: open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory

goroutine 1 [running]:
panic(0x15fc280, 0xc8204fc000)
    /usr/local/go/src/runtime/panic.go:464 +0x3e6
k8s.io/node-problem-detector/pkg/problemclient.NewClientOrDie(0x0, 0x0)
    /usr/local/google/home/lantaol/workspace/src/k8s.io/node-problem-detector/pkg/problemclient/problem_client.go:56 +0x132
k8s.io/node-problem-detector/pkg/problemdetector.NewProblemDetector(0x7faa4f155140, 0xc8202a6900, 0x0, 0x0)
    /usr/local/google/home/lantaol/workspace/src/k8s.io/node-problem-detector/pkg/problemdetector/problem_detector.go:45 +0x36
main.main()
    /usr/local/google/home/lantaol/workspace/src/k8s.io/node-problem-detector/node_problem_detector.go:33 +0x56

DaemonSet NPD creates /var/log/journal on non-journald node.

Ideally, we should not use journald config on non-systemd node. However, if we use, it is expected to error out.

However, now because of some unknown reason, NPD will create a /var/log/journal directory and keep going. Related code: https://github.com/kubernetes/node-problem-detector/blob/master/pkg/systemlogmonitor/logwatchers/journald/log_watcher.go#L141

This should not matter much, because no one will actually write any log, so NPD will just hang and not consume more resource. But we should still fix this.

@kubernetes/node-problem-detector-maintainers

A systemd service for test.

After introducing a new pattern, we usually need to verify it.

However, it's hard to trigger a real problem for a given service, so we usually have to inject log to verify it.

It's hard to inject log into a given systemd service. We should have a test systemd service which does nothing but generating specified log. We could let NPD parse the log of the test service to verify the newly introduced pattern.

"startPattern" is fragile and wrong on newer kernels

Broken out from here

Currently, the config has a default of "startPattern": "Initializing cgroup subsys cpuset",

This pattern is meant to detect a node's boot process. Prior to the 4.5 kernel, this message was typically printed during boot of a node. After 4.5 however, due to this change, it is quite unlikely for that message to appear.

Furthermore, there's rarely a reason to detect whether a message is for the current boot in such a fragile way.

With the kern.log reader, every message is for the current boot because kern.log is usually handled where each kern.log file corresponds to one boot (e.g. kern.log is this boot, kern.log.1 is the boot before, kern.log.2.gz the one before, etc). (EDIT: I'm wrong about this for gci at least)

With journald, the boot id is annotated in messages, and so it can accurately be correlated with the current boot id (see the "_BOOT_ID" record in journald messages).

With a kmsg reader, all messages will only be the current boot because kmsg is not persistent.

In none of those cases is startPattern useful. Each kernel log parsing plugin should be responsible for doing the right thing itself I think.

Feature request for a "hollow"-node-problem-detector having an empty list of conditions and rules inside kernel monitor config

As part of the effort to make testing using kubemark mimic real clusters as closely as possible, we are planning to add container for a "hollow-node-problem-detector" inside hollow-node alongside the existing containers "hollow-kubelet" and "hollow-kubeproxy". For this, it is required to have a node-problem-detector image which essentially has conditions and rules inside the kernel monitoring config set to an empty list. Also, this image should eventually (once it is tested to work fine in Kubemark) be pushed to gcr.io/google-containers.

@kubernetes/sig-scalability @wojtek-t @gmarek @Random-Liu

OOMKilling not triggering with recent kernels

With Linux 4.9, and perhaps earlier versions, there's a trailing

, shmem-rss:\\d+kB

Should it be optional in the built-in regex, are admins expect to edit their configuration, or something else?

status doesn't change when injecting log messages

I cannot get the node problem detector to change a node status by injecting messages.

I am using kubernetes 1.5.2, Ubuntu 16.04, kernel 4.4.0-51-generic.

I run the npd as a daemonset. I have attempted to get this to work with the npd as version 0.3.0 and 0.4.0. I start the npd with the default command, using /config/kernel-monitor.json because my nodes use journald.

I have /dev/kmsg mounted into the pod, and I echo expressions matching the regexs in the kernel-monitor.json to /dev/kmsg on the node. I can view the fake logs I've echoed to /dev/kmsg in the pod.

Steps to reproduce:

# as root on the node where your pod is running
echo "task umount.aufs:123 blocked for more than 120 seconds." >> /dev/kmsg
# I have verified that these logs show up in journalctl -k

# this should match the following permanent condition in /config/kernel-monitor.json
#	{
#		"type": "permanent",
#		"condition": "KernelDeadlock",
#		"reason": "AUFSUmountHung",
#		"pattern": "task umount\\.aufs:\\w+ blocked for more than \\w+ seconds\\."
#	},

# check the node status of the node where you ran this on
kubectl get node <node>
# status will still be Ready

# for further detail examine the json
kubectl get node <node> -o json | jq .status.conditions
# you will see that the KernelDeadlock condition is still "False"

# I would expect the general status to change to "KernelDeadlock"

If I am not testing this properly, could you please give a detailed breakdown of how to test the node problem detector is working properly for kernel logs AND docker logs?

I have also reproduced this behavior using a custom docker_monitor.json and having the systemd docker service write to the journald docker logs. I have still been unsuccessful in getting the node status to change.

does not create automatically /var/log/journal

Hi,

  1. great initiative. Here is my experience on RHEL/CentOS 7

  2. Apparently node-problem-detector is not able to work with the /run/log/journal/ (unfortunately) - correct? If so, it seems like a big limitation, no?

  3. The issue is that even if cannot do that, it also fails to start, as it depends on the existence of "/var/log/journal". Knowing that it's not able to use /run/log/journal, is it reasonable to expect that it will create "/var/log/journal" on each node?

  4. "/var/log/kern.log" is missing on my OS. I guess it's not in use any longer or it should be configured?

Here are the logs:

I0621 14:32:46.173199       7 log_watchers.go:40] Use log watcher of plugin "journald"
I0621 14:32:46.186164       7 log_monitor.go:72] Finish parsing log monitor config file: {WatcherConfig:{Plugin:filelog PluginConfig:map[timestamp:^.{15} message:kernel: \[.*\] (.*) timestampFormat:Jan _2 15:04:05] LogPath:/var/log/kern.log Lookback:5m} BufferSize:10 Source:kernel-monitor DefaultConditions:[{Type:KernelDeadlock Status:false Transition:0001-01-01 00:00:00 +0000 UTC Reason:KernelHasNoDeadlock Message:kernel has no deadlock}] Rules:[{Type:temporary Condition: Reason:OOMKilling Pattern:Kill process \d+ (.+) score \d+ or sacrifice child\nKilled process \d+ (.+) total-vm:\d+kB, anon-rss:\d+kB, file-rss:\d+kB} {Type:temporary Condition: Reason:TaskHung Pattern:task \S+:\w+ blocked for more than \w+ seconds\.} {Type:temporary Condition: Reason:UnregisterNetDevice Pattern:unregister_netdevice: waiting for \w+ to become free. Usage count = \d+} {Type:temporary Condition: Reason:KernelOops Pattern:BUG: unable to handle kernel NULL pointer dereference at .*} {Type:temporary Condition: Reason:KernelOops Pattern:divide error: 0000 \[#\d+\] SMP} {Type:permanent Condition:KernelDeadlock Reason:AUFSUmountHung Pattern:task umount\.aufs:\w+ blocked for more than \w+ seconds\.} {Type:permanent Condition:KernelDeadlock Reason:DockerHung Pattern:task docker:\w+ blocked for more than \w+ seconds\.}]}
I0621 14:32:46.186204       7 log_watchers.go:40] Use log watcher of plugin "filelog"
I0621 14:32:46.186926       7 log_monitor.go:81] Start log monitor
E0621 14:32:46.186945       7 problem_detector.go:65] Failed to start log monitor "/config/kernel-monitor.json": failed to stat the log path "/var/log/journal": stat /var/log/journal: no such file or directory
I0621 14:32:46.186951       7 log_monitor.go:81] Start log monitor
E0621 14:32:46.186958       7 problem_detector.go:65] Failed to start log monitor "/config/kernel-monitor-filelog.json": failed to stat the file "/var/log/kern.log": stat /var/log/kern.log: no such file or directory
F0621 14:32:46.186964       7 node_problem_detector.go:88] Problem detector failed with error: no log montior is successfully setup

node-problem-detector does not work if I change kubelet hostname to node ip.

I am running k8s 1.3.0 and with docker image node-problem-detector:v0.1

I changed kubelet --hostname-override with node ip on ehch minion:

root@SZX1000116607:~# cat /etc/default/kubelet
KUBELET_OPTS='--hostname-override=10.22.109.119 --api-servers=http://10.22.109.119:8080,http://10.22.69.237:8080,http://10.22.117.82:8080 --pod-infra-container-image=xxxxx/kubernetes/pause:latest --cluster-dns=192.168.1.1  --cluster-domain=test1 --low-diskspace-threshold-mb=2048 --cert-dir=/var/run/kubelet --allow-privileged=true'

when I start node-problem-detector as a daemon set, I got a lot of error messages:

2016-07-11T07:49:17.310422203Z I0711 07:49:17.309761       1 kernel_monitor.go:86] Finish parsing log file: {WatcherConfig:{KernelLogPath:/log/kern.log} BufferSize:10 Source:kernel-monitor DefaultConditions:[{Type:KernelDeadlock Status:false Transition:0001-01-01 00:00:00 +0000 UTC Reason:KernelHasNoDeadlock Message:kernel has no deadlock}] Rules:[{Type:temporary Condition: Reason:OOMKilling Pattern:Kill process \d+ (.+) score \d+ or sacrifice child\nKilled process \d+ (.+) total-vm:\d+kB, anon-rss:\d+kB, file-rss:\d+kB} {Type:temporary Condition: Reason:TaskHung Pattern:task \S+:\w+ blocked for more than \w+ seconds\.} {Type:permanent Condition:KernelDeadlock Reason:AUFSUmountHung Pattern:task umount\.aufs:\w+ blocked for more than \w+ seconds\.} {Type:permanent Condition:KernelDeadlock Reason:DockerHung Pattern:task docker:\w+ blocked for more than \w+ seconds\.} {Type:permanent Condition:KernelDeadlock Reason:UnregisterNetDeviceIssue Pattern:unregister_netdevice: waiting for \w+ to become free. Usage count = \d+}]}
2016-07-11T07:49:17.310510490Z I0711 07:49:17.310026       1 kernel_monitor.go:93] Got system boot time: 2016-07-08 01:40:54.310019188 +0000 UTC
2016-07-11T07:49:17.311878174Z I0711 07:49:17.311436       1 kernel_monitor.go:102] Start kernel monitor
2016-07-11T07:49:17.311907663Z I0711 07:49:17.311654       1 kernel_log_watcher.go:173] unable to parse line: "", can't find timestamp prefix "kernel: [" in line ""
2016-07-11T07:49:17.311926515Z I0711 07:49:17.311696       1 kernel_log_watcher.go:110] Start watching kernel log
2016-07-11T07:49:17.311942118Z I0711 07:49:17.311720       1 problem_detector.go:60] Problem detector started
2016-07-11T07:49:17.314020808Z 2016/07/11 07:49:17 Seeked /log/kern.log - &{Offset:0 Whence:0}
2016-07-11T07:49:18.355974460Z E0711 07:49:18.355712       1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:19.314395255Z E0711 07:49:19.314110       1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:20.314302138Z E0711 07:49:20.313982       1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:21.315151897Z E0711 07:49:21.314849       1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:22.313977681Z E0711 07:49:22.313834       1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:23.313906286Z E0711 07:49:23.313619       1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:24.314428608Z E0711 07:49:24.314141       1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:25.316626549Z E0711 07:49:25.316326       1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:26.314346471Z E0711 07:49:26.314142       1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:27.313901894Z E0711 07:49:27.313759       1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:28.314356686Z E0711 07:49:28.314198       1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:29.314747334Z E0711 07:49:29.314450       1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:30.314562756Z E0711 07:49:30.314235       1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found

It seems node-problem-detector still use the original os hostname to access node instead of overrided hostname that is registered in etcd.

/wangyumi

How to hook up third-party daemons?

I'm an ABRT developer and I would love to create a problem daemon reporting problems detected by ABRT to node-problem-detector.

ABRT's architecture is similar to node-problem-detector's - there are agents reporting detected problems to abrtd. An ABRT agent is either a tiny daemon watching logs (or systemd-journal) or a language error handler (Python sys.excepthook, Ruby at_exit callback, /proc/sys/kernel/core_pattern, Node.js uncaughtException event handler, Java JNI agent).

I've created a docker image that is capable to detect Kernel oopses, vmcores and core files on a host:
https://github.com/jfilak/docker-abrt/tree/atomic_minimal

(It should be possible to detect uncaught [Python, Ruby, Java] exceptions in the future)

ABRT provides several ways of reporting the detected problems to users - e-mail, FTP|SCP upload, D-Bus signal, Bugzilla bug, micro-Report, systemd-journal catalog message - and it is trivial to add another report destination.

The Design Doc defines "Problem Report Interface" but I've failed to find out how to register a new problem deamon to node-problem-detector or how to use the "Problem Report Interface" from a third party daemon.

Unable to build code, missing godep

I cloned the repo and ran:

$ make node-problem-detector
CGO_ENABLED=0 GOOS=linux godep go build -a -installsuffix cgo -ldflags '-w' -o node-problem-detector
vendor/gopkg.in/fsnotify.v1/inotify.go:19:2: cannot find package "golang.org/x/sys/unix" in any of:
    /home/decarr/go/src/k8s.io/node-problem-detector/vendor/golang.org/x/sys/unix (vendor tree)
    /usr/local/go/src/golang.org/x/sys/unix (from $GOROOT)
    /home/decarr/go/src/k8s.io/node-problem-detector/Godeps/_workspace/src/golang.org/x/sys/unix (from $GOPATH)
    /home/decarr/go/src/golang.org/x/sys/unix
godep: go exit status 1
Makefile:9: recipe for target 'node-problem-detector' failed
make: *** [node-problem-detector] Error 1

Looks like this is missing a godep in master:
https://github.com/kubernetes/node-problem-detector/tree/master/vendor

apiserver-override flag in standalone mode not working in kubemark

So I create npd containers inside hollow-nodes (just for clarity: these are actually pods) of kubemark by passing the address of the kubemark master and the kubeconfig file through the override flag as follows:

"name": "hollow-node-problem-detector",
"image": "gcr.io/google_containers/node-problem-detector:v0.3",
"env": [
						{
							"name": "NODE_NAME",
							"valueFrom": {
								"fieldRef": {
									"fieldPath": "metadata.name"
								}
							}
						}
],
"command": [
						"/node-problem-detector",
						"--kernel-monitor=/config/kernel-monitor.json",
						"--apiserver-override=https://104.198.41.48:443?inClusterConfig=false&auth=/kubeconfig/npd_kubeconfig",
						"--alsologtostderr",
						"1>>/var/logs/npd_$(NODE_NAME).log 2>&1"
],
"volumeMounts": ....

I even checked inside the container using kubectl exec that the right kubeconfig file is present on the container and NODE_NAME env var is defined. Here's the kubeconfig:

apiVersion: v1
kind: Config
users:
- name: node-problem-detector
  user:
    token: 8qt8RLdwIKeZQ0QeVrSWg4BrqK3Cs8H8
clusters:
- name: kubemark
  cluster:
    insecure-skip-tls-verify: true
    server: https://104.198.41.48
contexts:
- context:
    cluster: kubemark
    user: node-problem-detector
  name: kubemark-npd-context
current-context: kubemark-npd-context

Despite this, the npd seems to communicate with the wrong master (the master of the real cluster underneath the kubemark cluster). It might probably be still using the inClusterConfig from the real node underneath the hollow-node, as that seems to be the only way by which it can get hold of the wrong master's IP address.

cc @Random-Liu @andyxning

Node-Problem-Detector should Patch NodeStatus not Update

Problem:
kubelet, node-controller, and now the node-problem-detector all update the Node.Status field.

Normally this is not a problem because if changes happen rapidly, resource version mismatch fails and everything is ok.

When a new field is added to Node Status, however:

  • Since kubelet and node-controller are in the same repository they are recompiled with the new field and Status updates continue to operate normally.
  • However, since node-problem-detector is in a separate repository and has not been recompiled with the new version, its Update calls end up squashing the new (unknown) field, resetting it to nil.

Suggested Solution:
node-problem-detector should do a patch instead of an update to prevent wiping out fields it is not aware of. Incidentally, kubelet and node-controller should do this as well, but that is not as critical since they are in the same repository and new types are normally added their first.

CC @kubernetes/sig-node @kubernetes/sig-api-machinery @Random-Liu @bgrant0607

Now i am running daemonset in all node, but how do i verify it is useful?

What happens to the node wrong after creating daemonset? And set node to unscheduler?

Now i am running daemonset in all node, but how do i verify it is useful。

It is a bit difficult to simulate these problems。

Hardware issues: Bad cpu, memory or disk;
Kernel issues: Kernel deadlock, corrupted file system;
Container runtime issues: Unresponsive runtime daemon;
...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.