GithubHelp home page GithubHelp logo

kong / lua-resty-healthcheck Goto Github PK

View Code? Open in Web Editor NEW
124.0 30.0 51.0 478 KB

Healthcheck library for OpenResty to validate upstream service status

Home Page: https://kong.github.io/lua-resty-healthcheck/topics/README.md.html

License: Apache License 2.0

Makefile 0.45% Lua 99.55%

lua-resty-healthcheck's Introduction

lua-resty-healthcheck

latest version latest luarocks version master branch License Twitter Follow

A health check library for OpenResty.

Synopsis

http {
    lua_shared_dict test_shm 8m;
    lua_shared_dict my_worker_events 8m;
    init_worker_by_lua_block {

        local we = require "resty.worker.events"
        local ok, err = we.configure({
            shm = "my_worker_events",
            interval = 0.1
        })
        if not ok then
            ngx.log(ngx.ERR, "failed to configure worker events: ", err)
            return
        end

        local healthcheck = require("resty.healthcheck")
        local checker = healthcheck.new({
            name = "testing",
            shm_name = "test_shm",
            checks = {
                active = {
                    type = "https",
                    http_path = "/status",
                    healthy  = {
                        interval = 2,
                        successes = 1,
                    },
                    unhealthy  = {
                        interval = 1,
                        http_failures = 2,
                    }
                },
            }
        })

        local ok, err = checker:add_target("127.0.0.1", 8080, "example.com", false)

        local handler = function(target, eventname, sourcename, pid)
            ngx.log(ngx.DEBUG,"Event from: ", sourcename)
            if eventname == checker.events.remove
                -- a target was removed
                ngx.log(ngx.DEBUG,"Target removed: ",
                    target.ip, ":", target.port, " ", target.hostname)
            elseif eventname == checker.events.healthy
                -- target changed state, or was added
                ngx.log(ngx.DEBUG,"Target switched to healthy: ",
                    target.ip, ":", target.port, " ", target.hostname)
            elseif eventname ==  checker.events.unhealthy
                -- target changed state, or was added
                ngx.log(ngx.DEBUG,"Target switched to unhealthy: ",
                    target.ip, ":", target.port, " ", target.hostname)
            else
                -- unknown event
            end
        end
    }
}

Description

This library supports performing active and passive health checks on arbitrary hosts.

Control of the library happens via its programmatic API. Consumption of its events happens via the lua-resty-worker-events library.

Targets are added using checker:add_target(host, port). Changes in status ("healthy" or "unhealthy") are broadcasted via worker-events.

Active checks are executed in the background based on the specified timer intervals.

For passive health checks, the library receives explicit notifications via its programmatic API using functions such as checker:report_http_status(host, port, status).

See the online LDoc documentation for the complete API.

History

Versioning is strictly based on Semantic Versioning

Releasing new versions:

  • update changelog below (PR's should be merged including a changelog entry)
  • based on changelog determine new SemVer version
  • create a new rockspec
  • render the docs using ldoc (don't do this within PR's)
  • commit as "release x.x.x" (do not include rockspec revision)
  • tag the commit with "x.x.x" (do not include rockspec revision)
  • push commit and tag
  • upload rock to luarocks: luarocks upload rockspecs/[name] --api-key=abc

3.0.1 (22-Dec-2023)

  • Fix: fix delay clean logic when multiple healthchecker was started #146

3.0.0 (12-Oct-2023)

  • Perf: optimize by localizing some functions #92 (backport)
  • Fix: Generate fresh default http_statuses within new() #83 (backport)

2.0.0 (22-Sep-2020)

Note: Changes in this version has been discarded from current & future development. Below you can see it's changelog but be aware that these changes might not be present in 3.y.z unless they are explicitly stated in 3.y.z, 1.6.3 or previous releases. Read more at: release 3.0.0 (#142) and chore(*): realign master branch to 3.0.0 release (#144)

  • BREAKING: fallback for deprecated top-level field type is now removed (deprecated since 0.5.0) #56
  • BREAKING: Bump lua-resty-worker-events dependency to 2.0.0. This makes a lot of the APIs in this library asynchronous as the worker events post and post_local won't anymore call poll on a running worker automatically, for more information, see: https://github.com/Kong/lua-resty-worker-events#200-16-september-2020
  • BREAKING: tcp_failures can no longer be 0 on http(s) checks (unless http(s)_failures are also set to 0) #55
  • feature: Added support for https_sni #49
  • fix: properly log line numbers by using tail calls #29
  • fix: when not providing a hostname, use IP #48
  • fix: makefile; make install
  • feature: added a status version field #54
  • feature: add headers for probe request #54
  • fix: exit early when reloading during a probe #47
  • fix: prevent target-list from being nil, due to async behaviour #44
  • fix: replace timer and node-wide locks with resty-timer, to prevent interval skips #59
  • change: added additional logging on posting events #25
  • fix: do not run out of timers during init/init_worker when adding a vast amount of targets #57
  • fix: do not call on the module table, but use a method for locks. Also in #57

1.6.3 (06-Sep-2023)

  • Feature: Added support for https_sni #49 (backport)
  • Fix: Use OpenResty API for mTLS #99 (backport)

1.6.2 (17-Nov-2022)

  • Fix: avoid raising worker events for new targets that were marked for delayed removal, i.e. targets that already exist in memory only need the removal flag cleared when added back. #122

1.6.1 (25-Jul-2022)

  • Fix: improvements to ensure the proper securing of shared resources to avoid race conditions and clearly report failure states. #112, #113, #114.
  • Fix: reduce the frequency of checking for unused targets, reducing the number of locks created. #116
  • Fix accept any lua-resty-events 0.1.x release. #118

1.6.0 (27-Jun-2022)

  • Feature: introduce support to lua-resty-events module in addition to lua-resty-worker-events support. With this addition, the lua-resty-healthcheck luarocks package does not require a specific event-sharing module anymore, but you are still required to provide either lua-resty-worker-events or lua-resty-events. #105
  • Change: if available, lua-resty-healthcheck now uses string.buffer, the new LuaJIT's serialization API. If it is unavailable, lua-resty-healthcheck fallbacks to cjson. #109

1.5.3 (14-Nov-2022)

  • Fix: avoid raising worker events for new targets that were marked for delayed removal, i.e. targets that already exist in memory only need the removal flag cleared when added back. #121

1.5.2 (07-Jul-2022)

  • Better handling of resty.lock failure modes, adding more checks to ensure the lock is held before running critical code, and improving the decision whether a function should be retried after a timeout trying to acquire a lock. #113
  • Increased logging for locked function failures. #114
  • The cleanup frequency of deleted targets was lowered, cutting the number of created locks in a short period. #116

1.5.1 (23-Mar-2022)

  • Fix: avoid breaking active health checks when adding or removing targets. #93

1.5.0 (09-Feb-2022)

  • New option checks.active.headers supports one or more lists of values indexed by header name. #87
  • Introduce dealyed_clear() function, used to remove addresses after a time interval. This function may be used when an address is being removed but may be added again before the interval expires, keeping its health status. #88

1.4.3 (31-Mar-2022)

  • Fix: avoid breaking active health checks when adding or removing targets. #100

1.4.2 (29-Jun-2021)

  • Fix: prevent new active checks being scheduled while a health check is running. #72
  • Fix: remove event watcher when stopping an active health check. #74; fixes Kong issue #7406

1.4.1 (17-Feb-2021)

  • Fix: make sure that a single worker will actively check hosts' statuses. #67

1.4.0 (07-Jan-2021)

  • Use a single timer to actively health check targets. This reduces the number of timers used by health checkers, as they used to use two timers by each target. #62

1.3.0 (17-Jun-2020)

  • Adds support to mTLS to active healthchecks. This feature can be used adding the fields ssl_cert and ssl_key, with certificate and key respectively, when creating a new healthcheck object. #41

1.2.0 (13-Feb-2020)

  • Adds set_all_target_statuses_for_hostname, which sets the targets for all entries with a given hostname at once.

1.1.2 (19-Dec-2019)

  • Fix: when ngx.sleep API is not available (e.g. in the log phase) it is not possible to lock using lua-resty-lock and any function that needs exclusive access would fail. This fix adds a retry method that starts a new light thread, which has access to ngx.sleep, to lock the critical path. #37;

1.1.1 (14-Nov-2019)

  • Fix: fail when it is not possible to get exclusive access to the list of targets. This fix prevents that workers get to an inconsistent state. #34;

1.1.0 (30-Sep-2019)

  • Add support for setting the custom Host header to be used for active checks.
  • Fix: log error on SSL Handshake failure #28;

1.0.0 (05-Jul-2019)

  • BREAKING: all API functions related to hosts require a hostname argument now. This way different hostnames listening on the same IP and ports combination do not have an effect on each other.
  • Fix: fix reporting active TCP probe successes #20; fixes issue #19

0.6.1 (04-Apr-2019)

  • Fix: set up event callback only after target list is loaded #18; fixes Kong issue #4453

0.6.0 (26-Sep-2018)

  • Introduce checks.active.https_verify_certificate field. It is true by default; setting it to false disables certificate verification in active healthchecks over HTTPS.

0.5.0 (25-Jul-2018)

  • Add support for https -- thanks @gaetanfl for the PR!
  • Introduce separate checks.active.type and checks.passive.type fields; the top-level type field is still supported as a fallback but is now deprecated.

0.4.2 (23-May-2018)

  • Fix Host header in active healthchecks

0.4.1 (21-May-2018)

  • Fix internal management of healthcheck counters

0.4.0 (20-Mar-2018)

  • Correct setting of defaults in http_statuses
  • Type and bounds checking to checks table

0.3.0 (18-Dec-2017)

  • Disable individual checks by setting their counters to 0

0.2.0 (30-Nov-2017)

  • Adds set_target_status

0.1.0 (27-Nov-2017) Initial release

  • Initial upload

Copyright and License

Copyright 2017-2022 Kong Inc.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

lua-resty-healthcheck's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lua-resty-healthcheck's Issues

http 1.0 is out-of-date?

Hi,
I'm just a Kong user, know little about lua.
Just wondering the code line 1050:
local request = ("GET %s HTTP/1.0\r\n%sHost: %s\r\n\r\n"):format(path, headers, hostheader or hostname or ip)
if such health check request can adopt HTTP/1.1 ?

Timeout Errors and Unhealthy Upstreams during Health Checks

We are experiencing frequent timeout errors during the health checks of our services. While the health APIs work fine when invoked directly from the nodes, we encounter issues during the health checks performed by Kong's lua-resty-healthcheck library.

The timeout errors are logged as follows:

Unhealthy TIMEOUT increment (10/3) for 'my-service.my-domain.com(10.123.321.234:443)', context: ngx.timer
Failed to receive status line from 'my-service.my-domain.com(10.123.321.234:443)': timeout, context: ngx.timer
Failed SSL handshake with 'my-service.my-domain.com(10.123.321.234:443)': handshake failed, context: ngx.timer

It is important to note that this issue affects specific upstreams, and only one or two pods at a time experience this problem. The upstreams remain in an unhealthy state and do not recover automatically. The issue is resolved temporarily by restarting the affected Kong pod, which sets the upstream to a healthy state again.

Upon investigating the code used by Kong's lua-resty-healthcheck library, it appears that the health check query is performed using HTTP/1.0. The relevant code snippet is as follows:

local request = ("GET %s HTTP/1.0\r\n%sHost: %s\r\n\r\n"):format(path, headers, hostheader or hostname or ip)

Considering this, we suspect that the timeouts might be related to the usage of HTTP/1.0 instead of HTTP/1.1. We believe that updating the health check query to use HTTP/1.1 might help mitigate these timeout errors.

We kindly request to make the necessary changes to the lua-resty-healthcheck library to use HTTP/1.1 for health checks. This update should help improve the reliability of the health checks and prevent the upstreams from getting stuck in an unhealthy state.

checker:get_target_status fail to get result

Thank you for providing such a great project. I encountered some problems while using it.

Test according to the test case, the following code:

 location = /t {
        content_by_lua_block {
            local we = require "resty.worker.events"
            assert(we.configure{ shm = "my_worker_events", interval = 0.1 })
            local healthcheck = require("resty.healthcheck")
            local checker = healthcheck.new({
                name = "testing",
                shm_name = "test_shm",
                checks = {
                    active = {
                        http_path = "/status",
                        healthy  = {
                            interval = 999, -- we don't want active checks
                            successes = 1,
                        },
                        unhealthy  = {
                            interval = 999, -- we don't want active checks
                            tcp_failures = 1,
                            http_failures = 1,
                        }
                    },
                    passive = {
                        healthy  = {
                            successes = 1,
                        },
                        unhealthy  = {
                            tcp_failures = 1,
                            http_failures = 1,
                        }
                    }
                }
            })
            ngx.sleep(0.1) -- wait for initial timers to run once
            local ok, err = checker:add_target("127.0.0.1", 8088, nil, true)
            ngx.say(checker:get_target_status("127.0.0.1", 8088))  -- true
            checker:report_tcp_failure("127.0.0.1", 8088)
            ngx.say(checker:get_target_status("127.0.0.1", 8088))  -- false
            checker:report_success("127.0.0.1", 8088)
            ngx.say(checker:get_target_status("127.0.0.1", 8088))  -- true
        }
    }

The result of execution is:

curl http://127.0.0.1:8085/t
false
false
true

Why can't I reproduce the same results as your test case?
In other words, after executing checker:add_target, the result of executing checker:get_target_status for the first time is false.
I can confirm that my corresponding interface exists: as follows

curl -I   http://127.0.0.1:8088/status
HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8
Date: Mon, 04 Mar 2024 15:53:43 GMT
Content-Length: 15

Please tell me if there is anything wrong here;Thx
the openresty version of mine is 1.13.

Exception!!! 'report_tcp_success()' unimplemented in healthcheck.lua

2019/05/10 12:23:29 [error] 12056#0: *195028 lua user thread aborted: runtime error: /usr/local/share/lua/5.1/resty/healthcheck.lua:737: attempt to call method 'report_tcp_success' (a nil value)
stack traceback:
coroutine 0:
/usr/local/share/lua/5.1/resty/healthcheck.lua: in function 'run_single_check'
/usr/local/share/lua/5.1/resty/healthcheck.lua:798: in function </usr/local/share/lua/5.1/resty/healthcheck.lua:794>
coroutine 1:
[C]: in function 'connect'
/usr/local/share/lua/5.1/resty/healthcheck.lua:726: in function 'run_single_check'
/usr/local/share/lua/5.1/resty/healthcheck.lua:798: in function 'run_work_package'
/usr/local/share/lua/5.1/resty/healthcheck.lua:826: in function 'active_check_targets'
/usr/local/share/lua/5.1/resty/healthcheck.lua:907: in function </usr/local/share/lua/5.1/resty/healthcheck.lua:871>, context: ngx.timer, client: 127.0.0.1, server: 127.0.0.1:8001
2019/05/10 12:23:29 [error] 12056#0: *195028 lua user thread aborted: runtime error: /usr/local/share/lua/5.1/resty/healthcheck.lua:737: attempt to call method 'report_tcp_success' (a nil value)
stack traceback:
coroutine 0:
/usr/local/share/lua/5.1/resty/healthcheck.lua: in function 'run_single_check'
/usr/local/share/lua/5.1/resty/healthcheck.lua:798: in function </usr/local/share/lua/5.1/resty/healthcheck.lua:794>
coroutine 1:
[C]: in function 'connect'
/usr/local/share/lua/5.1/resty/healthcheck.lua:726: in function 'run_single_check'
/usr/local/share/lua/5.1/resty/healthcheck.lua:798: in function 'run_work_package'
/usr/local/share/lua/5.1/resty/healthcheck.lua:826: in function 'active_check_targets'
/usr/local/share/lua/5.1/resty/healthcheck.lua:907: in function </usr/local/share/lua/5.1/resty/healthcheck.lua:871>, context: ngx.timer, client: 127.0.0.1, server: 127.0.0.1:8001
2019/05/10 12:23:29 [error] 12056#0: *195028 lua entry thread aborted: runtime error: /usr/local/share/lua/5.1/resty/healthcheck.lua:737: attempt to call method 'report_tcp_success' (a nil value)
stack traceback:
coroutine 0:
/usr/local/share/lua/5.1/resty/healthcheck.lua: in function 'run_single_check'
/usr/local/share/lua/5.1/resty/healthcheck.lua:798: in function 'run_work_package'
/usr/local/share/lua/5.1/resty/healthcheck.lua:826: in function 'active_check_targets'
/usr/local/share/lua/5.1/resty/healthcheck.lua:907: in function </usr/local/share/lua/5.1/resty/healthcheck.lua:871>, context: ngx.timer, client: 127.0.0.1, server: 127.0.0.1:8001

Do you need to fill in the domain name in tcp mode?

lua-resty-healthcheck 1.0.0-1
lua-resty-worker-events 1.0.0-1

example:

if h.mode == "tcp" then
    h.http_domain = nil
    h.http_path = nil
end

error log

[error] 67293#3894067: *16 lua entry thread aborted: runtime error: /usr/local/share/lua/5.1/resty/healthcheck.lua:1310: table index is nil
stack traceback:
coroutine 0:
        /usr/local/share/lua/5.1/resty/healthcheck.lua: in function 'fn'
        /usr/local/share/lua/5.1/resty/healthcheck.lua:206: in function 'locking_target_list'
        /usr/local/share/lua/5.1/resty/healthcheck.lua:1296: in function 'new'

If the domain name is set to a placeholder is normal, is this design unreasonable?

The targets is nil when call the function get_target after add target.

At v1.2.0 , some errors got like:
[error] 46#46: 14188959 failed to run balancer_by_lua: /opt/app/test_proj/deps/share/lua/5.1/resty/healthcheck.lua:247: attempt to index field 'targets' (a nil value)
stack traceback:
/opt/app/test_proj/deps/share/lua/5.1/resty/healthcheck.lua:247: in function 'get_target'
/opt/app/test_proj/deps/share/lua/5.1/resty/healthcheck.lua:424: in function 'get_target_status'
/opt/app/test_proj/lua/ins_breaker/http/balancer.lua:334: in function 'load_balancer'
/opt/app/test_proj/lua/circuit_breaker.lua:259: in function 'http_balancer_phase'
balancer_by_lua:2: in main chunk while connecting to upstream, client: 10.23.178.7

Using the function locking_target_list, the timer delays adding targets, so it will get an error if use the get_target_status function to get the status of a node before added.
I wanna add a wait_add_target_list to store the target before added to the target_list actually, and to store target's default is_healthy status.

a success will reset _all three_ failure counters doesn't work

you have said "a success will reset all three failure counters" at the beginning, but it doesn't work.
Provided the http_failures is set three times, if there are two failures and then a success, but then another failure, the target will be marked as failure.
I think the problem is that the conditional statement of incr_counter "(health_mode == "healthy" and target.healthy) or (health_mode == "unhealthy" and not target.healthy) " goes wrong

When multiple targets have the same IP:PORT, active healthcheck results for one impact them all

When routing to multiple separate application with the same IP:PORT for ingress (as in an HA-proxy), I notice that healthcheck results for the original targets seem to be ignored as I add new ones.

This impacts us heavily, as most of our APIs are deployed to a kubernetes cluster, all of which has the same IP:PORT for ingress (the host header is used to determine which specific service to send traffic to at the cluster's ingress router).

Request adding logic to the balancer to track each target's status as a hostname:port object, or hostname:ip:port object instead of just ip:port to allow separate targets to have separate status'

Active health check stops working randomly

I have Kong installed on Openshift cluster. I have upstreams as external servers (not a internal openshift nodes).
I have configured a active and passive health check. Active health check will be used only for targets which are unhealthy. It is set to check with 10s interval.

After sometime (randomly) active checks are stopped working. When targets are unhealthy by passive checks, it remains unhealthy until I reload/restart entire kong instance.

Here is how I have healthcheck config

  Healthchecks:
    Active:
      Concurrency:  1
      Healthy:
        http_statuses:
          200
          302
        Interval:                0
        Successes:               1
      http_path:                 /v1/management/health/simple
      https_verify_certificate:  false
      Timeout:                   3
      Type:                      https
      Unhealthy:
        http_failures:  0
        Interval:      10
        tcp_failures:  0
        Timeouts:      0
    Passive:
      Healthy:
        Successes:  1
      Unhealthy:
        http_failures:  1
        http_statuses:
          429
          500
          503
        tcp_failures:  1
        Timeouts:      1

As per below logs, It seems health checker stopped and started immediately but it did not really performing active probe after this restart, it happens randomly

2021/11/04 16:30:45 [debug] 25#0: *1125536 [lua] healthcheck.lua:1126: log(): [healthcheck] (0dc6f45b-8f8d-40d2-a504-473544ee190b:<upstream xxxxxxxxxxxxx) healthchecker stopped
2021/11/04 16:30:45 [debug] 24#0: *1125506 [lua] healthcheck.lua:1126: log(): [healthcheck] (0dc6f45b-8f8d-40d2-a504-473544ee190b:<upstream xxxxxxxxxxxxx) Got initial target list (0 targets)
2021/11/04 16:30:45 [debug] 24#0: *1125506 [lua] healthcheck.lua:1126: log(): [healthcheck] (0dc6f45b-8f8d-40d2-a504-473544ee190b:<upstream xxxxxxxxxxxxx) active check flagged as active
2021/11/04 16:30:45 [debug] 24#0: *1125506 [lua] healthcheck.lua:1126: log(): [healthcheck] (0dc6f45b-8f8d-40d2-a504-473544ee190b:<upstream xxxxxxxxxxxxx) starting timer to check active checks
2021/11/04 16:30:45 [debug] 24#0: *1125506 [lua] healthcheck.lua:1126: log(): [healthcheck] (0dc6f45b-8f8d-40d2-a504-473544ee190b:<upstream xxxxxxxxxxxxx) Healthchecker started!
2021/11/04 16:30:45 [debug] 25#0: *1125536 [lua] healthcheck.lua:1126: log(): [healthcheck] (0dc6f45b-8f8d-40d2-a504-473544ee190b:<upstream xxxxxxxxxxxxx) Got initial target list (2 targets)
2021/11/04 16:30:45 [debug] 25#0: *1125536 [lua] healthcheck.lua:1126: log(): [healthcheck] (0dc6f45b-8f8d-40d2-a504-473544ee190b:<upstream xxxxxxxxxxxxx) Got initial status healthy <ip> <ip>:<port>
2021/11/04 16:30:45 [debug] 25#0: *1125536 [lua] healthcheck.lua:1126: log(): [healthcheck] (0dc6f45b-8f8d-40d2-a504-473544ee190b:<upstream xxxxxxxxxxxxx) active check flagged as active
2021/11/04 16:30:45 [debug] 25#0: *1125536 [lua] healthcheck.lua:1126: log(): [healthcheck] (0dc6f45b-8f8d-40d2-a504-473544ee190b:<upstream xxxxxxxxxxxxx) Healthchecker started!
2021/11/04 16:30:45 [debug] 25#0: *1125536 [lua] healthcheck.lua:1126: log(): [healthcheck] (0dc6f45b-8f8d-40d2-a504-473544ee190b:<upstream xxxxxxxxxxxxx) adding an existing target: <ip> <ip>:<port> (ign
oring)
2021/11/04 16:30:45 [debug] 24#0: *1125506 [lua] events.lua:211: do_event_json(): worker-events: handling event; source=lua-resty-healthcheck [0dc6f45b-8f8d-40d2-a504-473544ee190b:<upstream xxxxxxxxxxxxx], event=clear, 
pid=24, data=table: 0x7f3487367af0
2021/11/04 16:30:45 [debug] 24#0: *1125506 [lua] healthcheck.lua:1126: log(): [healthcheck] (0dc6f45b-8f8d-40d2-a504-473544ee190b:g<upstream xxxxxxxxxxxxx) event: local cache cleared

I don't have steps to reproduce since it happens randomly

checker:event_handler fails on a "remove" event

When the self.targets table is populated, the code uses hostname or ip in case the user did not provide a hostname. However, when the same table is accessed for a "remove" event, the code simply refers to target_found.hostname, which results in an attempt to access a nil key.

Adding: or target_found.ip to that key should suffice.

too many pending timers

Hi, I'm using the master branch and encountered this error:

...
2020/04/01 16:24:46 [error] 6083#0: *417577 [lua] healthcheck.lua:18: add_target(): failed to add target: too many pending timers, context: init_worker_by_lua*
2020/04/01 16:24:46 [error] 6083#0: *417577 [lua] healthcheck.lua:18: add_target(): failed to add target: too many pending timers, context: init_worker_by_lua*
2020/04/01 16:24:46 [error] 6083#0: *417577 [lua] healthcheck.lua:18: add_target(): failed to add target: too many pending timers, context: init_worker_by_lua*
2020/04/01 16:24:46 [error] 6083#0: *417577 [lua] healthcheck.lua:18: add_target(): failed to add target: too many pending timers, context: init_worker_by_lua*
2020/04/01 16:24:46 [error] 6083#0: *417577 [lua] healthcheck.lua:18: add_target(): failed to add target: too many pending timers, context: init_worker_by_lua*
2020/04/01 16:24:46 [error] 6083#0: *417577 [lua] healthcheck.lua:18: add_target(): failed to add target: too many pending timers, context: init_worker_by_lua*
2020/04/01 16:24:46 [error] 6083#0: *417577 [lua] healthcheck.lua:18: add_target(): failed to add target: too many pending timers, context: init_worker_by_lua*
2020/04/01 16:24:46 [error] 6083#0: *417577 [lua] healthcheck.lua:18: add_target(): failed to add target: too many pending timers, context: init_worker_by_lua*
2020/04/01 16:24:46 [error] 6083#0: *417577 [lua] healthcheck.lua:18: add_target(): failed to add target: too many pending timers, context: init_worker_by_lua*
2020/04/01 16:24:48 [alert] 6083#0: 256 lua_max_running_timers are not enough
2020/04/01 16:24:48 [alert] 6083#0: 256 lua_max_running_timers are not enough
2020/04/01 16:24:48 [alert] 6083#0: 256 lua_max_running_timers are not enough
2020/04/01 16:24:48 [alert] 6083#0: 256 lua_max_running_timers are not enough
2020/04/01 16:24:48 [alert] 6083#0: 256 lua_max_running_timers are not enough
2020/04/01 16:24:48 [alert] 6083#0: 256 lua_max_running_timers are not enough
2020/04/01 16:24:48 [alert] 6083#0: 256 lua_max_running_timers are not enough
2020/04/01 16:24:48 [alert] 6083#0: 256 lua_max_running_timers are not enough
...

Is there a limit for how many targets I can add? Is it possible to add more than 2000 upstream servers?

seems to be related to locking_target_list when add_target

local _, terr = ngx.timer.at(0, run_fn_locked_target_list, self, fn)

failed to release lock

2020/12/15 14:44:37 [error] 123#0: *60053758 [lua] healthcheck.lua:1104: log(): [healthcheck] (1bdcdd5c-ecbc-4cb3-b271-c5c4a3e03f56:ae-app-56.www.hba.main) failed to release lock 'lua-resty-healthcheck:1bdcdd5c-ecbc-4cb3-b271-c5c4a3e03f56:ae-app-56.www.hba.main:target_list_lock': unlocked, context: ngx.timer

TCP Health check is not Making upstream target unhealthy.

In order to reproduce this, i created a random service and enabled tcp healthcheck to it.
Post that i added some random IP and port which doesnt Exist. Even i dont see any error of timeout in logs, the status of upstream is never set to UNHEALTHY.

Kong Version i am using : kong:2.1.3-centos

Below is my service configuration
{ "client_certificate": null, "created_at": 1598938985, "id": "a250bcbe-3934-4825-89ab-81411ee95969", "tags": null, "name": "upstream_javaapigw", "algorithm": "round-robin", "hash_on_header": null, "hash_fallback_header": null, "host_header": null, "hash_on_cookie": null, "healthchecks": { "threshold": 100, "active": { "unhealthy": { "http_statuses": [ 429, 404, 500, 501, 502, 503, 504, 505 ], "tcp_failures": 2, "timeouts": 2, "http_failures": 1, "interval": 3 }, "type": "tcp", "http_path": "/", "timeout": 1, "healthy": { "successes": 5, "interval": 1, "http_statuses": [ 200, 302 ] }, "https_sni": null, "https_verify_certificate": true, "concurrency": 10 }, "passive": { "unhealthy": { "http_failures": 1, "http_statuses": [ 429, 500, 503 ], "tcp_failures": 1, "timeouts": 1 }, "healthy": { "http_statuses": [ 200, 201, 202, 203, 204, 205, 206, 207, 208, 226, 300, 301, 302, 303, 304, 305, 306, 307, 308 ], "successes": 0 }, "type": "tcp" } }, "hash_on_cookie_path": "/", "hash_on": "none", "hash_fallback": "none", "slots": 10000 }

timer failure: attempt to call a number value

Today I started deploying my Openresty project, I used lua-resty-healthcheck with version = 2.0.0-1 in this project.

I found many error logs which printed by my app code and caused by this line in the github project. And code snippets around was also displayed below.

--- Get the current status of the target.
-- @param ip IP address of the target being checked.
-- @param port the port being checked against.
-- @param hostname the hostname of the target being checked.
-- @return `true` if healthy, `false` if unhealthy, or `nil + error` on failure.
function checker:get_target_status(ip, port, hostname)

  local target = get_target(self, ip, port, hostname)
  if not target then
    return nil, "target not found"
  end
  return target.internal_health == "healthy"
      or target.internal_health == "mostly_healthy"

end

It seemed function get_target failed, so I deep into code and found it was that self.targets was nil when call this function get_target_status. But the root reasons was still unknown.

I spent almost the whole afternoon to deal with it. Finally I found some abnormal error log in my Nginx error_logs. It said timer failure: attempt to call a number value.
image

I read the source code again and found this snippets which may be related to it.
image

I doubt that pcall(args[1] ...) here was a bug. When ngx_timer_at was called, the args[1] will always be parameter delay. So
the call ngx_timer_at always failed, then Lua thread failed get target_list from nginx shm, and self.targets will be empty at last.

BTW, I found this error in my Nginx server which has 24 workers firstly. And I reproduced this error locally with another Nginx server which has 5 workers. This error seems occureed in Nginx server with more than one workers.

Passive healthcheck bug

file: resty/healthcheck.lua
function: incr_counter

description:
The bug code is there:
if (health_mode == “healthy” and target.healthy) or
(health_mode == “unhealthy” and not target.healthy) then
– No need to count successes when healthy or failures when unhealthy
return true
end

When I config passive healthcheck without active healthcheck. If the failures counter of a target is not null and it’s current status is heathy, then this bug results that the success request can’t clean the failure counter。

I had resove this bug in this way:

local nokCounter = self.shm:get(get_shm_key(self.TARGET_NOKS, ip, port));
if (health_mode == “healthy” and target.healthy and (not nokCounter or nokCounter == 0)) or
(health_mode == “unhealthy” and not target.healthy) then
– No need to count successes when healthy or failures when unhealthy
return true
end

Is there any plans to use 2.0 version in Kong?

As I understand version 2.0 of lua-resty-healthcheck uses shared dict to create health checker timer in only one worker process(system wide timer). The previous version 1.6.2 creates dedicated health checker timer per worker process. In some cases, this causing too much stress on upstream services when we run large number of kong pods.

Is there any plan kong to upgrade the version 2.0 to make use of kong node level timers?

Log Level of active succeeding health checks?

Does it make sense for successful active health-checks to log with warn and not info or debug?

2019/01/13 06:03:09 [warn] 34#0: *272118 [lua] healthcheck.lua:989: log(): [healthcheck] (test_upstream) healthy SUCCESS increment (1/3) for 10.xxx.xxx.xxx:443, context: ngx.timer

I think for unhealthy it makes sense to log a warn/error, but generally when things are working correctly is there a reason to write to log when running kong under notice mode(and since these are warn they all flood into my terminal view since I point Kongs logs to stdout)?

Testing using the Kong 1.0 release.

check return codes of lua-resty-worker-events calls

Check return codes and log any errors in worker_events calls.

Note that this requires a bump on the lua-resty-worker-events dependency, as the return codes are different between lua-resty-worker-events 0.x and 1.x.

suggestion: periodic lock time change to configurable (0.001 change to configurable)

local ok, err = self.shm:add(key, true, interval - 0.001)

I use lua-resty-healthcheck to test, and found that sometimes there will only one checks between 2 interval(There is only one worker start health checker).This is because the checker process to fast ,the time of consuming is less than 0.001。So,I suggest that the shm key expire time can be controled。

init_worker_by_lua error: /usr/local/openresty/lualib/resty/lock.lua:153: API disabled in the context of init_worker_by_lua*

Environment

OS: Ubuntu 18.04.3 LTS with all updates
Lua: 5.1.5
lua-resty-healthcheck 1.1.0-1 via luarocks
Nginx version: openresty/1.15.8.2 (./configure -j2 --with-pcre-jit --with-ipv6 --add-module=../ngx_lua_ipc --with-http_sub_module --with-http_v2_module)

Error

When restarting or reloading a instance configured resty-healthcheck most of the time the following error occurs:

Nov 19 19:01:09 host.tld nginx[1959]: 2019/11/19 19:01:07 [error] 1973#0: init_worker_by_lua error: /usr/local/openresty/lualib/resty/lock.lua:153: API disabled in the context of init_worker_by_lua*
Nov 19 19:01:09 host.tld nginx[1959]: stack traceback:
Nov 19 19:01:09 host.tld nginx[1959]:         [C]: in function 'sleep'
Nov 19 19:01:09 host.tld nginx[1959]:         /usr/local/openresty/lualib/resty/lock.lua:153: in function 'lock'
Nov 19 19:01:09 host.tld nginx[1959]:         /usr/local/share/lua/5.1/resty/healthcheck.lua:195: in function 'locking_target_list'
Nov 19 19:01:09 host.tld nginx[1959]:         /usr/local/share/lua/5.1/resty/healthcheck.lua:1307: in function 'new'
Nov 19 19:01:09 host.tld nginx[1959]:         /usr/local/share/lua/5.1/custommodule.lua:24293: in function 'init_worker'
Nov 19 19:01:09 host.tld nginx[1959]:         init_worker_by_lua:2: in main chunk

Snippets

nginx.conf:

http {
    [...]
    init_by_lua_block {
      cm = require 'custommodule'
      cm.init()
    }
    init_worker_by_lua_block {
      cm.init_worker()
    }
    [...]
}

custommodule.lua:

local _M = {}
local origins = {}
local healthchecker = {}

[...]

function _M.init_worker()
  -- Attention: Both modules must be required in init_worker().
  -- If one of those is required in global module space (e.g. Nginx init) resty.worker.events fails to manage it's internal _callback list
  local we = require "resty.worker.events"
  local hc = require "resty.healthcheck"

  -- worker events
  -- add dummy handler
  local handler = function(target, eventname, sourcename, pid)
  end
  we.register(handler)

  -- configure worker events
  local ok, err = we.configure{ shm = "worker_events", interval = 0.1 }
  if not ok then
    ngx.log(ngx.ERR, "Failed to configure worker events: ", err)
    return
  end

-- 8< --
-- All code below is repeated for each handled domain.
-- At failing instance: 3x
-- >8 --

  -- upstream health checks
  -- domain: domain1.tld
  -- init checker
  local checker = hc.new({
    name = "domain1.tld",
    shm_name = "healthchecks",
    type = "https",
    checks = {
      active = {
        http_path = '/',
        healthy = {
          interval = 10,
          successes = 1,
        },
        unhealthy = {
          interval = 5,
          http_statuses = { 400, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410,
                            410, 411, 412, 413, 414, 415, 416, 417, 418, 419, 420,
                            420, 421, 422, 423, 424, 425, 426, 427, 428, 429, 430,
                            431, 451,
                            500, 501, 502, 503, 504, 505, 506, 507, 508, 509, 510,
                            511 },
          tcp_failures = 2,
          http_failures = 2,
          timeouts = 2,
        }
      },
      passive = {
        healthy  = {
          successes = 1,
        },
        unhealthy  = {
          http_statuses = { 400, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410,
                            410, 411, 412, 413, 414, 415, 416, 417, 418, 419, 420,
                            420, 421, 422, 423, 424, 425, 426, 427, 428, 429, 430,
                            431, 451,
                            500, 501, 502, 503, 504, 505, 506, 507, 508, 509, 510,
                            511 },
          tcp_failures = 2,
          http_failures = 2,
          timeouts = 2,
        }
      }
    }
  })
  -- clear data to avoid broken syncronisation between health checker and balancer
  -- on nginx reload healtchecker keeps state and balancer loose it
  -- checker:clear()
  -- ^^ not necessary anymore because of workaround below

  -- add event handler for checker
  local handler = function(target, eventname, sourcename, pid)
    if not target then
      return
    end

    local domain = target.hostname
    local origin_host = target.ip
    if eventname == checker.events.remove then
      -- a target was removed
      local ok, err = pcall(function () origins[domain][1]:delete(origin_host) end)
      if not ok then
        ngx.log(ngx.WARN, "Deleting balancer node ", origin_host, " from domain: ", domain, " failed: ", err)
      else
        ngx.log(ngx.DEBUG, "Balancer node ", origin_host, " of domain: ", domain, " deleted")
      end
    elseif eventname == checker.events.healthy then
      -- target changed state, or was added
      local ok, err = pcall(function () origins[domain][1]:set(origin_host, 1) end)
      if not ok then
        ngx.log(ngx.WARN, "Setting balancer node ", origin_host, " for domain: ", domain, " failed: ", err)
      else
        ngx.log(ngx.DEBUG, "Balancer node ", origin_host, " for domain: ", domain, " added")
      end
    elseif eventname ==  checker.events.unhealthy then
      -- target changed state, or was added
      local ok, err = pcall(function () origins[domain][1]:delete(origin_host) end)
      if not ok then
        ngx.log(ngx.WARN, "Balancer delete for domain: ", domain, " failed: ", err)
      else
        ngx.log(ngx.DEBUG, "Balancer node ", origin_host, " of domain: ", domain, " deleted")
      end
    end
  end
  we.register(handler)

  -- add origin nodes
  -- special handling for nginx relaod
  -- health checker keeps state from previous instance, balancer not.
  -- we use the previous known state to resend healthy and unhealthy events to resync balancer node states
  -- this avoids the also possible use of checker:clear to reset the internal health states. The advantage of this
  -- approach is, that we don't lose the knowledge about unhealthy hosts.
  local healthy, err = checker:get_target_status("192.0.2.1", 443)
  if not err then
    checker:set_target_status("192.0.2.1", 443, healthy)
  end
  -- add new nodes in case of newly configured origins after reload or just a normal (re)start of nginx
  local ok, err = checker:add_target("192.0.2.1", 443, "domain1.tld", true)
  if err then
    ngx.log(ngx.ERR, 'Error adding target 192.0.2.1 to healthchecker domain1.tld: ', err)
  end

  -- store in module wide register
  healthchecker["domain1.tld"] = checker

  [...]
end

Question

I'm not sure if I'm just holding it wrong or if there is a bigger architectural problem buried, because @spacewander mentioned in openresty/lua-nginx-module#1210 *Deeplink that:

lua-resty-lock uses ngx.sleep internally, and ngx.sleep could not be used in init_by_worker* phase.

Should we check the version of lua-resty-events?

In 1.6.2 there is a line:

local RESTY_EVENTS_VER = [[^0\.1\.\d+$]]

It says that the library only supports 0.1.x version of lua-resty-events.

Is it necessary? If lua-resty-events changes the version we will get a disaster.

Question: How could we install this package using opm?

I'm pretty new to openresty/lua world, as I understands this library is part of luarocks wheres openresty recommending to use opm as package manager. I wonder how should I manage to make this happen to try library on openresty? Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.