hax hangs when no leader can be found in Consul. From discussion wit

mentioned in issue <a class="issue-link js-issue-link" data-error-text="Failed to load

Problem: It's not clear how hax should handle EP requests if there is no confd about cortx-hare HOT 26 CLOSED

typeundefined commented on August 25, 2024

Problem: It's not clear how hax should handle EP requests if there is no confd

from cortx-hare.

Comments (26)

typeundefined commented on August 25, 2024

closed

from cortx-hare.

typeundefined commented on August 25, 2024

The ticket got a kind of outdated. At this moment hax is able to construct a correct EP reply even if the confd process is actually down (due to updated confd's healthcheck scripts for Consul).

Closing the ticket.

from cortx-hare.

max-seagate commented on August 25, 2024

How should hax process the entrypoint request if the confd service is down and hence no leader node can be found in Consul?

It doesn't matter if confd is down or not. When the entrypoint reply is received the state of confds may have been changed already.
What matters is that confds are present in the cluster configuration. You should send confds in the list regardless of their online/non-online state.
If there is a use case to start m0_halon_interface with 0 confds in the cluster I can add a special case to m0_halon_interface.

from cortx-hare.

andriytk commented on August 25, 2024

For example, SNS-repair/rebalance management logic.

Consul has its own leader elected by Raft. It enables Consul to work and to serve Hare, but Hare's own logic is not related to it in any way.

"RC Leader" term seems Ok to me, as of now.

from cortx-hare.

vvv commented on August 25, 2024

mentioned in issue #91

from cortx-hare.

vvv commented on August 25, 2024

RC is the same as RC in Halon. [...] It is Hare's "brain" that executes its logic.

What logic? Can you be more specific?

Is “RC” an umbrella term for numerous Consul watches, which Hare configures for Mero cluster?

It would be less confusing if we called this thing by its true name — the Consul leader. Halon terminology does not map directly to Hare design. The artificial “RC” alias is confusing and unnecessary, it hinders understanding.

Update: These questions are off-topic and should be discussed separately.

from cortx-hare.

andriytk commented on August 25, 2024

RC is the same as RC in Halon. There might be only one instance of RC running in the cluster, we call it - "leader". It is Hare's "brain" that executes its logic.

The Mero's principal RM service which is running in some of the confd processes also should be only one in the cluster. So we've decided to put it together with the RC leader.

Now, what happens if, say, confd process dies for some reason? We should select some new principal RM service among the rest confd processes, right? The process of the selection is similar to the RC leader election. So we just combine these two processes: the RC leader's session is dependent on the confd health check, so if confd fails - the session fails and the new RC leader election starts.

See also Leader Election and the RC section in the Design Highlights gdoc.

Hope this helps.

from cortx-hare.

vvv commented on August 25, 2024

@andriy.tkachuk What is the “RC leader”? The “confd” m0d that is first to become online?

from cortx-hare.

andriytk commented on August 25, 2024

It is an interesting kind of chicken and egg problem: RC leader is needed for EP reply (we need to know the principal RM-service for it), but the RC leader can not be elected without m0d-confd process running.

How about the following solution: initially, before m0d-confd process is started, we use /tmp/confd file to simulate confd presence and to be able to respond on EP requests. But as soon as m0d-confd process is started, /tmp/confd file is removed and the health-checker starts to check the real m0d-confd process. Does it make sense?

from cortx-hare.

vvv commented on August 25, 2024

changed title from Problem: It's not clear how {-should hax handle entrypoint req-}s if there is no confd to Problem: It's not clear how {+hax should handle EP request+}s if there is no confd

from cortx-hare.

vvv commented on August 25, 2024

if I send entrypoint request reply with rc = EAGAIN the request is repeated without a delay. From the logs it looks like a DDoS attack

Hmm... OK, we've got several options here:

Reply to the pesky m0d with a poisonous entrypoint reply. (“Greetings, overeager entrypoint requester! We are happy to inform you that YOU and you alone are both the principal RM and the only confd of this cluster. Go ahead and use this information [and die in agony]. Happy crashing!”)
If hax discovers that the RC leader is offline, it loiters a bit before replying with EAGAIN error code. m0d won't send another entrypoint request until it hears from hax.

from cortx-hare.

typeundefined commented on August 25, 2024

@vvv BTW if I send entrypoint request reply with rc = EAGAIN the request is repeated without a delay. From the logs it looks like a DDoS attack :-)

2019-08-02 15:16:10,051 [DEBUG] {Dummy-2} Started processing entrypoint request from remote eps = '10.230.164.213@tcp:12345:45:1', process_fid = 0x7200000000000001:0x0
2019-08-02 15:16:10,053 [DEBUG] {Dummy-2} Starting new HTTP connection (1): 127.0.0.1:8500
2019-08-02 15:16:10,054 [DEBUG] {Dummy-2} http://127.0.0.1:8500 "GET /v1/kv/leader HTTP/1.1" 200 124
2019-08-02 15:16:10,055 [ERROR] {Dummy-2} Failed to get the data from Consul. Replying with EAGAIN error code.
Traceback (most recent call last):
  File "/home/720599/projects/hare/hax/hax/halink.py", line 93, in _entrypoint_request_cb
    sess = prov.get_leader_session()
  File "/home/720599/projects/hare/hax/hax/util.py", line 51, in get_leader_session
    raise HAConsistencyException('Could not get the leader from Consul')
hax.exception.HAConsistencyException: Could not get the leader from Consul
In m0_ha_entrypoint_reply_send
2019-08-02 15:16:10,055 [DEBUG] {Dummy-2} Reply sent
In entrypoint_request_cb
Module loaded? 1
Here - 1
Here - 2
2019-08-02 15:16:10,061 [DEBUG] {Dummy-2} Started processing entrypoint request from remote eps = '10.230.164.213@tcp:12345:45:1', process_fid = 0x7200000000000001:0x0
2019-08-02 15:16:10,064 [DEBUG] {Dummy-2} Starting new HTTP connection (1): 127.0.0.1:8500
2019-08-02 15:16:10,065 [DEBUG] {Dummy-2} http://127.0.0.1:8500 "GET /v1/kv/leader HTTP/1.1" 200 124
2019-08-02 15:16:10,065 [ERROR] {Dummy-2} Failed to get the data from Consul. Replying with EAGAIN error code.
Traceback (most recent call last):
  File "/home/720599/projects/hare/hax/hax/halink.py", line 93, in _entrypoint_request_cb
    sess = prov.get_leader_session()
  File "/home/720599/projects/hare/hax/hax/util.py", line 51, in get_leader_session
    raise HAConsistencyException('Could not get the leader from Consul')
hax.exception.HAConsistencyException: Could not get the leader from Consul
In m0_ha_entrypoint_reply_send
2019-08-02 15:16:10,066 [DEBUG] {Dummy-2} Reply sent
In entrypoint_request_cb
Module loaded? 1
Here - 1
Here - 2
2019-08-02 15:16:10,072 [DEBUG] {Dummy-2} Started processing entrypoint request from remote eps = '10.230.164.213@tcp:12345:45:1', process_fid = 0x7200000000000001:0x0
2019-08-02 15:16:10,074 [DEBUG] {Dummy-2} Starting new HTTP connection (1): 127.0.0.1:8500
2019-08-02 15:16:10,075 [DEBUG] {Dummy-2} http://127.0.0.1:8500 "GET /v1/kv/leader HTTP/1.1" 200 124
2019-08-02 15:16:10,076 [ERROR] {Dummy-2} Failed to get the data from Consul. Replying with EAGAIN error code.
Traceback (most recent call last):
  File "/home/720599/projects/hare/hax/hax/halink.py", line 93, in _entrypoint_request_cb
    sess = prov.get_leader_session()
  File "/home/720599/projects/hare/hax/hax/util.py", line 51, in get_leader_session
    raise HAConsistencyException('Could not get the leader from Consul')
hax.exception.HAConsistencyException: Could not get the leader from Consul
In m0_ha_entrypoint_reply_send
2019-08-02 15:16:10,076 [DEBUG] {Dummy-2} Reply sent

from cortx-hare.

vvv commented on August 25, 2024

For EES product we will depend on Pacemaker to handle availability issues. The solution here, AFAIU, is to make Consul health check handler to report confd's failure to Pacemaker and let the latter do failover.

cc @nikita.danilov

from cortx-hare.

vvv commented on August 25, 2024

The simplest thing to do in this situation is for hax to send EAGAIN error back to the entrypoint requester (m0d).

I don't know how m0d will behave once it receives this error though. Keep sending entrypoint requests? Give up?

from cortx-hare.

vvv commented on August 25, 2024

But this doesn't address the initial issue.

Right, I have created a separate issue (#65) for the self-entrypoint problem.

from cortx-hare.

typeundefined commented on August 25, 2024

But this doesn't address the initial issue.
Let's imagine that

hax receives an entrypoint request (no matter from whom).
and at the same moment leader node goes down and by some reason Consul doesn't have the new leader for some time span (in my case - forever since this is a singlenode).
But we've got the pending entrypoint request.

Should I reply it?
How the reply will look like?

from cortx-hare.

vvv commented on August 25, 2024

@konstantin.nekrasov As a workaround, hax may send a hard-coded reply to entrypoint request from itself. In order not to contact Consul for the information it (hax) doesn't use anyway.

from cortx-hare.

typeundefined commented on August 25, 2024

But it is not the question to hax. It just receives an entrypoint request (even the one triggered by the fact of launching hax per se).

from cortx-hare.

vvv commented on August 25, 2024

cc @max-seagate.medved

from cortx-hare.

vvv commented on August 25, 2024

the entrypoint request from "myself" comes into hax

This design is bad. Only users of Mero configuration (those who have confc cache) require entrypoint information.

hax does not cache Mero configuration, it is a mere bridge. For hax to send entrypoint request to itself is wrong.

from cortx-hare.

typeundefined commented on August 25, 2024

Ok, but my question was more narrow:
0. I remove /tmp/confd file so Consul learns that confd service is dead

I'm starting hax
Because of launching the process linked with libmero, the entrypoint request from "myself" comes into hax
hax is unable to get the leader from Consul (I guess because of the fact that I've got singlenode configuration)
According to the current logic, when no leader is found, an exception in Python gets thrown and on entrypoint request reply is sent back.
Libmero waits the reply forever, that happens in the Mero's thread which has acquired GIL lock so no Python threads switching happens. As a side effect Ctrl-C is not handled (since it can be handled from the main thread only).

In general this situation doesn't look sane and I just want to understand, how to construct an entrypoint request reply which says "Oops, we've got a problem here, this not is not functional".

from cortx-hare.

vvv commented on August 25, 2024

changed the description

from cortx-hare.

vvv commented on August 25, 2024

changed title from {-hax hangs when no leader can be found in Consul-} to {+Problem: It's not clear how should hax handle entrypoint reqs if there is no confd+}

from cortx-hare.

vvv commented on August 25, 2024

My guess is that as soon as m0d learns that there is no quorum of confds, it will clear its confc cache and resend entrypoint request to hax. @konstantin.nekrasov, do your observations support this theory?

from cortx-hare.

vvv commented on August 25, 2024

Confd services are being health-checked by Consul. Current implementation of this check is primitive: confd is considered healthy (by Consul) iff /tmp/confd file exists.

Removal of /tmp/confd makes Consul believe that the confd has failed. Consul's check handler sends HTTP POST request to hax.

When hax receives this HTTP request, it should create a HA state update notification message (m0_ha_msg with m0_ha_msg_nvec payload) and send it to all connected m0d-s. I think that the payload should be “Process object such-and-such and all its children objects — services and sdevs — have become M0_NC_FAILED”; you may check Halon code to know for sure.

When m0d receives such HA state update and learns from it that a confd has failed, it (m0d) checks if there is still a quorum of confd services in the cluster. Singlenode configuration, which you experiment on, has one confd; when this confd is gone, there will be no quorum.

Now what happens when there is no quorum of confd-s?.. I don't know the answer, I can only guess. It is better to check Mero code.

from cortx-hare.

vvv commented on August 25, 2024

changed the description

from cortx-hare.

Problem: It's not clear how hax should handle EP requests if there is no confd about cortx-hare HOT 26 CLOSED

Comments (26)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs