What happened? I'm running a 2 core node cluster on fly.io my

Thanks for your help <a class="user-mention notranslate" data-hovercard-type="user" da

Cool thanks I'll close this and investigate <span class="email-hid

Not sure if the above PR will help: It only affects <code clas

Can you post full response of nslookup? My GPG key is <a href="https://keyserver.ubunt

Nodes can't communicate in static cluster,about emqx/emqx

Comments (32)

Rotario commented on May 30, 2024 1

Yeah of course, send it over and I can try to run it. It's the last thing I hope before I start trying it in production

from emqx.

Rotario commented on May 30, 2024 1

Thanks for your help @zmstone !

** Cannot get connection id for node '[email protected]'
error:badarg, [{erts_internal,new_connection,['[email protected]'],[{error_info,#{module => erl_erts_errors}}]},{net_kernel,handle_info,2,[{file,"net_kernel.erl"},{line,1098}]},{gen_server,try_handle_info,3,[{file,"gen_server.erl"},{line,1095}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,1183}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,241}]}]

from emqx.

Rotario commented on May 30, 2024 1

Ah ok brill thank you! And set node names to the ip not to the fqdn. Brill thank you

from emqx.

id commented on May 30, 2024

Yes, nxdomain in the logs indicates that it's a DNS issue.

from emqx.

Rotario commented on May 30, 2024

Cool thanks I'll close this and investigate

…

On Tue, 6 Feb 2024, 07:07 Ivan Dyachkov, ***@***.***> wrote: Yes, nxdomain in the logs indicates that it's a DNS issue. — Reply to this email directly, view it on GitHub <#12484 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHA6QA4HPFEQA6MNSYSUUIDYSHJDFAVCNFSM6AAAAABC3C2TYKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRYHEYDGMZRGE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

from emqx.

Rotario commented on May 30, 2024

Hi - I've still got this issue, I've installed dnsutils on the production server (fly.io) and when I run
nslookup xxxx
the IP is resolved correctly.
I'm still getting this issue however sometimes. I don't think EMQX supports AAAA records properly somehow

Fly is pretty good at resolving these hostnames correctly. I think it might be an issue with EMQX

2024-02-06T16:56:21Z app[2874de3c1675d8] lhr [info]2024-02-06T16:56:21.032192+00:00 [error] event=connect_to_remote_server, [email protected], port=5369, reason=nxdomain
2024-02-06T16:56:21Z app[2874de3c1675d8] lhr [info]2024-02-06T16:56:21.032304+00:00 [error] crasher: initial call: gen_rpc_client:init/1, pid: <0.11531.1>, registered_name: [], exit: {{badrpc,nxdomain},[{gen_server,init_it,6,[{file,"gen_server.erl"},{line,835}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}, ancestors: [gen_rpc_client_sup,gen_rpc_sup,<0.2123.0>], message_queue_len: 0, messages: [], links: [<0.2129.0>], dictionary: [], trap_exit: true, status: running, heap_size: 2586, stack_size: 28, reductions: 2712; neighbours:

from emqx.

kjellwinblad commented on May 30, 2024

Hi,

Did you test with a version that includes the AAAA record fix: #12467? I Don't think that fix is included in version 5.4.1 as it just recently got merged into the master branch.

from emqx.

Rotario commented on May 30, 2024

I'm running from a docker image. Can I just copy the 5.5 Docker file and replace the emqx version with master?

…

On Wed, 7 Feb 2024, 10:06 Kjell Winblad, ***@***.***> wrote: Hi, Did you test with a version that includes the AAAA record fix: #12467 <#12467>? I Don't think that fix is included in version 5.4.1 as it just recently got merged into the master branch. — Reply to this email directly, view it on GitHub <#12484 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHA6QA7K33I5NI4VQR2NLP3YSNG3VAVCNFSM6AAAAABC3C2TYKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZRGY4TQMRSGA> . You are receiving this because you modified the open/close state.Message ID: ***@***.***>

from emqx.

ieQu1 commented on May 30, 2024

Not sure if the above PR will help:

It only affects DNS discovery strategy
It only adds a parameter to the configuration schema, and doesn't change anything in the underlying code.

from emqx.

ieQu1 commented on May 30, 2024

Can you post full response of nslookup? My GPG key is https://keyserver.ubuntu.com/pks/lookup?search=488654DF3FED6FDE&fingerprint=on&op=index , if you don't want to expose the data publicly.

from emqx.

Rotario commented on May 30, 2024

Sure - from VM1

root@6e82e56f2e3038:/opt/emqx# nslookup 2874de3c1675d8.vm.emqx.internal
Server:		fdaa::3
Address:	fdaa::3#53

Name:	2874de3c1675d8.vm.emqx.internal
Address: fdaa:0:xxxx:xxxx:xxxx:4935:2

from VM2

root@2874de3c1675d8:/opt/emqx# nslookup 6e82e56f2e3038.vm.emqx.internal
Server:		fdaa::3
Address:	fdaa::3#53

Name:	6e82e56f2e3038.vm.emqx.internal
Address: fdaa:0:xxxx:xxx:xx:xxxx:e24:2

from emqx.

ieQu1 commented on May 30, 2024

If I understand correctly, the problem is intermittent, since the nodes can communicate after restart. I assume that the IP addresses don't change. This suggests a temporary problem with the name resolution.

What distro are you running? Does it have a local DNS cache, like systemd-resolved? Also, what is the TTL for the AAAA record?

from emqx.

Rotario commented on May 30, 2024

Hi thanks for your reply I think it seems pretty consistently erroring now. I don't know if maybe a config change killed it?
I've not changed any of the cluster settings from those above

The distro is the Fly machine firecracker - I don't know how to check the DNS cache. NSlookup seems to resolve it fine - I'll check the TTL

from emqx.

Rotario commented on May 30, 2024

I'm going to wait for the DNS discovery AAAA feature to be released and try that

from emqx.

Rotario commented on May 30, 2024

Hi - I've got the master branch working now on the cloud provider

  EMQX_CLUSTER__DNS__NAME = "emqx.internal"
  EMQX_CLUSTER__DNS__RECORD_TYPE = "aaaa"
  EMQX_CLUSTER__PROTO_DIST = "inet6_tcp"
  EMQX_CLUSTER__DISCOVERY_STRATEGY = "dns"

And I'm still getting nxdomain issues - I think it's now interpretting this ipv6 IP as a domain name? Does it need to be bracketed?

2024-02-09T15:36:12Z app[6e82e56f2e3038] lhr [info]2024-02-09T15:36:12.952226+00:00 [error] event=connect_to_remote_server, peer=emqx@fdaa:0:xxxxxxxxxxx:2, port=5369, reason=nxdomain
2024-02-09T15:36:12Z app[6e82e56f2e3038] lhr [info]2024-02-09T15:36:12.952315+00:00 [error] crasher: initial call: gen_rpc_client:init/1, pid: <0.12092.0>, registered_name: [], exit: {{badrpc,nxdomain},[{gen_server,init_it,6,[{file,"gen_server.erl"},{line,961}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,241}]}]}, ancestors: [gen_rpc_client_sup,gen_rpc_sup,<0.2158.0>], message_queue_len: 0, messages: [], links: [<0.2164.0>], dictionary: [], trap_exit: true, status: running, heap_size: 2586, stack_size: 28, reductions: 2615; neighbours:

It did seem to be working fine

from emqx.

Rotario commented on May 30, 2024

I've changed the node name from emqx@ipv6_of_machine to emqx@FQDN of machine
and now I'm getting

Cannot get connection id for node '[email protected]'

from emqx.

Rotario commented on May 30, 2024

Update:

I found the RPC settings, and set it to listen only on ipv6 and ::

Now it seems like the nxdomain issues are fixed. The error I get now is this.
Any ideas please? seems connected to the setting EMQX_HOST and EMQX_NODE__NAME but I've played around and can't get this error to stop.

2024-02-09T17:03:56Z app[6e82e56f2e3038] lhr [info]2024-02-09T17:03:56.166004+00:00 [error] ** Cannot get connection id for node '[email protected]'
2024-02-09T17:04:02Z app[2874de3c1675d8] lhr [info]2024-02-09T17:04:02.986175+00:00 [error] ** Cannot get connection id for node '[email protected]'

from emqx.

Rotario commented on May 30, 2024

The cluster seems to be working fine - with messages and subscribes passing thru transparently. Though the cant get connection id is still logging

from emqx.

ieQu1 commented on May 30, 2024

I recall that cant get connection id message originates inside the Erlang runtime... I'll have to take a deep dive into the Erlang code to find the precise conditions that trigger it, and what are the implications of that error.

from emqx.

Rotario commented on May 30, 2024

It seeems related to nodes trying to connect to themselves which makes sense from the logs.
Maybe the auto DNS discovery mechanism with ipv6 makes a node try to connect to itself? Maybe some compare function that should return true to make a node not connect to itself doesn't compare IPv6 IPs properly or something?

Searching in GH i can't find Node.connect anywhere, so cant find the place where nodes are connecting to each other.

2024-02-18T18:09:01Z app[6e82e56f2e3038] lhr [info]2024-02-18T18:09:01.339236+00:00 [error] ** Cannot get connection id for node '[email protected]'
2024-02-18T18:09:08Z app[2874de3c1675d8] lhr [info]2024-02-18T18:09:08.114719+00:00 [error] ** Cannot get connection id for node '[email protected]'

bitwalker/libcluster#70
https://github.com/mrluc/peerage/pull/17/files

from emqx.

zmstone commented on May 30, 2024

Found this fix (in the links @Rotario shared above): erlang/otp#1870 but merged long ago, EMQX 5.4.1 should have the fix in place.
maybe there are other call paths not covered.

from emqx.

zmstone commented on May 30, 2024

Hi again @Rotario
"Cannot get connection id" has happened before in ipv4 network too, but I never managed to find the cause.
If you are open to run some debug tests. I can provide a patch + steps to install the patch to help the investigation.
the patch will try to print the stacktrace when a node tries to self-connect.

from emqx.

zmstone commented on May 30, 2024

Hi @Rotario

Here is the beam file in a zipped dir: net_kernel.zip
Sha256sum of the beam file (not the zip): 44551fcb70c1d58a9e2ab8430a968d65de37552e1e5d7012dcf0cf6ddbefba0a
Code diff: zmstone/otp@41dcd06

Extract the file net_kernel.beam, and mount it to replace /opt/emqx/lib/kernel-8.5.4/ebin/net_kernel.beam
Or commit a new docker tag with this file replaced in the container.

This patch adds another log line after "Cannot get connection id" which should include more error context as well as the stacktrace.

from emqx.

zmstone commented on May 30, 2024

Thank you @Rotario.
The log narrowed it down a bit.
Now we know that it's logged when accepting a new connection from Erlang distribution listener.
We might need to trace/debug the connection initiator side, but I need to dig more in the code to find where to trace.

In the meantime, could you help to test with this patch? dist-debug.zip.
It included more debug logs, see commits here: https://github.com/zmstone/otp/commits/log-stacktrace-when-cannot-get-connection-id-happens/

Also would like to ask:

Does it happen if you start a single node?
How often do you see this logged?

Example (normal) logs from the patch:

v5.4.1([email protected])1> 110989263 '[email protected]':recv_name: 'N' name="[email protected]" creation=4
{"MD5 connection from ~p~n",['[email protected]']}
{net_kernel,accept_pending,'[email protected]','[email protected]',undefined,[]}
110990781 '[email protected]':do_mark_pending(<0.2016.0>,'[email protected]','[email protected]',55966662589) -> ok
{dist_util,<0.3250.0>,send_status,'[email protected]',ok}
110991845 '[email protected]':send: 'N' challenge=780365816 creation=4
110994512 '[email protected]':recv_reply: challenge=631707742 digest=[6,93,82,221,42,134,231,21,140,166,246,
                                        161,49,183,233,112]
110994717 '[email protected]':sum = <<6,93,82,221,42,134,231,21,140,166,246,161,49,183,233,112>>
110994893 '[email protected]':send_ack: digest=<<25,210,44,53,117,211,215,113,115,210,118,160,255,201,32,218>>
{dist_util,<0.3250.0>,accept_connection,'[email protected]'}
110995605 '[email protected]':setnode: node='[email protected]' port=#Port<0.18> flags=55966662589(normal) creation=4

from emqx.

zmstone commented on May 30, 2024

Some more information for you to troubleshoot the network:
EMQX node resolves the peer's port from node name, in your case, both nodes should be listening on port 4370.
(For a node named emqx1@hostname, it listens on port 4371, and so on).
If node [email protected] tries to connect 6e82e56f2e3038.vm.emqx.internal:4370 but end up resolving 6e82e56f2e3038 to self IP, or if the connection is looped back to self due to a misconfigured TCP routing/port forwarding, this might happen.

from emqx.

Rotario commented on May 30, 2024

Thanks Zaiming @zmstone
Yeah I read about the port discovery - I hope the resolved IPs areconsistently correct, whenever I've manually done nslookup on each node the IPs resolve to the correct ones.

**Does it happen if you start a single node?**

I just ran one node. That single node still complains that it can't get a connection id for itself. Could this be due to the new IPv6 autodiscovery?
How often do you see this logged?
~Every 14s

Here's the log

135174674 '[email protected]':send_name: 'N' node='[email protected]' creation=4             
135175366 '[email protected]':recv_name: 'N' name="[email protected]" creation=4
{"MD5 connection from ~p~n",['[email protected]']}                          
{net_kernel,accept_pending,'[email protected]','[email protected]',undefined,[]}
135180949 '[email protected]':do_mark_pending(<0.2062.0>,'[email protected]','[email protected]',559666
                                                                                                                                                                
{dist_util,<0.3087.0>,send_status,'[email protected]',nok}                      
{dist_util,<0.3085.0>,recv_status,'emqx@fdaa:0:e26a:a7b:8f:ff70:e24:2',"nok"}                  
2024-02-20T10:07:19.818709+00:00 [error] ** Cannot get connection id for node '[email protected]'
2024-02-20T10:07:19.818859+00:00 [error] error:badarg, [{erts_internal,new_connection,['[email protected]'],[{error_info,#{module => erl_erts
ernel.erl"},{line,1099}]},{gen_server,try_handle_info,3,[{file,"gen_server.erl"},{line,1095}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,1183}]},{
line,241}]}]

from emqx.

zmstone commented on May 30, 2024

For reference. This is what happens if I self-connect with a different name:

v5.4.1([email protected])1>  net_kernel:connect_node(node()). % self node is `[email protected]` so it should always return `true` -- as expected.
true
v5.4.1([email protected])2>  net_kernel:connect_node('[email protected]'). % local.host is added as a loopback address
20025380 '[email protected]':send_name: 'N' node='[email protected]' creation=4
20025933 '[email protected]':recv_name: 'N' name="[email protected]" creation=4
{"MD5 connection from ~p~n",['[email protected]']}
{net_kernel,accept_pending,'[email protected]','[email protected]',undefined,[]}
2024-02-20T10:20:05.500277+00:00 [error] ** Cannot get connection id for node '[email protected]'
2024-02-20T10:20:05.500549+00:00 [error] error:badarg, [{erts_internal,new_connection,['[email protected]'],[{error_info,#{module => erl_erts_errors}}]},{net_kernel,handle_info,2,[{file,"net_kernel.erl"},{line,1099}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,1123}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,1200}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]
20027524 '[email protected]':do_mark_pending(<0.2016.0>,'[email protected]','[email protected]',55966662589) -> nok_pending
{dist_util,<0.3196.0>,send_status,'[email protected]',nok}
{dist_util,<0.3194.0>,recv_status,'[email protected]',"nok"}

this line looks suspicious in your logs: {dist_util,<0.3085.0>,recv_status,'emqx@fdaa:0:e26a:a7b:8f:ff70:e24:2',"nok"}

Some place in the code is trying to connect to this name 'emqx@fdaa:0:e26a:a7b:8f:ff70:e24:2' every 14/15 seconds.

from emqx.

zmstone commented on May 30, 2024

@Rotario I guess fdaa:0:e26a:a7b:8f:ff70:e24:2 is 6e82e56f2e3038's address?
Is emqx@fdaa:0:e26a:a7b:8f:ff70:e24:2 by any chance configured in the static discovery node list ?
Could you share the output of this command (in the container): emqx eval 'application:get_all_env(ekka)'

from emqx.

Rotario commented on May 30, 2024

Ok could it maybe just be a hangover from switching discovery mechanisms?

emqx eval 'application:get_all_env(ekka)' gives

root@6e82e56f2e3038:/opt/emqx# emqx eval 'application:get_all_env(ekka)'
643702 '[email protected]':send_name: 'N' node='[email protected]' creation=4
{dist_util,<0.93.0>,recv_status,'[email protected]',"ok"}
656404 '[email protected]':recv: 'N' node="[email protected]", challenge=3526690536 creation=4
657095 '[email protected]':send_reply: challenge=785686415 digest=<<190,232,253,141,252,169,249,220,242,
                                         38,215,173,131,214,231,93>>
659050 '[email protected]':recv_ack: digest=[19,143,161,52,113,63,105,18,229,8,223,64,237,51,219,50]
659608 '[email protected]':sum = <<19,143,161,52,113,63,105,18,229,8,223,64,237,51,219,50>>
660181 '[email protected]':setnode: node='[email protected]' port=#Port<0.6> flags=55966662588(hidden) creation=4
[{proto_dist,inet6_tcp},
 {cluster_discovery,{dns,[{name,"emqx.internal"},{type,aaaa}]}},
 {{callback,stop},fun emqx_machine_boot:stop_apps/0},
 {cluster_name,emqxcl},
 {{callback,start},fun emqx_machine_boot:ensure_apps_started/0}]

I'll destroy the instances and volumes and recreate from scratch

from emqx.

Rotario commented on May 30, 2024

I've noticed the new nodes aren't discovering each other either. I have to run emqx ctl cluster join <node>. This isn't an issue right now. Can explore it later

NOTE: There's no static discovery list - I think maybe the hostname is being resolved to an IP and that's being used as the node name? Which obviously the nodes aren't configured as.

Yeah you're right fdaa:0:e26a:a7b:8f:ff70:e24:2 is 6e82e56f2e3038

More logs from new clean nodes

2024-02-20T11:10:24Z app[784e679bd37638] lhr [info]321599835 '[email protected]':send_name: 'N' node='[email protected]' creation=4
2024-02-20T11:10:24Z app[784e679bd37638] lhr [info]321600674 '[email protected]':recv_name: 'N' name="[email protected]" creation=4
2024-02-20T11:10:24Z app[784e679bd37638] lhr [info]{"MD5 connection from ~p~n",['[email protected]']}
2024-02-20T11:10:24Z app[784e679bd37638] lhr [info]{net_kernel,accept_pending,'[email protected]','[email protected]',undefined,[]}
2024-02-20T11:10:24Z app[784e679bd37638] lhr [info]321606087 '[email protected]':do_mark_pending(<0.2062.0>,'[email protected]','[email protected]',55966662589) -> nok_pending
2024-02-20T11:10:24Z app[784e679bd37638] lhr [info]{dist_util,<0.3602.0>,send_status,'[email protected]',nok}
2024-02-20T11:10:24Z app[784e679bd37638] lhr [info]{dist_util,<0.3600.0>,recv_status,'emqx@fdaa:0:e26a:a7b:172:ce9:3967:2',"nok"}
2024-02-20T11:10:24Z app[784e679bd37638] lhr [info]2024-02-20T11:10:24.196529+00:00 [error] ** Cannot get connection id for node '[email protected]'
2024-02-20T11:10:24Z app[784e679bd37638] lhr [info]2024-02-20T11:10:24.196691+00:00 [error] error:badarg, [{erts_internal,new_connection,['[email protected]'],[{error_info,#{module => erl_erts_errors}}]},{net_kernel,handle_info,2,[{file,"net_kernel.erl"},{line,1099}]},{gen_server,try_handle_info,3,[{file,"gen_server.erl"},{line,1095}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,1183}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,241}]}]

from emqx.

zmstone commented on May 30, 2024

Ah. ok.
Now I get it.
The node name should not use FQDN as host part if you use DNS discovery strategy.
This is because DNS resolution is a list of IP addresses, this will always make the nodes connect peers using emqx@IP.

You'll need to either use static/manual strategy for node discovery,
or use DNS, but assign static IPs to the containers (so the nodes will get a static name like emqx@IPv6)

from emqx.

zmstone commented on May 30, 2024

@Rotario Thank you.
This debug session resolved a long time myth for us.

from emqx.

Nodes can't communicate in static cluster about emqx HOT 32 CLOSED

Comments (32)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs