Comments (32)
Yeah of course, send it over and I can try to run it. It's the last thing I hope before I start trying it in production
from emqx.
Thanks for your help @zmstone !
** Cannot get connection id for node '[email protected]'
error:badarg, [{erts_internal,new_connection,['[email protected]'],[{error_info,#{module => erl_erts_errors}}]},{net_kernel,handle_info,2,[{file,"net_kernel.erl"},{line,1098}]},{gen_server,try_handle_info,3,[{file,"gen_server.erl"},{line,1095}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,1183}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,241}]}]
from emqx.
Ah ok brill thank you! And set node names to the ip not to the fqdn. Brill thank you
from emqx.
Yes, nxdomain
in the logs indicates that it's a DNS issue.
from emqx.
from emqx.
Hi - I've still got this issue, I've installed dnsutils on the production server (fly.io) and when I run
nslookup xxxx
the IP is resolved correctly.
I'm still getting this issue however sometimes. I don't think EMQX supports AAAA records properly somehow
Fly is pretty good at resolving these hostnames correctly. I think it might be an issue with EMQX
2024-02-06T16:56:21Z app[2874de3c1675d8] lhr [info]2024-02-06T16:56:21.032192+00:00 [error] event=connect_to_remote_server, [email protected], port=5369, reason=nxdomain
2024-02-06T16:56:21Z app[2874de3c1675d8] lhr [info]2024-02-06T16:56:21.032304+00:00 [error] crasher: initial call: gen_rpc_client:init/1, pid: <0.11531.1>, registered_name: [], exit: {{badrpc,nxdomain},[{gen_server,init_it,6,[{file,"gen_server.erl"},{line,835}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}, ancestors: [gen_rpc_client_sup,gen_rpc_sup,<0.2123.0>], message_queue_len: 0, messages: [], links: [<0.2129.0>], dictionary: [], trap_exit: true, status: running, heap_size: 2586, stack_size: 28, reductions: 2712; neighbours:
from emqx.
Hi,
Did you test with a version that includes the AAAA record fix: #12467? I Don't think that fix is included in version 5.4.1 as it just recently got merged into the master branch.
from emqx.
from emqx.
Not sure if the above PR will help:
- It only affects
DNS
discovery strategy - It only adds a parameter to the configuration schema, and doesn't change anything in the underlying code.
from emqx.
Can you post full response of nslookup? My GPG key is https://keyserver.ubuntu.com/pks/lookup?search=488654DF3FED6FDE&fingerprint=on&op=index , if you don't want to expose the data publicly.
from emqx.
Sure - from VM1
root@6e82e56f2e3038:/opt/emqx# nslookup 2874de3c1675d8.vm.emqx.internal
Server: fdaa::3
Address: fdaa::3#53
Name: 2874de3c1675d8.vm.emqx.internal
Address: fdaa:0:xxxx:xxxx:xxxx:4935:2
from VM2
root@2874de3c1675d8:/opt/emqx# nslookup 6e82e56f2e3038.vm.emqx.internal
Server: fdaa::3
Address: fdaa::3#53
Name: 6e82e56f2e3038.vm.emqx.internal
Address: fdaa:0:xxxx:xxx:xx:xxxx:e24:2
from emqx.
If I understand correctly, the problem is intermittent, since the nodes can communicate after restart. I assume that the IP addresses don't change. This suggests a temporary problem with the name resolution.
What distro are you running? Does it have a local DNS cache, like systemd-resolved
? Also, what is the TTL for the AAAA record?
from emqx.
Hi thanks for your reply I think it seems pretty consistently erroring now. I don't know if maybe a config change killed it?
I've not changed any of the cluster settings from those above
The distro is the Fly machine firecracker - I don't know how to check the DNS cache. NSlookup seems to resolve it fine - I'll check the TTL
from emqx.
I'm going to wait for the DNS discovery AAAA feature to be released and try that
from emqx.
Hi - I've got the master branch working now on the cloud provider
EMQX_CLUSTER__DNS__NAME = "emqx.internal"
EMQX_CLUSTER__DNS__RECORD_TYPE = "aaaa"
EMQX_CLUSTER__PROTO_DIST = "inet6_tcp"
EMQX_CLUSTER__DISCOVERY_STRATEGY = "dns"
And I'm still getting nxdomain issues - I think it's now interpretting this ipv6 IP as a domain name? Does it need to be bracketed?
2024-02-09T15:36:12Z app[6e82e56f2e3038] lhr [info]2024-02-09T15:36:12.952226+00:00 [error] event=connect_to_remote_server, peer=emqx@fdaa:0:xxxxxxxxxxx:2, port=5369, reason=nxdomain
2024-02-09T15:36:12Z app[6e82e56f2e3038] lhr [info]2024-02-09T15:36:12.952315+00:00 [error] crasher: initial call: gen_rpc_client:init/1, pid: <0.12092.0>, registered_name: [], exit: {{badrpc,nxdomain},[{gen_server,init_it,6,[{file,"gen_server.erl"},{line,961}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,241}]}]}, ancestors: [gen_rpc_client_sup,gen_rpc_sup,<0.2158.0>], message_queue_len: 0, messages: [], links: [<0.2164.0>], dictionary: [], trap_exit: true, status: running, heap_size: 2586, stack_size: 28, reductions: 2615; neighbours:
It did seem to be working fine
from emqx.
I've changed the node name from emqx@ipv6_of_machine to emqx@FQDN of machine
and now I'm getting
Cannot get connection id for node '[email protected]'
from emqx.
Update:
I found the RPC settings, and set it to listen only on ipv6 and ::
Now it seems like the nxdomain issues are fixed. The error I get now is this.
Any ideas please? seems connected to the setting EMQX_HOST and EMQX_NODE__NAME but I've played around and can't get this error to stop.
2024-02-09T17:03:56Z app[6e82e56f2e3038] lhr [info]2024-02-09T17:03:56.166004+00:00 [error] ** Cannot get connection id for node '[email protected]'
2024-02-09T17:04:02Z app[2874de3c1675d8] lhr [info]2024-02-09T17:04:02.986175+00:00 [error] ** Cannot get connection id for node '[email protected]'
from emqx.
The cluster seems to be working fine - with messages and subscribes passing thru transparently. Though the cant get connection id
is still logging
from emqx.
I recall that cant get connection id
message originates inside the Erlang runtime... I'll have to take a deep dive into the Erlang code to find the precise conditions that trigger it, and what are the implications of that error.
from emqx.
It seeems related to nodes trying to connect to themselves which makes sense from the logs.
Maybe the auto DNS discovery mechanism with ipv6 makes a node try to connect to itself? Maybe some compare function that should return true to make a node not connect to itself doesn't compare IPv6 IPs properly or something?
Searching in GH i can't find Node.connect anywhere, so cant find the place where nodes are connecting to each other.
2024-02-18T18:09:01Z app[6e82e56f2e3038] lhr [info]2024-02-18T18:09:01.339236+00:00 [error] ** Cannot get connection id for node '[email protected]'
2024-02-18T18:09:08Z app[2874de3c1675d8] lhr [info]2024-02-18T18:09:08.114719+00:00 [error] ** Cannot get connection id for node '[email protected]'
bitwalker/libcluster#70
https://github.com/mrluc/peerage/pull/17/files
from emqx.
Found this fix (in the links @Rotario shared above): erlang/otp#1870 but merged long ago, EMQX 5.4.1 should have the fix in place.
maybe there are other call paths not covered.
from emqx.
Hi again @Rotario
"Cannot get connection id" has happened before in ipv4 network too, but I never managed to find the cause.
If you are open to run some debug tests. I can provide a patch + steps to install the patch to help the investigation.
the patch will try to print the stacktrace when a node tries to self-connect.
from emqx.
Hi @Rotario
Here is the beam file in a zipped dir: net_kernel.zip
Sha256sum of the beam file (not the zip): 44551fcb70c1d58a9e2ab8430a968d65de37552e1e5d7012dcf0cf6ddbefba0a
Code diff: zmstone/otp@41dcd06
Extract the file net_kernel.beam
, and mount it to replace /opt/emqx/lib/kernel-8.5.4/ebin/net_kernel.beam
Or commit a new docker tag with this file replaced in the container.
This patch adds another log line after "Cannot get connection id" which should include more error context as well as the stacktrace.
from emqx.
Thank you @Rotario.
The log narrowed it down a bit.
Now we know that it's logged when accepting a new connection from Erlang distribution listener.
We might need to trace/debug the connection initiator side, but I need to dig more in the code to find where to trace.
In the meantime, could you help to test with this patch? dist-debug.zip.
It included more debug logs, see commits here: https://github.com/zmstone/otp/commits/log-stacktrace-when-cannot-get-connection-id-happens/
Also would like to ask:
- Does it happen if you start a single node?
- How often do you see this logged?
Example (normal) logs from the patch:
v5.4.1([email protected])1> 110989263 '[email protected]':recv_name: 'N' name="[email protected]" creation=4
{"MD5 connection from ~p~n",['[email protected]']}
{net_kernel,accept_pending,'[email protected]','[email protected]',undefined,[]}
110990781 '[email protected]':do_mark_pending(<0.2016.0>,'[email protected]','[email protected]',55966662589) -> ok
{dist_util,<0.3250.0>,send_status,'[email protected]',ok}
110991845 '[email protected]':send: 'N' challenge=780365816 creation=4
110994512 '[email protected]':recv_reply: challenge=631707742 digest=[6,93,82,221,42,134,231,21,140,166,246,
161,49,183,233,112]
110994717 '[email protected]':sum = <<6,93,82,221,42,134,231,21,140,166,246,161,49,183,233,112>>
110994893 '[email protected]':send_ack: digest=<<25,210,44,53,117,211,215,113,115,210,118,160,255,201,32,218>>
{dist_util,<0.3250.0>,accept_connection,'[email protected]'}
110995605 '[email protected]':setnode: node='[email protected]' port=#Port<0.18> flags=55966662589(normal) creation=4
from emqx.
Some more information for you to troubleshoot the network:
EMQX node resolves the peer's port from node name, in your case, both nodes should be listening on port 4370
.
(For a node named emqx1@hostname
, it listens on port 4371
, and so on).
If node [email protected]
tries to connect 6e82e56f2e3038.vm.emqx.internal:4370
but end up resolving 6e82e56f2e3038
to self IP, or if the connection is looped back to self due to a misconfigured TCP routing/port forwarding, this might happen.
from emqx.
Thanks Zaiming @zmstone
Yeah I read about the port discovery - I hope the resolved IPs areconsistently correct, whenever I've manually done nslookup
on each node the IPs resolve to the correct ones.
**Does it happen if you start a single node?**
I just ran one node. That single node still complains that it can't get a connection id for itself. Could this be due to the new IPv6 autodiscovery?
How often do you see this logged?
~Every 14s
Here's the log
135174674 '[email protected]':send_name: 'N' node='[email protected]' creation=4
135175366 '[email protected]':recv_name: 'N' name="[email protected]" creation=4
{"MD5 connection from ~p~n",['[email protected]']}
{net_kernel,accept_pending,'[email protected]','[email protected]',undefined,[]}
135180949 '[email protected]':do_mark_pending(<0.2062.0>,'[email protected]','[email protected]',559666
{dist_util,<0.3087.0>,send_status,'[email protected]',nok}
{dist_util,<0.3085.0>,recv_status,'emqx@fdaa:0:e26a:a7b:8f:ff70:e24:2',"nok"}
2024-02-20T10:07:19.818709+00:00 [error] ** Cannot get connection id for node '[email protected]'
2024-02-20T10:07:19.818859+00:00 [error] error:badarg, [{erts_internal,new_connection,['[email protected]'],[{error_info,#{module => erl_erts
ernel.erl"},{line,1099}]},{gen_server,try_handle_info,3,[{file,"gen_server.erl"},{line,1095}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,1183}]},{
line,241}]}]
from emqx.
For reference. This is what happens if I self-connect with a different name:
v5.4.1([email protected])1> net_kernel:connect_node(node()). % self node is `[email protected]` so it should always return `true` -- as expected.
true
v5.4.1([email protected])2> net_kernel:connect_node('[email protected]'). % local.host is added as a loopback address
20025380 '[email protected]':send_name: 'N' node='[email protected]' creation=4
20025933 '[email protected]':recv_name: 'N' name="[email protected]" creation=4
{"MD5 connection from ~p~n",['[email protected]']}
{net_kernel,accept_pending,'[email protected]','[email protected]',undefined,[]}
2024-02-20T10:20:05.500277+00:00 [error] ** Cannot get connection id for node '[email protected]'
2024-02-20T10:20:05.500549+00:00 [error] error:badarg, [{erts_internal,new_connection,['[email protected]'],[{error_info,#{module => erl_erts_errors}}]},{net_kernel,handle_info,2,[{file,"net_kernel.erl"},{line,1099}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,1123}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,1200}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]
20027524 '[email protected]':do_mark_pending(<0.2016.0>,'[email protected]','[email protected]',55966662589) -> nok_pending
{dist_util,<0.3196.0>,send_status,'[email protected]',nok}
{dist_util,<0.3194.0>,recv_status,'[email protected]',"nok"}
this line looks suspicious in your logs: {dist_util,<0.3085.0>,recv_status,'emqx@fdaa:0:e26a:a7b:8f:ff70:e24:2',"nok"}
Some place in the code is trying to connect to this name 'emqx@fdaa:0:e26a:a7b:8f:ff70:e24:2'
every 14/15 seconds.
from emqx.
@Rotario I guess fdaa:0:e26a:a7b:8f:ff70:e24:2
is 6e82e56f2e3038
's address?
Is emqx@fdaa:0:e26a:a7b:8f:ff70:e24:2
by any chance configured in the static discovery node list ?
Could you share the output of this command (in the container): emqx eval 'application:get_all_env(ekka)'
from emqx.
Ok could it maybe just be a hangover from switching discovery mechanisms?
emqx eval 'application:get_all_env(ekka)' gives
root@6e82e56f2e3038:/opt/emqx# emqx eval 'application:get_all_env(ekka)'
643702 '[email protected]':send_name: 'N' node='[email protected]' creation=4
{dist_util,<0.93.0>,recv_status,'[email protected]',"ok"}
656404 '[email protected]':recv: 'N' node="[email protected]", challenge=3526690536 creation=4
657095 '[email protected]':send_reply: challenge=785686415 digest=<<190,232,253,141,252,169,249,220,242,
38,215,173,131,214,231,93>>
659050 '[email protected]':recv_ack: digest=[19,143,161,52,113,63,105,18,229,8,223,64,237,51,219,50]
659608 '[email protected]':sum = <<19,143,161,52,113,63,105,18,229,8,223,64,237,51,219,50>>
660181 '[email protected]':setnode: node='[email protected]' port=#Port<0.6> flags=55966662588(hidden) creation=4
[{proto_dist,inet6_tcp},
{cluster_discovery,{dns,[{name,"emqx.internal"},{type,aaaa}]}},
{{callback,stop},fun emqx_machine_boot:stop_apps/0},
{cluster_name,emqxcl},
{{callback,start},fun emqx_machine_boot:ensure_apps_started/0}]
I'll destroy the instances and volumes and recreate from scratch
from emqx.
I've noticed the new nodes aren't discovering each other either. I have to run emqx ctl cluster join <node>
. This isn't an issue right now. Can explore it later
NOTE: There's no static discovery list - I think maybe the hostname is being resolved to an IP and that's being used as the node name? Which obviously the nodes aren't configured as.
Yeah you're right fdaa:0:e26a:a7b:8f:ff70:e24:2
is 6e82e56f2e3038
More logs from new clean nodes
2024-02-20T11:10:24Z app[784e679bd37638] lhr [info]321599835 '[email protected]':send_name: 'N' node='[email protected]' creation=4
2024-02-20T11:10:24Z app[784e679bd37638] lhr [info]321600674 '[email protected]':recv_name: 'N' name="[email protected]" creation=4
2024-02-20T11:10:24Z app[784e679bd37638] lhr [info]{"MD5 connection from ~p~n",['[email protected]']}
2024-02-20T11:10:24Z app[784e679bd37638] lhr [info]{net_kernel,accept_pending,'[email protected]','[email protected]',undefined,[]}
2024-02-20T11:10:24Z app[784e679bd37638] lhr [info]321606087 '[email protected]':do_mark_pending(<0.2062.0>,'[email protected]','[email protected]',55966662589) -> nok_pending
2024-02-20T11:10:24Z app[784e679bd37638] lhr [info]{dist_util,<0.3602.0>,send_status,'[email protected]',nok}
2024-02-20T11:10:24Z app[784e679bd37638] lhr [info]{dist_util,<0.3600.0>,recv_status,'emqx@fdaa:0:e26a:a7b:172:ce9:3967:2',"nok"}
2024-02-20T11:10:24Z app[784e679bd37638] lhr [info]2024-02-20T11:10:24.196529+00:00 [error] ** Cannot get connection id for node '[email protected]'
2024-02-20T11:10:24Z app[784e679bd37638] lhr [info]2024-02-20T11:10:24.196691+00:00 [error] error:badarg, [{erts_internal,new_connection,['[email protected]'],[{error_info,#{module => erl_erts_errors}}]},{net_kernel,handle_info,2,[{file,"net_kernel.erl"},{line,1099}]},{gen_server,try_handle_info,3,[{file,"gen_server.erl"},{line,1095}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,1183}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,241}]}]
from emqx.
Ah. ok.
Now I get it.
The node name should not use FQDN as host part if you use DNS discovery strategy.
This is because DNS resolution is a list of IP addresses, this will always make the nodes connect peers using emqx@IP
.
You'll need to either use static/manual strategy for node discovery,
or use DNS, but assign static IPs to the containers (so the nodes will get a static name like emqx@IPv6
)
from emqx.
@Rotario Thank you.
This debug session resolved a long time myth for us.
from emqx.
Related Issues (20)
- Can't distinguish if string has leading spaces in Dashboard HOT 2
- Please support writing map type directly in SQL HOT 2
- Authentication 功能中 Built-in Database 不支持密码为空的场景,是否可以支持一下? HOT 1
- Built-in functions cannot be used directly in arrays in Rule Engine SQL HOT 1
- When built on NixOS, Buffer overflow using QUIC HOT 14
- Direct support for lets encrypt HOT 4
- Frame too Large quirk HOT 7
- Allow add mountpoint for each user HOT 1
- [error] failed_to_check_schema: emqx_conf_schema HOT 12
- date_to_unix_ts in rule engine does not convert time correctly
- Time zone offset in function format_date HOT 1
- The return value of the subbits function cannot be JSON encoded HOT 4
- Requires a 4-parameter subbits function HOT 5
- Rule engine processing of large numbers HOT 2
- No prometheus stats on v5.5 HOT 4
- Issue about delivery.dropped hook and metric
- Messages dropped due to session expiration will not cause the delivery.dropped count to increase
- Wildcard support for topic metrics HOT 1
- Nodes cannot join into cluster automatically using emqx-operator in k8s HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from emqx.