Comments (19)
HI @leobesen
To help narrow down the investigation scope, could you share with us which features you have enabled ?
E.g. rule engine which processes JSON payload? making use of the jq
functions?
from emqx.
To help us to narrow the scope for investigation, could describe a bit what function do you use? how the traffic flows in EMQX.
such as:
"client connects to EMQX with transport TCP and publish messages with json payload, the messages are filter with rule engine rule as .... and they are persistent in Kafka with TCP/TLS connection, messages are compressed with zip....
and we call emqx rest api with auth of user/password from time to time ....
"
are you using any function below?
- QUIC transport
- encode/decode json
- auth with SASL
- persistent session on disk
- compress/decompress with snappy
- jq rules
- TLS.
- password auth
Do you get coredump file or crashdump file? Could you enable coredump?
from emqx.
Hello @qzhuyan and @zmstone! I work with @leobesen and will be following this issue as well.
Our implementation is quite simple actually, pretty much just a bridge between devices and our backend servers.
QUIC transport = no
encode/decode json = no
auth with SASL = Built-in(no hash) auth and mongo(SHA-256 no salt)
persistent session on disk = no
compress/decompress with snappy = no
jq rules = no rules used
No coredump file or crashdump file found with the default EMQX logs.
from emqx.
The Iot devices connect though TCP and our clients can visualize live data though WS, backend connects trough TCP as well, there is no compression or encoding, just plain text/json due to the nature of the project, TCP connections can be live for days or weeks.
The endpoint /api/v5/status is called every 10 seconds for monitoring.
from emqx.
thanks @vitorkogut , is it new detected issue since you upgrade to new EMQX version or?
suspect mongoldb password hash with bcrypt
from emqx.
We had this problem with 5.1.6, then we updated our hml server to 5.3.2 and got the same error in the exact same way, 30 days and crash.
The mongo connection isn't using bcrypt
from emqx.
Do you have anything in the emqx logs?
to proceed I think we need the core dump file. could you try to enable the core dump?
ulimit -c unlimited
sysctl -w kernel.core_pattern=core
note, if you use systemd, ensure you set this in emqx systemd file:
LimitCORE=infinity
Would be best if you could help to reproduce with minimal steps with our load generator:
https://github.com/emqx/emqtt-bench
or
https://github.com/emqx/emqttb
from emqx.
Log:
Dec 31 21:18:19 ip-X-X-X-X bash[329193]: EMQX 5.1.6 is running now!
Jan 31 22:59:11 ip-X-X-X-X bash[329193]: free(): double free detected in tcache 2
Jan 31 22:59:11 ip-X-X-X-X bash[329478]: [os_mon] cpu supervisor port (cpu_sup): Erlang has closed
Jan 31 22:59:11 ip-X-X-X-X systemd[1]: emqx.service: Main process exited, code=killed, status=6/ABRT
Jan 31 22:59:11 ip-X-X-X-X bash[329477]: [os_mon] memory supervisor port (memsup): Erlang has closed
Jan 31 22:59:11 ip-X-X-X-X systemd[1]: emqx.service: Failed with result 'signal'.
is there any downside to enabling core dump? Just to be sure
I don't think the load generator will be of any help in this case, we've had really high loads and very low ones, but It's always after 30 days, the logs go from Dec 31 -> Jan 31, and here's another example:
from emqx.
EMQX log
2024-01-26T14:28:06.478178-03:00 [warning] msg: start_resource_failed, mfa: emqx_resource_manager:start_resource/2, line: 492, id: <<"emqx_authn_mongodb:1">>, reason: {start_pool_failed,<<"emqx_authn_mongodb:1">>,{connect_failed,econnrefused}}
2024-01-26T14:28:06.478263-03:00 [error] crasher: initial call: supervisor:mc_pool_sup/1, pid: <0.21779.207>, registered_name: mc_pool_sup, exit: {{connect_failed,econnrefused},[{gen_server,decode_msg,9,[{file,"gen_server.erl"},{line,909}]},{proc_li>
2024-01-26T14:28:10.854868-03:00 [error] msg: mongodb_query_failed, mfa: emqx_authn_mongodb:authenticate/2, line: 176, peername: X.144.103.249:19701, clientid: 00-04-A3-E6-52-F6, collection: <<"brokers">>, filter: #{username => <<"00-04-A3-E6-52-F>
2024-01-26T14:28:11.103436-03:00 [error] msg: mongodb_query_failed, mfa: emqx_authn_mongodb:authenticate/2, line: 176, peername: X.138.171.172:4204, clientid: 00-04-A3-E6-48-98, collection: <<"brokers">>, filter: #{username => <<"00-04-A3-E6-48-98>
2024-01-26T14:28:12.876833-03:00 [error] msg: mongodb_query_failed, mfa: emqx_authn_mongodb:authenticate/2, line: 176, peername: X.221.77.2:30892, clientid: 00-04-A3-E6-2D-DD, collection: <<"brokers">>, filter: #{username => <<"00-04-A3-E6-2D-DD">>
2024-01-26T14:28:13.120204-03:00 [error] msg: mongodb_query_failed, mfa: emqx_authn_mongodb:authenticate/2, line: 176, peername: X.95.23.227:1036, clientid: 00-04-A3-E5-BB-4E, collection: <<"brokers">>, filter: #{username => <<"00-04-A3-E5-BB-4E">>
2024-01-26T14:28:16.811515-03:00 [error] msg: mongodb_query_failed, mfa: emqx_authn_mongodb:authenticate/2, line: 176, peername: X50.128.34:1166, clientid: 00-04-A3-E5-BB-0F, collection: <<"brokers">>, filter: #{username => <<"00-04-A3-E5-BB-0F">>>
2024-01-26T14:28:19.565423-03:00 [error] msg: mongodb_query_failed, mfa: emqx_authn_mongodb:authenticate/2, line: 176, peername: X.69.237.225:2357, clientid: 00-04-A3-E5-D3-B4, collection: <<"brokers">>, filter: #{username => <<"00-04-A3-E5-D3-B4">
2024-01-26T14:28:19.882256-03:00 [error] msg: mongodb_query_failed, mfa: emqx_authn_mongodb:authenticate/2, line: 176, peername: X.27.70.36:1487, clientid: 00-04-A3-E6-2E-14, collection: <<"brokers">>, filter: #{username => <<"00-04-A3-E6-2E-14">>>
2024-01-26T14:28:19.999531-03:00 [error] msg: mongodb_query_failed, mfa: emqx_authn_mongodb:authenticate/2, line: 176, peername: X.110.162.194:3825, clientid: 00-04-A3-E5-DF-7D, collection: <<"brokers">>, filter: #{username => <<"00-04-A3-E5-DF-7D>
2024-01-26T14:28:20.701453-03:00 [error] msg: mongodb_query_failed, mfa: emqx_authn_mongodb:authenticate/2, line: 176, peername: X.6.83.215:1431, clientid: 00-04-A3-E6-10-BC, collection: <<"brokers">>, filter: #{username => <<"00-04-A3-E6-10-BC">>>
2024-01-26T14:28:21.423672-03:00 [error] msg: mongodb_query_failed, mfa: emqx_authn_mongodb:authenticate/2, line: 176, peername: X.67.144.55:1101, clientid: 00-04-A3-E6-1D-5E, collection: <<"brokers">>, filter: #{username => <<"00-04-A3-E6-1D-5E">>
2024-01-26T14:28:22.927516-03:00 [warning] msg: alarm_is_deactivated, mfa: emqx_alarm:do_actions/3, line: 424, name: <<"emqx_authn_mongodb:1">>
2024-01-28T01:27:11.464561-03:00 [warning] msg: unexpected_api_access, mfa: emqx_dashboard_not_found:init/2, line: 25, request: #{bindings => #{},body_length => 0,cert => undefined,has_body => false,headers => #{<<"accept-encoding">> => <<"gzip">>,<>
2024-01-28T02:12:32.270193-03:00 [error] supervisor: 'esockd_connection_sup - <0.2747.0>', errorContext: connection_shutdown, reason: #{hint => invalid_proto_name,parsed_length => 256,remaining_bytes_length => 1}, offender: [{pid,<0.4552.218>},{name>
2024-01-29T15:01:40.251201-03:00 [error] msg: session_stepdown_request_exception, mfa: emqx_cm:request_stepdown/3, line: 470, action: discard, reason: normal, stacktrace: [{emqx_ws_connection,call,3,[{file,"emqx_ws_connection.erl"},{line,192}]},{emq>
2024-01-30T05:24:09.595858-03:00 [error] supervisor: 'esockd_connection_sup - <0.2747.0>', errorContext: connection_shutdown, reason: #{hint => invalid_proto_name,parsed_length => 257,remaining_bytes_length => 1}, offender: [{pid,<0.12415.233>},{nam>
2024-01-30T21:59:26.485019-03:00 [error] supervisor: 'esockd_connection_sup - <0.2747.0>', errorContext: connection_shutdown, reason: #{hint => invalid_proto_name,parsed_length => 256,remaining_bytes_length => 1}, offender: [{pid,<0.16764.238>},{nam>
2024-01-31T01:25:51.094557-03:00 [warning] msg: unexpected_api_access, mfa: emqx_dashboard_not_found:init/2, line: 25, request: #{bindings => #{},body_length => 0,cert => undefined,has_body => false,headers => #{<<"accept-encoding">> => <<"gzip">>,<>
2024-02-01T09:42:37.246183-03:00 [error] crasher: initial call: cowboy_clear:connection_process/4, pid: <0.12929.3>, registered_name: [], error: {{case_clause,{error,timeout}},[{cowboy_websocket,websocket_send_close,2,[{file,"cowboy_websocket.erl"},>
2024-02-01T09:42:37.267175-03:00 [error] Ranch listener 'ws:default' had connection process started with cowboy_clear:start_link/4 at <0.12929.3> exit with reason: {{case_clause,{error,timeout}},[{cowboy_websocket,websocket_send_close,2,[{file,"cowb>
Nothing happened at this time Jan 31 22:59:11
from emqx.
do you have monitor system? was the memory usage high when the issue happened?
I don't think the load generator will be of any help in this case, we've had really high loads and very low ones, but It's always after 30 days, the logs go from Dec 31 -> Jan 31, and here's another example:
If we don't use loadgen to trigger the fault, do you have any other idea to speedup the reproducing of the trouble? or wait for another 30 days?
from emqx.
we do, the crash was at 22:59, the load was normal, just a spike after bc of the restart.
I really don't think we could speed up the process unfortunately, it's totally random (other than this 30 day pattern), we tested load, number of clients, doesn't seem like it's repeatable.
Could erlang be the problem in this case?
from emqx.
Our prod broker machine is only running the emqx service. The load is quite low:
from emqx.
@leobesen could you check system logs (/var/log/syslog
, or similar)? May be there is some cron job which is triggering this?
from emqx.
@id here's the full logs from journal
Jan 31 22:53:59 ip-********* sshd[402874]: error: kex_exchange_identification: Connection closed by remote host
Jan 31 22:53:59 ip-********* sshd[402874]: Connection closed by 205.210.31.15 port 55425
Jan 31 22:56:38 ip-********* sshd[402876]: Received disconnect from 134.122.88.182 port 60424:11: Bye Bye [preauth]
Jan 31 22:56:38 ip-********* sshd[402876]: Disconnected from authenticating user root 134.122.88.182 port 60424 [preauth]
Jan 31 22:57:35 ip-********* sshd[402879]: Received disconnect from 134.122.88.182 port 52898:11: Bye Bye [preauth]
Jan 31 22:57:35 ip-********* sshd[402879]: Disconnected from authenticating user root 134.122.88.182 port 52898 [preauth]
Jan 31 22:58:22 ip-********* sshd[402881]: Received disconnect from 43.156.39.228 port 48912:11: Bye Bye [preauth]
Jan 31 22:58:22 ip-********* sshd[402881]: Disconnected from authenticating user root 43.156.39.228 port 48912 [preauth]
Jan 31 22:58:32 ip-********* sshd[402883]: Received disconnect from 134.122.88.182 port 39864:11: Bye Bye [preauth]
Jan 31 22:58:32 ip-********* sshd[402883]: Disconnected from authenticating user root 134.122.88.182 port 39864 [preauth]
Jan 31 22:59:11 ip-********* bash[329193]: free(): double free detected in tcache 2
Jan 31 22:59:11 ip-********* bash[329478]: [os_mon] cpu supervisor port (cpu_sup): Erlang has closed
Jan 31 22:59:11 ip-********* systemd[1]: emqx.service: Main process exited, code=killed, status=6/ABRT
Jan 31 22:59:11 ip-********* bash[329477]: [os_mon] memory supervisor port (memsup): Erlang has closed
Jan 31 22:59:11 ip-********* systemd[1]: emqx.service: Failed with result 'signal'.
Jan 31 22:59:11 ip-********* systemd[1]: emqx.service: Consumed 2d 20h 2min 54.409s CPU time.
Jan 31 22:59:30 ip-********* sshd[402885]: Received disconnect from 43.156.39.228 port 40408:11: Bye Bye [preauth]
Jan 31 22:59:30 ip-********* sshd[402885]: Disconnected from authenticating user root 43.156.39.228 port 40408 [preauth]
Jan 31 22:59:33 ip-********* sshd[402887]: Received disconnect from 134.122.88.182 port 36118:11: Bye Bye [preauth]
Jan 31 22:59:33 ip-********* sshd[402887]: Disconnected from authenticating user root 134.122.88.182 port 36118 [preauth]
Jan 31 23:00:36 ip-********* sshd[402889]: Received disconnect from 134.122.88.182 port 37464:11: Bye Bye [preauth]
Jan 31 23:00:36 ip-********* sshd[402889]: Disconnected from authenticating user root 134.122.88.182 port 37464 [preauth]
Jan 31 23:00:39 ip-********* sshd[402891]: Received disconnect from 43.156.39.228 port 43570:11: Bye Bye [preauth]
Jan 31 23:00:39 ip-********* sshd[402891]: Disconnected from authenticating user root 43.156.39.228 port 43570 [preauth]
Jan 31 23:01:12 ip-********* systemd[1]: emqx.service: Scheduled restart job, restart counter is at 3.
Jan 31 23:01:12 ip-********* systemd[1]: Stopped emqx.service - emqx daemon.
Jan 31 23:01:12 ip-********* systemd[1]: emqx.service: Consumed 2d 20h 2min 54.409s CPU time.
Jan 31 23:01:12 ip-********* systemd[1]: Started emqx.service - emqx daemon.
Jan 31 23:01:14 ip-********* emqx[403128]: EXEC: /opt/emqx/erts-13.2.2/bin/erlexec -noshell -noinput +Bd -boot /opt/emqx/releases/5.1.6/start -boot_var RELEASE_LIB />
Jan 31 23:01:15 ip-********* bash[402893]: Listener tcp:default on 0.0.0.0:1883 started.
Jan 31 23:01:15 ip-********* bash[402893]: Listener ws:default on 0.0.0.0:80 started.
Jan 31 23:01:16 ip-********* bash[402893]: Listener http:dashboard on :18083 started.
Jan 31 23:01:17 ip-********* bash[402893]: EMQX 5.1.6 is running now!
from emqx.
@vitorkogut sorry for the long silence.
We tried to reproduce this but no luck so far. Did this happen recently in March as well?
from emqx.
@id, yes, the same thing happened.
Mar 03 00:54:22 bash[402893]: free(): double free detected in tcache 2
Mar 03 00:54:22 bash[403178]: [os_mon] cpu supervisor port (cpu_sup): Erlang has closed
Mar 03 00:54:22 systemd[1]: emqx.service: Main process exited, code=killed, status=6/ABRT
Mar 03 00:54:22 bash[403177]: [os_mon] memory supervisor port (memsup): Erlang has closed
Mar 03 00:54:22 systemd[1]: emqx.service: Failed with result 'signal'.
Mar 03 00:54:22 systemd[1]: emqx.service: Consumed 2d 18h 1min 27.706s CPU time.
Mar 03 00:56:22 systemd[1]: emqx.service: Scheduled restart job, restart counter is at 4.
Mar 03 00:56:22 systemd[1]: Stopped emqx.service - emqx daemon.
Mar 03 00:56:22 systemd[1]: emqx.service: Consumed 2d 18h 1min 27.706s CPU time.
Mar 03 00:56:22 systemd[1]: Started emqx.service - emqx daemon.
Mar 03 00:56:24 emqx[484532]: EXEC: /opt/emqx/erts-13.2.2/bin/erlexec -noshell -noinput +Bd -boot /opt/emqx/releases/5.1.6/start -boot_var RELEASE_LIB /opt/emqx/lib -bo>
Mar 03 00:56:26 bash[484297]: Listener tcp:default on 0.0.0.0:1883 started.
Mar 03 00:56:26 bash[484297]: Listener ws:default on 0.0.0.0:80 started.
Mar 03 00:56:26 bash[484297]: Listener http:dashboard on :18083 started.
Mar 03 00:56:28 bash[484297]: EMQX 5.1.6 is running now!
from emqx.
@qzhuyan do you have any idea how else we could troubleshoot this?
from emqx.
we may need to reproduce this issue first.
debian 12 on arm plus similar configs and messages as in @leobesen's environment.
from emqx.
I got coredumps on arm64 everywhere with OTP26.2. I am moving away from arm64
from emqx.
Related Issues (20)
- Not getting Emqx dashboard HOT 1
- EMQX Clustering - Message Replication HOT 4
- STOMP GW does not send heartbeat to client HOT 4
- EMQX cluster cannot restart after persistence HOT 1
- After STOMP enables authentication, authentication fails when the account password carries a colon
- runq_overload everyday for few minutes HOT 4
- Receive Maximum Not Sent in Bridge CONNACK HOT 4
- ~10ms latency on publishing and receiving message on the same machine on windows HOT 6
- webhook监听上下线事件时,事件时序有误,原因不明(connected and disconnected events may out of order) HOT 3
- Authentication fails periodically and restart fixes it HOT 9
- For my project required kafka and kerberos integration with EMQX opensource,So any how is it possible to do these integration with Emqx OPenSource version HOT 2
- Get acknowledgement from subscriber(s) after publish messages HOT 2
- Restored retained message have no payload HOT 2
- error messages received and MQTT broker keep running up and down, very HOT 4
- Several protocol violations or bugs in EMQX HOT 26
- Upgrade to Openssl 3.0 or higher HOT 7
- LDAP server treated as down due to wrong error returned than expected? HOT 1
- EMQX Cluster error msg: failed_to_kick_session_on_remote_node HOT 4
- Bug: v5.7.0 /api/v5/monitor API return 500 Error becasue of the incompatible conf changes HOT 4
- MemoryDB certificate verification fails after upgrade to version 5.7.0 HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from emqx.