GithubHelp home page GithubHelp logo

Comments (19)

zmstone avatar zmstone commented on May 30, 2024

HI @leobesen
To help narrow down the investigation scope, could you share with us which features you have enabled ?
E.g. rule engine which processes JSON payload? making use of the jq functions?

from emqx.

qzhuyan avatar qzhuyan commented on May 30, 2024

To help us to narrow the scope for investigation, could describe a bit what function do you use? how the traffic flows in EMQX.

such as:
"client connects to EMQX with transport TCP and publish messages with json payload, the messages are filter with rule engine rule as .... and they are persistent in Kafka with TCP/TLS connection, messages are compressed with zip....

and we call emqx rest api with auth of user/password from time to time ....
"

are you using any function below?

  • QUIC transport
  • encode/decode json
  • auth with SASL
  • persistent session on disk
  • compress/decompress with snappy
  • jq rules
  • TLS.
  • password auth

Do you get coredump file or crashdump file? Could you enable coredump?

from emqx.

vitorkogut avatar vitorkogut commented on May 30, 2024

Hello @qzhuyan and @zmstone! I work with @leobesen and will be following this issue as well.

Our implementation is quite simple actually, pretty much just a bridge between devices and our backend servers.

QUIC transport = no

encode/decode json = no

auth with SASL = Built-in(no hash) auth and mongo(SHA-256 no salt)
image

persistent session on disk = no

compress/decompress with snappy = no

jq rules = no rules used

No coredump file or crashdump file found with the default EMQX logs.

from emqx.

vitorkogut avatar vitorkogut commented on May 30, 2024

The Iot devices connect though TCP and our clients can visualize live data though WS, backend connects trough TCP as well, there is no compression or encoding, just plain text/json due to the nature of the project, TCP connections can be live for days or weeks.

The endpoint /api/v5/status is called every 10 seconds for monitoring.

from emqx.

qzhuyan avatar qzhuyan commented on May 30, 2024

thanks @vitorkogut , is it new detected issue since you upgrade to new EMQX version or?

suspect mongoldb password hash with bcrypt

from emqx.

vitorkogut avatar vitorkogut commented on May 30, 2024

We had this problem with 5.1.6, then we updated our hml server to 5.3.2 and got the same error in the exact same way, 30 days and crash.

The mongo connection isn't using bcrypt

image

from emqx.

qzhuyan avatar qzhuyan commented on May 30, 2024

Do you have anything in the emqx logs?

to proceed I think we need the core dump file. could you try to enable the core dump?

ulimit -c unlimited
sysctl -w kernel.core_pattern=core

note, if you use systemd, ensure you set this in emqx systemd file:

LimitCORE=infinity

Would be best if you could help to reproduce with minimal steps with our load generator:

https://github.com/emqx/emqtt-bench
or
https://github.com/emqx/emqttb

from emqx.

vitorkogut avatar vitorkogut commented on May 30, 2024

Log:

Dec 31 21:18:19 ip-X-X-X-X bash[329193]: EMQX 5.1.6 is running now!  
Jan 31 22:59:11 ip-X-X-X-X bash[329193]: free(): double free detected in tcache 2  
Jan 31 22:59:11 ip-X-X-X-X bash[329478]: [os_mon] cpu supervisor port (cpu_sup): Erlang has closed  
Jan 31 22:59:11 ip-X-X-X-X systemd[1]: emqx.service: Main process exited, code=killed, status=6/ABRT  
Jan 31 22:59:11 ip-X-X-X-X bash[329477]: [os_mon] memory supervisor port (memsup): Erlang has closed  
Jan 31 22:59:11 ip-X-X-X-X systemd[1]: emqx.service: Failed with result 'signal'.  

is there any downside to enabling core dump? Just to be sure

I don't think the load generator will be of any help in this case, we've had really high loads and very low ones, but It's always after 30 days, the logs go from Dec 31 -> Jan 31, and here's another example:

Dec 04 -> Jan 04
image

from emqx.

leobesen avatar leobesen commented on May 30, 2024

EMQX log

2024-01-26T14:28:06.478178-03:00 [warning] msg: start_resource_failed, mfa: emqx_resource_manager:start_resource/2, line: 492, id: <<"emqx_authn_mongodb:1">>, reason: {start_pool_failed,<<"emqx_authn_mongodb:1">>,{connect_failed,econnrefused}}
2024-01-26T14:28:06.478263-03:00 [error] crasher: initial call: supervisor:mc_pool_sup/1, pid: <0.21779.207>, registered_name: mc_pool_sup, exit: {{connect_failed,econnrefused},[{gen_server,decode_msg,9,[{file,"gen_server.erl"},{line,909}]},{proc_li>
2024-01-26T14:28:10.854868-03:00 [error] msg: mongodb_query_failed, mfa: emqx_authn_mongodb:authenticate/2, line: 176, peername: X.144.103.249:19701, clientid: 00-04-A3-E6-52-F6, collection: <<"brokers">>, filter: #{username => <<"00-04-A3-E6-52-F>
2024-01-26T14:28:11.103436-03:00 [error] msg: mongodb_query_failed, mfa: emqx_authn_mongodb:authenticate/2, line: 176, peername: X.138.171.172:4204, clientid: 00-04-A3-E6-48-98, collection: <<"brokers">>, filter: #{username => <<"00-04-A3-E6-48-98>
2024-01-26T14:28:12.876833-03:00 [error] msg: mongodb_query_failed, mfa: emqx_authn_mongodb:authenticate/2, line: 176, peername: X.221.77.2:30892, clientid: 00-04-A3-E6-2D-DD, collection: <<"brokers">>, filter: #{username => <<"00-04-A3-E6-2D-DD">>
2024-01-26T14:28:13.120204-03:00 [error] msg: mongodb_query_failed, mfa: emqx_authn_mongodb:authenticate/2, line: 176, peername: X.95.23.227:1036, clientid: 00-04-A3-E5-BB-4E, collection: <<"brokers">>, filter: #{username => <<"00-04-A3-E5-BB-4E">>
2024-01-26T14:28:16.811515-03:00 [error] msg: mongodb_query_failed, mfa: emqx_authn_mongodb:authenticate/2, line: 176, peername: X50.128.34:1166, clientid: 00-04-A3-E5-BB-0F, collection: <<"brokers">>, filter: #{username => <<"00-04-A3-E5-BB-0F">>>
2024-01-26T14:28:19.565423-03:00 [error] msg: mongodb_query_failed, mfa: emqx_authn_mongodb:authenticate/2, line: 176, peername: X.69.237.225:2357, clientid: 00-04-A3-E5-D3-B4, collection: <<"brokers">>, filter: #{username => <<"00-04-A3-E5-D3-B4">
2024-01-26T14:28:19.882256-03:00 [error] msg: mongodb_query_failed, mfa: emqx_authn_mongodb:authenticate/2, line: 176, peername: X.27.70.36:1487, clientid: 00-04-A3-E6-2E-14, collection: <<"brokers">>, filter: #{username => <<"00-04-A3-E6-2E-14">>>
2024-01-26T14:28:19.999531-03:00 [error] msg: mongodb_query_failed, mfa: emqx_authn_mongodb:authenticate/2, line: 176, peername: X.110.162.194:3825, clientid: 00-04-A3-E5-DF-7D, collection: <<"brokers">>, filter: #{username => <<"00-04-A3-E5-DF-7D>
2024-01-26T14:28:20.701453-03:00 [error] msg: mongodb_query_failed, mfa: emqx_authn_mongodb:authenticate/2, line: 176, peername: X.6.83.215:1431, clientid: 00-04-A3-E6-10-BC, collection: <<"brokers">>, filter: #{username => <<"00-04-A3-E6-10-BC">>>
2024-01-26T14:28:21.423672-03:00 [error] msg: mongodb_query_failed, mfa: emqx_authn_mongodb:authenticate/2, line: 176, peername: X.67.144.55:1101, clientid: 00-04-A3-E6-1D-5E, collection: <<"brokers">>, filter: #{username => <<"00-04-A3-E6-1D-5E">>
2024-01-26T14:28:22.927516-03:00 [warning] msg: alarm_is_deactivated, mfa: emqx_alarm:do_actions/3, line: 424, name: <<"emqx_authn_mongodb:1">>
2024-01-28T01:27:11.464561-03:00 [warning] msg: unexpected_api_access, mfa: emqx_dashboard_not_found:init/2, line: 25, request: #{bindings => #{},body_length => 0,cert => undefined,has_body => false,headers => #{<<"accept-encoding">> => <<"gzip">>,<>
2024-01-28T02:12:32.270193-03:00 [error] supervisor: 'esockd_connection_sup - <0.2747.0>', errorContext: connection_shutdown, reason: #{hint => invalid_proto_name,parsed_length => 256,remaining_bytes_length => 1}, offender: [{pid,<0.4552.218>},{name>
2024-01-29T15:01:40.251201-03:00 [error] msg: session_stepdown_request_exception, mfa: emqx_cm:request_stepdown/3, line: 470, action: discard, reason: normal, stacktrace: [{emqx_ws_connection,call,3,[{file,"emqx_ws_connection.erl"},{line,192}]},{emq>
2024-01-30T05:24:09.595858-03:00 [error] supervisor: 'esockd_connection_sup - <0.2747.0>', errorContext: connection_shutdown, reason: #{hint => invalid_proto_name,parsed_length => 257,remaining_bytes_length => 1}, offender: [{pid,<0.12415.233>},{nam>
2024-01-30T21:59:26.485019-03:00 [error] supervisor: 'esockd_connection_sup - <0.2747.0>', errorContext: connection_shutdown, reason: #{hint => invalid_proto_name,parsed_length => 256,remaining_bytes_length => 1}, offender: [{pid,<0.16764.238>},{nam>
2024-01-31T01:25:51.094557-03:00 [warning] msg: unexpected_api_access, mfa: emqx_dashboard_not_found:init/2, line: 25, request: #{bindings => #{},body_length => 0,cert => undefined,has_body => false,headers => #{<<"accept-encoding">> => <<"gzip">>,<>
2024-02-01T09:42:37.246183-03:00 [error] crasher: initial call: cowboy_clear:connection_process/4, pid: <0.12929.3>, registered_name: [], error: {{case_clause,{error,timeout}},[{cowboy_websocket,websocket_send_close,2,[{file,"cowboy_websocket.erl"},>
2024-02-01T09:42:37.267175-03:00 [error] Ranch listener 'ws:default' had connection process started with cowboy_clear:start_link/4 at <0.12929.3> exit with reason: {{case_clause,{error,timeout}},[{cowboy_websocket,websocket_send_close,2,[{file,"cowb>

Nothing happened at this time Jan 31 22:59:11

from emqx.

qzhuyan avatar qzhuyan commented on May 30, 2024

do you have monitor system? was the memory usage high when the issue happened?

I don't think the load generator will be of any help in this case, we've had really high loads and very low ones, but It's always after 30 days, the logs go from Dec 31 -> Jan 31, and here's another example:

If we don't use loadgen to trigger the fault, do you have any other idea to speedup the reproducing of the trouble? or wait for another 30 days?

from emqx.

vitorkogut avatar vitorkogut commented on May 30, 2024

we do, the crash was at 22:59, the load was normal, just a spike after bc of the restart.

image

I really don't think we could speed up the process unfortunately, it's totally random (other than this 30 day pattern), we tested load, number of clients, doesn't seem like it's repeatable.

Could erlang be the problem in this case?

from emqx.

leobesen avatar leobesen commented on May 30, 2024

Our prod broker machine is only running the emqx service. The load is quite low:
image

from emqx.

id avatar id commented on May 30, 2024

@leobesen could you check system logs (/var/log/syslog, or similar)? May be there is some cron job which is triggering this?

from emqx.

vitorkogut avatar vitorkogut commented on May 30, 2024

@id here's the full logs from journal

Jan 31 22:53:59 ip-********* sshd[402874]: error: kex_exchange_identification: Connection closed by remote host
Jan 31 22:53:59 ip-********* sshd[402874]: Connection closed by 205.210.31.15 port 55425
Jan 31 22:56:38 ip-********* sshd[402876]: Received disconnect from 134.122.88.182 port 60424:11: Bye Bye [preauth]
Jan 31 22:56:38 ip-********* sshd[402876]: Disconnected from authenticating user root 134.122.88.182 port 60424 [preauth]
Jan 31 22:57:35 ip-********* sshd[402879]: Received disconnect from 134.122.88.182 port 52898:11: Bye Bye [preauth]
Jan 31 22:57:35 ip-********* sshd[402879]: Disconnected from authenticating user root 134.122.88.182 port 52898 [preauth]
Jan 31 22:58:22 ip-********* sshd[402881]: Received disconnect from 43.156.39.228 port 48912:11: Bye Bye [preauth]
Jan 31 22:58:22 ip-********* sshd[402881]: Disconnected from authenticating user root 43.156.39.228 port 48912 [preauth]
Jan 31 22:58:32 ip-********* sshd[402883]: Received disconnect from 134.122.88.182 port 39864:11: Bye Bye [preauth]
Jan 31 22:58:32 ip-********* sshd[402883]: Disconnected from authenticating user root 134.122.88.182 port 39864 [preauth]

Jan 31 22:59:11 ip-********* bash[329193]: free(): double free detected in tcache 2 

Jan 31 22:59:11 ip-********* bash[329478]: [os_mon] cpu supervisor port (cpu_sup): Erlang has closed
Jan 31 22:59:11 ip-********* systemd[1]: emqx.service: Main process exited, code=killed, status=6/ABRT
Jan 31 22:59:11 ip-********* bash[329477]: [os_mon] memory supervisor port (memsup): Erlang has closed
Jan 31 22:59:11 ip-********* systemd[1]: emqx.service: Failed with result 'signal'.
Jan 31 22:59:11 ip-********* systemd[1]: emqx.service: Consumed 2d 20h 2min 54.409s CPU time.
Jan 31 22:59:30 ip-********* sshd[402885]: Received disconnect from 43.156.39.228 port 40408:11: Bye Bye [preauth]
Jan 31 22:59:30 ip-********* sshd[402885]: Disconnected from authenticating user root 43.156.39.228 port 40408 [preauth]
Jan 31 22:59:33 ip-********* sshd[402887]: Received disconnect from 134.122.88.182 port 36118:11: Bye Bye [preauth]
Jan 31 22:59:33 ip-********* sshd[402887]: Disconnected from authenticating user root 134.122.88.182 port 36118 [preauth]
Jan 31 23:00:36 ip-********* sshd[402889]: Received disconnect from 134.122.88.182 port 37464:11: Bye Bye [preauth]
Jan 31 23:00:36 ip-********* sshd[402889]: Disconnected from authenticating user root 134.122.88.182 port 37464 [preauth]
Jan 31 23:00:39 ip-********* sshd[402891]: Received disconnect from 43.156.39.228 port 43570:11: Bye Bye [preauth]
Jan 31 23:00:39 ip-********* sshd[402891]: Disconnected from authenticating user root 43.156.39.228 port 43570 [preauth]
Jan 31 23:01:12 ip-********* systemd[1]: emqx.service: Scheduled restart job, restart counter is at 3.
Jan 31 23:01:12 ip-********* systemd[1]: Stopped emqx.service - emqx daemon.
Jan 31 23:01:12 ip-********* systemd[1]: emqx.service: Consumed 2d 20h 2min 54.409s CPU time.
Jan 31 23:01:12 ip-********* systemd[1]: Started emqx.service - emqx daemon.
Jan 31 23:01:14 ip-********* emqx[403128]: EXEC: /opt/emqx/erts-13.2.2/bin/erlexec -noshell -noinput +Bd -boot /opt/emqx/releases/5.1.6/start -boot_var RELEASE_LIB />
Jan 31 23:01:15 ip-********* bash[402893]: Listener tcp:default on 0.0.0.0:1883 started.
Jan 31 23:01:15 ip-********* bash[402893]: Listener ws:default on 0.0.0.0:80 started.
Jan 31 23:01:16 ip-********* bash[402893]: Listener http:dashboard on :18083 started.
Jan 31 23:01:17 ip-********* bash[402893]: EMQX 5.1.6 is running now!

from emqx.

id avatar id commented on May 30, 2024

@vitorkogut sorry for the long silence.

We tried to reproduce this but no luck so far. Did this happen recently in March as well?

from emqx.

leobesen avatar leobesen commented on May 30, 2024

@id, yes, the same thing happened.

Mar 03 00:54:22 bash[402893]: free(): double free detected in tcache 2
Mar 03 00:54:22 bash[403178]: [os_mon] cpu supervisor port (cpu_sup): Erlang has closed
Mar 03 00:54:22 systemd[1]: emqx.service: Main process exited, code=killed, status=6/ABRT
Mar 03 00:54:22 bash[403177]: [os_mon] memory supervisor port (memsup): Erlang has closed
Mar 03 00:54:22 systemd[1]: emqx.service: Failed with result 'signal'.
Mar 03 00:54:22 systemd[1]: emqx.service: Consumed 2d 18h 1min 27.706s CPU time.
Mar 03 00:56:22 systemd[1]: emqx.service: Scheduled restart job, restart counter is at 4.
Mar 03 00:56:22 systemd[1]: Stopped emqx.service - emqx daemon.
Mar 03 00:56:22 systemd[1]: emqx.service: Consumed 2d 18h 1min 27.706s CPU time.
Mar 03 00:56:22 systemd[1]: Started emqx.service - emqx daemon.
Mar 03 00:56:24 emqx[484532]: EXEC: /opt/emqx/erts-13.2.2/bin/erlexec -noshell -noinput +Bd -boot /opt/emqx/releases/5.1.6/start -boot_var RELEASE_LIB /opt/emqx/lib -bo>
Mar 03 00:56:26 bash[484297]: Listener tcp:default on 0.0.0.0:1883 started.
Mar 03 00:56:26 bash[484297]: Listener ws:default on 0.0.0.0:80 started.
Mar 03 00:56:26 bash[484297]: Listener http:dashboard on :18083 started.
Mar 03 00:56:28 bash[484297]: EMQX 5.1.6 is running now!

from emqx.

id avatar id commented on May 30, 2024

@qzhuyan do you have any idea how else we could troubleshoot this?

from emqx.

zmstone avatar zmstone commented on May 30, 2024

we may need to reproduce this issue first.
debian 12 on arm plus similar configs and messages as in @leobesen's environment.

from emqx.

qzhuyan avatar qzhuyan commented on May 30, 2024

I got coredumps on arm64 everywhere with OTP26.2. I am moving away from arm64

from emqx.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.