GithubHelp home page GithubHelp logo

Comments (26)

ioolkos avatar ioolkos commented on August 18, 2024

@tinto82 thanks for reporting!
So now we seem to have a second data point for vernemq/vernemq#715

I understand this happens completely randomly? (that is you don't have any observation how & when this happens, or what context could make it more likely etc)

@dergraf more ideas how to address this? (I marked it as a bug in #715)

from docker-vernemq.

tinto82 avatar tinto82 commented on August 18, 2024

Hi ioolkos,

yes, it happens randomly.
There is something i can do to help you understand the problem? Some data to retrieve or some test to do?

Thank you,
Mauro

from docker-vernemq.

ioolkos avatar ioolkos commented on August 18, 2024

@tinto82 thanks for offering your help!
Will let you know as soon as we have some more hypothesis/ideas.

from docker-vernemq.

tinto82 avatar tinto82 commented on August 18, 2024

@ioolkos ok, i'll wait for news. Thank you

from docker-vernemq.

tinto82 avatar tinto82 commented on August 18, 2024

I add more informations in order to better contextualize.
In 6 months of execution before July we did't have any problems. The number of subscribers is raising slowly and constantly. More or less we have now 400-450 new subscribers every month, at July the total number was indicatively 2000.
From our experience it seems the problem occurs after some days of execution. The specific liveness check we added with the last deploy is used to restart Pod in the event of a stuck.
We also have a test environment, very similar to the production one (where we have the problem); in this environment nodes are preemptible, so restart within 24 hours. In test environment we tried with 28000 subscribers and publishing between 500 to 2000 messages per second and similar problems didn't occur.

from docker-vernemq.

tinto82 avatar tinto82 commented on August 18, 2024

Hi,
in the last 14 days our 3 vernemq pods restarted 15 times, with a pods life between 9 hours and almost 12 days.
Hope that can help you.

from docker-vernemq.

dergraf avatar dergraf commented on August 18, 2024

@tinto82 Could be related,but the listener.vmq.default should be configured to a specifig IP:PORT that is reachable by the other pods instead of using 0.0.0.0.

Can you exclude that the container went out of memory and terminated for that reason?

from docker-vernemq.

tinto82 avatar tinto82 commented on August 18, 2024

@dergraf on startup the configuration is updated and ip 0.0.0.0 is replaced with the pod one.
I can exclude is a memory problem, eg yesterday a pod restarted and his used memory was more or less 400 Mb, but the pod memory limit is 3Gb.

The log error is:

14:55:56.244 [error] Supervisor plumtree_sup had child plumtree_metadata_manager started with plumtree_metadata_manager:start_link() at <0.225.0> exit with reason no case clause matching [] in plumtree_metadata_manager:store/2 line 472 in context child_terminated
14:55:56.244 [error] Supervisor plumtree_sup had child plumtree_metadata_manager started with plumtree_metadata_manager:start_link() at <0.225.0> exit with reason no case clause matching [] in plumtree_metadata_manager:store/2 line 472 in context child_terminated
14:55:56.244 [error] CRASH REPORT Process plumtree_broadcast with 0 neighbours exited with reason: {{{case_clause,[]},[{plumtree_metadata_manager,store,2,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,472}]},{plumtree_metadata_manager,read_merge_write,2,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,457}]},{plumtree_metadata_manager,handle_call,3,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,316}]},{gen_server,try_handle_call,...},...]},...} in gen_server2:terminate/3 line 995
14:55:56.244 [error] gen_server plumtree_broadcast terminated with reason: {{{case_clause,[]},[{plumtree_metadata_manager,store,2,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,472}]},{plumtree_metadata_manager,read_merge_write,2,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,457}]},{plumtree_metadata_manager,handle_call,3,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,316}]},{gen_server,try_handle_call,...},...]},...}
14:55:56.243 [error] CRASH REPORT Process plumtree_metadata_manager with 0 neighbours exited with reason: no case clause matching [] in plumtree_metadata_manager:store/2 line 472 in gen_server:terminate/7 line 812
14:55:56.243 [error] gen_server plumtree_metadata_manager terminated with reason: no case clause matching [] in plumtree_metadata_manager:store/2 line 472
14:55:56.239 [error] got unknown sync event in wait_for_offline state {cleanup,session_taken_over}

immagine

from docker-vernemq.

tinto82 avatar tinto82 commented on August 18, 2024

@dergraf in addition to ip 0.0.0.0, do you notice anything else strange in the configuration?
Can subscribers/publishers behavior be a cause of the problem? Some devices connect and disconnect very often to VerneMQ: could this behavior affect the stability of the application?

from docker-vernemq.

dergraf avatar dergraf commented on August 18, 2024

Well it depends, do your subscribers/publishers always use a random client id when connecting? If they do, this could (become) certainly an issue.

from docker-vernemq.

larshesel avatar larshesel commented on August 18, 2024

Hi @tinto82 do the clients you have use clean_session=false or clean_session=true or a mix? do you use allow_multiple_session=on?

Due to the 14:55:56.244 [error] gen_server plumtree_broadcast terminated with reason: {{{case_clause,[]},[{plumtree_metadata_manager,store,2,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,472}]}, this looks like the same issue as vernemq/vernemq#715 and vernemq/vernemq#547, so I'm trying to figure out what those three issues have in common.

So far the only common denominator is you run docker, but it's hard to understand why that would play a role...

from docker-vernemq.

tinto82 avatar tinto82 commented on August 18, 2024

Hi @dergraf, every device have a specific client id

from docker-vernemq.

tinto82 avatar tinto82 commented on August 18, 2024

Hi @larshesel, clients use clean_session=true and we don't specify allow_multiple_session in our configuration; the default value is off, right?
Thank you

from docker-vernemq.

larshesel avatar larshesel commented on August 18, 2024

Yes, that is correct - it defaults to off - then my suspicion that allow_multiple_sessions had something to do with it was wrong :) Need to form another hypothesis.

from docker-vernemq.

tinto82 avatar tinto82 commented on August 18, 2024

Hi everybody,
we noticed that sometimes some clients have a wrong behavior we are deepening: they do repeatly connect every second with the same client_id and without subscribing to any topic.
In correspondence of these repeated connections we can find in logs a warning

08:55:07.113 [warning] can't migrate the queue[<0.4193.61>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/eee/ffffffff/hhh/k">>} due to no responsible remote node found

that I have reported in issue vernemq/vernemq#862.
Can this be related to the problem about "stuck PODs"?

Thank you

from docker-vernemq.

tinto82 avatar tinto82 commented on August 18, 2024

Hi,
today a client started to make connections with a greater frequency than cases seen so far (12 connect/second more or less).
In concomitance with the beginning of the connections we started to have VerneMQ reboots.
Below I report logs:

09:25:17.939 [warning] can't migrate the queue[<0.27547.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:25:24.371 [warning] can't migrate the queue[<0.27793.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:25:27.845 [warning] can't migrate the queue[<0.27958.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:25:33.914 [warning] can't migrate the queue[<0.28176.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:25:37.726 [warning] can't migrate the queue[<0.28283.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:26:01.844 [warning] can't migrate the queue[<0.29285.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:27:01.133 [warning] can't migrate the queue[<0.31830.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:27:02.441 [warning] can't migrate the queue[<0.31887.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:27:09.074 [warning] can't migrate the queue[<0.32172.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:27:13.133 [warning] can't migrate the queue[<0.32374.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:27:19.119 [warning] can't migrate the queue[<0.32613.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:27:20.325 [warning] can't migrate the queue[<0.32659.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:27:21.590 [warning] can't migrate the queue[<0.32714.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:27:21.591 [warning] can't migrate the queue[<0.32714.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:27:21.592 [warning] can't migrate the queue[<0.32714.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:27:24.121 [warning] can't migrate the queue[<0.54.3>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:27:40.013 [warning] can't migrate the queue[<0.734.3>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:27:53.927 [warning] can't migrate the queue[<0.1330.3>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:27:56.353 [warning] can't migrate the queue[<0.1449.3>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:28:09.609 [warning] can't migrate the queue[<0.2047.3>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:28:09.609 [warning] can't migrate the queue[<0.2047.3>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
09:28:19.384 [info] completed metadata exchange with '[email protected]'. repaired 0 missing local prefixes, 0 missing remote prefixes, and 4 keys
09:28:29.381 [info] completed metadata exchange with '[email protected]'. repaired 0 missing local prefixes, 0 missing remote prefixes, and 3 keys
09:28:29.508 [warning] client {[],<<"aaaa/rrrrrrrrrrrr/ccc/DD000015/eee/1">>} with username <<"aaabbbccc">> stopped due to keepalive expired
09:28:31.254 [error] got unknown sync event in wait_for_offline state {cleanup,session_taken_over}
09:28:31.277 [error] got unknown sync event in wait_for_offline state {cleanup,session_taken_over}
09:28:31.291 [error] gen_server plumtree_metadata_manager terminated with reason: no case clause matching [] in plumtree_metadata_manager:store/2 line 472
09:28:31.292 [error] CRASH REPORT Process plumtree_metadata_manager with 0 neighbours exited with reason: no case clause matching [] in plumtree_metadata_manager:store/2 line 472 in gen_server:terminate/7 line 812
09:28:31.292 [error] gen_server plumtree_broadcast terminated with reason: {{{case_clause,[]},[{plumtree_metadata_manager,store,2,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,472}]},{plumtree_metadata_manager,read_merge_write,2,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,457}]},{plumtree_metadata_manager,handle_call,3,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,316}]},{gen_server,try_handle_call,...},...]},...}
09:28:31.293 [error] CRASH REPORT Process plumtree_broadcast with 0 neighbours exited with reason: {{{case_clause,[]},[{plumtree_metadata_manager,store,2,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,472}]},{plumtree_metadata_manager,read_merge_write,2,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,457}]},{plumtree_metadata_manager,handle_call,3,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,316}]},{gen_server,try_handle_call,...},...]},...} in gen_server2:terminate/3 line 995
09:28:31.293 [error] Supervisor plumtree_sup had child plumtree_metadata_manager started with plumtree_metadata_manager:start_link() at <0.225.0> exit with reason no case clause matching [] in plumtree_metadata_manager:store/2 line 472 in context child_terminated
09:28:31.294 [error] Supervisor plumtree_sup had child plumtree_broadcast started with plumtree_broadcast:start_link() at <0.224.0> exit with reason {{{case_clause,[]},[{plumtree_metadata_manager,store,2,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,472}]},{plumtree_metadata_manager,read_merge_write,2,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,457}]},{plumtree_metadata_manager,handle_call,3,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,316}]},{gen_server,try_handle_call,...},...]},...} in context child_terminated
09:28:39.391 [info] completed metadata exchange with '[email protected]'. repaired 0 missing local prefixes, 0 missing remote prefixes, and 3 keys
09:28:41.400 [info] completed metadata exchange with '[email protected]'. repaired 0 missing local prefixes, 0 missing remote prefixes, and 3 keys
09:28:49.393 [info] completed metadata exchange with '[email protected]'. repaired 0 missing local prefixes, 0 missing remote prefixes, and 5 keys
09:28:51.313 [info] completed metadata exchange with '[email protected]'. repaired 0 missing local prefixes, 0 missing remote prefixes, and 2 keys
09:28:59.383 [info] completed metadata exchange with '[email protected]'. repaired 0 missing local prefixes, 0 missing remote prefixes, and 1 keys
09:29:01.313 [info] completed metadata exchange with '[email protected]'. repaired 0 missing local prefixes, 0 missing remote prefixes, and 1 keys

We are beginning to relate the reconnections with VerneMQ reboots

from docker-vernemq.

tinto82 avatar tinto82 commented on August 18, 2024

Hi,
we have recreated the problem also in a test environment we have. The test environment is similar to the production one.
We developed a test client: this client make connections one after another without stopping and without closing the previous connection. The test made that 2 PODs restart in few minutes.
I hope this can help your analysis.

from docker-vernemq.

ioolkos avatar ioolkos commented on August 18, 2024

Hi @tinto82 thanks. I can easily reproduce the migration warnings, even for a single node broker. It's more challenging to trigger the plumtree errors. But we're working on it.

from docker-vernemq.

larshesel avatar larshesel commented on August 18, 2024

I've been able to reproduce the migration warnings as well, but not yet the plumtree crash. In your test scenario where you made those pods restart was that due to memory issues or did they also exhibit the plumtree crash in the logs?

from docker-vernemq.

tinto82 avatar tinto82 commented on August 18, 2024

Hi @larshesel.
also in our test environment we find the plumtree crash.
But i can add some new info we found in these days. Sometimes a client have a wrong behavior: create connection with Vernemq and subscribe, but create also another thread and try to make another connection. we are trying to understand this strange behavior.
But we have seen also this problem: when client try to make a new connection we would expect the previous one to be closed and the new to be created. Actually we have seen that the previous connection is not closed and the new one is rejected. So the client enter in a loop trying to create the new connection and making too many calls.
We looked this using WiredShark. If is useful to help you to understand the problem we can share in private with you data taken with Wiredshark.

from docker-vernemq.

tinto82 avatar tinto82 commented on August 18, 2024

Hi @larshesel,

the latest information we have provided has been helpful?

from docker-vernemq.

larshesel avatar larshesel commented on August 18, 2024

Hi @tinto82, I've been away a while - I hope to take a look again at this soon.

from docker-vernemq.

tinto82 avatar tinto82 commented on August 18, 2024

Hi @larshesel can i do something else to help you understand the problem?

from docker-vernemq.

larshesel avatar larshesel commented on August 18, 2024

The problem is the plumtree crash you see above, we've added some additional logging for this case in the newly released VerneMQ 1.6.0 - would you be able to test with that release? It will log errors with the following line Known issue, please report here https://github.com/erlio/vernemq/issues/715 plus some hopefully detailed enough information to help us get to the bottom of this.

from docker-vernemq.

larshesel avatar larshesel commented on August 18, 2024

Also the change in plumtree means that VerneMQ will try to repair the data causing this(after logging it), and as such will just log the message and VerneMQ should continue working.

from docker-vernemq.

larshesel avatar larshesel commented on August 18, 2024

Closing this due to inactivity - feel free to reopen or close a new issue if the issue persists with VerneMQ 1.9.0

from docker-vernemq.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.