Comments (26)
@tinto82 thanks for reporting!
So now we seem to have a second data point for vernemq/vernemq#715
I understand this happens completely randomly? (that is you don't have any observation how & when this happens, or what context could make it more likely etc)
@dergraf more ideas how to address this? (I marked it as a bug in #715)
from docker-vernemq.
Hi ioolkos,
yes, it happens randomly.
There is something i can do to help you understand the problem? Some data to retrieve or some test to do?
Thank you,
Mauro
from docker-vernemq.
@tinto82 thanks for offering your help!
Will let you know as soon as we have some more hypothesis/ideas.
from docker-vernemq.
@ioolkos ok, i'll wait for news. Thank you
from docker-vernemq.
I add more informations in order to better contextualize.
In 6 months of execution before July we did't have any problems. The number of subscribers is raising slowly and constantly. More or less we have now 400-450 new subscribers every month, at July the total number was indicatively 2000.
From our experience it seems the problem occurs after some days of execution. The specific liveness check we added with the last deploy is used to restart Pod in the event of a stuck.
We also have a test environment, very similar to the production one (where we have the problem); in this environment nodes are preemptible, so restart within 24 hours. In test environment we tried with 28000 subscribers and publishing between 500 to 2000 messages per second and similar problems didn't occur.
from docker-vernemq.
Hi,
in the last 14 days our 3 vernemq pods restarted 15 times, with a pods life between 9 hours and almost 12 days.
Hope that can help you.
from docker-vernemq.
@tinto82 Could be related,but the listener.vmq.default
should be configured to a specifig IP:PORT that is reachable by the other pods instead of using 0.0.0.0.
Can you exclude that the container went out of memory and terminated for that reason?
from docker-vernemq.
@dergraf on startup the configuration is updated and ip 0.0.0.0 is replaced with the pod one.
I can exclude is a memory problem, eg yesterday a pod restarted and his used memory was more or less 400 Mb, but the pod memory limit is 3Gb.
The log error is:
14:55:56.244 [error] Supervisor plumtree_sup had child plumtree_metadata_manager started with plumtree_metadata_manager:start_link() at <0.225.0> exit with reason no case clause matching [] in plumtree_metadata_manager:store/2 line 472 in context child_terminated
14:55:56.244 [error] Supervisor plumtree_sup had child plumtree_metadata_manager started with plumtree_metadata_manager:start_link() at <0.225.0> exit with reason no case clause matching [] in plumtree_metadata_manager:store/2 line 472 in context child_terminated
14:55:56.244 [error] CRASH REPORT Process plumtree_broadcast with 0 neighbours exited with reason: {{{case_clause,[]},[{plumtree_metadata_manager,store,2,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,472}]},{plumtree_metadata_manager,read_merge_write,2,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,457}]},{plumtree_metadata_manager,handle_call,3,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,316}]},{gen_server,try_handle_call,...},...]},...} in gen_server2:terminate/3 line 995
14:55:56.244 [error] gen_server plumtree_broadcast terminated with reason: {{{case_clause,[]},[{plumtree_metadata_manager,store,2,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,472}]},{plumtree_metadata_manager,read_merge_write,2,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,457}]},{plumtree_metadata_manager,handle_call,3,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,316}]},{gen_server,try_handle_call,...},...]},...}
14:55:56.243 [error] CRASH REPORT Process plumtree_metadata_manager with 0 neighbours exited with reason: no case clause matching [] in plumtree_metadata_manager:store/2 line 472 in gen_server:terminate/7 line 812
14:55:56.243 [error] gen_server plumtree_metadata_manager terminated with reason: no case clause matching [] in plumtree_metadata_manager:store/2 line 472
14:55:56.239 [error] got unknown sync event in wait_for_offline state {cleanup,session_taken_over}
from docker-vernemq.
@dergraf in addition to ip 0.0.0.0, do you notice anything else strange in the configuration?
Can subscribers/publishers behavior be a cause of the problem? Some devices connect and disconnect very often to VerneMQ: could this behavior affect the stability of the application?
from docker-vernemq.
Well it depends, do your subscribers/publishers always use a random client id when connecting? If they do, this could (become) certainly an issue.
from docker-vernemq.
Hi @tinto82 do the clients you have use clean_session=false or clean_session=true or a mix? do you use allow_multiple_session=on
?
Due to the 14:55:56.244 [error] gen_server plumtree_broadcast terminated with reason: {{{case_clause,[]},[{plumtree_metadata_manager,store,2,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,472}]},
this looks like the same issue as vernemq/vernemq#715 and vernemq/vernemq#547, so I'm trying to figure out what those three issues have in common.
So far the only common denominator is you run docker, but it's hard to understand why that would play a role...
from docker-vernemq.
Hi @dergraf, every device have a specific client id
from docker-vernemq.
Hi @larshesel, clients use clean_session=true and we don't specify allow_multiple_session in our configuration; the default value is off, right?
Thank you
from docker-vernemq.
Yes, that is correct - it defaults to off - then my suspicion that allow_multiple_sessions
had something to do with it was wrong :) Need to form another hypothesis.
from docker-vernemq.
Hi everybody,
we noticed that sometimes some clients have a wrong behavior we are deepening: they do repeatly connect every second with the same client_id and without subscribing to any topic.
In correspondence of these repeated connections we can find in logs a warning
08:55:07.113 [warning] can't migrate the queue[<0.4193.61>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/eee/ffffffff/hhh/k">>} due to no responsible remote node found
that I have reported in issue vernemq/vernemq#862.
Can this be related to the problem about "stuck PODs"?
Thank you
from docker-vernemq.
Hi,
today a client started to make connections with a greater frequency than cases seen so far (12 connect/second more or less).
In concomitance with the beginning of the connections we started to have VerneMQ reboots.
Below I report logs:
09:25:17.939 [warning] can't migrate the queue[<0.27547.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:25:24.371 [warning] can't migrate the queue[<0.27793.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:25:27.845 [warning] can't migrate the queue[<0.27958.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:25:33.914 [warning] can't migrate the queue[<0.28176.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:25:37.726 [warning] can't migrate the queue[<0.28283.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:26:01.844 [warning] can't migrate the queue[<0.29285.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:27:01.133 [warning] can't migrate the queue[<0.31830.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:27:02.441 [warning] can't migrate the queue[<0.31887.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:27:09.074 [warning] can't migrate the queue[<0.32172.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:27:13.133 [warning] can't migrate the queue[<0.32374.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:27:19.119 [warning] can't migrate the queue[<0.32613.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:27:20.325 [warning] can't migrate the queue[<0.32659.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:27:21.590 [warning] can't migrate the queue[<0.32714.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:27:21.591 [warning] can't migrate the queue[<0.32714.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:27:21.592 [warning] can't migrate the queue[<0.32714.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:27:24.121 [warning] can't migrate the queue[<0.54.3>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:27:40.013 [warning] can't migrate the queue[<0.734.3>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:27:53.927 [warning] can't migrate the queue[<0.1330.3>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:27:56.353 [warning] can't migrate the queue[<0.1449.3>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:28:09.609 [warning] can't migrate the queue[<0.2047.3>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:28:09.609 [warning] can't migrate the queue[<0.2047.3>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
09:28:19.384 [info] completed metadata exchange with '[email protected]'. repaired 0 missing local prefixes, 0 missing remote prefixes, and 4 keys
09:28:29.381 [info] completed metadata exchange with '[email protected]'. repaired 0 missing local prefixes, 0 missing remote prefixes, and 3 keys
09:28:29.508 [warning] client {[],<<"aaaa/rrrrrrrrrrrr/ccc/DD000015/eee/1">>} with username <<"aaabbbccc">> stopped due to keepalive expired
09:28:31.254 [error] got unknown sync event in wait_for_offline state {cleanup,session_taken_over}
09:28:31.277 [error] got unknown sync event in wait_for_offline state {cleanup,session_taken_over}
09:28:31.291 [error] gen_server plumtree_metadata_manager terminated with reason: no case clause matching [] in plumtree_metadata_manager:store/2 line 472
09:28:31.292 [error] CRASH REPORT Process plumtree_metadata_manager with 0 neighbours exited with reason: no case clause matching [] in plumtree_metadata_manager:store/2 line 472 in gen_server:terminate/7 line 812
09:28:31.292 [error] gen_server plumtree_broadcast terminated with reason: {{{case_clause,[]},[{plumtree_metadata_manager,store,2,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,472}]},{plumtree_metadata_manager,read_merge_write,2,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,457}]},{plumtree_metadata_manager,handle_call,3,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,316}]},{gen_server,try_handle_call,...},...]},...}
09:28:31.293 [error] CRASH REPORT Process plumtree_broadcast with 0 neighbours exited with reason: {{{case_clause,[]},[{plumtree_metadata_manager,store,2,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,472}]},{plumtree_metadata_manager,read_merge_write,2,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,457}]},{plumtree_metadata_manager,handle_call,3,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,316}]},{gen_server,try_handle_call,...},...]},...} in gen_server2:terminate/3 line 995
09:28:31.293 [error] Supervisor plumtree_sup had child plumtree_metadata_manager started with plumtree_metadata_manager:start_link() at <0.225.0> exit with reason no case clause matching [] in plumtree_metadata_manager:store/2 line 472 in context child_terminated
09:28:31.294 [error] Supervisor plumtree_sup had child plumtree_broadcast started with plumtree_broadcast:start_link() at <0.224.0> exit with reason {{{case_clause,[]},[{plumtree_metadata_manager,store,2,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,472}]},{plumtree_metadata_manager,read_merge_write,2,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,457}]},{plumtree_metadata_manager,handle_call,3,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,316}]},{gen_server,try_handle_call,...},...]},...} in context child_terminated
09:28:39.391 [info] completed metadata exchange with '[email protected]'. repaired 0 missing local prefixes, 0 missing remote prefixes, and 3 keys
09:28:41.400 [info] completed metadata exchange with '[email protected]'. repaired 0 missing local prefixes, 0 missing remote prefixes, and 3 keys
09:28:49.393 [info] completed metadata exchange with '[email protected]'. repaired 0 missing local prefixes, 0 missing remote prefixes, and 5 keys
09:28:51.313 [info] completed metadata exchange with '[email protected]'. repaired 0 missing local prefixes, 0 missing remote prefixes, and 2 keys
09:28:59.383 [info] completed metadata exchange with '[email protected]'. repaired 0 missing local prefixes, 0 missing remote prefixes, and 1 keys
09:29:01.313 [info] completed metadata exchange with '[email protected]'. repaired 0 missing local prefixes, 0 missing remote prefixes, and 1 keys
We are beginning to relate the reconnections with VerneMQ reboots
from docker-vernemq.
Hi,
we have recreated the problem also in a test environment we have. The test environment is similar to the production one.
We developed a test client: this client make connections one after another without stopping and without closing the previous connection. The test made that 2 PODs restart in few minutes.
I hope this can help your analysis.
from docker-vernemq.
Hi @tinto82 thanks. I can easily reproduce the migration warnings, even for a single node broker. It's more challenging to trigger the plumtree errors. But we're working on it.
from docker-vernemq.
I've been able to reproduce the migration warnings as well, but not yet the plumtree crash. In your test scenario where you made those pods restart was that due to memory issues or did they also exhibit the plumtree crash in the logs?
from docker-vernemq.
Hi @larshesel.
also in our test environment we find the plumtree crash.
But i can add some new info we found in these days. Sometimes a client have a wrong behavior: create connection with Vernemq and subscribe, but create also another thread and try to make another connection. we are trying to understand this strange behavior.
But we have seen also this problem: when client try to make a new connection we would expect the previous one to be closed and the new to be created. Actually we have seen that the previous connection is not closed and the new one is rejected. So the client enter in a loop trying to create the new connection and making too many calls.
We looked this using WiredShark. If is useful to help you to understand the problem we can share in private with you data taken with Wiredshark.
from docker-vernemq.
Hi @larshesel,
the latest information we have provided has been helpful?
from docker-vernemq.
Hi @tinto82, I've been away a while - I hope to take a look again at this soon.
from docker-vernemq.
Hi @larshesel can i do something else to help you understand the problem?
from docker-vernemq.
The problem is the plumtree crash you see above, we've added some additional logging for this case in the newly released VerneMQ 1.6.0 - would you be able to test with that release? It will log errors with the following line Known issue, please report here https://github.com/erlio/vernemq/issues/715
plus some hopefully detailed enough information to help us get to the bottom of this.
from docker-vernemq.
Also the change in plumtree means that VerneMQ will try to repair the data causing this(after logging it), and as such will just log the message and VerneMQ should continue working.
from docker-vernemq.
Closing this due to inactivity - feel free to reopen or close a new issue if the issue persists with VerneMQ 1.9.0
from docker-vernemq.
Related Issues (20)
- Upgrade to alpine 3.17 and change openssl to openssl1.1-compat
- Uncaught target signal 11 (Segmentation fault) - core dumped HOT 5
- Can't reconfigure mqtts listener HOT 57
- VerneMQ cluster not working in Ipv6 only environment on Kubernetes HOT 11
- Container Crashes on DOCKER_VERNEMQ_ALLOW_MULTIPLE_SESSIONS HOT 2
- latest values.yaml mentions 1.13.0-alpine but that doesn't exist in dockerhub HOT 5
- Harcoded spec.type in template api-service.yaml
- some regular questions... HOT 5
- Configuring VerneMQ via Docker Image in Azure container app
- Restart of single node vernemq cluster triggers a cluster-leave scenario HOT 2
- wss and mqtts ports are not properly configured in helm chart
- Error in Kubernetes when not use helm-chart or operator HOT 2
- join_cluster missing in alpine image? HOT 4
- Hard to understand k8s secret name in Helm chart for TLS certificates HOT 4
- Docker swarm deployment with multiple networks not possible HOT 3
- List of CVE Vulnerabilities found in the Alpine Docker image HOT 3
- Helm chart: Configurable securityContext on container level
- Support custom ports on the headless service HOT 2
- Use of imagePullSecrets in the helm chart
- Kubernetes healthcheck gives access denied HOT 12
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from docker-vernemq.