GithubHelp home page GithubHelp logo

RabbitMQ nodedown about rabbitmq HOT 56 CLOSED

sky-big avatar sky-big commented on September 24, 2024
RabbitMQ nodedown

from rabbitmq.

Comments (56)

sky-big avatar sky-big commented on September 24, 2024

看rabbit的日志,down掉的时候应该有日志的

from rabbitmq.

chutian52 avatar chutian52 commented on September 24, 2024

奇怪就奇怪在后台看不到任何日志信息。能看到的,也是通过 rabbitmqctl command 之类的命令看到的信息。关键是,我看不懂erlang 语言,要不然我也进去看下到底是哪里出了问题。 使用 epmd -names 显示 正常 。能告诉我,代码的入口在哪里吗,我研究下。或者是方便告诉我 rabbitmqctl 这个代码运行的RabbitMQ 对应的入口

from rabbitmq.

sky-big avatar sky-big commented on September 24, 2024

应该有日志的,有两个日志文件,一个.log结尾的,一个_sasl.log结尾的,如果你想看代码的话,你这个问题是node挂掉相关的,可以看看rabbit_node.erl文件,但是我觉得应该看不出什么问题,我怀疑是你的网络出现了问题,导致节点不互通了

from rabbitmq.

chutian52 avatar chutian52 commented on September 24, 2024

.log 结尾的并没有看到任何异常。都是正常的关闭连接,连接关闭。 sasl.log 结尾的 都是=SUPERVISOR REPORT==== 12-Jun-2017::13:39:13 === Supervisor: {<0.17645.33>,amqp_channel_sup_sup} Context: shutdown_error Reason: shutdown Offender: [{nb_children,1}, {name,channel_sup}, {mfargs, {amqp_channel_sup,start_link, [direct,<0.17643.33>, <<"<[email protected]>">>]}}, {restart_type,temporary}, {shutdown,brutal_kill}, {child_type,supervisor}]
我问了下官方的mk,也没给出个具体的结论。如果是网络问题的话,那我至少在本机器上rabbitmqctl status 也应该显示出我当前节点的详细信息出来,关键是rabbitmq 也没有正常显示。 提示我三种可能:1. hostname mistach. 2 ttl ;3. cookie uncorrect。 第二种显然不可能。第三个.erlang.cookie 根本没改动过. 第一种 是放在/etc/hosts 里面的 只是配置了个
172.0.0.1 xhy-205-2.linux.com
192.18.205.2 xhy-205-2
对于普通的集群模式,消息都是通过拉取过来的吧?如果是这样,难道是rabbitmq 注册的时候,修改了ip地址?

from rabbitmq.

sky-big avatar sky-big commented on September 24, 2024

以前也有个哥们跟你一样是这个情况,没有任何日志错误,我当时给他说的也是第一种情况,hostname在某个时间点被修改了导致那个node节点失联,所以应该是hostname的问题

from rabbitmq.

chutian52 avatar chutian52 commented on September 24, 2024

hostname 难道不是第一次启动的时候获取的吗? epmd 上面的注册,难道还是实时的?

from rabbitmq.

chutian52 avatar chutian52 commented on September 24, 2024

[root@xhy-205-2 ~]# cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
127.0.0.1 xhy-205-2.linux.com

192.18.205.2 xhy-205-2
192.18.205.3 xhy-205-3
192.18.205.4 xhy-205-4
192.18.205.5 xhy-205-5
192.18.205.6 xhy-205-6
192.18.205.7 xhy-205-7

192.18.209.2 xhy-209-2
192.18.209.3 xhy-209-3
还有一个问题是,对于rabbitmq 来讲,A节点和B节点之间的通讯,是通过A拉B的,还是B push A ?

from rabbitmq.

sky-big avatar sky-big commented on September 24, 2024

两个节点是互通的,相互连接,双方都建立有到对方的连接

from rabbitmq.

sky-big avatar sky-big commented on September 24, 2024

你用的是哪个版本的rabbit啊?那哥们跟你的情况一样,要不试试把rabbit升级一下,不过我还是怀疑是你节点挂掉的那个时间点之前有什么其他程序修改了hostname导致那个节点丢失了

from rabbitmq.

chutian52 avatar chutian52 commented on September 24, 2024

RabbitMQ 3.5.7, 上面的hostname 有什么问题么? 或者,我怎么能看到RabbitMQ 使用的hostname ?

from rabbitmq.

chutian52 avatar chutian52 commented on September 24, 2024

线上的,而且20多台机器,说服领导升级,必须有合理的证据啊。如果没有,他也不会同意升级

from rabbitmq.

chutian52 avatar chutian52 commented on September 24, 2024

我们的集群方式是: keepalived,rabbitmq,haproxy 三个装在一个服务器上。每keepavlied 做 haproxy 热备。haporxy 对所有的请求做分发,负载均衡。
Uploading image2017-4-19 16-21-5.png…

from rabbitmq.

chutian52 avatar chutian52 commented on September 24, 2024

我就想,如果我可以打印出 epmd 中运行的 rabbit 对应的hostname, 这样我就可以确诊问题在哪里了。

from rabbitmq.

sky-big avatar sky-big commented on September 24, 2024

去机器上执行epmd -names看一下

from rabbitmq.

chutian52 avatar chutian52 commented on September 24, 2024

[root@fdfs2-s2 /]# epmd -d -names
epmd: up and running on port 4369 with data:
name rabbit at port 25672
这个命令我执行过了,怎么执行这个命令都是成功的,即使是挂掉的时候。

from rabbitmq.

chutian52 avatar chutian52 commented on September 24, 2024

/usr/lib64/erlang/erts-8.2/bin/beam.smp -W w -A 64 -P 1048576 -K true -B i -- -root /usr/lib64/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa /usr/lib/rabbitmq/lib/rabbitmq_server-3.5.7/sbin/../ebin -noshell -noinput -s rabbit boot -sname rabbit@fdfs2-s2 -boot start_sasl -kernel inet_default_connect_options [{nodelay,true}] -sasl errlog_type error -sasl sasl_error_logger false -rabbit error_logger {file,"/var/log/rabbitmq/[email protected]"} -rabbit sasl_error_logger {file,"/var/log/rabbitmq/[email protected]"} -rabbit enabled_plugins_file "/etc/rabbitmq/enabled_plugins" -rabbit plugins_dir "/usr/lib/rabbitmq/lib/rabbitmq_server-3.5.7/sbin/../plugins" -rabbit plugins_expand_dir "/data/rabbitmq/mnesia/rabbit@fdfs2-s2-plugins-expand" -os_mon start_cpu_sup false -os_mon start_disksup false -os_mon start_memsup false -mnesia dir "/data/rabbitmq/mnesia/rabbit@fdfs2-s2" -kernel inet_dist_listen_min 25672 -kernel inet_dist_listen_max 25672
除了 inet_dist_listen_max 晓得是监听的端口外。 没看到在那里启动epmd的

from rabbitmq.

sky-big avatar sky-big commented on September 24, 2024

你挂掉的那个节点是直接crash掉了,还是挂掉的节点跟集群失联了,如果挂掉的节点还在的话,你登入到那个挂掉的节点上面去执行nodes().

from rabbitmq.

sky-big avatar sky-big commented on September 24, 2024

epmd会在每个rabbit的那个节点上启动的

from rabbitmq.

sky-big avatar sky-big commented on September 24, 2024

epmd相当于erlang节点的dns

from rabbitmq.

chutian52 avatar chutian52 commented on September 24, 2024

只是执行rabbitmqctl status 没办法用了,显示的是nodedown, 应用服务器报获取消息失败。其实消息还是有的。只是拉取不到消息了而已。执行nodes(). 是 怎么搞的?

from rabbitmq.

chutian52 avatar chutian52 commented on September 24, 2024

[root@fdfs2-s2 bin]# ^C
[root@fdfs2-s2 bin]# erl -sname test@localhost -setcookie PUUSPJHCJRAMFVIEHLGI
Erlang/OTP 19 [erts-8.2] [source-fbd2db2] [64-bit] [smp:2:2] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V8.2 (abort with ^G)
(test@localhost)1> nodes().
[]
(test@localhost)2>

[root@fdfs2-s2 ~]# epmd -names
epmd: up and running on port 4369 with data:
name test at port 33735
name rabbit at port 25672

这种写法有什么问题?

from rabbitmq.

sky-big avatar sky-big commented on September 24, 2024

root@rabbit rabbitmq]# rabbitmqctl join_cluster rabbit@mqmaster
Clustering node rabbit@rabbit with rabbit@mqmaster ...
Error: unable to connect to nodes [rabbit@mqmaster]: nodedown
我觉得就是你的hostname的问题,要不然你不要用hostname,直接使用ip,就是你的rabbit节点的名字是这种格式的:rabbit名字@ip地址(替换你的hostname),我觉得这样应该能解决你的问题,直接省去hostname解析这一步

from rabbitmq.

chutian52 avatar chutian52 commented on September 24, 2024

你的意思是,需要修改rabbitmq-server 中的 信息, 直接把 -sname 修改为 rabbitmq@ip ?

from rabbitmq.

sky-big avatar sky-big commented on September 24, 2024

是的

from rabbitmq.

chutian52 avatar chutian52 commented on September 24, 2024

.erlang.cookie 中含有回车和不含回车,有没有什么区别? 对于rabbitmq 来讲,hash 出来的值会不会不一样?我问了官方,说没影响

from rabbitmq.

sky-big avatar sky-big commented on September 24, 2024

你可以先用个这种方法启动一个集群起来测试一下,看看是否有节点会挂掉

from rabbitmq.

sky-big avatar sky-big commented on September 24, 2024

cookie节点间只要一样就行,应该是没啥影响

from rabbitmq.

chutian52 avatar chutian52 commented on September 24, 2024

[root@fdfs2-s2 bin]# cat /var/log/rabbitmq/startup_log
ERROR: epmd error for host 10.100.157.197}: nxdomain (non-existing domain)

这个是修改了 rabbitmq-env 后报的错误
else
RABBITMQ_NAME_TYPE=-sname
[ "x" = "x$HOSTNAME" ] && HOSTNAME=env hostname
[ "x" = "x$NODENAME" ] && [email protected]}

[ "x" = "x$NODENAME" ] && NODENAME=rabbit@${HOSTNAME%%.*}

from rabbitmq.

chutian52 avatar chutian52 commented on September 24, 2024

或者是否可以设定 RABBITMQ_NODE_IP_ADDRESS 让他只监听指定的IP ? 如果是这样的话,估计keeaplived 可能就不能用了

from rabbitmq.

sky-big avatar sky-big commented on September 24, 2024

这个是rabbitmq-env对节点名字必须用hostname的限制吧,你看看makefile吧,可以直接到rabbitmq-server的start_rabbitmq_server()这个函数去直接改节点的名字启动试试。

from rabbitmq.

sky-big avatar sky-big commented on September 24, 2024

跟以前那个哥们一样,就是怀疑在某个时刻有其他的应用程序短时间改过hostname导致了rabbit集群里面心跳连接的时候发现找不到那个节点,导致那个rabbit节点挂掉

from rabbitmq.

chutian52 avatar chutian52 commented on September 24, 2024

后来直接用 ip 地址 好了 ?

from rabbitmq.

sky-big avatar sky-big commented on September 24, 2024

就是给了他解决方案,不知道最后他怎么解决的

from rabbitmq.

chutian52 avatar chutian52 commented on September 24, 2024

[root@xhy-205-2 ~]# ls -slt /etc/hosts
4 -rw-r--r-- 1 root root 382 Aug 26 2016 /etc/hosts

我看了下这个hosts 时间,貌似也没给修改过。这样,不就尴尬了

from rabbitmq.

sky-big avatar sky-big commented on September 24, 2024

本机器的hostname名字变化了呢?

from rabbitmq.

chutian52 avatar chutian52 commented on September 24, 2024

这个hostname 怎么获取 ? 应该是从erlang 中获取到的才算。但是,erlang 中如何获取?

from rabbitmq.

chutian52 avatar chutian52 commented on September 24, 2024

有没有可能是 bond0 Link encap:Ethernet HWaddr 24:6E:96:24:67:64
inet addr:192.18.205.2 Bcast:172.18.205.255 Mask:255.255.255.0
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
RX packets:74662923893 errors:0 dropped:0 overruns:0 frame:0
TX packets:72946533560 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:45029744249942 (40.9 TiB) TX bytes:53250385229498 (48.4 TiB)

bond0:0 Link encap:Ethernet HWaddr 24:6E:96:24:67:64
inet addr:192.18.205.200 Bcast:0.0.0.0 Mask:255.255.255.255
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1 keepalived 导致2个节点的虚拟IP 一致引起的?

from rabbitmq.

sky-big avatar sky-big commented on September 24, 2024

你本机器上的rabbit节点名字是这种格式吧:rabbit@本机器的hostname,这样的话如果你本机器的hostname被其他程序改变的话,应该就会出现你目前的这个问题,应该是这样,你可以去rabbitmq的github邮件列表上去问问(你试试将本机的hostname对应的ip也加入到/etc/hosts文件中试试,可能会解决这个问题)

from rabbitmq.

chutian52 avatar chutian52 commented on September 24, 2024

这个本机的hostname 在上面给你的回复是有的。每个机器上都是的
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
127.0.0.1 xhy-205-2.linux.com

192.18.205.2 xhy-205-2
192.18.205.3 xhy-205-3
192.18.205.4 xhy-205-4
192.18.205.5 xhy-205-5
192.18.205.6 xhy-205-6
192.18.205.7 xhy-205-7

192.18.209.2 xhy-209-2
192.18.209.3 xhy-209-3

from rabbitmq.

chutian52 avatar chutian52 commented on September 24, 2024

貌似命令是 node(). 不是 nodes().

from rabbitmq.

sky-big avatar sky-big commented on September 24, 2024

nodes().是查看集群中的所有节点列表,node().是查看自己节点的名字。

from rabbitmq.

chutian52 avatar chutian52 commented on September 24, 2024

那不对啊。nodes(). 看不到集群里面 所有节点列表

from rabbitmq.

chutian52 avatar chutian52 commented on September 24, 2024

错了。我运行的环境不对。

from rabbitmq.

sky-big avatar sky-big commented on September 24, 2024

from rabbitmq.

chutian52 avatar chutian52 commented on September 24, 2024

我发现生产环境是可以看到其中的一个节点的。貌似 也只能通过 rabbitmqctl eval 'nodes().' 查看到. 我自己用相同的 cookie 就不行了
好尴尬

from rabbitmq.

chutian52 avatar chutian52 commented on September 24, 2024

刚才问了下RabbitMQ Karl, 给了我rabbitmqctl eval "erlang:get_cookie()" 可以获取当前使用的cookie,不晓得这个在nodedown的时候是否可以使用

from rabbitmq.

sky-big avatar sky-big commented on September 24, 2024

elang:get_cookie()这个函数是可以用的,专门用来获取cookie

from rabbitmq.

chutian52 avatar chutian52 commented on September 24, 2024

erlang 里面,如何让我 自定义的节点,也加入到集群中?

from rabbitmq.

sky-big avatar sky-big commented on September 24, 2024

erlang:ping().返回pong表示成功

from rabbitmq.

chutian52 avatar chutian52 commented on September 24, 2024

貌似不成功啊

from rabbitmq.

chutian52 avatar chutian52 commented on September 24, 2024

我测试了下,如果要跟现在的rabbitmq 节点连接,可以使用 net_adm:ping('rabbit@hostname'). 这个命令,让两个erl 在同一个集群下。

from rabbitmq.

sky-big avatar sky-big commented on September 24, 2024

就是这个。。。

from rabbitmq.

chutian52 avatar chutian52 commented on September 24, 2024

有什么好的方案,能够取消当前的集群吗? 拆分成单机

from rabbitmq.

sky-big avatar sky-big commented on September 24, 2024

rabbit不是有命令将节点从集群中删除么,不过好像数据是没了

from rabbitmq.

chutian52 avatar chutian52 commented on September 24, 2024

确切的讲,数据丢了还不算什么大事。关键问题是所有的queue是不是都可以保留或者是全部清空。
reset 或者是 force_reset 都可以。

from rabbitmq.

sky-big avatar sky-big commented on September 24, 2024

这些你看rabbit的命令吧

from rabbitmq.

Related Issues (10)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.