GithubHelp home page GithubHelp logo

kronosnet's Introduction

#
# Copyright (C) 2010-2024 Red Hat, Inc.  All rights reserved.
#
# Author: Fabio M. Di Nitto <[email protected]>
#
# This software licensed under GPL-2.0+
#

Upstream resources
------------------

https://github.com/kronosnet/kronosnet/
https://ci.kronosnet.org/
https://projects.clusterlabs.org/project/board/86/ (TODO list and activities tracking)
https://drive.google.com/drive/folders/0B_zxAPgZTkM_TklfYzN6a2FYUFE?resourcekey=0-Cfr5D94rZ8LVbeMPGjxbdg&usp=sharing (google shared drive)
https://lists.kronosnet.org/mailman3/postorius/lists/users.lists.kronosnet.org/
https://lists.kronosnet.org/mailman3/postorius/lists/devel.lists.kronosnet.org/
https://lists.kronosnet.org/mailman3/postorius/lists/commits.lists.kronosnet.org/
https://kronosnet.org/ (web 0.1 style)
IRC: #kronosnet on Libera.Chat

Architecture
------------

Please refer to the google shared drive Presentations directory for
diagrams and fancy schemas

Running knet on FreeBSD
-----------------------

knet requires big socket buffers and you need to set:
kern.ipc.maxsockbuf=18388608
in /etc/sysctl.conf or knet will fail to run.

For version 12 (or lower), knet requires also:
net.inet.sctp.blackhole=1
in /etc/sysctl.conf or knet will fail to work with SCTP.
This sysctl is obsoleted in version 13.

libnozzle requires if_tap.ko loaded in the kernel.

Please avoid using ifconfig_DEFAULT in /etc/rc.conf to use
DHCP for all interfaces or the dhclient will interfere with
libnozzle interface management, causing errors on some
operations such as "ifconfig tap down".


Rust Bindings
-------------

Rust bindings for libknet and libnozzle are part of this
source tree, but are included here mainly to keep all of the
kronosnet APIs in one place and to ensure that everything is kept
up-to-date and properly tested in our CI system.

The correct place to get the Rust crates for libknet and libnozzle
is still crates.io as it would be for other crates. These will be
updated when we issue a new release of knet.

https://crates.io/crates/knet-bindings
https://crates.io/crates/nozzle-bindings

Of course, if you want to try any new features in the APIs that
may have not yet been released then you can try these sources, but
please keep in touch with us via email or IRC if you do so.

kronosnet's People

Contributors

cglosner avatar chrissie-c avatar digimer avatar fabbione avatar fabian-gruenbichler avatar jfriesse avatar jnpkrn avatar jonesmz avatar kraj avatar mbaldessari avatar miz-take avatar ppiao avatar simon3z avatar thomaslamprecht avatar wferi avatar yuanren10 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kronosnet's Issues

No project description

Now that 1.0 is out (\o/) the lack of any substantial project description on https://kronosnet.org/ (and consequently, in the packages) is getting embarrassing. "VPNs on steroids", "Multipoint-to-Multipoint VPN daemon" and even "Please refer to the not-yet-existing documentation for further information" clearly don't cut it. It would be possible to extract keywords from the HA Summit material, but I'd very much appreciate a well-rounded and to-the-point paragraph from the principal authors to serve as a description across the Internet.

isolated corosync node after an ingress flood, with "rx: Source host x not reachable yet" loop

libknet version : v1.11
corosync version: 3.0.2

Hi,
I'm trying to reproduce proxmox users bug with corosync3/knet, and I'm able to reproduce a bug, when a corosync process is stuck and need to be restart to join the cluster again.

The test is a small 3 nodes.
I launch an iperf from node2 to node3, saturating the rx link of node3.
Then I have a lot of retransmit and others logs.
I'm stopping the flood, than node3 still don't see other nodes, until a full corosync restart of node3.

here an extract of node3, logs are looping with this sequence:

Sep 17 13:58:43 kvmformation3 corosync[293354]:   [TOTEM ] entering GATHER state from 9(merge during operational state).
Sep 17 13:58:43 kvmformation3 corosync[293354]:   [KNET  ] rx: Source host 1 not reachable yet
Sep 17 13:58:43 kvmformation3 corosync[293354]:   [TOTEM ] entering GATHER state from 8(foreign message in gather state).
Sep 17 13:58:43 kvmformation3 corosync[293354]:   [KNET  ] rx: Source host 2 not reachable yet
Sep 17 13:58:43 kvmformation3 corosync[293354]:   [KNET  ] rx: Source host 2 not reachable yet
Sep 17 13:58:43 kvmformation3 corosync[293354]:   [KNET  ] rx: Source host 2 not reachable yet
Sep 17 13:58:44 kvmformation3 corosync[293354]:   [KNET  ] rx: Source host 1 not reachable yet
Sep 17 13:58:44 kvmformation3 corosync[293354]:   [KNET  ] rx: Source host 1 not reachable yet
Sep 17 13:58:44 kvmformation3 corosync[293354]:   [KNET  ] rx: Source host 1 not reachable yet
Sep 17 13:58:44 kvmformation3 corosync[293354]:   [KNET  ] rx: Source host 1 not reachable yet
Sep 17 13:58:44 kvmformation3 corosync[293354]:   [KNET  ] rx: Source host 1 not reachable yet
Sep 17 13:58:44 kvmformation3 corosync[293354]:   [KNET  ] rx: Source host 1 not reachable yet
Sep 17 13:58:44 kvmformation3 corosync[293354]:   [KNET  ] rx: Source host 1 not reachable yet
Sep 17 13:58:44 kvmformation3 corosync[293354]:   [KNET  ] rx: Source host 1 not reachable yet
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [KNET  ] rx: Source host 1 not reachable yet
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [KNET  ] rx: Source host 1 not reachable yet
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [KNET  ] rx: Source host 1 not reachable yet
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [KNET  ] rx: Source host 1 not reachable yet
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [KNET  ] rx: Source host 1 not reachable yet
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [KNET  ] rx: Source host 1 not reachable yet
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [KNET  ] rx: Source host 1 not reachable yet
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [KNET  ] rx: Source host 1 not reachable yet
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [KNET  ] rx: Source host 1 not reachable yet
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [KNET  ] rx: Source host 1 not reachable yet
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [KNET  ] rx: Source host 1 not reachable yet
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [KNET  ] rx: Source host 1 not reachable yet
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [KNET  ] rx: Source host 1 not reachable yet
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [KNET  ] rx: Source host 1 not reachable yet
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [KNET  ] rx: Source host 1 not reachable yet
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [KNET  ] rx: Source host 1 not reachable yet
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [TOTEM ] entering GATHER state from 0(consensus timeout).
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [TOTEM ] Creating commit token because I am the rep.
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [TOTEM ] Saving state aru 4 high seq received 4
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [MAIN  ] Storing new sequence id for ring 1888
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [TOTEM ] entering COMMIT state.
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [TOTEM ] got commit token
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [TOTEM ] entering RECOVERY state.
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [TOTEM ] TRANS [0] member 3:
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [TOTEM ] position [0] member 3:
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [TOTEM ] previous ring seq 1884 rep 3
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [TOTEM ] aru 4 high delivered 4 received flag 1
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [TOTEM ] Did not need to originate any messages in recovery.
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [TOTEM ] got commit token
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [TOTEM ] Sending initial ORF token
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 0, aru 0
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [TOTEM ] install seq 0 aru 0 high seq received 0
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 1, aru 0
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [TOTEM ] install seq 0 aru 0 high seq received 0
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 2, aru 0
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [TOTEM ] install seq 0 aru 0 high seq received 0
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 3, aru 0
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [TOTEM ] install seq 0 aru 0 high seq received 0
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [TOTEM ] retrans flag count 4 token aru 0 install seq 0 aru 0 0
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [TOTEM ] Resetting old ring state
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [TOTEM ] recovery to regular 1-0
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [TOTEM ] waiting_trans_ack changed to 1
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [SYNC  ] call init for locally known services
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [TOTEM ] entering OPERATIONAL state.
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [TOTEM ] A new membership (3:6280) was formed. Members
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [SYNC  ] enter sync process
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [SYNC  ] Committing synchronization for corosync configuration map access
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [CMAP  ] Not first sync -> no action
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [CPG   ] downlist left_list: 0 received
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [CPG   ] got joinlist message from node 0x3
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [SYNC  ] Committing synchronization for corosync cluster closed process group service v1.01
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [CPG   ] my downlist: members(old:1 left:0)
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [CPG   ] joinlist_messages[0] group:pve_dcdb_v1\x00, ip:r(0) ip(10.59.100.233) , pid:2179
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [CPG   ] joinlist_messages[1] group:pve_kvstore_v1\x00, ip:r(0) ip(10.59.100.233) , pid:2179
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [VOTEQ ] flags: quorate: No Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [VOTEQ ] Sending nodelist callback. ring_id = 3/6280
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [VOTEQ ] got nodeinfo message from cluster node 3
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [VOTEQ ] nodeinfo message[3]: votes: 1, expected: 3 flags: 0
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [VOTEQ ] flags: quorate: No Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [VOTEQ ] total_votes=1, expected_votes=3
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [VOTEQ ] node 1 state=2, votes=1, expected=3
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [VOTEQ ] node 2 state=2, votes=1, expected=3
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [VOTEQ ] node 3 state=1, votes=1, expected=3
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [VOTEQ ] got nodeinfo message from cluster node 3
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [SYNC  ] Committing synchronization for corosync vote quorum service v1.0
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [VOTEQ ] total_votes=1, expected_votes=3
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [VOTEQ ] node 1 state=2, votes=1, expected=3
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [VOTEQ ] node 2 state=2, votes=1, expected=3
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [VOTEQ ] node 3 state=1, votes=1, expected=3
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [QUORUM] Members[1]: 3
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [QUORUM] sending quorum notification to (nil), length = 52
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [VOTEQ ] Sending quorum callback, quorate = 0
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep 17 13:58:45 kvmformation3 corosync[293354]:   [TOTEM ] waiting_trans_ack changed to 0
Sep 17 13:58:46 kvmformation3 corosync[293354]:   [KNET  ] rx: Source host 1 not reachable yet
Sep 17 13:58:46 kvmformation3 corosync[293354]:   [TOTEM ] entering GATHER state from 9(merge during operational state).
Sep 17 13:58:47 kvmformation3 corosync[293354]:   [KNET  ] rx: Source host 2 not reachable yet
Sep 17 13:58:47 kvmformation3 corosync[293354]:   [KNET  ] rx: Source host 2 not reachable yet
Sep 17 13:58:47 kvmformation3 corosync[293354]:   [TOTEM ] entering GATHER state from 8(foreign message in gather state).
Sep 17 13:58:47 kvmformation3 corosync[293354]:   [KNET  ] rx: Source host 2 not reachable yet
Sep 17 13:58:47 kvmformation3 corosync[293354]:   [KNET  ] rx: Source host 2 not reachable yet
Sep 17 13:58:47 kvmformation3 corosync[293354]:   [KNET  ] rx: Source host 2 not reachable yet
Sep 17 13:58:47 kvmformation3 corosync[293354]:   [KNET  ] rx: Source host 2 not reachable yet
Sep 17 13:58:47 kvmformation3 corosync[293354]:   [KNET  ] rx: Source host 2 not reachable yet
Sep 17 13:58:47 kvmformation3 corosync[293354]:   [KNET  ] rx: Source host 2 not reachable yet
Sep 17 13:58:47 kvmformation3 corosync[293354]:   [KNET  ] rx: Source host 2 not reachable yet
Sep 17 13:58:47 kvmformation3 corosync[293354]:   [KNET  ] rx: Source host 2 not reachable yet
Sep 17 13:58:47 kvmformation3 corosync[293354]:   [KNET  ] rx: Source host 2 not reachable yet
Sep 17 13:58:47 kvmformation3 corosync[293354]:   [KNET  ] rx: Source host 2 not reachable yet
Sep 17 13:58:47 kvmformation3 corosync[293354]:   [KNET  ] rx: Source host 2 not reachable yet
Sep 17 13:58:47 kvmformation3 corosync[293354]:   [KNET  ] rx: Source host 2 not reachable yet
Sep 17 13:58:47 kvmformation3 corosync[293354]:   [KNET  ] rx: Source host 2 not reachable yet
Sep 17 13:58:47 kvmformation3 corosync[293354]:   [KNET  ] rx: Source host 2 not reachable yet
Sep 17 13:58:47 kvmformation3 corosync[293354]:   [KNET  ] rx: Source host 1 not reachable yet
Sep 17 13:58:48 kvmformation3 corosync[293354]:   [KNET  ] rx: Source host 1 not reachable yet
Sep 17 13:58:48 kvmformation3 corosync[293354]:   [KNET  ] rx: Source host 1 not reachable yet
Sep 17 13:58:48 kvmformation3 corosync[293354]:   [KNET  ] rx: Source host 1 not reachable yet
Sep 17 13:58:48 kvmformation3 corosync[293354]:   [KNET  ] rx: Source host 2 not reachable yet
Sep 17 13:58:48 kvmformation3 corosync[293354]:   [KNET  ] rx: Source host 2 not reachable yet
Sep 17 13:58:48 kvmformation3 corosync[293354]:   [KNET  ] rx: Source host 2 not reachable yet
Sep 17 13:58:48 kvmformation3 corosync[293354]:   [KNET  ] rx: Source host 2 not reachable yet
Sep 17 13:58:49 kvmformation3 corosync[293354]:   [KNET  ] rx: Source host 1 not reachable yet
Sep 17 13:58:49 kvmformation3 corosync[293354]:   [KNET  ] rx: Source host 1 not reachable yet
Sep 17 13:58:49 kvmformation3 corosync[293354]:   [KNET  ] rx: Source host 1 not reachable yet
Sep 17 13:58:49 kvmformation3 corosync[293354]:   [KNET  ] rx: Source host 1 not reachable yet
Sep 17 13:58:49 kvmformation3 corosync[293354]:   [TOTEM ] entering GATHER state from 0(consensus timeout).
Sep 17 13:58:49 kvmformation3 corosync[293354]:   [TOTEM ] Creating commit token because I am the rep.
Sep 17 13:58:49 kvmformation3 corosync[293354]:   [TOTEM ] Saving state aru 4 high seq received 4
Sep 17 13:58:49 kvmformation3 corosync[293354]:   [MAIN  ] Storing new sequence id for ring 188c
Sep 17 13:58:49 kvmformation3 corosync[293354]:   [TOTEM ] entering COMMIT state.
Sep 17 13:58:49 kvmformation3 corosync[293354]:   [TOTEM ] got commit token
Sep 17 13:58:49 kvmformation3 corosync[293354]:   [TOTEM ] entering RECOVERY state.
Sep 17 13:58:49 kvmformation3 corosync[293354]:   [TOTEM ] TRANS [0] member 3:
Sep 17 13:58:49 kvmformation3 corosync[293354]:   [TOTEM ] position [0] member 3:
Sep 17 13:58:49 kvmformation3 corosync[293354]:   [TOTEM ] previous ring seq 1888 rep 3
Sep 17 13:58:49 kvmformation3 corosync[293354]:   [TOTEM ] aru 4 high delivered 4 received flag 1
Sep 17 13:58:49 kvmformation3 corosync[293354]:   [TOTEM ] Did not need to originate any messages in recovery.
Sep 17 13:58:49 kvmformation3 corosync[293354]:   [TOTEM ] got commit token
Sep 17 13:58:49 kvmformation3 corosync[293354]:   [TOTEM ] Sending initial ORF token
Sep 17 13:58:49 kvmformation3 corosync[293354]:   [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 0, aru 0
Sep 17 13:58:49 kvmformation3 corosync[293354]:   [TOTEM ] install seq 0 aru 0 high seq received 0
Sep 17 13:58:49 kvmformation3 corosync[293354]:   [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 1, aru 0
Sep 17 13:58:49 kvmformation3 corosync[293354]:   [TOTEM ] install seq 0 aru 0 high seq received 0
Sep 17 13:58:49 kvmformation3 corosync[293354]:   [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 2, aru 0
Sep 17 13:58:49 kvmformation3 corosync[293354]:   [TOTEM ] install seq 0 aru 0 high seq received 0
Sep 17 13:58:49 kvmformation3 corosync[293354]:   [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 3, aru 0
Sep 17 13:58:49 kvmformation3 corosync[293354]:   [TOTEM ] install seq 0 aru 0 high seq received 0
Sep 17 13:58:49 kvmformation3 corosync[293354]:   [TOTEM ] retrans flag count 4 token aru 0 install seq 0 aru 0 0
Sep 17 13:58:49 kvmformation3 corosync[293354]:   [TOTEM ] Resetting old ring state
Sep 17 13:58:49 kvmformation3 corosync[293354]:   [TOTEM ] recovery to regular 1-0
Sep 17 13:58:49 kvmformation3 corosync[293354]:   [TOTEM ] waiting_trans_ack changed to 1
Sep 17 13:58:49 kvmformation3 corosync[293354]:   [SYNC  ] call init for locally known services
Sep 17 13:58:49 kvmformation3 corosync[293354]:   [TOTEM ] entering OPERATIONAL state.
Sep 17 13:58:49 kvmformation3 corosync[293354]:   [TOTEM ] A new membership (3:6284) was formed. Members
Sep 17 13:58:49 kvmformation3 corosync[293354]:   [SYNC  ] enter sync process

I have also see the case, where node2<->node3 can see together, and node1 isolated.
and restarting corosync node3 fix the problem for node1....
(with same king of logs on node3)

logs of node3

ep 17 13:42:10 kvmformation3 corosync[221308]:   [TOTEM ] entering GATHER state from 11(merge during join).
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [TOTEM ] Creating commit token because I am the rep.
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [TOTEM ] Saving state aru 9 high seq received 9
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [MAIN  ] Storing new sequence id for ring 1504
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [TOTEM ] entering COMMIT state.
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [TOTEM ] got commit token
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [TOTEM ] entering RECOVERY state.
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [TOTEM ] TRANS [0] member 3:
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [TOTEM ] position [0] member 3:
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [TOTEM ] previous ring seq 14fc rep 3
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [TOTEM ] aru 9 high delivered 9 received flag 1
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [TOTEM ] Did not need to originate any messages in recovery.
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [TOTEM ] got commit token
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [TOTEM ] Sending initial ORF token
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 0, aru 0
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [TOTEM ] install seq 0 aru 0 high seq received 0
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 1, aru 0
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [TOTEM ] install seq 0 aru 0 high seq received 0
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 2, aru 0
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [TOTEM ] install seq 0 aru 0 high seq received 0
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 3, aru 0
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [TOTEM ] install seq 0 aru 0 high seq received 0
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [TOTEM ] retrans flag count 4 token aru 0 install seq 0 aru 0 0
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [TOTEM ] Resetting old ring state
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [TOTEM ] recovery to regular 1-0
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [TOTEM ] waiting_trans_ack changed to 1
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [SYNC  ] call init for locally known services
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [TOTEM ] entering OPERATIONAL state.
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [TOTEM ] A new membership (3:5380) was formed. Members
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [SYNC  ] enter sync process
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [SYNC  ] Committing synchronization for corosync configuration map access
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [CMAP  ] Not first sync -> no action
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [CPG   ] downlist left_list: 0 received
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [CPG   ] got joinlist message from node 0x3
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [SYNC  ] Committing synchronization for corosync cluster closed process group service v1.01
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [CPG   ] my downlist: members(old:1 left:0)
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [CPG   ] joinlist_messages[0] group:pve_kvstore_v1\x00, ip:r(0) ip(10.59.100.233) , pid:2179
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [VOTEQ ] flags: quorate: No Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [VOTEQ ] Sending nodelist callback. ring_id = 3/5380
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [VOTEQ ] got nodeinfo message from cluster node 3
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [VOTEQ ] nodeinfo message[3]: votes: 1, expected: 3 flags: 0
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [VOTEQ ] flags: quorate: No Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [VOTEQ ] total_votes=1, expected_votes=3
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [VOTEQ ] node 1 state=2, votes=1, expected=3
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [VOTEQ ] node 2 state=2, votes=1, expected=3
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [VOTEQ ] node 3 state=1, votes=1, expected=3
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [VOTEQ ] got nodeinfo message from cluster node 3
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [SYNC  ] Committing synchronization for corosync vote quorum service v1.0
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [VOTEQ ] total_votes=1, expected_votes=3
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [VOTEQ ] node 1 state=2, votes=1, expected=3
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [VOTEQ ] node 2 state=2, votes=1, expected=3
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [VOTEQ ] node 3 state=1, votes=1, expected=3
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [QUORUM] Members[1]: 3
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [QUORUM] sending quorum notification to (nil), length = 52
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [VOTEQ ] Sending quorum callback, quorate = 0
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [TOTEM ] waiting_trans_ack changed to 0
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [TOTEM ] entering GATHER state from 7(foreign message in operational state).
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [KNET  ] rx: Source host 1 not reachable yet
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [TOTEM ] entering GATHER state from 8(foreign message in gather state).
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [KNET  ] rx: Source host 1 not reachable yet
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [KNET  ] rx: Source host 1 not reachable yet
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [KNET  ] rx: Source host 1 not reachable yet
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [KNET  ] rx: Source host 1 not reachable yet
Sep 17 13:42:10 kvmformation3 corosync[221308]:   [KNET  ] rx: Source host 1 not reachable yet
Sep 17 13:42:11 kvmformation3 corosync[221308]:   [KNET  ] rx: Source host 1 not reachable yet
Sep 17 13:42:11 kvmformation3 corosync[221308]:   [KNET  ] rx: Source host 1 not reachable yet
Sep 17 13:42:11 kvmformation3 corosync[221308]:   [KNET  ] rx: Source host 1 not reachable yet
Sep 17 13:42:11 kvmformation3 corosync[221308]:   [KNET  ] rx: Source host 1 not reachable yet
Sep 17 13:42:11 kvmformation3 corosync[221308]:   [KNET  ] rx: Source host 1 not reachable yet
Sep 17 13:42:11 kvmformation3 corosync[221308]:   [KNET  ] rx: Source host 1 not reachable yet
Sep 17 13:42:11 kvmformation3 corosync[221308]:   [KNET  ] rx: Source host 1 not reachable yet
Sep 17 13:42:11 kvmformation3 corosync[221308]:   [KNET  ] rx: Source host 1 not reachable yet
Sep 17 13:42:11 kvmformation3 corosync[221308]:   [KNET  ] rx: Source host 1 not reachable yet
Sep 17 13:42:11 kvmformation3 corosync[221308]:   [KNET  ] rx: Source host 1 not reachable yet
Sep 17 13:42:11 kvmformation3 corosync[221308]:   [KNET  ] rx: Source host 1 not reachable yet
Sep 17 13:42:11 kvmformation3 corosync[221308]:   [KNET  ] rx: Source host 1 not reachable yet
Sep 17 13:42:11 kvmformation3 corosync[221308]:   [KNET  ] rx: Source host 1 not reachable yet

[Question] About corosync with sctp protocol on RHEL8.1.

Hi All,

We tried to check the behavior of corosync with sctp protocol on RHEL8.1.

We modprobe the sctp module on RHEL8.1 and tried to build a cluster with two nodes, but communication between the two nodes could not be established.

Looking at the next issue, it seems to work if you install kronosnet version 1.14, but will it work if you install 1.14 on RHEL8.1?

Or is it still not working due to a problem with the RHEL8.1 kernel or sctp module?

[root@rh81-test01 ~]# uname -a
Linux rh81-test01 4.18.0-147.5.1.el8_1.x86_64 #1 SMP Tue Jan 14 15:50:19 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
[root@rh81-test01 ~]# modinfo sctp
filename:       /lib/modules/4.18.0-147.5.1.el8_1.x86_64/kernel/net/sctp/sctp.ko.xz
license:        GPL
description:    Support for the SCTP protocol (RFC2960)
author:         Linux Kernel SCTP developers <[email protected]>
alias:          net-pf-10-proto-132
alias:          net-pf-2-proto-132
rhelversion:    8.1
srcversion:     63097D7A766DE886A7E64F6
depends:        libcrc32c
intree:         Y
name:           sctp
vermagic:       4.18.0-147.5.1.el8_1.x86_64 SMP mod_unload modversions 
sig_id:         PKCS#7
signer:         Red Hat Enterprise Linux kernel signing key
(snip)

Best Regards,
Hideo Yamauchi.

knet v1.13 (df4bdef) can't find libcrypto on FreeBSD

The current state of kronosnet v1.3(f9c0a59) under FreeBSD is that the respective port in our ports tree isn't able to build using its base system's OpenSSL on 12.0-RELEASE, 12.1-PRERELEASE or 13.0-CURRENT . We can build it on 11.3-RELEASE.

11.3-RELEASE has OpenSSL 1.0.2s;
12.0-RELEASE, and 12.1-PRERELEASE have OpenSSL 1.1.1a;
13.0-CURRENT have OpenSSL 1.1.1d.

In order to make kronosnet build and work, a few conditional patches must be applied. there's a recent PR on FreeBSD's Bugzilla that updates the old version to the actual v1.13 (df4bdef) and revisits these patches. It also patches it to build against LibreSSL.

All patches, either the old or the new ones, are applying changes matching KNET_OPTION_DEFINES's and PKG_CHECK_MODULE's living at configure.ac, where it should find and define which version of libcrypto we have and set either BUILDCRYPTOOPENSSL10 or BUILDCRYPTOOPENSSL11, depending on libcrypto's the version.

During a quick chat with @fabbione on IRC we checked that the current CI builds are working fine without extra patches, but the machines building kronosnet's source have an extra package installed for OpenSSL; they are FreeBSD 12.0-RELEASE and a 13.0-CURRENT. We can verify the build logs here for 12.0, and here for 13.0.

FreeBSD base system already has OpenSSL and its resources are under /lib, /usr/bin, /usr/lib, and /usr/include (with no extra pkg-config .pc file). We actually do not need to install any extra package or port to have libcrypto. If we do it, the prefix of all resources will change to /usr/local.

To specify that we want to use the OpenSSL from our base system we override the following:

openssl_CFLAGS='-I/usr/include' openssl_LIBS='-lcrypto'

Reproducing the failure on finding/identifying libcrypto's version can be done using Poudriere or in a clean FreeBSD Jail. A list of all packages to be installed as dependencies to build kronosnet is available here.

We can check patched build for FreeBSD 11.3-RELEASE here, and here for 12.0-RELEASE; these are the current official production releases. Sadly only conditional patches have being done so far and I could not send a request to upstream here. I would be happy to help troubleshooting it further should you need any extra input.

Should you want to reproduce the build failures:

  1. get and install FreeBSD 11.3-RELEASE (or any newer version of it);
  2. use pkg to install the list of packages availeble here;
  3. fetch kronosnet v1.13;
  4. change into its source directory, after decompressing it;
  5. run /bin/sh autogen.sh (or, if you prefer, /usr/local/bin/autoreconf -f -i);
  6. execute /bin/sh configure.

Here you should be presented to an error, telling that we can't find OpenSSL's libcrypto! But, as we want to be able to build on FreeBSD, after running autogen.sh you can proceed with:

env openssl_CFLAGS='-I/usr/include' openssl_LIBS='-lcrypto' /bin/sh configure
make

RFE: is it possible to start making github releases?🤔

On create github release entry is created email notification to those whom have set in your repo the web UI Watch->Releases.
gh release can contain additional comments (li changelog) or additional assets like release tar balls (by default it contains only assets from git tag) however all those part are not obligatory.
In simplest variant gh release can be empty because subiekt of the sent email contains git tag name.

I'm asking because my automation process uses those email notifications by trying to make preliminary automated upgrades of building packages, which allows saving some time on maintaining packaging procedures.
Probably other people may be interested to be instantly informed about release new version as well.

Documentation and examples of generate gh releases:
https://docs.github.com/en/repositories/releasing-projects-on-github/managing-releases-in-a-repository
https://cli.github.com/manual/gh_release_upload/
https://github.com/marketplace/actions/github-release
https://pgjones.dev/blog/trusted-plublishing-2023/
jbms/sphinx-immaterial#281 (comment)
tox target to publish on pypi and make gh release https://github.com/jaraco/skeleton/blob/928e9a86d61d3a660948bcba7689f90216cc8243/tox.ini#L42-L58

api/abi future breakage - best practise

Hey Feri,

I am about to start breaking API/ABI in master. I can´t remember the best practices on "when" to bump the library soname from 1.0 to 2.0. Was that at the end of the development or early in the process? what is your recommendation here? Clearly nobody should be shipping master so from a packaging perspective it shouldn´t be a big issue but I am thinking about upstream in general.

thanks for your input
Fabio

Project style guide

A guide for contributors that indicates such things as:

Spaces vs Tabs
Braces always or not required for one-line if statements
Minimum supported C dialect (e.g. C89, C99, C11, so on)
Variables declared at top of function, or closest to usage
Indentation levels inside of if-statements, loops, switch case, etc.
Starting curly-brace on same line or next line for function signatures, if statements, loops, so on.
variable name and function naming conventions, e.g. snake_case, vs camelCase

possible regression when faced with storm of cpg_mcast_joined messages

Background information:

We develop and heavily rely on a distributed config file systems implemented on top of corosync/libcpg/.. 2.x and are currently evaluating Corosync 3.x with knet. Our file system has an internal state machine that is responsible for keeping all nodes view of the contents in sync.

On initial startup, the following sequences happen:

  1. cpg_confchg callback gets called
  2. node sends a sync request to all other members of cpg (cpg_mcast_joined)
  3. once all nodes have replied with their state, next phase begins

and

  1. cpg_deliver callback gets called
  2. if it is a sync request, send own state to all other members of cpg (cpg_mcast_joined)

(leaving out parts like flushing local sync state if a new request arrives, and how to decide which sync request is the latest/current one).

On a cluster cold start (e.g., after a power outage and subsequent simultaneous start of all nodes, or if corosync gets started on all nodes at the same time after some service outage) this leads to a lot of messages being sent via cpg_mcast_joined to various partitions of the full cluster, since the confchg callback gets called once for each member of the newly establishing CPG.

We retry cpg_mcast_joined up to a 100 times in case it returns CS_ERR_TRY_AGAIN, with a short delay between attempts, before finally giving up.

With udp(u) and either Corosync 2.x and Corosync 3.x, this seems to work fine. With knet, cpg_mcast_joined starts to return 6/CS_ERR_TRY_AGAIN much sooner, and links start flapping, corosync takes up a whole CPU core, and this initial phase does not complete successfully.

I reduced this initial sequence to some sample code and included run output and corosync logs from our test setup, which consists of:
6 physical nodes running Proxmox VE (E5-2620 v3/v4), interconnected with 100Gbit using Mellanox Connect X-4
each running 5 VMs with two vcores and 3GB of RAM, Debian Stretch with backported Corosync/libqb/knet packages (3.0.1-2 / 1.0.3-2 / 1.8-1)

Note that this exact behaviour of link flapping etc. only occurs with big clusters, but also with smaller clusters it is visible that knet causes retries much sooner than udp(u) does, e.g. a test run with 5x3 VMs and ./cpgtest 4 4095 goes through without any retry attempts needed with udpu, but ends up with the following on knet:

callbacks:
        confchg: 15
        totem_confchg: 2
        deliver: 875
initial messages:
        sent: 15
        send retries: 5
        received: 192
response messages:
        sent: 682
        send retries: 101
        received: 683

Sample code and big cluster results:
https://gist.github.com/Fabian-Gruenbichler/64c81f258774406a24e42cd67ba2966d

Function pointers stored in knet_transport_ops need better return types

For reference, this is the declaration / definition of the struct knet_transport_ops

typedef struct knet_transport_ops {
/*
 * transport generic information
 */
	const char *transport_name;
	const uint8_t transport_id;
	uint32_t transport_mtu_overhead;
/*
 * transport init must allocate the new transport
 * and perform all internal initializations
 * (threads, lists, etc).
 */
	int (*transport_init)(knet_handle_t knet_h);
/*
 * transport free must releases _all_ resources
 * allocated by tranport_init
 */
	int (*transport_free)(knet_handle_t knet_h);

/*
 * link operations should take care of all the
 * sockets and epoll management for a given link/transport set
 * transport_link_disable should return err = -1 and errno = EBUSY
 * if listener is still in use, and any other errno in case
 * the link cannot be disabled.
 *
 * set_config/clear_config are invoked in global write lock context
 */
	int (*transport_link_set_config)(knet_handle_t knet_h, struct knet_link *link);
	int (*transport_link_clear_config)(knet_handle_t knet_h, struct knet_link *link);

/*
 * transport callback for incoming dynamic connections
 * this is called in global read lock context
 */
	int (*transport_link_dyn_connect)(knet_handle_t knet_h, int sockfd, struct knet_link *link);

/*
 * per transport error handling of recvmmsg
 * (see _handle_recv_from_links comments for details)
 */

/*
 * transport_rx_sock_error is invoked when recvmmsg returns <= 0
 *
 * transport_rx_sock_error is invoked with both global_rdlock
 */

	int (*transport_rx_sock_error)(knet_handle_t knet_h, int sockfd, int recv_err, int recv_errno);

/*
 * transport_tx_sock_error is invoked with global_rwlock and
 * it's invoked when sendto or sendmmsg returns =< 0
 *
 * it should return:
 * -1 on internal error
 *  0 ignore error and continue
 *  1 retry
 *    any sleep or wait action should happen inside the transport code
 */
	int (*transport_tx_sock_error)(knet_handle_t knet_h, int sockfd, int recv_err, int recv_errno);

/*
 * this function is called on _every_ received packet
 * to verify if the packet is data or internal protocol error handling
 *
 * it should return:
 * -1 on error
 *  0 packet is not data and we should continue the packet process loop
 *  1 packet is not data and we should STOP the packet process loop
 *  2 packet is data and should be parsed as such
 *
 * transport_rx_is_data is invoked with both global_rwlock
 * and fd_tracker read lock (from RX thread)
 */
	int (*transport_rx_is_data)(knet_handle_t knet_h, int sockfd, struct knet_mmsghdr *msg);
} knet_transport_ops_t;

I'll go in order of each item:

  • int (*transport_init)(knet_handle_t knet_h);
  • int (*transport_free)(knet_handle_t knet_h);

What does this int represent? Success / failure?
Using a C99 bool, or an enum with success / failure values, such as

enum knet_bool
{
SUCCESS = 0,
FAILURE = 1
}

Would provide a significant amount of clarity here.

  • int (*transport_link_set_config)(knet_handle_t knet_h, struct knet_link *link);
  • int (*transport_link_clear_config)(knet_handle_t knet_h, struct knet_link *link);

These are documented with:

  • link operations should take care of all the
  • sockets and epoll management for a given link/transport set
  • transport_link_disable should return err = -1 and errno = EBUSY
  • if listener is still in use, and any other errno in case
  • the link cannot be disabled.

What is "err" in this context? It's not a function parameter? Some global variable?

What other return values are valid?

  • int (*transport_tx_sock_error)(knet_handle_t knet_h, int sockfd, int recv_err, int recv_errno);
  • int (*transport_rx_is_data)(knet_handle_t knet_h, int sockfd, struct knet_mmsghdr *msg);

These should return an enum, not an int. Especially since the possible values are already enumerated in the comments.

build error!

[root@compute1 kronosnet-main]# ./autogen.sh
autoreconf: Entering directory `.'
autoreconf: configure.ac: not using Gettext
autoreconf: running: aclocal -I m4
build-aux/git-version-gen: WARNING: .gitarchivever doesn't contain valid version tag
build-aux/git-version-gen: ERROR: Can't find valid version. Please use valid git repository, released tarball or version tagged archive
configure.ac:22: error: AC_INIT should be called with package and version arguments
/usr/share/aclocal-1.16/init.m4:29: AM_INIT_AUTOMAKE is expanded from...
configure.ac:22: the top level
autom4te: /usr/bin/m4 failed with exit status: 1
aclocal: error: echo failed with exit status: 1
autoreconf: aclocal failed with exit status: 1
[root@compute1 kronosnet-main]#

[Question] On the display of corosync-cfgtool.

Hi All,

Configure two nodes with the following settings.

totem {
    version: 2
    cluster_name: testcluster
    transport: knet
}

nodelist {
    node {
        ring0_addr: 192.168.106.185
        ring1_addr: 192.168.107.185
        nodeid: 1
        name: rh74-01
    }
    node {
        ring0_addr: 192.168.106.186
        ring1_addr: 192.168.107.186
        nodeid: 2
        name: rh74-02
    }
}
(snip)

When I run corosync-cfgtool on these two nodes, it is displayed as follows.

[root@rh74-kro1 ~]# corosync-cfgtool -s
Printing link status.
Local node ID 1
LINK ID 0
        addr    = 192.168.106.185
        status:
                node 0: link enabled:1  link connected:1
                node 1: link enabled:1  link connected:1
LINK ID 1
        addr    = 192.168.107.185
        status:
                node 0: link enabled:0  link connected:1
                node 1: link enabled:1  link connected:1

[root@rh74-kro2 ~]# corosync-cfgtool -s                                                                                                                                                                     
Printing link status.
Local node ID 2
LINK ID 0
        addr    = 192.168.106.186
        status:
                node 0: link enabled:1  link connected:1
                node 1: link enabled:1  link connected:1
LINK ID 1
        addr    = 192.168.107.186
        status:
                node 0: link enabled:1  link connected:1
                node 1: link enabled:0  link connected:1

I understand that the display of "connected" shows Heartbeat (ping) communication.
As for the display of "enabled", as I saw the source code, apparently it seems to be 1 if it is own node, only if loopback can be set.

For the display of "enabled", is basically the following specification?

  1. For other nodes, 1
  2. In case of own node, only when loopback can be set 1

Also, I was concerned that the following error will be output when this specification.

Jan 11 13:11:08 rh74-kro2 corosync[8992]: [TOTEM ] knet_link_set_config failed: Invalid argument (22)

Since this message always comes out on the second interface, is not it better to modify it?

Best Regards,
Hideo Yamauchi.

corosync segfault

Not sure it's related to my other problem, but I'm also able to reproduce segfault sometimes.
(proxmox users also have reported them https://bugzilla.proxmox.com/show_bug.cgi?id=2326)
I have able to reproduce it this night, also using iperf from node2 to node3,
after some time corosync on node3 segfault.

fdata.gz

[core.corosync.0.c55a1abe64634566b5a111658686da55.66057.1568764963000000.lz4.gz]

(https://github.com/kronosnet/kronosnet/files/3624596/core.corosync.0.c55a1abe64634566b5a111658686da55.66057.1568764963000000.lz4.gz)

gdb-bt.txt

doxyxml breaks cross-compilation

The current in-house manpage generation breaks cross-compilation. The cross branch presents a not particularly elegant, but working solution to that. Another way to handle the problem was mentioned in #127: provide a switch to disable documentation generation (maybe even separate targets for building the documentation only). This would re-enable cross compilation and facilitate separate packaging of architecture-dependent and architecture-independent components as well. Opinions?

Error waiting for packet: Success

To quote the classics, such messages don't belong to anywhere, except maybe to Windows 95. Still one finds it in a reproducible build report of Kronosnet 1.15 (on i386):

[knet]: [debug] pmtud: Starting PMTUD for host: 1 link: 0
[knet]: [debug] pmtud: Unable to send pmtu packet (sendto): 12 Cannot allocate memory
[knet]: [info] link: host: 1 link: 0 is down
[knet]: [info] host: host: 1 (passive) best link: 0 (pri: 0)
[knet]: [WARNING] host: host: 1 has no active links
Error waiting for packet: Success

I guess it's just an unexpected corner case in the error handling logic, but I can't investigate it right now.

still other compile error on i686

knet_bench.c:649:107: error: format ‘%zu’ expects argument of type ‘size_t’, but argument 6 has type ‘uint64_t {aka long long unsigned int}’ [-Werror=format=]
printf("Execution time: %8.4f secs Average speed: %8.4f MB/sec %8.4f pckts/sec (size: %u total: %zu)\n", time_diff_sec, average_rx_mbytes, average_rx_pkts, current_pckt_size, rx_pkts);
^
knet_bench.c: In function ‘send_perf_data_by_size’:
knet_bench.c:799:67: error: format ‘%zu’ expects argument of type ‘size_t’, but argument 3 has type ‘uint64_t {aka long long unsigned int}’ [-Werror=format=]
printf("Testing with %u packet size. Total bytes to transfer: %zu (%zu packets)\n", packetsize, perf_by_size_size, total_pkts_to_tx);
^
knet_bench.c:799:72: error: format ‘%zu’ expects argument of type ‘size_t’, but argument 4 has type ‘uint64_t {aka long long unsigned int}’ [-Werror=format=]
printf("Testing with %u packet size. Total bytes to transfer: %zu (%zu packets)\n", packetsize, perf_by_size_size, total_pkts_to_tx);
^
knet_bench.c: In function ‘send_perf_data_by_time’:
knet_bench.c:866:51: error: format ‘%zu’ expects argument of type ‘size_t’, but argument 3 has type ‘uint64_t {aka long long unsigned int}’ [-Werror=format=]
printf("Testing with %u bytes packet size for %zu seconds.\n", packetsize, perf_by_time_secs);

plugin loading path order

Hey Feri,

I have just been hit by an ordering issue when doing testing.

Basically the system had an old set of plugins installed in /usr/lib.... and it turns out that while I was executing some make check with new plugins, the checks were loading the system installed knet plugins instead of the newly built one.

this needs to be fixed or we need to have knet Buildconflict knet in the event we need to change internal API/ABI :-)

Doubts about log_msg

I've got a couple of problems with this code:

kronosnet/libknet/logging.c

Lines 219 to 224 in 72751d7

* if we get an EINVAL and locking is initialized, then
* we are getting a real error and we need to stop
*/
err = pthread_rwlock_tryrdlock(&knet_h->global_rwlock);
if ((err == EAGAIN) && (knet_h->lock_init_done))
return;

  1. The comment mentions EINVAL, but the code handles EAGAIN, and the purpose isn't obvious.

  2. The documentation of pthread_rwlock_rdlock() states:

    Results are undefined if any of these functions are called with an uninitialized read-write lock.

    Some variants mention EINVAL as a possible error code in such cases, though.

  3. The check for lock_init_done suggests that this function is meant to be called before lock initialization. And it really can be in knet_handle_new_ex().

Are these issues non-issues for some reason or another?

1.9: build fails

Making all in kronosnetd
make[2]: Entering directory '/home/tkloczko/rpmbuild/BUILD/kronosnet-1.9/kronosnetd'
gcc -DHAVE_CONFIG_H -I. -I..  -I../libnozzle -I../libknet  -O3 -ggdb3 -Werror -Wall -Wextra -Wno-unused-parameter  -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fstack-protector-strong -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection -flto -c -o kronosnetd-cfg.o `test -f 'cfg.c' || echo './'`cfg.c
In file included from cfg.c:16:
cfg.h:22:2: error: unknown type name ‘tap_t’
   22 |  tap_t tap;
      |  ^~~~~
cfg.c: In function ‘knet_get_iface’:
cfg.c:25:15: error: implicit declaration of function ‘tap_get_name’ [-Werror=implicit-function-declaration]
   25 |   if (!strcmp(tap_get_name(knet_iface->cfg_eth.tap), name)) {
      |               ^~~~~~~~~~~~
cfg.c:25:15: error: passing argument 1 of ‘strcmp’ makes pointer from integer without a cast [-Werror=int-conversion]
   25 |   if (!strcmp(tap_get_name(knet_iface->cfg_eth.tap), name)) {
      |               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |               |
      |               int
In file included from cfg.c:14:
/usr/include/string.h:136:32: note: expected ‘const char *’ but argument is of type ‘int’
  136 | extern int strcmp (const char *__s1, const char *__s2)
      |                    ~~~~~~~~~~~~^~~~
cc1: all warnings being treated as errors
make[2]: *** [Makefile:595: kronosnetd-cfg.o] Error 1

pmtud: Aborting PMTUD process: Too many attempts. MTU might have changed during discovery

Hi,

Hopefully im in the right place.

We have moved from corosync version 2 to version 3, this uses kronosnet. On version 2 we had now issues, version 3 we cant get 2 nodes to talk to each other.

We use infiniband in connected mode, and the mtu is then 65520.

ibp4s0d1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 65520

when trying to start corosync, both nodes after only a second start spamming the syslogs with

Jul 29 14:48:17 kvm1 corosync[4039025]: [KNET ] pmtud: Aborting PMTUD process: Too many attempts. MTU might have changed during discovery.
Jul 29 14:48:17 kvm1 corosync[4039025]: [KNET ] pmtud: Aborting PMTUD process: Too many attempts. MTU might have changed during discovery.
Jul 29 14:48:17 kvm1 corosync[4039025]: [KNET ] pmtud: Aborting PMTUD process: Too many attempts. MTU might have changed during discovery.
Jul 29 14:48:17 kvm1 corosync[4039025]: [KNET ] pmtud: Aborting PMTUD process: Too many attempts. MTU might have changed during discovery.
Jul 29 14:48:17 kvm1 corosync[4039025]: [KNET ] pmtud: Aborting PMTUD process: Too many attempts. MTU might have changed during discovery.
Jul 29 14:48:18 kvm1 corosync[4039025]: [KNET ] pmtud: Aborting PMTUD process: Too many attempts. MTU might have changed during discovery.
Jul 29 14:48:18 kvm1 corosync[4039025]: [KNET ] pmtud: Aborting PMTUD process: Too many attempts. MTU might have changed during discovery.

any help would be appreciated.

best regards
Kevin M

connection issue with sctp

I'm testing a 3 node cluster with corosync knet_transport sctp and for some reason this ends always in an very unstable state. When switching back to default udp the cluster immediately gets back into a stable state.

Current setup is a virtualised 3 node cluster running Proxmox as an OS (libknet 1.13), the cluster network is directly connected to a linux bridge with no firewall in between.

Starting one host after the other, with sctp enabled, works most of the time, but if I restart one of the nodes it can't join the cluster anymore. There isn't a consistent error message in the log, I got different ones while testing without changing anything in my cluster.
Some examples:

corosync[19558]:   [KNET  ] heartbeat: Unable to send ping (sock: 32) packet (sendto): 32 Broken pipe. recorded src ip: 10.10.90.1 src port: 5405 dst ip: 10.10.90.3 dst port: 5405
corosync[2032]:   [KNET  ] sctp: SCTP connect on 31 to 10.10.90.2 port 5405 failed: Protocol not available
corosync[2032]:   [KNET  ] pmtud: PMTUD link change for host: 3 link: 0 from 453 to 65397

Again, if the cluster is freshly started one after the other everything seems fine, problems start when for some reason you need to reboot a host or change some configuration in corosync which can't be applied live.

logging {
  debug: on
  to_syslog: yes
}

nodelist {
  node {
    name: testnode2tm
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.10.90.2
  }
  node {
    name: testnode3tm
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.10.90.3
  }
  node {
    name: testnode4tm
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 10.10.90.4
  }
  node {
    name: testnodetm1
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.10.90.1
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: testcluster
  config_version: 18
  interface {
    linknumber: 0
    knet_transport: sctp
  }
  ip_version: ipv4
  secauth: on
  version: 2
}

Just for reference, the 4th (10.10.90.4) node was never started while testing, to see what knet sctp logs when a node isn't reachable at all.

website: no support for TLSv1.2 (and other problems)

Trying to connect to the project page from Debian buster (current stable):

$ curl https://kronosnet.org
curl: (35) error:1425F102:SSL routines:ssl_choose_client_version:unsupported protocol

I think this is because the OpenSSL system default enforces TLS 1.2 as a minimum. The SSL Labs checker gives the website an F grade for an unrelated reason, but protocol support is a serious issue there as well. Please consider correcting these.

Header guard macros may conflict with other projects

For example

kronosnet/libknet/transports.h has the headerguard TRANSPORTS_H

This is really quite generic and there's a high likelihood of TRANSPORTS_H having been used in a header of another library that may someday be used in conjunction with kronosnet.

I recommend prefixing or postfixing headerguards with a UUID, such as

TRANSPORTS_H_46C5B652_73C7_489C_C35A_8E4A1BC02509

That way, the probability of a conflict is reduced to essentially zero.

Multiple Coverity scan defects

During a code audit of kronosnet (https://bugs.launchpad.net/ubuntu/+source/kronosnet/+bug/1811139), the Ubuntu security team ran a Coverity scan analysis of kronosnet as packaged for the current Ubuntu development release (https://launchpad.net/ubuntu/+source/kronosnet/1.8-2) - this identified a number of possible buffer overruns, and similar issues which could be encountered under various error conditions or in normal operation - the full output can be seen in the attached file. It is suggested that these defects be analyzed and addressed and that ongoing analysis be done by the kronosnet project using the free scan.coverity.com service.
coverity.txt

Tests occasionally time out waiting for pongs

Running make -j16 check in Kronosnet 1.14's libknet/tests in a loop occasionally fails. I struggle to reproduce it, but I happened to see it on ARM64 without root privileges: api_knet_send_sync_test (twice) or api_knet_send_compress_test (once) timed out; the logs show that "pongs" arrived sporadically or not at all in these cases (full logs attached):

Test knet_send_sync with dst_host_filter returning too many host_ids_entries
[knet]: [ERROR] transport: Failed to set socket buffer via option 8 to value 8388608: capped at 425984
[knet]: [ERROR] transport: Continuing regardless, as the handle is not privileged. Expect poor performance!
[knet]: [ERROR] transport: Failed to set socket buffer via option 7 to value 8388608: capped at 425984
[knet]: [ERROR] transport: Continuing regardless, as the handle is not privileged. Expect poor performance!
[knet]: [debug] transport: FREEBIND enabled on socket: 20
[knet]: [debug] transport: PMTUDISC enabled on socket: 20
[knet]: [debug] udp: IP_RECVERR enabled on socket: 20
[knet]: [debug] link: Configuring default access lists for host: 1 link: 0 socket: 20
[knet]: [debug] link: host: 1 link: 0 is configured
[knet]: [info] host: host: 1 (passive) best link: 0 (pri: 0)
[knet]: [WARNING] host: host: 1 has no active links
waiting host 1 to be reachable for 10 more seconds
waiting host 1 to be reachable for 9 more seconds
waiting host 1 to be reachable for 8 more seconds
waiting host 1 to be reachable for 7 more seconds
waiting host 1 to be reachable for 6 more seconds
waiting host 1 to be reachable for 5 more seconds
waiting host 1 to be reachable for 4 more seconds
waiting host 1 to be reachable for 3 more seconds
waiting host 1 to be reachable for 2 more seconds
waiting host 1 to be reachable for 1 more seconds
timeout waiting for host to be reachable[knet]: [debug] link: host: 1 link: 0 is disabled
[knet]: [debug] link: host: 1 link: 0 config has been wiped
[knet]: [debug] dstcache: Unable to find host: 1
Test knet_send with none and valid data
[...]
[knet]: [debug] link: Configuring default access lists for host: 1 link: 0 socket: 20
[knet]: [debug] link: host: 1 link: 0 is configured
[knet]: [debug] handle: Data forwarding is enabled
[knet]: [info] host: host: 1 (passive) best link: 0 (pri: 0)
[knet]: [WARNING] host: host: 1 has no active links
waiting host 1 to be reachable for 10 more seconds
waiting host 1 to be reachable for 9 more seconds
waiting host 1 to be reachable for 8 more seconds
waiting host 1 to be reachable for 7 more seconds
waiting host 1 to be reachable for 6 more seconds
waiting host 1 to be reachable for 5 more seconds
[knet]: [debug] rx: host: 1 link: 0 received pong: 1
waiting host 1 to be reachable for 4 more seconds
[knet]: [debug] rx: host: 1 link: 0 received pong: 2
waiting host 1 to be reachable for 3 more seconds
[knet]: [debug] rx: host: 1 link: 0 received pong: 3
waiting host 1 to be reachable for 2 more seconds
[knet]: [debug] rx: host: 1 link: 0 received pong: 4
waiting host 1 to be reachable for 1 more seconds
timeout waiting for host to be reachable[knet]: [debug] rx: host: 1 link: 0 received pong: 5
[knet]: [debug] link: host: 1 link: 0 is disabled
[knet]: [debug] link: host: 1 link: 0 config has been wiped
[knet]: [debug] dstcache: Unable to find host: 1
Test knet_send_sync with dst_host_filter returning too many host_ids_entries
[knet]: [ERROR] transport: Failed to set socket buffer via option 8 to value 8388608: capped at 425984
[knet]: [ERROR] transport: Continuing regardless, as the handle is not privileged. Expect poor performance!
[knet]: [ERROR] transport: Failed to set socket buffer via option 7 to value 8388608: capped at 425984
[knet]: [ERROR] transport: Continuing regardless, as the handle is not privileged. Expect poor performance!
[knet]: [debug] transport: FREEBIND enabled on socket: 20
[knet]: [debug] transport: PMTUDISC enabled on socket: 20
[knet]: [debug] udp: IP_RECVERR enabled on socket: 20
[knet]: [debug] link: Configuring default access lists for host: 1 link: 0 socket: 20
[knet]: [debug] link: host: 1 link: 0 is configured
waiting host 1 to be reachable for 10 more seconds
[knet]: [info] host: host: 1 (passive) best link: 0 (pri: 0)
[knet]: [WARNING] host: host: 1 has no active links
[knet]: [debug] rx: host: 1 link: 0 received pong: 1
waiting host 1 to be reachable for 9 more seconds
waiting host 1 to be reachable for 8 more seconds
waiting host 1 to be reachable for 7 more seconds
waiting host 1 to be reachable for 6 more seconds
waiting host 1 to be reachable for 5 more seconds
[knet]: [debug] rx: host: 1 link: 0 received pong: 2
waiting host 1 to be reachable for 4 more seconds
waiting host 1 to be reachable for 3 more seconds
waiting host 1 to be reachable for 2 more seconds
waiting host 1 to be reachable for 1 more seconds
timeout waiting for host to be reachable[knet]: [debug] link: host: 1 link: 0 is disabled
[knet]: [debug] link: host: 1 link: 0 config has been wiped
[knet]: [debug] dstcache: Unable to find host: 1

I've got a log from the official build daemon as well, which happens to have an extra [info] and a [WARNING] at its end:

Test knet_send_sync with dst_host_filter returning too many host_ids_entries
[knet]: [ERROR] transport: Failed to set socket buffer via option 8 to value 8388608: capped at 425984
[knet]: [ERROR] transport: Continuing regardless, as the handle is not privileged. Expect poor performance!
[knet]: [ERROR] transport: Failed to set socket buffer via option 7 to value 8388608: capped at 425984
[knet]: [ERROR] transport: Continuing regardless, as the handle is not privileged. Expect poor performance!
[knet]: [debug] transport: FREEBIND enabled on socket: 20
[knet]: [debug] transport: PMTUDISC enabled on socket: 20
[knet]: [debug] udp: IP_RECVERR enabled on socket: 20
[knet]: [debug] link: Configuring default access lists for host: 1 link: 0 socket: 20
[knet]: [debug] link: host: 1 link: 0 is configured
waiting host 1 to be reachable for 10 more seconds
[knet]: [info] host: host: 1 (passive) best link: 0 (pri: 0)
[knet]: [WARNING] host: host: 1 has no active links
waiting host 1 to be reachable for 9 more seconds
waiting host 1 to be reachable for 8 more seconds
waiting host 1 to be reachable for 7 more seconds
waiting host 1 to be reachable for 6 more seconds
waiting host 1 to be reachable for 5 more seconds
[knet]: [debug] rx: host: 1 link: 0 received pong: 1
waiting host 1 to be reachable for 4 more seconds
waiting host 1 to be reachable for 3 more seconds
waiting host 1 to be reachable for 2 more seconds
waiting host 1 to be reachable for 1 more seconds
timeout waiting for host to be reachable[knet]: [debug] link: host: 1 link: 0 is disabled
[knet]: [debug] link: host: 1 link: 0 config has been wiped
[knet]: [info] host: host: 1 (passive) best link: 0 (pri: 0)
[knet]: [WARNING] host: host: 1 has no active links

1.log
2.log
3.log
4.log

Issues with cross compiling

Hey Feri,

I need some specific Debian help here. Just filing the issue to open a discussion.

https://ci.kronosnet.org/view/knet/job/knet-build-crosscompile-armhf/

started failing a few weeks ago with pkg-config unable to find nss for armhf.

10:11:08 ii libnss3-dev:armhf 2:3.85-1 armhf Development files for the Network Security Service libraries

Oddly enough if I install libnss3-dev:amd64 the cross compilation continues just fine.

That said, libnss3-dev is the only one in our BuildRequires that can only be installed in either amd64 or armhf version. All the other libs have -dev available for both (that might be just masking another problem).

Any idea what has changed? Should we just use the amd64 -dev and move on?

unsigned integer underflow causing large timeouts in pthread_cond_timedwait

For the past few days, @itihas and I have been investigating an issue with corosync locking up on a cluster we manage. There are multiple ways it manifests, one of which is the corosync process becoming unresponsive. Running corosync-quorumtool returns could not initialize CMAP service. Running corosync-cmapctl gets stuck. Other nodes are unable to reach this node, and mark it as down.
Looking at the core dump of the unresponsive corosync process, we came across a call to pthread_cond_timedwait with unusual arguments in one of the threads.
https://github.com/kronosnet/kronosnet/blob/v1.13/libknet/threads_pmtud.c#L312

Here, the timeout passed is large.

(gdb) print (struct timespec) *0x7f0e301d1e00
$25 = {tv_sec = 13141446444, tv_nsec = 120981050}

13141446444 seconds is over 400 years. Tracing this back, tv_sec gets its value from pong_timeout_adj_tmp at https://github.com/kronosnet/kronosnet/blob/v1.13/libknet/threads_pmtud.c#L292

if (knet_h->crypto_instance) {
	/*
	 * crypto, under pressure, is a royal PITA
	 */
	pong_timeout_adj_tmp = dst_link->pong_timeout_adj * dst_link->pmtud_crypto_timeout_multiplier;
} else {
	pong_timeout_adj_tmp = dst_link->pong_timeout_adj;
}

ts.tv_sec += pong_timeout_adj_tmp / 1000000;
ts.tv_nsec += (((pong_timeout_adj_tmp) % 1000000) * 1000);
while (ts.tv_nsec > 1000000000) {
	ts.tv_sec += 1;
	ts.tv_nsec -= 1000000000;
}

In the dst_link struct, pong_timeout_adj is 15384728329515354 and pmtud_crypto_timeout_multiplier is 2. pong_timeout_adj is set at https://github.com/kronosnet/kronosnet/blob/v1.13/libknet/threads_heartbeat.c#L180

dst_link->pong_timeout_adj = (dst_link->pong_timeout * dst_link->pong_timeout_backoff) + (dst_link->status.latency * KNET_LINK_PONG_TIMEOUT_LAT_MUL);

pong_timeout = 27000000
pong_timeout_backoff = 1
status.latency = 7692364151257677
KNET_LINK_PONG_TIMEOUT_LAT_MUL = 2

latency is set at https://github.com/kronosnet/kronosnet/blob/v1.13/libknet/threads_rx.c#L625

src_link->status.latency =
	((src_link->status.latency * src_link->latency_exp) +
	((latency_last / 1000llu) *
		(src_link->latency_fix - src_link->latency_exp))) /
			src_link->latency_fix;

Here, latency_exp = 4294965888. This value appears to be set at https://github.com/kronosnet/kronosnet/blob/v1.13/libknet/links.c#L230

link->latency_exp = KNET_LINK_DEFAULT_PING_PRECISION - \
    ((link->ping_interval * KNET_LINK_DEFAULT_PING_PRECISION) / 8000000);

Corosync sets ping_interval to token_timeout/4. We had set token to 5000 and token_coefficient to 1000 on our 51 node cluster, which would make token_timeout 5000+1000*(49) = 54000 ms. This in turn makes ping_interval = 13500000 us.

KNET_LINK_DEFAULT_PING_PRECISION = 2048
ping_interval = token_timeout / (pong_count*2)
pong_count = 2
token_timeout = 54000000

Putting these values in, we can see that latency_exp, which is an unsigned int, gets set to -1408, ie 4294965888.

Granted our token and token_coefficient are probably larger than they should be, but even with the defaults of 1000ms and 650ms a cluster having 47 nodes will run into this issue. Given this is a problem with both how knet doesn't sanitize latency_exp and how corosync autoconfigures ping_interval from token_timeout, we're reporting this issue on both projects.

corosync issue:
corosync/corosync#525

Some clang compiler errors

libtool: compile: clang -DHAVE_CONFIG_H -I. -I.. -I/usr/include/nss -I/usr/include/nspr -g -O2 -fPIC -DPIC -O3 -ggdb3 -Wall -Wshadow -Wmissing-prototypes -Wmissing-declarations -Wstrict-prototypes -Wdeclaration-after-statement -Wpointer-arith -Wwrite-strings -Wcast-align -Wbad-function-cast -Wmissing-format-attribute -Wformat=2 -Wformat-security -Wformat-nonliteral -Wno-long-long -Wno-strict-aliasing -Werror -Waddress -Woverflow -Wparentheses -Wsequence-point -Wswitch -Wuninitialized -Wunused-function -Wunused-result -Wunused-value -Wunused-variable -MT libknet_la-threads_common.lo -MD -MP -MF .deps/libknet_la-threads_common.Tpo -c threads_common.c -o libknet_la-threads_common.o >/dev/null 2>&1
mv -f .deps/libknet_la-compat.Tpo .deps/libknet_la-compat.Plo
transport_udp.c:313:16: error: cast from 'unsigned char *' to 'struct
sock_extended_err ' increases required alignment from 1 to 4
[-Werror,-Wcast-align]
...sock_err = (struct sock_extended_err
)CMSG_DATA(cmsg);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
transport_udp.c:324:17: error: cast from 'struct sockaddr *' to 'struct
sockaddr_storage *' increases required alignment from 2 to 8
[-Werror,-Wcast-align]
...origin = (struct sockaddr_storage *)SO_EE_OFFENDER(sock_err);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2 errors generated.
make[3]: *** [libknet_la-transport_udp.lo] Error 1
make[3]: *** Waiting for unfinished jobs....
libtool: compile: clang -DHAVE_CONFIG_H -I. -I.. -I/usr/include/nss -I/usr/include/nspr -g -O2 -fPIC -DPIC -O3 -ggdb3 -Wall -Wshadow -Wmissing-prototypes -Wmissing-declarations -Wstrict-prototypes -Wdeclaration-after-statement -Wpointer-arith -Wwrite-strings -Wcast-align -Wbad-function-cast -Wmissing-format-attribute -Wformat=2 -Wformat-security -Wformat-nonliteral -Wno-long-long -Wno-strict-aliasing -Werror -Waddress -Woverflow -Wparentheses -Wsequence-point -Wswitch -Wuninitialized -Wunused-function -Wunused-result -Wunused-value -Wunused-variable -MT libknet_la-threads_dsthandler.lo -MD -MP -MF .deps/libknet_la-threads_dsthandler.Tpo -c threads_dsthandler.c -o libknet_la-threads_dsthandler.o >/dev/null 2>&1
libtool: compile: clang -DHAVE_CONFIG_H -I. -I.. -I/usr/include/nss -I/usr/include/nspr -g -O2 -fPIC -DPIC -O3 -ggdb3 -Wall -Wshadow -Wmissing-prototypes -Wmissing-declarations -Wstrict-prototypes -Wdeclaration-after-statement -Wpointer-arith -Wwrite-strings -Wcast-align -Wbad-function-cast -Wmissing-format-attribute -Wformat=2 -Wformat-security -Wformat-nonliteral -Wno-long-long -Wno-strict-aliasing -Werror -Waddress -Woverflow -Wparentheses -Wsequence-point -Wswitch -Wuninitialized -Wunused-function -Wunused-result -Wunused-value -Wunused-variable -MT libknet_la-crypto.lo -MD -MP -MF .deps/libknet_la-crypto.Tpo -c crypto.c -o libknet_la-crypto.o >/dev/null 2>&1
handle.c:1084:15: error: comparison of unsigned expression < 0 is always false
[-Werror,-Wtautological-compare]
if ((enabled < 0) || (enabled > 1)) {
~~~~~~~ ^ ~

You can see the full context here: https://travis-ci.org/jonesmz/kronosnet/jobs/206453102

Note: Things are a little wonky since Amazon-S3 had some performance problems yesterday.

Suspicion to infinite loop problem?

Hi,
I have a suspicion that there is bug in your code. I experienced 100% CPU usage by process corosync which uses your library. It happened when I disabled network interface by ifconfig (ifconfig eth0 down). I did not have debugging symbols, so I tried to guess where the bug was exactly.
It looked like forever loop in which function sendto() was called. sendto() returned -1 and errno value was 113 (I am not 100% sure with this).
I think that infinite loop is here.
I think that src_link->transport_connected is true, one of the sendto functions returns -1, transport_tx_sock_error() returns KNET_TRANSPORT_SOCK_ERROR_RETRY and then goto jumps at the beginning.

I was not able to check, that the bug is really here. Please, can you check this?

[crypto] port to openssl3

Hey Feri,

I need some help on $subject.

CI recently spotted that Debian experimental has openssl 3.0 packages (some old devel snapshot).

knet uses HMAC (https://www.openssl.org/docs/manmaster/man3/HMAC.html) that has been deprecated in 3.0. We currently have a workaround in place to continue build / running:

b5702c1

(this is also available in stable1-proposed, queued for the next release).

clearly this is not ideal for the long term, so I started working to port knet to the new API and I have the code working in a PoC:

https://fabbione.net/ssl3.c.txt

The issue is that the current openssl package in Debian experimental is missing at least <openssl/mac.h> and therefor I can´t really land the final code in knet.

Do you think you could be so kind to contact the openssl team and ask for a more recent snapshot with all the files?

I am sure they have it under control, and this is far from being urgent. I just want to be more pro-active to get this going.

Thanks
Fabio

gcc 5 compilation failure

On F22 compilation fails (gcc-5.0.0-0.17.fc22.x86_64):

In file included from link.c:19:0:
link.c: In function '_link_updown':
link.c:44:6: error: format '%u' expects argument of type 'unsigned int', but argument 7 has type 'int' [-Werror=format=]
      "Unable to update link status for host: %u link: %u enabled: %u connected: %u)",
      ^
logging.h:28:42: note: in definition of macro 'log_debug'
  log_msg(knet_h, subsys, KNET_LOG_DEBUG, fmt, ##args)
                                          ^
link.c:44:6: error: format '%u' expects argument of type 'unsigned int', but argument 8 has type 'int' [-Werror=format=]
      "Unable to update link status for host: %u link: %u enabled: %u connected: %u)",
      ^
logging.h:28:42: note: in definition of macro 'log_debug'
  log_msg(knet_h, subsys, KNET_LOG_DEBUG, fmt, ##args)
                                          ^
link.c: In function 'knet_link_set_timeout':
link.c:598:5: error: format '%d' expects argument of type 'int', but argument 9 has type 'unsigned int' [-Werror=format=]
     "host: %u link: %u timeout update - interval: %llu timeout: %llu precision: %d",
     ^
logging.h:28:42: note: in definition of macro 'log_debug'
  log_msg(knet_h, subsys, KNET_LOG_DEBUG, fmt, ##args)
                                          ^
cc1: all warnings being treated as errors
Makefile:767: recipe for target 'libknet_la-link.lo' failed

Build error when compiling for Ubuntu Precise

The command "make -j" exited with 2.
0.14s$ sudo make install
make install-recursive
make[1]: Entering directory /home/travis/build/jonesmz/kronosnet' Making install in init make[2]: Entering directory /home/travis/build/jonesmz/kronosnet/init'
make[3]: Entering directory /home/travis/build/jonesmz/kronosnet/init' make[3]: Nothing to be done for install-exec-am'.
test -z "" || /bin/mkdir -p ""
test -z "" || /bin/mkdir -p ""
make[3]: Leaving directory /home/travis/build/jonesmz/kronosnet/init' make[2]: Leaving directory /home/travis/build/jonesmz/kronosnet/init'
Making install in libtap
make[2]: Entering directory /home/travis/build/jonesmz/kronosnet/libtap' make[3]: Entering directory /home/travis/build/jonesmz/kronosnet/libtap'
test -z "/usr/lib" || /bin/mkdir -p "/usr/lib"
test -z "/usr/include" || /bin/mkdir -p "/usr/include"
test -z "" || /bin/mkdir -p ""
make[3]: Leaving directory /home/travis/build/jonesmz/kronosnet/libtap' make[2]: Leaving directory /home/travis/build/jonesmz/kronosnet/libtap'
Making install in libknet
make[2]: Entering directory /home/travis/build/jonesmz/kronosnet/libknet' Making install in . make[3]: Entering directory /home/travis/build/jonesmz/kronosnet/libknet'
/bin/bash ../libtool --tag=CC --mode=compile gcc -std=gnu99 -DHAVE_CONFIG_H -I. -I.. -I/usr/include/nss -I/usr/include/nspr -std=gnu99 -g -O2 -fPIC -DPIC -O3 -ggdb3 -Wall -Wshadow -Wmissing-prototypes -Wmissing-declarations -Wstrict-prototypes -Wdeclaration-after-statement -Wpointer-arith -Wwrite-strings -Wcast-align -Wbad-function-cast -Wmissing-format-attribute -Wformat=2 -Wformat-security -Wformat-nonliteral -Wno-long-long -Wno-strict-aliasing -Werror -Waddress -Wcpp -Woverflow -Wparentheses -Wsequence-point -Wswitch -Wuninitialized -Wunused-but-set-variable -Wunused-function -Wunused-result -Wunused-value -Wunused-variable -MT libknet_la-transport_sctp.lo -MD -MP -MF .deps/libknet_la-transport_sctp.Tpo -c -o libknet_la-transport_sctp.lo test -f 'transport_sctp.c' || echo './'transport_sctp.c
libtool: compile: gcc -std=gnu99 -DHAVE_CONFIG_H -I. -I.. -I/usr/include/nss -I/usr/include/nspr -std=gnu99 -g -O2 -fPIC -DPIC -O3 -ggdb3 -Wall -Wshadow -Wmissing-prototypes -Wmissing-declarations -Wstrict-prototypes -Wdeclaration-after-statement -Wpointer-arith -Wwrite-strings -Wcast-align -Wbad-function-cast -Wmissing-format-attribute -Wformat=2 -Wformat-security -Wformat-nonliteral -Wno-long-long -Wno-strict-aliasing -Werror -Waddress -Wcpp -Woverflow -Wparentheses -Wsequence-point -Wswitch -Wuninitialized -Wunused-but-set-variable -Wunused-function -Wunused-result -Wunused-value -Wunused-variable -MT libknet_la-transport_sctp.lo -MD -MP -MF .deps/libknet_la-transport_sctp.Tpo -c transport_sctp.c -fPIC -DPIC -o .libs/libknet_la-transport_sctp.o
transport_sctp.c: In function ‘_close_connect_socket’:
transport_sctp.c:95:74: error: declaration of ‘link’ shadows a global declaration [-Werror=shadow]
transport_sctp.c: In function ‘sctp_link_listener_start’:
transport_sctp.c:901:98: error: declaration of ‘link’ shadows a global declaration [-Werror=shadow]
transport_sctp.c: In function ‘sctp_link_listener_stop’:
transport_sctp.c:1002:76: error: declaration of ‘link’ shadows a global declaration [-Werror=shadow]
transport_sctp.c: In function ‘sctp_transport_link_set_config’:
transport_sctp.c:1096:83: error: declaration of ‘link’ shadows a global declaration [-Werror=shadow]
transport_sctp.c: In function ‘sctp_transport_link_clear_config’:
transport_sctp.c:1154:85: error: declaration of ‘link’ shadows a global declaration [-Werror=shadow]
cc1: all warnings being treated as errors
make[3]: *** [libknet_la-transport_sctp.lo] Error 1
make[3]: Leaving directory /home/travis/build/jonesmz/kronosnet/libknet' make[2]: *** [install-recursive] Error 1 make[2]: Leaving directory /home/travis/build/jonesmz/kronosnet/libknet'
make[1]: *** [install-recursive] Error 1
make[1]: Leaving directory `/home/travis/build/jonesmz/kronosnet'
make: *** [install] Error 2
The command "sudo make install" exited with 2.
Done. Your build exited with 1.

debian valgrind missing fix for https://bugs.kde.org/show_bug.cgi?id=381289

Hi Feri,

could you please talk to the Debian valgrind maintainer, that the current combination of glibc/valgrind is triggering this know upstream valgrind issue: https://bugs.kde.org/show_bug.cgi?id=381289 ?

simple reproducer is here: https://bugs.launchpad.net/ubuntu/+source/valgrind/+bug/1726711

I originally spotted this one on ubuntu and debian experimental. It looks like that the new glibc that triggers the issue in valgrind now made it all the way to unstable/testing (see our CI logs for example).

Fix already exists upstream, just needs a simple backport.

Thanks
Fabio

another differents segfaults (libknet 1.12)

Hi,

some proxmox users still have corosync segfault (seem different than link bug)

Here a coredump back trace (libknet 1.12),

This look like some other coredump we have here:
https://bugzilla.proxmox.com/show_bug.cgi?id=2326

The common error is
" group_name = 0x7fe26e160857 <error: Cannot access memory at address 0x7fe26e160857>"

oot@kvmformation1:~# gdb /usr/sbin/corosync core
GNU gdb (Debian 8.2.1-2+b1) 8.2.1
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/sbin/corosync...Reading symbols from /usr/lib/debug/.build-id/c2/fb2926f76a842314c5b489de886b3943c51f52.debug...done.
done.
[New LWP 41758]
[New LWP 41770]
[New LWP 41768]
[New LWP 41773]
[New LWP 41767]
[New LWP 41772]
[New LWP 41771]
[New LWP 41774]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/sbin/corosync -f'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  __memcmp_avx2_movbe () at ../sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S:183
183	../sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S: No such file or directory.
[Current thread is 1 (Thread 0x7fe26b907c00 (LWP 41758))]
(gdb) bt full
#0  __memcmp_avx2_movbe () at ../sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S:183
No locals.
#1  0x0000557216e9fb64 in group_matches (iov_len=1, iovec=<synthetic pointer>, adjust_iovec=<synthetic pointer>, group_b_cnt=1, groups_b=0x5572197da660) at totempg.c:495
        group_len = 0x7fe268287955
        group_name = 0x7fe26e160857 <error: Cannot access memory at address 0x7fe26e160857>
        i = <optimized out>
        j = <optimized out>
        group_len = <optimized out>
        group_name = <optimized out>
        i = <optimized out>
        j = <optimized out>
        __PRETTY_FUNCTION__ = <optimized out>
#2  app_deliver_fn (endian_conversion_required=0, msg_len=<optimized out>, msg=0x7fe268287955, nodeid=3) at totempg.c:546
        stripped_iovec = <optimized out>
        adjust_iovec = <optimized out>
        list = 0x5572197da640
        aligned_iovec = <optimized out>
        instance = 0x5572197da620
        iovec = <synthetic pointer>
        instance = <optimized out>
        stripped_iovec = <optimized out>
        adjust_iovec = <optimized out>
        iovec = <optimized out>
        list = <optimized out>
        aligned_iovec = <optimized out>
#3  totempg_deliver_fn (nodeid=3, msg=0x5572197ebc8d, msg_len=<optimized out>, endian_conversion_required=0) at totempg.c:676
        mcast = 0x5572197ebc8d
        msg_lens = 0x7ffe5c1f13f8
        i = 4
        assembly = <optimized out>
        header = "\000\000\000\000\000\000\005\000v\002-\002$\002z\002u\002u\002r\002(\002t\002x\002u\002{\002v\002r\002u\002(\002(\002u\002r\002v\002r\002b\002x\002h\002\374\001\373\001\375\001\372\001\371\001\374\001\374\001\374\001", '\000' <repeats 25472 times>...
        msg_count = 5
        continuation = <optimized out>
        start = 0
        data = 0x5572197ebc8d ""
        datasize = <optimized out>
        iov_delv = {iov_base = 0x7fe268287955, iov_len = <optimized out>}
        __PRETTY_FUNCTION__ = "totempg_deliver_fn"
        __FUNCTION__ = "totempg_deliver_fn"
#4  0x0000557216e975e6 in messages_deliver_to_app (instance=instance@entry=0x55721759bc00, skip=skip@entry=0, end_point=<optimized out>) at totemsrp.c:4208
        ptr = 0x7fe26a0442e0
        sort_queue_item_p = 0x7fe26a0442e0
        i = <optimized out>
        res = 0
        mcast_in = <optimized out>
        mcast_header = {header = {magic = 49264, version = 3 '\003', type = 1 '\001', encapsulated = 2 '\002', nodeid = 3, target_nodeid = 0}, system_from = {nodeid = 3}, seq = 393517, this_seqno = 100977, ring_id = {rep = 1, 
            seq = 2050076}, node_id = 0, guarantee = 0}
        range = <optimized out>
        endian_conversion_required = 0
        my_high_delivered_stored = <optimized out>
        __FUNCTION__ = "messages_deliver_to_app"
        __PRETTY_FUNCTION__ = "messages_deliver_to_app"
#5  0x0000557216e980e4 in message_handler_mcast (instance=0x55721759bc00, msg=<optimized out>, msg_len=<optimized out>, endian_conversion_needed=<optimized out>) at totemsrp.c:4329
        sort_queue_item = {mcast = 0x5572197ebc60, msg_len = 3755}
        sort_queue = <optimized out>
--Type <RET> for more, q to quit, c to continue without paging--
        mcast_header = {header = {magic = 49264, version = 3 '\003', type = 1 '\001', encapsulated = 2 '\002', nodeid = 3, target_nodeid = 0}, system_from = {nodeid = 3}, seq = 393517, this_seqno = 100977, ring_id = {rep = 1, 
            seq = 2050076}, node_id = 0, guarantee = 0}
        __PRETTY_FUNCTION__ = "message_handler_mcast"
        __FUNCTION__ = "message_handler_mcast"
#6  0x0000557216ea2459 in data_deliver_fn (fd=<optimized out>, revents=<optimized out>, data=0x5572175faf00) at totemknet.c:715
        instance = <optimized out>
        msg_hdr = {msg_name = 0x7ffe5c2015c0, msg_namelen = 0, msg_iov = 0x7ffe5c201570, msg_iovlen = 1, msg_control = 0x0, msg_controllen = 0, msg_flags = 0}
        iov_recv = {iov_base = 0x5572175faf78, iov_len = 65536}
        system_from = {ss_family = 64907, 
          __ss_padding = ":\000\000\000\000\000\361\353\270#\000\000\000\000\310\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000\002\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000\002", '\000' <repeats 15 times>, "\060\026 \\\376\177\000\000\366_\300k\342\177\000\000h\250V\027rU\000\000\020\312\177\031rU\000\000\200\254X\027rU\000\000\321\342\251k\342\177\000\000\213\375:\000\000\000\000", __ss_align = 13137072968565283584}
        msg_len = <optimized out>
        truncated_packet = 0
        __FUNCTION__ = "data_deliver_fn"
#7  0x00007fe26baa10af in ?? () from /usr/lib/x86_64-linux-gnu/libqb.so.0
No symbol table info available.
#8  0x00007fe26baa0c8d in qb_loop_run () from /usr/lib/x86_64-linux-gnu/libqb.so.0
No symbol table info available.
#9  0x0000557216e6c0f5 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at main.c:1564
        error_string = 0x557216f51540 <error_string_response> "Successfully read main configuration file '/etc/corosync/corosync.conf'."
        totem_config = {version = 2, interfaces = 0x557217577960, orig_interfaces = 0x0, node_id = 2, clear_node_high_bit = 0, knet_pmtud_interval = 30, 
          private_key = "\021J\364,\213\273\033S\022f\256\006\232C\373o\264:q \306\303^\267\265A\307\063\320\342\237$\201\300\362TBTM\033\270x\264\025t\v\357\354\370\363<wI\366G\016\064=FU\267\017\213\275\223\241\334\024>5\026\346λ\252~\203\247}\375_g\001YV\373\347VZ\036\236\326\364p\210u\241m?\363\071\v\250\213\352\231Cw5.Q\203\375\r\303\307ԩ\312,\243(P&Y\005x\264", '\000' <repeats 3967 times>, private_key_len = 128, token_timeout = 2300, token_warning = 75, 
          token_retransmit_timeout = 547, token_hold_timeout = 427, token_retransmits_before_loss_const = 4, join_timeout = 50, send_join_timeout = 0, consensus_timeout = 2760, merge_timeout = 200, downcheck_timeout = 1000, 
          fail_to_recv_const = 2500, seqno_unchanged_const = 30, link_mode = "passive", '\000' <repeats 56 times>, totem_logging_configuration = {log_printf = 0x557216e822a0 <_logsys_log_printf>, log_level_security = 4, 
            log_level_error = 3, log_level_warning = 4, log_level_notice = 5, log_level_debug = 7, log_level_trace = 8, log_subsys_id = 15}, net_mtu = 65491, threads = 0, heartbeat_failures_allowed = 0, max_network_delay = 50, 
          window_size = 50, max_messages = 17, vsf_type = 0x0, broadcast_use = 0, crypto_model = 0x557217576660 "nss", crypto_cipher_type = 0x557217576620 "aes256", crypto_hash_type = 0x557217576640 "sha256", 
          knet_compression_model = 0x557217589fd0 "none", knet_compression_threshold = 0, knet_compression_level = 0, transport_number = TOTEM_TRANSPORT_KNET, miss_count_const = 5, ip_version = TOTEM_IP_VERSION_4, 
          block_unlisted_ips = 1, totem_memb_ring_id_create_or_load = 0x557216e82a10 <corosync_ring_id_create_or_load>, totem_memb_ring_id_store = 0x557216e82890 <corosync_ring_id_store>}
        res = <optimized out>
        ch = <optimized out>
        background = <optimized out>
        sched_rr = <optimized out>
        prio = <optimized out>
        testonly = <optimized out>
        move_to_root_cgroup = <optimized out>
        stat_out = {st_dev = 64768, st_ino = 264293, st_nlink = 2, st_mode = 16877, st_uid = 0, st_gid = 0, __pad0 = 0, st_rdev = 0, st_size = 4096, st_blksize = 4096, st_blocks = 8, st_atim = {tv_sec = 1569524446, tv_nsec = 732683325}, 
          st_mtim = {tv_sec = 1569501051, tv_nsec = 912452066}, st_ctim = {tv_sec = 1569501051, tv_nsec = 912452066}, __glibc_reserved = {0, 0, 0}}
        flock_err = COROSYNC_DONE_EXIT
        totem_config_warnings = 16
        scheduler_pause_timeout_data = {totem_config = 0x7ffe5c201860, handle = 4175510717761323015, tv_prev = 3865995425596872, max_tv_diff = 1840000000}
        tmpli = <optimized out>
        ep = 0x0
        tmp_str = 0x0
        log_subsys_id_totem = <optimized out>
        __func__ = <optimized out>

Configure travis-ci for the official kronosnet repository

Pursuant to the travis-ci configuration file that I contributed to this project a few weeks ago, a project developer needs to register for a free account with Travis CI and configure it to build the kronosnet project.

Additional optional features include modifying the gh-pages to display the current build status, and configure travis ci to verify pull requests, and so on.

Git archive fails to build with obscure error

Download the ZIP archive of a branch from GitHub, unzip -x and build it. It fails with:

[...]
if [ -z "$dirty" ]; then sed -i -e "s#%glo.*dirty.*##g" kronosnet.spec-t; fi
sed: can't read kronosnet.spec-t: No such file or directory
Makefile:993: recipe for target 'kronosnet.spec' failed
make[2]: *** [kronosnet.spec] Error 2
make[2]: Leaving directory '/home/wferi/ha/kronosnet/kronosnet-configure-hardening2'
Makefile:550: recipe for target 'all-recursive' failed
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory '/home/wferi/ha/kronosnet/kronosnet-configure-hardening2'
Makefile:460: recipe for target 'all' failed
make: *** [all] Error 2

It isn't very obvious that echo 1.1 >.tarball-version fixes this.
It would be nicer to provide at least a clear error in such cases, or even better let the build finish (maybe without the spec file).

kronosnet tries to unconditionally build with -fstack-protector which is not available on all targets

kronosnet fails to build from source on Debian alpha, hppa and ia64 targets as the build system tries to use the -fstack-protector flag unconditionally which is not available on these targets and consequently generates a warning which fails the build when built with -Werror:

/bin/bash ../libtool  --tag=CC   --mode=compile gcc -DHAVE_CONFIG_H -I. -I..   -Wdate-time -D_FORTIFY_SOURCE=2 -O3 -ggdb3 -Werror -Wall -Wextra  -fPIC -DPIC -pie -D_FORTIFY_SOURCE=2 -fstack-protector-strong -fexceptions -D_GLIBCXX_ASSERTIONS -Wl,-z,now -fstack-clash-protection -Wno-unused-parameter -pthread -I/usr/include/libnl3 -I/usr/include/libnl3 -g -O2 -ffile-prefix-map=/<<PKGBUILDDIR>>=. -specs=/usr/share/dpkg/pie-compile.specs -Wformat -Werror=format-security -c -o libnozzle_la-libnozzle.lo `test -f 'libnozzle.c' || echo './'`libnozzle.c
libtool: compile:  gcc -DHAVE_CONFIG_H -I. -I.. -Wdate-time -D_FORTIFY_SOURCE=2 -O3 -ggdb3 -Werror -Wall -Wextra -fPIC -DPIC -D_FORTIFY_SOURCE=2 -fstack-protector-strong -fexceptions -D_GLIBCXX_ASSERTIONS -Wl,-z,now -fstack-clash-protection -Wno-unused-parameter -pthread -I/usr/include/libnl3 -I/usr/include/libnl3 -g -O2 -ffile-prefix-map=/<<PKGBUILDDIR>>=. -specs=/usr/share/dpkg/pie-compile.specs -Wformat -Werror=format-security -c libnozzle.c  -fPIC -DPIC -o .libs/libnozzle_la-libnozzle.o
cc1: error: ‘-fstack-protector’ not supported for this target [-Werror]
cc1: all warnings being treated as errors

Full log: https://buildd.debian.org/status/fetch.php?pkg=kronosnet&arch=alpha&ver=1.23-1&stamp=1638252618&raw=0

See also this issue for a possible fix: Yubico/libfido2#443

stats gathering can cause PMTUd to never complete

we discovered this issue while debugging @illustris environment.

In an environment with many nodes and many links, PMTUd can take a long time to complete (in the order of many minutes).
If any monitoring system is gather stats faster than PMTUd, PMTUd will be restarted on each data collection and never complete.

The restart is triggered by the need of a write lock around stats gathering (reverse locking for performances).

Clearly this is a major operational issue on big knet clusters.

I have been working on a set of patches to switch stats reading / collection from write lock to read lock by using a combination of 2 mutexes.

One global mutex is used to protect the handle stats, and one per-link mutex to protect the per link stats.

On my test environment with 2 nodes and one link, I notice less than 0.1% performance hit (that is perfectly reasonable).

The patches have not been tested at scale, but I assume the more the nodes, the more the links, the less the locking contention on a per-link base. The handle stats mutex are very low (not many stats there to collect across threads).

I have created a PR #286 for master so that @chrissie-c can start more extensive review of the changes and performances hit on different environments.

There is also a new temporary branch: https://github.com/kronosnet/kronosnet/tree/stable1-pmtud that contains the same patches, backported to stable1.

I would appreciate if more than me and chrissie could please extensively testing those changes as they are delicate in some areas.

pacemaker error!

Seeing that you are a corosync contributor, now I have a problem that pacemaker can't start, can you help me take a look,
crm status
error: Could not connect to launcher: Connection refused
crm_mon: Connection to cluster failed: Connection refused
ERROR: status: crm_mon (rc=102):
error: Could not connect to the CIB manager: Transport endpoint is not con>

[7273] (mcp_read_config) info: Could not connect to Corosync CMAP: CS_ERR_LIBRARY (retrying in 2s) | rc=2
Failed to initialize the cmap API. Error CS_ERR_LIBRARY
sbd -d /dev/disk/by-id/scsi-360014051fdfce608104470e930c688e2 create

tail -f /var/log/pacemaker/pacemaker.log
Oct 28 13:46:55 node-2 pacemakerd [12941] (crm_log_init) info: Changed active directory to /var/lib/pacemaker/cores
Oct 28 13:46:55 node-2 pacemakerd [12941] (ipc_post_disconnect) info: Disconnected from launcher IPC API
Oct 28 13:46:55 node-2 pacemakerd [12941] (mcp_read_config) info: Could not connect to Corosync CMAP: CS_ERR_LIBRARY (retrying in 1s) | rc=2
Oct 28 13:46:56 node-2 pacemakerd [12941] (mcp_read_config) info: Could not connect to Corosync CMAP: CS_ERR_LIBRARY (retrying in 2s) | rc=2
Oct 28 13:46:58 node-2 pacemakerd [12941] (mcp_read_config) info: Could not connect to Corosync CMAP: CS_ERR_LIBRARY (retrying in 3s) | rc=2
Oct 28 13:47:01 node-2 pacemakerd [12941] (mcp_read_config) info: Could not connect to Corosync CMAP: CS_ERR_LIBRARY (retrying in 4s) | rc=2
Oct 28 13:47:05 node-2 pacemakerd [12941] (mcp_read_config) info: Could not connect to Corosync CMAP: CS_ERR_LIBRARY (retrying in 5s) | rc=2
Oct 28 13:47:10 node-2 pacemakerd [12941] (mcp_read_config) crit: Could not connect to Corosync CMAP: CS_ERR_LIBRARY | rc=2
Oct 28 13:47:10 node-2 pacemakerd [12941] (crm_exit) info: Exiting pacemakerd | with status 69

My configuration file is as follows:
vim /etc/corosync/corosync.conf

Please read the corosync.conf.5 manual page

totem {
version: 2

# Set name of the cluster
cluster_name: ExampleCluster
secauth: off
# crypto_cipher and crypto_hash: Used for mutual node authentication.
# If you choose to enable this, then do remember to create a shared
# secret with "corosync-keygen".
# enabling crypto_cipher, requires also enabling of crypto_hash.
# crypto works only with knet transport
crypto_cipher: none
crypto_hash: none
#transport:udpu

}
interface {
ringnumber: 0 #回环号码,若主机有多块网卡,避免心跳汇流
bindnetaddr: 60.60.60.0 #心跳网段,corosync会自动判断本地网卡上配置的哪个IP地址是属于这个网络的,并把这个接口作为多播心跳信息传递的接口
mcastaddr: 226.94.1.1 #心跳信息组播地址(所有节点必须一致)
mcastport: 5405 #组播端口
ttl: 1 #只向外多播ttl为1的报文,防止发生环路
}
logging {
# Log the source file and line where messages are being
# generated. When in doubt, leave off. Potentially useful for
# debugging.
fileline: off
# Log to standard error. When in doubt, set to yes. Useful when
# running in the foreground (when invoking "corosync -f")
to_stderr: yes
# Log to a log file. When set to "no", the "logfile" option
# must not be set.
to_logfile: yes
logfile: /var/log/cluster/corosync.log
# Log to the system log daemon. When in doubt, set to yes.
to_syslog: yes
# Log debug messages (very verbose). When in doubt, leave off.
debug: off
# Log messages with time stamps. When in doubt, set to hires (or on)
#timestamp: hires
logger_subsys {
subsys: QUORUM
debug: off
}
}

quorum {
# Enable and configure quorum subsystem (default: off)
# see also corosync.conf.5 and votequorum.5
provider: corosync_votequorum
}

nodelist {
# Change/uncomment/add node sections to match cluster configuration

node {
	# Hostname of the node
	name: node-1
	# Cluster membership node identifier
	nodeid: 1
	# Address of first link
	ring0_addr: node-1
	# When knet transport is used it's possible to define up to 8 links
	ring1_addr: 60.60.60.84
}
node {
	# Hostname of the node
	name: node-2
	# Cluster membership node identifier
	nodeid: 2
	# Address of first link
	ring0_addr: node-2
	# When knet transport is used it's possible to define up to 8 links
	ring1_addr: 60.60.60.119
}
# ...
service {
var: 0
name: pacemaker
}

}

kronosnet build fails on x86_64 running Ubuntu 16.04.

Hi,

I am trying to build kronosnet by cloning the code from this repository. After cloning the code, I issued following commands in order:

cd kronosnet
./autogen.sh
./configure
make

I am building this with gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609. This is giving me following errors (apparently c99 mode is not getting set):

libtool: compile: gcc -DHAVE_CONFIG_H -I. -I.. -I/usr/include/nss -I/usr/include/nspr -g -O2 -fPIC -DPIC -O3 -ggdb3 -Wall -Wshadow -Wmissing-prototypes -Wmissing-declarations -Wstrict-prototypes -Wdeclaration-after-statement -Wpointer-arith -Wwrite-strings -Wcast-align -Wbad-function-cast -Wmissing-format-attribute -Wformat=2 -Wformat-security -Wformat-nonliteral -Wno-long-long -Wno-strict-aliasing -Werror -Waddress -Wcpp -Woverflow -Wparentheses -Wsequence-point -Wswitch -Wuninitialized -Wunused-but-set-variable -Wunused-function -Wunused-result -Wunused-value -Wunused-variable -MT libknet_la-transport_sctp.lo -MD -MP -MF .deps/libknet_la-transport_sctp.Tpo -c transport_sctp.c -fPIC -DPIC -o .libs/libknet_la-transport_sctp.o
transport_sctp.c:684:2: error: unknown field 'handle_allocate' specified in initializer
.handle_allocate = sctp_handle_allocate,
^
transport_sctp.c:684:21: error: initialization from incompatible pointer type [-Werror=incompatible-pointer-types]
.handle_allocate = sctp_handle_allocate,
^
transport_sctp.c:684:21: note: (near initialization for 'sctp_transport_ops.transport_name')
transport_sctp.c:685:2: error: unknown field 'handle_free' specified in initializer
.handle_free = sctp_handle_free,
^
transport_sctp.c:685:17: error: initialization makes integer from pointer without a cast [-Werror=int-conversion]
.handle_free = sctp_handle_free,
^
transport_sctp.c:685:17: note: (near initialization for 'sctp_transport_ops.transport_mtu_overhead')
transport_sctp.c:685:17: error: initializer element is not computable at load time
transport_sctp.c:685:17: note: (near initialization for 'sctp_transport_ops.transport_mtu_overhead')
transport_sctp.c:686:2: error: unknown field 'handle_fd_eof' specified in initializer
.handle_fd_eof = sctp_handle_fd_eof,
^
transport_sctp.c:686:19: error: initialization from incompatible pointer type [-Werror=incompatible-pointer-types]
.handle_fd_eof = sctp_handle_fd_eof,
^
transport_sctp.c:686:19: note: (near initialization for 'sctp_transport_ops.transport_init')
transport_sctp.c:687:2: error: unknown field 'handle_fd_notification' specified in initializer
.handle_fd_notification = sctp_handle_fd_notification,
^
transport_sctp.c:687:28: error: initialization from incompatible pointer type [-Werror=incompatible-pointer-types]
.handle_fd_notification = sctp_handle_fd_notification,
^
transport_sctp.c:687:28: note: (near initialization for 'sctp_transport_ops.transport_free')
transport_sctp.c:689:2: error: unknown field 'link_allocate' specified in initializer
.link_allocate = sctp_link_allocate,
^
transport_sctp.c:689:19: error: initialization from incompatible pointer type [-Werror=incompatible-pointer-types]
.link_allocate = sctp_link_allocate,
^
transport_sctp.c:689:19: note: (near initialization for 'sctp_transport_ops.transport_link_set_config')
transport_sctp.c:690:2: error: unknown field 'link_listener_start' specified in initializer
.link_listener_start = sctp_link_listener_start,
^
transport_sctp.c:690:25: error: initialization from incompatible pointer type [-Werror=incompatible-pointer-types]
.link_listener_start = sctp_link_listener_start,
^
transport_sctp.c:690:25: note: (near initialization for 'sctp_transport_ops.transport_link_clear_config')
transport_sctp.c:691:2: error: unknown field 'link_free' specified in initializer
.link_free = sctp_link_free,
^
transport_sctp.c:691:15: error: initialization from incompatible pointer type [-Werror=incompatible-pointer-types]
.link_free = sctp_link_free,
^
transport_sctp.c:691:15: note: (near initialization for 'sctp_transport_ops.transport_rx_sock_error')
transport_sctp.c:692:2: error: unknown field 'link_get_mtu_overhead' specified in initializer
.link_get_mtu_overhead = sctp_link_get_mtu_overhead,
^
transport_sctp.c:692:27: error: initialization from incompatible pointer type [-Werror=incompatible-pointer-types]
.link_get_mtu_overhead = sctp_link_get_mtu_overhead,
^
transport_sctp.c:692:27: note: (near initialization for 'sctp_transport_ops.transport_rx_is_data')
cc1: all warnings being treated as errors
Makefile:680: recipe for target 'libknet_la-transport_sctp.lo' failed
make[3]: *** [libknet_la-transport_sctp.lo] Error 1
make[3]: Leaving directory '/root/kronosnet/libknet'
Makefile:755: recipe for target 'all-recursive' failed
make[2]: *** [all-recursive] Error 1
make[2]: Leaving directory '/root/kronosnet/libknet'
Makefile:514: recipe for target 'all-recursive' failed
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory '/root/kronosnet'
Makefile:424: recipe for target 'all' failed
make: *** [all] Error 2

Have you seen similar issues lately? I tried passing "-std=c99/gnu99" and "-Dtypeof=typeof" options to gcc but that did not help.

Thanks,
Atul.

new openssl in debian unstable causes make check-memcheck failures

Hey Feri,

just for your information, last week the CI nodes got a new version of openssl:

https://ci.kronosnet.org/job/update-all-apt-unstable/296/update-all-apt-unstable=debian-unstable-x86-64/console

08:12:10 Unpacking libssl1.1:amd64 (1.1.1c-1) over (1.1.1b-2) ...

that is cause havoc with valgrind:

https://ci.kronosnet.org/view/knet/job/knet-build-all-nonvoting/1108/knet-build-all-nonvoting=debian-unstable-x86-64/console

Not sure those are openssl issues or we should just ignore them for the time being.

Please let me know how I should proceed from an upstream perspective.

Thanks
Fabio

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.