milannic / libevent_paxos Goto Github PK
View Code? Open in Web Editor NEWxinput set-button-map 11 1 0 3
License: MIT License
xinput set-button-map 11 1 0 3
License: MIT License
I've tried 1 proxy and 1 server on 1 node. It works fine. However, when I run the experiment on 3 seperate servers(bug00, bug01, bug02), the secondary nodes cannot start successfully. The expriements works fine on the main branch.
Will look into it. The following is the output from the secondary nodes.
real mode is opened.
1423898857.572204:CONSENSUS MODULE : Network Layer Initialization Failed.
1423898857.572239:PROXY : Cannot Initialize Consensus Component.
After we successfully deploy the 3 nodes(the clients received correct responses), we kill the leader and restart the leader with recovery mode instead of leader mode. However, the 2nd node unexpectedly exit after reconnect to node 1.
In certain cases that node 2 is elected as the new leader, clients couldn't get the response.
In the first phase of leader election, in function leader_election_proposer_do(), there is a step that we need to update our own record to the database. However, we couldn't write to the database because my_node->db_p is NULL for some reason.
For the new code branch with leader election, the program segfaults right after being run.
Settings:
Using mongoose server, on my laptop.
[start the three mongoose servers and the three nodes ---> sleep 5 ---> client1: ab -n 100 -c 1 ---> CRIU dumps the server.out(i.e. proxy-consensus process) of Node 2 ---> kill -SIGCONT the server.out of Node 2(after dump CRIU will make the server.out stopped, so we should wake it up by hand) ---> client2: ab -n 100 -c 1 ---> sleep 5 ---> kill all, the end].
From the logs of Node 2, we could find that it didn't receive and handle requests after dump. And if you add 'sleep 5' after above 'kill -SIGCONT', Node 2 will report:
1425565861.252845:Node 2 Haven't Heard From The Leader
1425565861.252908:Node 2 Lost Connection With The Leader
1425565861.252923:Node 2 Will Start A Leader Election
According to Paxos Made Practical, the first phase of leader election is proposing a new view. The current implementation skips this phase. Thus, for the leader election process under the current implementation, only two nodes will be involved. The restarted node will be blocked outside due to the mismatch of view id. Since two nodes still can come into consensus, I'll skip this problem for now.
When leader election event begins in a node, in function leader_election_proposer_do(), this node will send out several paxos phase 1 messages(prepare message) to the other nodes. However, node of the nodes get these leader election messages and none of the nodes can get response back from other nodes.
In function replica_on_read(), the node begins to read the input buffer when the content size exceeds the length of SYS_MSG_HEADER_SIZE.
It first use
evbuffer_copyout(input,buf,SYS_MSG_HEADER_SIZE);
int data_size = buf->data_size;
to peek the content of the message header. Then it reads message per message.
However, for some unkown reason, the messages in the input buffer get corrupted. In this case, the variable data_size will be set to an extremly big number. Then, the node can't read any incoming messages at all. The following is a log snippet shows how the messages stock piled.
1424226169.440681:Node 1 Received Consensus Message
1424226170.441590:Enter Consensus Communication Module.
1424226170.441603:There Is 80 Bytes Data In The Buffer In Total.
1424226170.441608:data_size is 4093640872.
1424226171.431029:Connection refused (2)
1424226171.442083:Enter Consensus Communication Module.
1424226171.442097:There Is 160 Bytes Data In The Buffer In Total.
1424226171.442103:data_size is 4093640872.
1424226172.443214:Enter Consensus Communication Module.
1424226172.443227:There Is 240 Bytes Data In The Buffer In Total.
1424226172.443233:data_size is 4093640872.
1424226173.432619:Connection refused (2)
1424226173.443750:Enter Consensus Communication Module.
1424226173.443764:There Is 320 Bytes Data In The Buffer In Total.
1424226173.443769:data_size is 4093640872.
1424226174.444884:Enter Consensus Communication Module.
1424226174.444903:There Is 400 Bytes Data In The Buffer In Total.
1424226174.444909:data_size is 4093640872.
1424226175.035923:A New Connection Is Established.
1424226175.433731:Connected to Node 2
1424226175.445026:Enter Consensus Communication Module.
1424226175.445040:There Is 480 Bytes Data In The Buffer In Total.
1424226175.445047:data_size is 4093640872.
1424226176.446142:Enter Consensus Communication Module.
1424226176.446156:There Is 560 Bytes Data In The Buffer In Total.
1424226176.446162:data_size is 4093640872.
1424226177.447262:Enter Consensus Communication Module.
1424226177.447276:There Is 640 Bytes Data In The Buffer In Total.
1424226177.447283:data_size is 4093640872.
1424226178.448353:Enter Consensus Communication Module.
1424226178.448367:There Is 720 Bytes Data In The Buffer In Total.
1424226178.448373:data_size is 4093640872.
After an intensive two weeks debugging, the leader election module works(although it's still fragile). Basically, I first set up 3 nodes to perform the normal requests to see whether the client will correctly receive the response. Then I kill the leader(bug00). Then I ask the client to send requests to bug01(the new leader is bug02) to see if it can receive the response again.
The following is the client side output. (The stderr output like "write to fake read" is removed here.)
[1] 16:49:39 [SUCCESS] bug00.cs.columbia.edu
[1] 16:49:45 [SUCCESS] bug01.cs.columbia.edu
[2] 16:49:45 [SUCCESS] bug02.cs.columbia.edu
This is ApacheBench, Version 2.3 <$Revision: 1604373 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 128.59.17.171 (be patient).....done
Server Software: Apache/2.4.10
Server Hostname: 128.59.17.171
Server Port: 9000
Document Path: /
Document Length: 45 bytes
Concurrency Level: 10
Time taken for tests: 0.018 seconds
Complete requests: 10
Failed requests: 0
Total transferred: 2890 bytes
HTML transferred: 450 bytes
Requests per second: 550.06 #/sec
Time per request: 18.180 ms
Time per request: 1.818 [ms](mean, across all concurrent requests)
Transfer rate: 155.24 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 1 0.1 1 1
Processing: 16 17 0.3 17 17
Waiting: 9 10 0.8 10 11
Total: 17 17 0.4 17 18
Percentage of the requests served within a certain time (ms)
50% 17
66% 18
75% 18
80% 18
90% 18
95% 18
98% 18
99% 18
100% 18 (longest request)
[1] 16:49:55 [SUCCESS] bug00.cs.columbia.edu
Restart Proxy
This is ApacheBench, Version 2.3 <$Revision: 1604373 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 128.59.17.172 (be patient).....done
Server Software: Apache/2.4.10
Server Hostname: 128.59.17.172
Server Port: 9001
Document Path: /
Document Length: 45 bytes
Concurrency Level: 10
Time taken for tests: 0.022 seconds
Complete requests: 10
Failed requests: 0
Total transferred: 2890 bytes
HTML transferred: 450 bytes
Requests per second: 456.75 #/sec
Time per request: 21.894 ms
Time per request: 2.189 [ms](mean, across all concurrent requests)
Transfer rate: 128.91 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 1 0.1 1 1
Processing: 20 21 0.3 21 21
Waiting: 18 18 0.1 18 18
Total: 21 21 0.3 21 22
Percentage of the requests served within a certain time (ms)
50% 21
66% 21
75% 22
80% 22
90% 22
95% 22
98% 22
99% 22
100% 22 (longest request)
The current implementation use many unint32_t. For example, node_id_t, content etc. However, many of them are initialized with the value -1. This will cause many small logic problems. I'm not sure if this is intended. I've changed part of them to int64_t. This is just a temporary hack. Ideally, we should avoid the situation using -1.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.