GithubHelp home page GithubHelp logo

milannic / libevent_paxos Goto Github PK

View Code? Open in Web Editor NEW
7.0 7.0 4.0 19.76 MB

xinput set-button-map 11 1 0 3

License: MIT License

Makefile 3.02% C 85.51% HTML 0.83% JavaScript 0.46% CSS 0.28% Perl 3.58% Shell 5.31% PHP 0.02% Python 0.50% C++ 0.48%

libevent_paxos's People

Contributors

milannic avatar ruigulala avatar wanzc12345 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

libevent_paxos's Issues

Secondary Nodes Can't Start

I've tried 1 proxy and 1 server on 1 node. It works fine. However, when I run the experiment on 3 seperate servers(bug00, bug01, bug02), the secondary nodes cannot start successfully. The expriements works fine on the main branch.
Will look into it. The following is the output from the secondary nodes.
real mode is opened.
1423898857.572204:CONSENSUS MODULE : Network Layer Initialization Failed.
1423898857.572239:PROXY : Cannot Initialize Consensus Component.

Can't Write to Database

In the first phase of leader election, in function leader_election_proposer_do(), there is a step that we need to update our own record to the database. However, we couldn't write to the database because my_node->db_p is NULL for some reason.

Node 2 cannot continue to participate in the PAXOS transaction after CRIU dumping

Using mongoose server, on my laptop.

[start the three mongoose servers and the three nodes ---> sleep 5 ---> client1: ab -n 100 -c 1 ---> CRIU dumps the server.out(i.e. proxy-consensus process) of Node 2 ---> kill -SIGCONT the server.out of Node 2(after dump CRIU will make the server.out stopped, so we should wake it up by hand) ---> client2: ab -n 100 -c 1 ---> sleep 5 ---> kill all, the end].

From the logs of Node 2, we could find that it didn't receive and handle requests after dump. And if you add 'sleep 5' after above 'kill -SIGCONT', Node 2 will report:
1425565861.252845:Node 2 Haven't Heard From The Leader
1425565861.252908:Node 2 Lost Connection With The Leader
1425565861.252923:Node 2 Will Start A Leader Election

Missing View Proposing Phase

According to Paxos Made Practical, the first phase of leader election is proposing a new view. The current implementation skips this phase. Thus, for the leader election process under the current implementation, only two nodes will be involved. The restarted node will be blocked outside due to the mismatch of view id. Since two nodes still can come into consensus, I'll skip this problem for now.

Nodes Can't Receive Leader Election Message

When leader election event begins in a node, in function leader_election_proposer_do(), this node will send out several paxos phase 1 messages(prepare message) to the other nodes. However, node of the nodes get these leader election messages and none of the nodes can get response back from other nodes.

Input Buffer Corruption

In function replica_on_read(), the node begins to read the input buffer when the content size exceeds the length of SYS_MSG_HEADER_SIZE.
It first use
evbuffer_copyout(input,buf,SYS_MSG_HEADER_SIZE);
int data_size = buf->data_size;
to peek the content of the message header. Then it reads message per message.
However, for some unkown reason, the messages in the input buffer get corrupted. In this case, the variable data_size will be set to an extremly big number. Then, the node can't read any incoming messages at all. The following is a log snippet shows how the messages stock piled.

1424226169.440681:Node 1 Received Consensus Message
1424226170.441590:Enter Consensus Communication Module.
1424226170.441603:There Is 80 Bytes Data In The Buffer In Total.
1424226170.441608:data_size is 4093640872.
1424226171.431029:Connection refused (2)
1424226171.442083:Enter Consensus Communication Module.
1424226171.442097:There Is 160 Bytes Data In The Buffer In Total.
1424226171.442103:data_size is 4093640872.
1424226172.443214:Enter Consensus Communication Module.
1424226172.443227:There Is 240 Bytes Data In The Buffer In Total.
1424226172.443233:data_size is 4093640872.
1424226173.432619:Connection refused (2)
1424226173.443750:Enter Consensus Communication Module.
1424226173.443764:There Is 320 Bytes Data In The Buffer In Total.
1424226173.443769:data_size is 4093640872.
1424226174.444884:Enter Consensus Communication Module.
1424226174.444903:There Is 400 Bytes Data In The Buffer In Total.
1424226174.444909:data_size is 4093640872.
1424226175.035923:A New Connection Is Established.
1424226175.433731:Connected to Node 2
1424226175.445026:Enter Consensus Communication Module.
1424226175.445040:There Is 480 Bytes Data In The Buffer In Total.
1424226175.445047:data_size is 4093640872.
1424226176.446142:Enter Consensus Communication Module.
1424226176.446156:There Is 560 Bytes Data In The Buffer In Total.
1424226176.446162:data_size is 4093640872.
1424226177.447262:Enter Consensus Communication Module.
1424226177.447276:There Is 640 Bytes Data In The Buffer In Total.
1424226177.447283:data_size is 4093640872.
1424226178.448353:Enter Consensus Communication Module.
1424226178.448367:There Is 720 Bytes Data In The Buffer In Total.
1424226178.448373:data_size is 4093640872.

Leader Election Works!

After an intensive two weeks debugging, the leader election module works(although it's still fragile). Basically, I first set up 3 nodes to perform the normal requests to see whether the client will correctly receive the response. Then I kill the leader(bug00). Then I ask the client to send requests to bug01(the new leader is bug02) to see if it can receive the response again.
The following is the client side output. (The stderr output like "write to fake read" is removed here.)
[1] 16:49:39 [SUCCESS] bug00.cs.columbia.edu

[1] 16:49:45 [SUCCESS] bug01.cs.columbia.edu
[2] 16:49:45 [SUCCESS] bug02.cs.columbia.edu

This is ApacheBench, Version 2.3 <$Revision: 1604373 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 128.59.17.171 (be patient).....done

Server Software: Apache/2.4.10
Server Hostname: 128.59.17.171
Server Port: 9000

Document Path: /
Document Length: 45 bytes

Concurrency Level: 10
Time taken for tests: 0.018 seconds
Complete requests: 10
Failed requests: 0
Total transferred: 2890 bytes
HTML transferred: 450 bytes
Requests per second: 550.06 #/sec
Time per request: 18.180 ms
Time per request: 1.818 [ms](mean, across all concurrent requests)
Transfer rate: 155.24 [Kbytes/sec] received

Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 1 0.1 1 1
Processing: 16 17 0.3 17 17
Waiting: 9 10 0.8 10 11
Total: 17 17 0.4 17 18

Percentage of the requests served within a certain time (ms)
50% 17
66% 18
75% 18
80% 18
90% 18
95% 18
98% 18
99% 18
100% 18 (longest request)

[1] 16:49:55 [SUCCESS] bug00.cs.columbia.edu
Restart Proxy

This is ApacheBench, Version 2.3 <$Revision: 1604373 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 128.59.17.172 (be patient).....done

Server Software: Apache/2.4.10
Server Hostname: 128.59.17.172
Server Port: 9001

Document Path: /
Document Length: 45 bytes

Concurrency Level: 10
Time taken for tests: 0.022 seconds
Complete requests: 10
Failed requests: 0
Total transferred: 2890 bytes
HTML transferred: 450 bytes
Requests per second: 456.75 #/sec
Time per request: 21.894 ms
Time per request: 2.189 [ms](mean, across all concurrent requests)
Transfer rate: 128.91 [Kbytes/sec] received

Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 1 0.1 1 1
Processing: 20 21 0.3 21 21
Waiting: 18 18 0.1 18 18
Total: 21 21 0.3 21 22

Percentage of the requests served within a certain time (ms)
50% 21
66% 21
75% 22
80% 22
90% 22
95% 22
98% 22
99% 22
100% 22 (longest request)

Inconsistent between type and logic

The current implementation use many unint32_t. For example, node_id_t, content etc. However, many of them are initialized with the value -1. This will cause many small logic problems. I'm not sure if this is intended. I've changed part of them to int64_t. This is just a temporary hack. Ideally, we should avoid the situation using -1.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.