GithubHelp home page GithubHelp logo

robustirc / robustirc Goto Github PK

View Code? Open in Web Editor NEW
174.0 8.0 8.0 1.25 MB

RobustIRC - an IRC network without netsplits, implemented in Go using the Raft consensus algorithm

Home Page: https://robustirc.net/

License: BSD 3-Clause "New" or "Revised" License

HTML 3.10% Go 96.19% Shell 0.14% Makefile 0.25% Dockerfile 0.32%
go golang irc raft raft-consensus-algorithm

robustirc's Introduction

robustirc's People

Contributors

danielkim802 avatar merovius avatar slymas avatar stapelberg avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

robustirc's Issues

Investigate why NickServ’s RECOVER doesn’t communicate nick changes properly

From rami’s session:

PRIVMSG NickServ :<privacy filtered>
1428334505393258445.1   2015-04-06 15:35:05 +00:00   :NickServ!services@services NOTICE rami_ :Ghost with your nick has been killed.
1428334505407942600.1   2015-04-06 15:35:05 +00:00   :rami_!rami@robust/0x13d2769f2cfd31dd NICK :rami

In my session:

1428334458460164545.1   2015-04-06 15:34:18 +00:00   :rami_!rami@robust/0x13d2769f2cfd31dd JOIN :#chaos-hd
1428334469592251793.0   2015-04-06 15:34:29 +00:00   PING keepalive
1428334469592251793.1   2015-04-06 15:34:29 +00:00   :robustirc.net PONG keepalive
1428334505654522390.1   2015-04-06 15:35:05 +00:00   :ChanServ!services@services MODE #chaos-hd +o rami

It seems like the NICK message coming from services is not relayed properly, and possibly not processed properly.

Look into making crypto/tls faster

With the throughput program from robustirc/benchmark, we can easily create a scenario in which the bottleneck is crypto/cipher/(*gcm).mul (use throughput -sessions=250 -channels=1, leading to at least 500 HTTPS connections).

Looking at go/src/crypto/cipher/gcm.go, the function does multiplication on GF(128). According to http://en.wikipedia.org/wiki/Galois/Counter_Mode, the currently fastest implementation which does not use hardware acceleration is a timing-resistent bitslicing implementation, see also http://www.chesworkshop.org/ches2009/presentations/01_Session_1/CHES2009_ekasper.pdf

It seems non-trivial to replace Go’s mul() (which in itself is relatively short) with the bitslicing implementation (which seems rather long, but perhaps I’m looking at the wrong thing).

Further optimization would be to use AES-NI, see https://software.intel.com/sites/default/files/managed/72/cc/clmul-wp-rev-2.02-2014-04-20.pdf. This should lead to another 2x speed-up on CPUs which support it.

A workaround until this is done could be to use stunnel, which uses OpenSSL, which has AES-NI support. Placing stunnel in front of RobustIRC for incoming connections is doable, but that still leaves us with the outgoing connections that are not accelerated.

Run a clean benchmark

I.e. with 3 servers + a separate server for the bridge + a separate server for the benchmark program.

Perhaps we can spin up a couple of rackspace VMs to do this.

Track and expose network latency statistics

It’s useful to have a good idea of the latency between nodes to configure the various raft timeouts properly.

We should track and expose (in the web interface, and possibly one of the STATS IRC commands later) the average latency from each node to all others.

Make rate limiting exponential

Currently, the cooloff for rate-limiting is static, but I think it’d be a better experience if it would slowly ramp up. I.e., it starts at 2^0 = 1ms, then goes to 2^1 = 2ms, 2^2, 2^3, etc. until it arrives at PostMessageCooloff. That way, the common case (regular users) is faster than now, while the worst case (spam bots) still gets throttled eventually.

transport: monitor http errors

Currently we only monitor the round-trip time, but we should also monitor how many errors of a certain type we receive (i.e. HTTP status codes and network errors).

Investigate a mechanism to run a RobustIRC canary infrastructure before network migrations

Before big networks migrate, they’ll want to make sure the performance of RobustIRC is good enough to handle their traffic. It’d be good to have a way to send regular IRC traffic into RobustIRC. Possibly we can come up with a server program that can be linked into a regular IRC network which then translates commands from server-to-server (from IRC network) to client-to-server (for RobustIRC).

Implement the MAP command

The MAP command should return the configured peers and when they were last successfully contacted.

Before implementing this, we need to make sure that the result is the same when replaying a snapshot, i.e. we should probably handle the peerschanged raft messages and persist them in the ircstore. Note that only the number of peers needs to stay the same (so that the message/reply ids are consistent between runs), their last successful contact timestamp doesn’t matter.

We should force a PING/PONG exchange before accepting a session

Currently, sending a NICK message is enough to get a valid session. This is bad as this allows senders with spoofed IP addresses to participate in our network. We should not send the 001 reply until we have seen one valid PING/PONG exchange, i.e. we send PING with a random challenge, the client answers with a corresponding PONG.

UnrealIRCd does this as well, so see telnet irc.twice-irc.de 6667 for details ;).

Figure out how the ircOutput slice should be influenced by compaction

Currently, we only ever grow the ircOutput slice.

ircserver.ClearState must kill all running GetMessages requests (and forbid new ones), then reset ircOutput, then allow new GetMessages requests.

I’m not entirely sure when this should happen with regards to taking snapshots, see also the comment in robustirc.go about deleting entries from ircstore. I suppose we can either:

  1. Delete specific entries from ircOutput when deleting them from ircstore.
  2. Delete ircOutput entirely and reprocess a snapshot.

Not sure yet which one is more expensive.

Fix issues related to the 2015-04-02 14:32 CEST deadlock

Symptom: clients disconnected after a while, /debug/pprof/goroutines reveals that > 70 goroutines are stuck on acquiring the applyMu mutex.

From the log on alp:

I0402 12:32:55.480121       1 api.go:97] Proxying request ("/robustirc/v1/0x13ced24bfa07d924/message") to leader "ridcully.robustirc.net:60667"
I0402 12:32:55.494351       1 api.go:97] Proxying request ("/robustirc/v1/0x13cad43f066bfec3/message") to leader "ridcully.robustirc.net:60667"
I0402 12:32:55.508024       1 robustirc.go:358] Apply(msg.Type=irc_from_client)
I0402 12:32:55.510380       1 server.go:1775] http: panic serving 172.17.42.1:40364: Assumption violated: current time 1427977975510311504 is older than the timestamp of the last processed message (1427977975515627970)
goroutine 3075187 [running]:
net/http.func·011()
        /home/michael/go/src/net/http/server.go:1130 +0xbb
github.com/robustirc/robustirc/ircserver.(*IRCServer).NewRobustMessage(0xc208354000, 0x2, 0x13ced089c4d38cbf, 0x0, 0xc221be8d80, 0xe, 0x0)
        /home/michael/gocode/src/github.com/robustirc/robustirc/ircserver/ircserver.go:239 +0x247
main.handlePostMessage(0x7fe9102a19c8, 0xc220823d60, 0xc2194c2dd0, 0xc211856f60, 0x1, 0x1)
        /home/michael/gocode/src/github.com/robustirc/robustirc/api.go:184 +0x717
github.com/julienschmidt/httprouter.(*Router).ServeHTTP(0xc21ab040c0, 0x7fe9102a19c8, 0xc220823d60, 0xc2194c2dd0)
        /home/michael/gocode/src/github.com/julienschmidt/httprouter/router.go:293 +0x18e
net/http.(*ServeMux).ServeHTTP(0xc20800ca80, 0x7fe9102a19c8, 0xc220823d60, 0xc2194c2dd0)
        /home/michael/go/src/net/http/server.go:1541 +0x17d
net/http.serverHandler.ServeHTTP(0xc21bbc1500, 0x7fe9102a19c8, 0xc220823d60, 0xc2194c2dd0)
        /home/michael/go/src/net/http/server.go:1703 +0x19a
net/http.(*conn).serve(0xc21f1695e0)
        /home/michael/go/src/net/http/server.go:1204 +0xb57
created by net/http.(*Server).Serve
        /home/michael/go/src/net/http/server.go:1751 +0x35e
I0402 12:32:55.613561       1 robustirc.go:358] Apply(msg.Type=irc_from_client)

There are two problems to be addressed:

  1. panic() calls in HTTP handlers should lead to RobustIRC exiting. That fixes the deadlock itself.
  2. Possibly we should try to avoid calling NewRobustMessage on non-leaders in the first place. Still thinking about that.

don’t handle GetMessages until all messages are replayed when recovering from snapshot

A user reported that some messages were received twice when we did a rolling restart of the network. My suspicion is that the bridge sent GetMessages with a lastseen that was newer than what the server had in the outputstream, so as the server recovered, the bridge would get new messages.

We should verify that this can currently happen and then come up with a fix.

list doesn't seem to show all channels sometimes

Yesterday, I was invited (by someone messaging me) to join #zhn-kino on robustirc.
Yet, issuing /list didn't show the channel.
I was nevertheless able to connect to it just a few minutes later, and issuing /list again a quarter hour later showed the channel…
The screenshot below shows the output of the weechat "status" window for the two calls to /list.
2015-04-03 23-44-03

How can we catch messages of death?

In case an IRC message crashes RobustIRC, that’s a serious problem, because it will crash the entire network at once.

I’m not yet sure how we could best catch messages of death. Two ideas come to mind:

  1. Before calling node.Apply() on the leader, already apply the message, so that only the leader will crash, but the message will not have been persisted, hence other nodes will still be alive. Ideally, in the HTTP handler, we’d even return an HTTP 404 Not Found status code, so that the client will not retry. The problem with this approach is that it might violate the raft ordering, I think. Perhaps this needs to be solved on the raft level instead?
  2. Perhaps we can remove entries out of the raft log, if all nodes do so in a consistent way? I.e., when a node realizes (somehow) that it crashes when restoring the log when starting, it could remove the latest entry from the log and retry. That way, all nodes would automatically remove the offending entry.
  3. The most costly way is to duplicate the entire IRC server state on each incoming message, apply the message on the state copy and see whether it worked there. This may be feasible (or a last resort) in case we restrict it to messages that are not well-tested and trivial, like PRIVMSG. The bulk of traffic should be PRIVMSG.

Implement a test program for verifying robustness

I think such a program could work the following way:

  1. It uses robustirc-localnet to bring up a new local network.
  2. It directly uses github.com/robustirc/bridge/robustsession instead of talking to the bridge so that it can call GetMessages without specify lastseen, thereby getting the entire IRC output log and not just new messages. This is important for verifying that the that old messages are restored properly from snapshots/the log.
  3. It connects to IRC and then does all of this at the same time:
    • after a random time interval in range [0, 1s] it sends a message to IRC
    • after a different random time interval in range [0, 1s] it kills a random number of servers.
  4. After killing the server(s), it should make a copy of their state directory so that one can reproduce potential problems.
  5. Then, it should restart the server(s) and verify that the network becomes consistent again and that every server returns the same IRC log as response to the GetMessages request.
  6. Goto 3

Before doing releases, we could then leave this running for a day or so to make sure that the new release is as robust as the old one.

In a later version, it’d be cool to use fault injection to simulate other failures, such as a full disk, or a failing write.

irc: failed to parse command "PRIVMSG"

• Steps to reproduce:

  1. Run robustirc-localnet
  2. Start weechat instance, connect to localhost
  3. weechat: /join #test
  4. Open second terminal, telnet localhost 6667
  5. telnet: NICK foobar
  6. telnet: JOIN #test
  7. telnet: PRIVMSG #test foobar

• What happens:
You receive in the weechat buffer for the server the following:
22:03:01 localhost =!= | irc: too few arguments received from IRC server for command "privmsg"
| (received: 3 arguments, expected: at least 4)
22:03:01 localhost =!= | irc: failed to parse command "PRIVMSG" (please report to developers):
22:03:01 localhost =!= | :foobar!robust@robust/0x13b9119ff29dc0b0 PRIVMSG #test

• What should happen:
Maybe you should get these messages in the telnet window? Certainly not every user in a channel needs to be alerted, that one of them is misbehaving.

make leveldb compression configurable

It uses snappy by default, but for large networks it may be worthwhile to exchange more disk usage for lower CPU usage and hence more messages/s. From the code perspective, this is as simple as using:

    db, err := leveldb.OpenFile(dir, &opt.Options{
        Compression: opt.NoCompression,
    })

But we should make that a flag (as for small networks compression is good) and figure out + document whether one can switch from no compression to compression, or whether leveldb doesn’t support that.

implement a config file

This config file should contain e.g.:

  • throttling delay (currently in a flag)
  • session timeout (currently not configurable)
  • ACLs (who can be an irc operator)
  • (network-wide bans) (not yet implemented)

The config file will be stored in raft in its entirety, i.e. not just mutations.

A new tool can then be used like this:

robustirc-editconfig -network=robustirc.net -network_password=…

It calls the /getconfig API method, starts $EDITOR, calls /putconfig if the editor successfully terminates.

In case there is an edit conflict (i.e. the revision number in the file is not the revision number of the most recent config entry), the leader will send back a HTTP 400 Bad Request, which will be reported by robustirc-editconfig, and the user will need to try again.

In case we want to change the config via IRC (can be implemented later), we should add a bot that does that.

With regards to the config file format, we decided to use TOML (https://github.com/toml-lang/toml), and https://github.com/BurntSushi/toml seems like a good go package for TOML.

implement throttling

To prevent flooding, single users should not be able to send more than e.g. 2 messages/s.

Canary mode: download raft-/irc-log from existing network, diff

To test how a new version (or a development version) would change what users see in a live network, we could add a canary feature: robustirc -canary_from=robustirc.net -network_password=<secret> -canary_report=/tmp/report.html

RobustIRC would then call a new handler, say, /getlog, which would send the current raft log (= input) and output IRC messages. For each message, it would then compare the existing output with the output that the local version would produce, and generate a diff if they differ.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.