robustirc / robustirc Goto Github PK

View Code? Open in Web Editor NEW

174.0 8.0 8.0 1.25 MB

RobustIRC - an IRC network without netsplits, implemented in Go using the Raft consensus algorithm

Home Page: https://robustirc.net/

License: BSD 3-Clause "New" or "Revised" License

HTML 3.10% Go 96.19% Shell 0.14% Makefile 0.25% Dockerfile 0.32%

go golang irc raft raft-consensus-algorithm

robustirc's Introduction

Please see http://robustirc.net/docs.html for documentation.

robustirc's People

Contributors

Stargazers

Watchers

Forkers

stapelberg dopuskh3 aftran njnuwjq fosterswiss slymas iq-scm danielkim802

robustirc's Issues

call PrivacyFilter in canary mode and irclog

Investigate why NickServ’s RECOVER doesn’t communicate nick changes properly

From rami’s session:

PRIVMSG NickServ :<privacy filtered>
1428334505393258445.1   2015-04-06 15:35:05 +00:00   :NickServ!services@services NOTICE rami_ :Ghost with your nick has been killed.
1428334505407942600.1   2015-04-06 15:35:05 +00:00   :rami_!rami@robust/0x13d2769f2cfd31dd NICK :rami

In my session:

1428334458460164545.1   2015-04-06 15:34:18 +00:00   :rami_!rami@robust/0x13d2769f2cfd31dd JOIN :#chaos-hd
1428334469592251793.0   2015-04-06 15:34:29 +00:00   PING keepalive
1428334469592251793.1   2015-04-06 15:34:29 +00:00   :robustirc.net PONG keepalive
1428334505654522390.1   2015-04-06 15:35:05 +00:00   :ChanServ!services@services MODE #chaos-hd +o rami

It seems like the NICK message coming from services is not relayed properly, and possibly not processed properly.

Look into making crypto/tls faster

With the throughput program from robustirc/benchmark, we can easily create a scenario in which the bottleneck is crypto/cipher/(*gcm).mul (use throughput -sessions=250 -channels=1, leading to at least 500 HTTPS connections).

Looking at go/src/crypto/cipher/gcm.go, the function does multiplication on GF(128). According to http://en.wikipedia.org/wiki/Galois/Counter_Mode, the currently fastest implementation which does not use hardware acceleration is a timing-resistent bitslicing implementation, see also http://www.chesworkshop.org/ches2009/presentations/01_Session_1/CHES2009_ekasper.pdf

It seems non-trivial to replace Go’s mul() (which in itself is relatively short) with the bitslicing implementation (which seems rather long, but perhaps I’m looking at the wrong thing).

Further optimization would be to use AES-NI, see https://software.intel.com/sites/default/files/managed/72/cc/clmul-wp-rev-2.02-2014-04-20.pdf. This should lead to another 2x speed-up on CPUs which support it.

A workaround until this is done could be to use stunnel, which uses OpenSSL, which has AES-NI support. Placing stunnel in front of RobustIRC for incoming connections is doable, but that still leaves us with the outgoing connections that are not accelerated.

Localnet should check whether the certificate is (about to become) expired and generate a new one

Run a clean benchmark

I.e. with 3 servers + a separate server for the bridge + a separate server for the benchmark program.

Perhaps we can spin up a couple of rackspace VMs to do this.

implement pagination for the raft log on the status page

Otherwise, loading it takes minutes once the network saw a couple thousand messages.

Implement the OPER and KILL command

See https://tools.ietf.org/html/rfc2812#section-3.1.4 for OPER

The username should be the command sender’s nickname, and the password should be the network password (for now, finer-grained ACLs is something for later).

See https://tools.ietf.org/html/rfc2812#section-3.7.1 for KILL

Track and expose network latency statistics

It’s useful to have a good idea of the latency between nodes to configure the various raft timeouts properly.

We should track and expose (in the web interface, and possibly one of the STATS IRC commands later) the average latency from each node to all others.

Make rate limiting exponential

Currently, the cooloff for rate-limiting is static, but I think it’d be a better experience if it would slowly ramp up. I.e., it starts at 2^0 = 1ms, then goes to 2^1 = 2ms, 2^2, 2^3, etc. until it arrives at PostMessageCooloff. That way, the common case (regular users) is faster than now, while the worst case (spam bots) still gets throttled eventually.

Implement authentication at services using PASS

Implement KNOCK

Refactor ircserver/ircserver_test.go so that message comparisons are simpler

The code suffers from a lot of duplication right now. We should add a test helper which ensures two slices of *irc.Message are equal.

implement rudimentary server-to-server features so that connecting services (anope?) is possible

transport: monitor http errors

Currently we only monitor the round-trip time, but we should also monitor how many errors of a certain type we receive (i.e. HTTP status codes and network errors).

Investigate a mechanism to run a RobustIRC canary infrastructure before network migrations

Before big networks migrate, they’ll want to make sure the performance of RobustIRC is good enough to handle their traffic. It’d be good to have a way to send regular IRC traffic into RobustIRC. Possibly we can come up with a server program that can be linked into a regular IRC network which then translates commands from server-to-server (from IRC network) to client-to-server (for RobustIRC).

Implement the TOPIC command

This is the first thing which requires RobustIRC to have the concept of channels :).

Implement the MAP command

The MAP command should return the configured peers and when they were last successfully contacted.

Before implementing this, we need to make sure that the result is the same when replaying a snapshot, i.e. we should probably handle the peerschanged raft messages and persist them in the ircstore. Note that only the number of peers needs to stay the same (so that the message/reply ids are consistent between runs), their last successful contact timestamp doesn’t matter.

cannot connect when nickname is already in use

At least not with irssi, it’s just stuck :|.

Figure out why nodes are so slow during compaction

We often have leadership flaps during compaction because heartbeats cannot be done within the allocated timeframe (e.g. 2s).

Analyze compaction effectiveness, i.e. figure out for which messages we still need to implement compaction

Make updater check network health before hitting /quit

Otherwise it’s too easy to bring down the entire network by accident (as happened this morning).

We should force a PING/PONG exchange before accepting a session

Currently, sending a NICK message is enough to get a valid session. This is bad as this allows senders with spoofed IP addresses to participate in our network. We should not send the 001 reply until we have seen one valid PING/PONG exchange, i.e. we send PING with a random challenge, the client answers with a corresponding PONG.

UnrealIRCd does this as well, so see telnet irc.twice-irc.de 6667 for details ;).

Figure out how the ircOutput slice should be influenced by compaction

Currently, we only ever grow the ircOutput slice.

ircserver.ClearState must kill all running GetMessages requests (and forbid new ones), then reset ircOutput, then allow new GetMessages requests.

I’m not entirely sure when this should happen with regards to taking snapshots, see also the comment in robustirc.go about deleting entries from ircstore. I suppose we can either:

Delete specific entries from ircOutput when deleting them from ircstore.
Delete ircOutput entirely and reprocess a snapshot.

Not sure yet which one is more expensive.

implement NS as alias for PRIVMSG NickServ etc.

killed users will only be killed on their next activity

This is due to SendMessages being called with session==killer, not session==killed. We’ll need to refactor this. Let’s use OperServ’s KILL feature for now.

implement the INVITE command

Fix issues related to the 2015-04-02 14:32 CEST deadlock

Symptom: clients disconnected after a while, /debug/pprof/goroutines reveals that > 70 goroutines are stuck on acquiring the applyMu mutex.

From the log on alp:

I0402 12:32:55.480121       1 api.go:97] Proxying request ("/robustirc/v1/0x13ced24bfa07d924/message") to leader "ridcully.robustirc.net:60667"
I0402 12:32:55.494351       1 api.go:97] Proxying request ("/robustirc/v1/0x13cad43f066bfec3/message") to leader "ridcully.robustirc.net:60667"
I0402 12:32:55.508024       1 robustirc.go:358] Apply(msg.Type=irc_from_client)
I0402 12:32:55.510380       1 server.go:1775] http: panic serving 172.17.42.1:40364: Assumption violated: current time 1427977975510311504 is older than the timestamp of the last processed message (1427977975515627970)
goroutine 3075187 [running]:
net/http.func·011()
        /home/michael/go/src/net/http/server.go:1130 +0xbb
github.com/robustirc/robustirc/ircserver.(*IRCServer).NewRobustMessage(0xc208354000, 0x2, 0x13ced089c4d38cbf, 0x0, 0xc221be8d80, 0xe, 0x0)
        /home/michael/gocode/src/github.com/robustirc/robustirc/ircserver/ircserver.go:239 +0x247
main.handlePostMessage(0x7fe9102a19c8, 0xc220823d60, 0xc2194c2dd0, 0xc211856f60, 0x1, 0x1)
        /home/michael/gocode/src/github.com/robustirc/robustirc/api.go:184 +0x717
github.com/julienschmidt/httprouter.(*Router).ServeHTTP(0xc21ab040c0, 0x7fe9102a19c8, 0xc220823d60, 0xc2194c2dd0)
        /home/michael/gocode/src/github.com/julienschmidt/httprouter/router.go:293 +0x18e
net/http.(*ServeMux).ServeHTTP(0xc20800ca80, 0x7fe9102a19c8, 0xc220823d60, 0xc2194c2dd0)
        /home/michael/go/src/net/http/server.go:1541 +0x17d
net/http.serverHandler.ServeHTTP(0xc21bbc1500, 0x7fe9102a19c8, 0xc220823d60, 0xc2194c2dd0)
        /home/michael/go/src/net/http/server.go:1703 +0x19a
net/http.(*conn).serve(0xc21f1695e0)
        /home/michael/go/src/net/http/server.go:1204 +0xb57
created by net/http.(*Server).Serve
        /home/michael/go/src/net/http/server.go:1751 +0x35e
I0402 12:32:55.613561       1 robustirc.go:358] Apply(msg.Type=irc_from_client)

There are two problems to be addressed:

panic() calls in HTTP handlers should lead to RobustIRC exiting. That fixes the deadlock itself.
Possibly we should try to avoid calling NewRobustMessage on non-leaders in the first place. Still thinking about that.

set up and document monitoring

Currently there is some ad-hoc monitoring running on my workstation. We should document the setup and make it more persistent. Also, we should consider running https://github.com/docker-infra/container_exporter on all robustirc nodes so that we get system-level stats (RAM usage, disk space, …)

Can we give up raft leadership on SIGTERM?

That way, shutting down a node would be a bit more graceful. Perhaps this is not worth it, if the heartbeat intervals are short enough.

add a test for the message of death detection

I think it does not actually work, based on what I have seen this morning.

don’t handle GetMessages until all messages are replayed when recovering from snapshot

A user reported that some messages were received twice when we did a rolling restart of the network. My suspicion is that the bridge sent GetMessages with a lastseen that was newer than what the server had in the outputstream, so as the server recovered, the bridge would get new messages.

We should verify that this can currently happen and then come up with a fix.

Find a way to make compaction not happen at the same time on all nodes

get rid of data races

list doesn't seem to show all channels sometimes

Yesterday, I was invited (by someone messaging me) to join #zhn-kino on robustirc.
Yet, issuing /list didn't show the channel.
I was nevertheless able to connect to it just a few minutes later, and issuing /list again a quarter hour later showed the channel…
The screenshot below shows the output of the weechat "status" window for the two calls to /list.

How can we catch messages of death?

In case an IRC message crashes RobustIRC, that’s a serious problem, because it will crash the entire network at once.

I’m not yet sure how we could best catch messages of death. Two ideas come to mind:

Before calling node.Apply() on the leader, already apply the message, so that only the leader will crash, but the message will not have been persisted, hence other nodes will still be alive. Ideally, in the HTTP handler, we’d even return an HTTP 404 Not Found status code, so that the client will not retry. The problem with this approach is that it might violate the raft ordering, I think. Perhaps this needs to be solved on the raft level instead?
Perhaps we can remove entries out of the raft log, if all nodes do so in a consistent way? I.e., when a node realizes (somehow) that it crashes when restoring the log when starting, it could remove the latest entry from the log and retry. That way, all nodes would automatically remove the offending entry.
The most costly way is to duplicate the entire IRC server state on each incoming message, apply the message on the state copy and see whether it worked there. This may be feasible (or a last resort) in case we restrict it to messages that are not well-tested and trivial, like PRIVMSG. The bulk of traffic should be PRIVMSG.

throttle remote hosts when specifying wrong -network_password

…to prevent brute-forcing.

Investigate why NickServ’s GHOST doesn’t work

Increase statement coverage for the ircserver package to (almost) 100%

We should have extensive test and run them frequently. The sparse number of tests we currently have are frequently broken and no one notices because they are never run.

Implement a test program for verifying robustness

I think such a program could work the following way:

It uses robustirc-localnet to bring up a new local network.
It directly uses github.com/robustirc/bridge/robustsession instead of talking to the bridge so that it can call GetMessages without specify lastseen, thereby getting the entire IRC output log and not just new messages. This is important for verifying that the that old messages are restored properly from snapshots/the log.
It connects to IRC and then does all of this at the same time:
- after a random time interval in range [0, 1s] it sends a message to IRC
- after a different random time interval in range [0, 1s] it kills a random number of servers.
After killing the server(s), it should make a copy of their state directory so that one can reproduce potential problems.
Then, it should restart the server(s) and verify that the network becomes consistent again and that every server returns the same IRC log as response to the GetMessages request.
Goto 3

Before doing releases, we could then leave this running for a day or so to make sure that the new release is as robust as the old one.

In a later version, it’d be cool to use fault injection to simulate other failures, such as a full disk, or a failing write.

irc: failed to parse command "PRIVMSG"

• Steps to reproduce:

Run robustirc-localnet
Start weechat instance, connect to localhost
weechat: /join #test
Open second terminal, telnet localhost 6667
telnet: NICK foobar
telnet: JOIN #test
telnet: PRIVMSG #test foobar

• What happens:
You receive in the weechat buffer for the server the following:
22:03:01 localhost =!= | irc: too few arguments received from IRC server for command "privmsg"
| (received: 3 arguments, expected: at least 4)
22:03:01 localhost =!= | irc: failed to parse command "PRIVMSG" (please report to developers):
22:03:01 localhost =!= | :foobar!robust@robust/0x13b9119ff29dc0b0 PRIVMSG #test

• What should happen:
Maybe you should get these messages in the telnet window? Certainly not every user in a channel needs to be alerted, that one of them is misbehaving.

-log_total_bytes seems ineffective

The robustirc directory on alp is 2.5G by now. Need to figure out why old logs are not being deleted.

make leveldb compression configurable

It uses snappy by default, but for large networks it may be worthwhile to exchange more disk usage for lower CPU usage and hence more messages/s. From the code perspective, this is as simple as using:

    db, err := leveldb.OpenFile(dir, &opt.Options{
        Compression: opt.NoCompression,
    })

But we should make that a flag (as for small networks compression is good) and figure out + document whether one can switch from no compression to compression, or whether leveldb doesn’t support that.

implement a config file

This config file should contain e.g.:

throttling delay (currently in a flag)
session timeout (currently not configurable)
ACLs (who can be an irc operator)
(network-wide bans) (not yet implemented)

The config file will be stored in raft in its entirety, i.e. not just mutations.

A new tool can then be used like this:

robustirc-editconfig -network=robustirc.net -network_password=…

It calls the /getconfig API method, starts $EDITOR, calls /putconfig if the editor successfully terminates.

In case there is an edit conflict (i.e. the revision number in the file is not the revision number of the most recent config entry), the leader will send back a HTTP 400 Bad Request, which will be reported by robustirc-editconfig, and the user will need to try again.

In case we want to change the config via IRC (can be implemented later), we should add a bot that does that.

With regards to the config file format, we decided to use TOML (https://github.com/toml-lang/toml), and https://github.com/BurntSushi/toml seems like a good go package for TOML.

-singlenode should only work when there is no existing state

(as a safeguard)

implement throttling

To prevent flooding, single users should not be able to send more than e.g. 2 messages/s.

Implement +n (no external messages)

Otherwise people can send messages to channels even after they’re kicked, which is not expected by users.

Canary mode: download raft-/irc-log from existing network, diff

To test how a new version (or a development version) would change what users see in a live network, we could add a canary feature: robustirc -canary_from=robustirc.net -network_password=<secret> -canary_report=/tmp/report.html

RobustIRC would then call a new handler, say, /getlog, which would send the current raft log (= input) and output IRC messages. For each message, it would then compare the existing output with the output that the local version would produce, and generate a diff if they differ.

robustirc / robustirc Goto Github PK

robustirc's Introduction

robustirc's People

Contributors

Stargazers

Watchers

Forkers

robustirc's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs