ipfs-cluster / ipfs-cluster Goto Github PK

ipfs-clusterd tool
- exposes IPFS API
- API?
- what does the CLI look like?
ipfs-clusterctl tool
- what does the CLI look like?
- look at analogs in kubernetes, docker, and other systems
- figure out authentication (look at how others do it)

APIs

ipfs api state changes (figure out what api commands change state, consensus)
- read/write
state replication configuration (UI)
- RAID
- (look for other analogs)
- network/erasure coding
state replication data structures (implementation)
- what does the distributed pinset look like
clusterd-to-clusterd protocol
- what needs to be defined? (beyond consensus)
- what do we put on consensus? (is it just replicating an op-log?)

libp2p pieces

libp2p consensus modules interface
- raft
- pbft
- ethereum
libp2p transports
- p0: clusterd can talk to other clurterd's directly
- p1: mountable transports / pipes
- p2: exo transport

Dev/Sys Ops

containerization
- containers? pods?
- load balancing
test harnesses
- scenario driven tests
- large network tests
- CI to test clusters
  - team city?
  - snap-ci?

Other Things

Use Cases

We need to create use case scenarios

deployment scenarios
- write out 6 different scenarios that we want to cover
  (along with the tech to use)
  (translate into modes of operation)
business use case scenarios
get feedback with how people would use this tooling
Look at other tools for inspiration
- Cluster managers (hashicorp, coreos, docker stuff)
- FileSystems (GlusterFS, Ceph)
- Replication (RAID)
- d/ctl tool UX (coreos, hashicorp)

PMing

We need to do the following

task breakdown
- knowledge requirements explicit
task dependencies
Roadmap / timeline
allocations

Next Steps

translate pieces into own spec issues
do the PM work (tasks, dependencies, allocations)
figure out roadmap for pieces
figure out other people interested
meet again next week 2016-07-06 17:00Z

Hijack IPFS proxy pin/add and pin/rm to do Cluster.Pin() and Unpin()

It needs to look like it's the IPFS API regarding response format too.

Onboarding Question Collection

Here are a few questions I have had at various points while working out how ipfs-cluster works.

What exactly are peers communicating to each other?

What is the division of labor between an ordinary peer and the cluster leader? What extra work does the cluster leader do?

What is the purpose of bootstrapping in ipfs-cluster-service? Is this the way for a single node to begin its own cluster?

What does the comment "// The only way I could make this work" in ipfs-cluster/main.go's init function referring to?

What is the purpose of the --leave flag and why is it possible for a node to be considered part of the cluster when its ipfs-cluster-service process is no longer running? Why is --leave not default? It sounds like if a node does not leave on shutdown it can lead to problems, so what is the beneficial use case of leaving it in?

From context clues it seems like ipfs-cluster in the future hopes to provide options for consensus, ipfs connections and monitors? Is this the case? If so what is the purpose of having these options?

ipfshttp appears to communicate with an ipfs daemon listening a local port through an http api. Does the ipfs daemon have other methods of communication and its just a matter of cluster not implementing clients? I suspect this is the case because the ipfsconn folder seems to abstract away ipfs connections from an http api, however the service binary directly calls ipfshttp, so maybe the ipfsconn/ipfshttp hierarchy was intended to mean something else.

Thanks, and more to follow!

Authentication

The os/etcd/rafthttp Raft HTTP implementation (https://github.com/coreos/etcd/tree/master/rafthttp or https://godoc.org/github.com/coreos/etcd/rafthttp) can use TLS to secure communication, which in turn is capable of mandating the client be authenticated, allowing you to specify a CA to validate client certificates (https://godoc.org/crypto/tls#Config). We could use this to our advantage to add authentication to ipfs-cluster by having the cluster accept a configured certificate and a certificate authority that it uses to keep the cluster tight. Kubernetes or similar would be responsible for the CA and Certificate generation.

ipfs-cluster-ctl should not print "Request accepted"

Specially since those calls are concatenated with a status call.

IPFS proxy should handle /add requets and pin the added items in cluster

It would be great to have option for auto pinning objects that are added/pinned with ipfs daemon.

Steps:

ipfs add
ipfs pin
ipfs-cluster auto pins and therefore all items are propagated automatically

Pins should support a replication factor

A Pin should not need to be pinned in every cluster member. We should be able to say that a pin needs to be pinned in 2, 3 cluster members.

We will start with a general replication factor for all pins, then maybe transition to replication factor per-pin.

These are thoughts for the first approach.

Replication factor -1 means Pin everywhere. If replication factor is larger than the number of clusters then it is assumed to be as large.

Pinning

We need a PeerMonitor component which is able to decide, when a pin request arrives, which peer comes next. The decision should be based on pluggable modules: for a start, we will start with one which attempts to evenly distribute the pins, although it should easily support other metrics like disk space etc.

Every commit log entry asking to Pin something must be tagged with the peers which are in charge. The Pin Tracker will receive the task and if it is tagged itself on a pin it will pin. Alternatively it will store the pin and mark it as remote.

If the PinTracker receives a Pin which is already known, it should unpin if it is no longer tagged among the hosts that are in charge of pinning. Somewhere in the pipeline we probably should detect re-pinnings and not change pinning peers stupidly.

Unpinning

Unpinning works as usual removing the pin only where it is pinned.

Re-pinning on peer failure

The peer monitor should detect hosts which are down (or hosts whose ipfs daemon is down).Upon a certain time threshold ( say 5 mins, configurable). It should grep the status for pins assigned to that host and re-pin them to new hosts.

The peer monitor should also receive updates from the peer manager and make sure that there are no pins assigned to hosts that are no longer in the cluster.

For the moment there is no re-rebalancing when a node comes back online.

This assumes there is a single peer monitor for the whole cluster. While monitoring the local ipfs daemon could be done by each peer (and triggering rebalances for that), if all nodes watch eachothers this will cause havoc when triggering rebalances. The Raft cluster leader should probably be in charge then. But this conflicts with being completely abstracted from the consensus algorithm below. If we had a non-leader-based consensus we could assume a distributed lottery to select someone. It makes no sense to re-implement code to choose a peer from the cluster when Raft has it all. Also, running the rebalance process in the Raft leader saves redirection for every new pin request.

UX

We need to attack ipfs-cluster-ctl to provide more human readable outputs as the API formats are more stable. status should probably show succinctly which pins are underreplicated or peers in error, 1 line per pin.

Notes from trying out ipfs-cluster for the first time

This is a quick rundown of a first-ever-user trying the ipfscluster tooling. Take it with a pile of salt: i noted down everything i ran into, including a bunch of problems that are obviously just polish not meant to be there yet, and other stuff which is probably meant to be experimental or a shortcut for now. I noted everything i ran into to give feedback on what was intuitive and what wasn't, first reactions, etc. My proper review is just beginning-- this is just a first stab at playing with it.

Connectivity: the good feedback so far.

Already, I think some work can be done on the connectivity side of things. Here's the basic points. maybe these can be extracted to separate issues later, but keeping these here to retain the context.

check connection to child ipfs node, and see what's going on there. (it wasn't clear to me that the communication was working. it was, i just didn't know how or what feedback to expect for correct or incorrect behavior)
ideally ipfs members ls or something like it would show connectivity status, particularly both:
- list cluster nodes ids (this is done) #15
- list cluster nodes' child ipfs nodes' ids #15
- whether the cluster nodes are currently connected (or disconnected) #15
- whether the child ipfs node of this cluster is connected to the child ipfs node of the other cluster
- should have an option to view multiaddrs of the connected entities. (this could just be done by supporting the ipfs swarm command. in reality, that should be a libp2p thing, so maybe we can lift that command out of go-ipfs and into go-libp2p, so that ipfscluster may take advantage of it). #15
help ensure connectivity:
- should discover other cluster nodes from each other-- i.e. if i connect: A -> C, and B -> C, A and B should find out about each other and connect to each other. #16
- given A is connected to B and C (cluster nodes), A should tell its child node (Achild) to try to connect to Bchild and Cchild. this can be done with ipfs swarm connect <multiaddr>. #16
- in practice, this should bust through nats. as long as one cluster node behind a nat manages to connect to a cluster node outside, it should be able to find other cluster nodes, and their child ipfs nodes and connect to everyone.
we need some command to output the connectivity graph of all the components, to see who is connected to whom and what's not connected. that way we can determine why something may not be working. Basically, figure out all libp2p nodes involved, and who they're connected to. Can use that to check connectivity in the graph of nodes (cluster nodes, child ipfs nodes, etc). #17

Also, some docs on what log levels or log modules i should listen to to figure things out would be good. Eg if i want to debug connectivity, or the consensus stuff, or the interactions with the ipfs connector, what level should i --loglevel. may be good to support per-module support (i think go-ipfs has this with an ENV var or something, i dont recall. it's useful to isolate a module and hear its debug output only). #18 #19

Trace of trying out `ipfscluster`

good looking readme.
- could be more extensive. dont see how to setup a proper cluster #19
- maybe could have a video or something #19
good architecture doc. (will review properly later)
naming
- ipfscluster-server = awkward
  - ipfs-cluster-server or ipfsclusterserver?
  - ipfs-cluster-service? ipfs-cluster-daemon?
install
- git co devel -- but that's fine. PR to master.
- make install failed:
  - go get -u gx. use $(shell which gx)? or download to a local installation. #20
  - go get -u the pkg deps. i had older versions of some packages which did not build.
    - if it had built but been an older, buggy, wrong version and failed silently, it could have been potentially much worse and time waste for me. lucky it did not compile. #20
  - also, don't use a global go get -u -- why should i mess up my system for you? #20
    - i know this is idiomatic to Go, but it's not idiomatic to IPFS.[0] #20
    - we are the people that use hash-linking to make sure we have what we intended to have. we should use that to our advantage here.
  - go get's on other packages do not use gx. we should version lock. ideally with gx.
  - gx install failed:
    - go-ipfs got stuck... i think it's a bitswap bug.
    - failed 3 times, wow... :c -- but hey, reproducible go-ipfs failures! yay! \o/
ipfscluster-server -h
- well written, clear, helpful
- yay for config path
  - mayb euse --config PATH or --config FILE. it's not any string. #21
- ideally would fit our other tooling help style. (USAGE, COMMANDS, OPTIONS, etc). maybe see the cli help tool gx uses (i dont recommend the one go-ipfs) #21
i think we probably should default the config path to $(pwd)/.ipfs-cluster instead of $(whoami)/.ipfs-cluster, because in server side installations (more typical for clusters), users are not the typical place stuff is installed/stored. This is departing from the convention of go-ipfs and IPFS_PATH, but i think that's fine. #22
ipfscluster-server init
- failed because it's lacking a dir, i think:
```
> ipfscluster-server init
error loading configuration: open /Users/earth/.ipfs-cluster/server.json: no such file or directory
```
- ipfscluster-server init --config .ipfs-cluster/server.json did not work, took me a bit to realize the global flag had to be before the subcommand name.
- ipfscluster-server --config .ipfs-cluster/server.json init worked as before (back to failing same way as above).
- mkdir .ipfs-cluster is not enough. still same error.
- touch .ipfs-cluster/server.json gets further, but crashes the process:
```
panic: multihash too short. must be > 3 bytes #22

goroutine 1 [running]:
panic(0x682b60, 0xc420074b20)
  /usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/ipfs/ipfs-cluster.NewMapPinTracker(0xc4201ef130, 0xc42016baa0)
  /Users/earth/go/src/github.com/ipfs/ipfs-cluster/map_pin_tracker.go:45 +0x370
main.main()
  /Users/earth/go/src/github.com/ipfs/ipfs-cluster/ipfscluster-server/main.go:134 +0x221
```
- maybe it's because it's not json? but echo '{}' >.ipfs-cluster/server.json did not work. same error.
- Ohhh init is an option, not a command! (ipfscluster-server -init, not ipfscluster-server init). that took a while.
  - not sure why i didn't notice this before. i'm biased to init being a subcommand. (git init, ipfs init, etc.), and to "commands" being in subcommand notation, not options. (i know golang flags isn't good about this.)
  - it probably should be a subcommand. #21
  - not a big deal anyway, this is a minor detail. but probably important to figure out for other subcommands too (look slike just version).
- "try deleting it first." maybe ask user to use -f to overwrite. having to rm manually is annoying. and it's less automation friendly. #21
- ok yay it worked!
- maybe the config file keys should be Go-style (CamlCased), as our other config files do it.
  - this is a stylistic choice. i dont care much about it. others may. #22
  - i do prefer consistency though in style #22
- ok on to setup a cluster.
ipfscluster-server init & run on 3x machines.
- looks like i have to edit the configs manually to add members?
- wait let's look at ipfscluster
ipfscluster has a short command listing, nice.
- but the help is not too long. and it should have it too.
- maybe this belongs under COMMANDS section in -h.
adding a member
- looks like ipfscluster member lacks ipfscluster member add. #23
- hmm i guess it's just a multiaddr? but to what, go-ipfs?
- it's not clear to me whether i should point to the go-ipfs node or directly to the other ipfscluster-server instances. yeah probably the latter, the config file shows a "api_port" which is different from "ipfs_api_port".
- Ok, so it looks like ipfscluster-server is going ot use whatever available ipfs node it finds.
  - maybe it should manage its own? (i originally thought it would work this way).
  - it should be configurable either way. i probably should be able to just use the ipfscluster tool to do everything. (launch the ipfscluster-server, launch ipfs daemon, a local one or a global one.).
these names are confusing, see comments: #21
js { // ... "ipfs_api_addr": "127.0.0.1", "ipfs_api_port": 9095, "ipfs_addr": "127.0.0.1", "ipfs_port": 5001 // this is an ipfs_api, so hard to distinguish from ipfs_api_port. // maybe it should be "ipfs_node_port". #22 // ... }
- Also, don't use separate addrs and ports, use a multiaddr. The IPFS API is not at all guaranteed to be served over HTTP or TCP. it could be over gRPC over a unix domain socket, or utp, or whatever. maybe use:
```
"ipfs_cluster_api": "/ip4/127.0.0.1/tcp/9095/http",
"ipfs_node_api":    "/ip4/127.0.0.1/tcp/5001/http",
```
- this isn't good yet-- something is missing. a distinguishing word instead of "node". (becauser the cluster one is a node, and the cluster one is really the cluster api).
- "underlying_ipfs_node_api" is more clear, but long. and "underlying" is not that good of a word. maybe we should clarify the relationship between the cluster (and the ipfs-node it represents) and the sub ipfs-node. maybe "parent/child" works for this, because it works with the tree recursive structure?
```
"cluster_node_api": "/ip4/127.0.0.1/tcp/9095/http",
"child_node_api":    "/ip4/127.0.0.1/tcp/5001/http", 
// or
"parent_cluster_api": "/ip4/127.0.0.1/tcp/9095/http",
"child_node_api":    "/ip4/127.0.0.1/tcp/5001/http", 
```
- We should consider having an ascii diagram (like the illustration we have from keynote) to convey the components and tree structure in the help text, so that users know what's going on. (@diasdavid is pretty good at ascii diagrams.) #19
- ok back to adding a member
  - I think I have to link to it manually from the config, in cluster_peers. but now i'm not sure how.
    - If the config doesnt take multiaddrs for the api addresses, what does it take for the cluster_peers array? #24 #19
    - oh, code says multiaddrs. yay.
  - Ok, but what format of multiaddr? /ip4/127.0.0.1/tcp/9095/http? or /ip4/127.0.0.1/tcp/9095? or what?
    - Code looks to be parsing an IPFS multiaddr. (so like /ip4/127.0.0.1/tcp/9095/ipfs/Qmfoobarbazpk...).
    - But is this supposed to point to an ipfs node, or to the peer's cluster node? Oh, is it already using libp2p to mount itself as a libp2p protocol? (dont think so as the corenet stuff isnt merged yet...).
    - Oh got it, cluster creates its own libp2p Node, and connects between the peers using libp2p.
- reading code to get a better understanding.
  - go-libp2p-rpc is pretty cool
  - ok so it seems we have a libp2p network stack, and a go-libp2p-rpc protocol mounted on top, and that's how cluster nodes talk to each other.
  - hm there's some abstraction/byzantine questions there (eg why does the node call a func on the leader instead of sending a message, etc). #25
  - maybe this is an rpc for clients, and not between cluster nodes? -- No, the docs say "enables components and members of the cluster to communicate and request actions from each other"
  - [got distracted; went to sleep]
  - [return next morning]
  - we must make sure the cluster node to cluster node RPC model fits in a byzantine environment. right now it seems that directly calling a function on the leader is setup to fail in the byzantine case. Also what about leaderless consensus protocols? Many consensus protocols do not use leaders. It seems there's a conflation here between what is the role of the consensus protocol and what is the role of the ipfs-cluster protocol (which right now includes both consensus protocol updates AND direct node-to-node RPC calls). I think we should be able to abstract out all inter-cluster-node communication to ONLY operating on the consensus log. I may be wrong about this, i need to dig into understanding the need for the RPCs better.
  - why does the consensus interface export a Leader() function? that should be a subclass of "Leader-based Consensus Protocols" if anything. #25
  - Is this intended as a shortcut for now (fine)? or is this meant for the long run? (not fine, we need ipfs-cluster to operate on top of leaderless consensus protocols) #25
ok, back to adding a member (third time's a charm!)
- I think i need to construct a multiaddr by combining server.json's "/ip4/%s/tcp/%s/ipfs/%s", config.api_addr, config.api_port, config.id.
- would be nice if there was a command like ipfscluster id similar to ipfs id. #21
- i wonder what the ipfs peer id of the entire cluster is, or where that is defined.
  - the id listed in the config seems to be the id of the libp2p ipfs cluster node, which is neither the id of the child ipfs node, NOR the simulated cluster ipfs node, it's a third libp2p node (which is fine), just clarifying for myself/readers).
  - reminder: switch libp2p over to use /p2p protocol prefix, not /ipfs, would be less confusing here.
- ok, so wrote this tool: https://gist.github.com/jbenet/7007202501fc1c4eb623327e5328cb9d
  - i think it's actually config.cluster_addr, config.cluster_port that we want for this.
  - works: /ip4/0.0.0.0/tcp/9096/ipfs/QmbGvizLZHVWto8ZWU2tbkNcV6W92G6AggKdPfx5gFbLZz
  - oops, 0.0.0.0. need the actual ip addr. nvm, should've added this to ipfscluster. let's do it manually for now. #21
- yay! got them to connect. sweet.
- --debug logs are sweet, shows tons of activity.
- btw i dont think disconnections and reconnections are graceful. killing one node and starting it again shows some errors. not sure if just notices, or actually problematic. we'll see. #26
- also ^c on one node, then ^c on the other hangs, looks like it's trapping the exit signal and waiting for the other members to respond, so it's stuck. cant kill it, only kill -9.
- ipfscluster member ls sweet.
  - yeah, killing one node and then doing member ls still shows both, but should only show one now. (or signal who is online and who is not) #15
connected, ok now let's pin.
- ipfs pin add fails:
```
// ipfscluster
Error 500: leader unknown or not existing yet
---
// ipfscluster-server logs
15:14:24.627  INFO    cluster: pinning:QmcskskhwkUFh1vvZbGFhBJhVMvzg6Hx44niysaoiiQGVt cluster.go:275
15:14:24.627 ERROR libp2p-rpc: leader unknown or not existing yet client.go:125
15:14:24.627 ERROR    cluster: sending error response: 500: leader unknown or not existing yet rest_api.go:396
```
- ok then, maybe i need a third node for there to be a raft leader. let's add a third.
- can we just add a third to one node and that will propagate info about the cluster? ie, will the other two nodes find each other? #24
- nope. adding a third only adds the third to that one node who knows about the other. ipfscluster members ls shows (2, 2, 3), instead of (3, 3, 3).
  - this makes startup tricky, particularly because of reconnection correctness uncertainty.
  - ok link like this:. 1->[], 2->[1], 3->[1,2].
  - looks like they all have each other. ipfscluster members ls shows (3, 3, 3)
  - looks like we'll need a "inspect connectivity" command that figures out the cluster connectivity and prints it (possibly graphs it in d3), to make sure our clusters are finding each other and well connected (n^2). #17
  - and we'll have to figure out how node discovery should work. could plug a node discovery protocol directly into the cluster's libp2p node. OR do it at a higher level (probably safer...). #24
ok let's try pinning again.
- seems to have worked:
```
> ipfscluster pin add <cid>
Request accepted
---
// ipfscluster-server logs
15:21:08.581 ERROR libp2p-raf: QmbGvizLZHVWto8ZWU2tbkNcV6W92G6AggKdPfx5gFbLZz: Pipeline error: EOF transport.go:716
// is this bad? o/
15:24:19.916  INFO    cluster: pinning:QmcskskhwkUFh1vvZbGFhBJhVMvzg6Hx44niysaoiiQGVt cluster.go:275
15:24:19.963  INFO    cluster: pin commited to global state: QmcskskhwkUFh1vvZbGFhBJhVMvzg6Hx44niysaoiiQGVt consensus.go:267
15:24:20.348  INFO    cluster: IPFS object is already pinned: QmcskskhwkUFh1vvZbGFhBJhVMvzg6Hx44niysaoiiQGVt ipfs_http_connector.go:205
```
- now, let's verify.
- ipfscluster pin ls. #2 and #3 have it, #1 does not. probably that pipeline error, got disconnected... but ipfscluster members ls still shows 3 for everyone, but i think #1 disconnected.
- killed and restarted #1. #1> ipfscluster members ls shows just 1. ok need to reconnect #2 and #3.
- killed #3 (connected to #1 and #2) and restarted it, oops panic: https://gist.github.com/jbenet/55697b749e9f99d2ebf59ae51083bf51
- killed #2. 2016/12/30 18:30:23 [INFO] snapshot: Creating new snapshot at /home/jbenet/.ipfs-cluster/data/snapshots/421-8-1483140623423.tmp took a while.
- started #2 (#1 and #2 on, #3 off). ipfscluster members ls shows (3, 2). probably from #3 before it paniced. ok restart everything.
  - start #1. #1> members ls -> (1)
  - start #2. {#1, #2}> members ls -> (1, 2), #2> members ls -> (1, 2).
  - start #3. {#1, #2, #3}> members ls -> (1, 2, 3), ok all set. (3, 3, 3).
- ok, i THINK #1 should automatically catch up with the others, and get the pin, no?
  - Woooh! yeah! \o/ it does!. let's inspect the ipfs nodes manually, and verify the pins.
    - ipfs pin ls <cid> ... stuck in 2 machines. looks like it's iterating over the entire damn pinset, hanging the machine...
    - ipfs refs local | grep <cid> shows it in #3 (where it was added), but not on #2 nor on #1. looks like the cluster server knows about the pin, but it did not translate to the child ipfs node i had running in those machines... so it did not pin it.
    - hmm the ipfscluster-server logs should maybe show whether it found + can connect to the child ipfs node.
      > ipfscluster status cid: QmcskskhwkUFh1vvZbGFhBJhVMvzg6Hx44niysaoiiQGVt status: QmTHEzZHGTSiVFFM2h3TgFCSsp2Ecq82U6heAxK7jJRijF: (#2) ipfs: pinning QmUmQ2DRe2keGN8meXXLWjUgGbyiBLPWJFXGi4c2kfDGJb: (#1) ipfs: pin_error QmbGvizLZHVWto8ZWU2tbkNcV6W92G6AggKdPfx5gFbLZz: (#3) ipfs: pinned
    - ok i have verified that the child ipfs #3 is directly connected to the child of #1, but not #2.
    - ok manually connected #3 to #2 (using ipfs ping).
    - something happend in #2's logs. ipfscluster status now shows pinning -> pin_error on #2.
    - okay, i wonder if it's stuck on pin_error forever, or whether the cluster will try to get the node to repin.
    - right now the cluster is in a failed state: the consensus log advanced to track the cid, but 2/3 of the nodes have not pinned. not sure if they will retry, or just get stuck in the failure.
    - let's try issuing it again...
    - same thing. #2 says: 18:46:41.588 WARNI cluster: IPFS unsuccessful: 500: Path 'QmcskskhwkUFh1vvZbGFhBJhVMvzg6Hx44ed' not pinned
    - ok reboot #2.
    - woah, ipfscluster status paniced. then #2 paniced: https://gist.github.com/jbenet/e04c59731ce33a3522603efa7a22f3d3
    - very flaky.
    - now ipfscluster-server wont start. what? https://gist.github.com/jbenet/e04c59731ce33a3522603efa7a22f3d3
    - no idea why. maybe a dir is locked. or something.
    - http://grokbase.com/t/gg/golang-nuts/15315a1hhs/go-nuts-getting-runtime-cgo-pthread-create-failed-resource-temporarily-unavailable-crashes
    - no, maybe the os thread is hosed? or some other resource? may need reboot?
    - ok waited 2min and the os recovered.
  - killed everything, started everything.
  - still nothing, pins wont cross. maybe there's a bitswap bug here. let me reboot the ipfs nodes.
  - stopped all 3 ipfs nodes. stopped all 3 ipfscluster-server nodes.
  - started ipfs nodes. started ipfscluster-server nodes. ipfscluster members ls shows (3, 3, 3).
  - ok, check the pins. still say pinning...
  - ok manually connect #3 to #2 and #1, because #3 is behind a nat ...
  - woah great, now everything says "pinned"!
  - ipfs pin ls <cid> shows the pin on all 3! \o/ Yay.
  - that's awesome, it:
    - kept trying to pin the thing.
    - kept once it got it, the cluster figured it out, and finished getting the pin. \o/
    - i wont try messing with it (ipfs pin rm <cid> in the child manually, hoping the cluster will notice the pin fail).
- ok let's try a whole directory.
  - yay! it worked just fine!
  - looks like ipfs connectivity problems caused the initial isses, which made ipfscluster-server connectivity problems worse.
  - and ipfs refs local | grep <cid> shows the pin. yay!
- ok that's sweet. that's great to see.
- how can i inspect the raft log manually? i guess ipfscluster pin ls
- wait, ipfscluster pin ls shows the second pin, but no longer the first... the 2nd contains the 1st, but these should not be coalesced.
  - why? i may pin A, pin B, unpin B, and I expect pin A to remain, whether or not B contains A is irrelevant.
- ok i pinned unrelated file C and now pin ls shows both B and C.
ok i'm going to stop for now.

Summary

total time playing with ipfs-cluster: around 4-5 hours.
i managed to pin things to the cluster. \o/ yay.
found some panics :0
ipfscluster recovered pinsets after rebooting. sweet!!
some feedback on some of the abstractions and tooling construction
lots of feedback on connectivity, seems to be the main source of problems i ran into.
lots remains to be reviewed + studied
i am already very excited about using it to track my personal content!! 😄
- a bunch of tooling comes to mind that i want (get size of pinset, etc), but all that comes later.
intense correctness testing comes to mind. we need tests that do all sorts of things, particularly:
- particularly messing with the underlying child ipfs nodes (i.e. making them drop pins, or removing content without removing the pin (simulate data loss in the child) to make sure the cluster can self heal.
- we need tests with harsh connectivity settings: certain connections being impossible, certain connections being flakey, bandwidth limitations, etc. another thing to add to the testing lab queue.

Me when the pins succeeded:

[0] Notes on go packaging. (TL;DR: use gx-go. this is the expanded why use gx-go). Warning: this is a contrarian view in respect to the Go language. And this is a standard, sane view from package management, version control, and secure open source. Go packaging is designed for monolithic codebases (well tended sequoia) not open source (haphazard expansive brush forest). Go was designed at Google, baking into the language many of the software engineering practices of Google. In general this is a great thing. In the cases where open source != how google develops, it is not. Google has a single, huge tree of code, with atomic safe updates. You cannot merge something into the tree if ANYTHING across all (most) of google fails to compile/errors. Open Source is fundamentally different. There is no such atomic-safe-update gating. We cannot assume other people's systems are setup like ours, or that they want to update their tree to the version we require (running go get -u for the user may be harmful to them). Or that we know who is depending on our code (lots of private code may depend on our package). Or that whoever is running a package we depend on wont screw everything up by moving something or breaking an API. Go uses location addressing for package identification... not just inside a single dilligent org (which works really well) but in the broader internet (which can fail catastrophically). Despite years of heated arguments on this, the Go team has not yet understood this is a real problem (they washed their hands off of it by having an external committe handle it). But that's ok because we are the people who use hash-linking to securely address everything. Let's use it to our advantage!

What is peer behaviour on cluster shutdown/restart?

When there is cluster with let's say two peers and one of them or both gets shutdown for some reason, do peers stay remembered and connect automatically on startup?

I suppose they are automatically saved into service.json file in cluster peer section?

Thanks !

Improve docs

This wraps documentation in general:

Document log levels and how to debug
Extend README with instructions on how to set up a Cluster
See if it would help to make a video
Ascii diagram of components
Make sure it is clear how to add new members

Handle SIGTERM correctly and exit properly

Currently ipfs-cluster-service only handles SIGINT (as in ctrl-c). Killing the process with kill produces a dirty shutdown.

Dependency management

make sure we use go get -u gx use $(shell which gx) or download to local install
Fork dependencies and lock versions with Gx. dont go get deps

Improve connectivity between IPFS nodes

All IPFS nodes associated to cluster nodes should have connectivity among themselves. Should trigger swarm connect commands to all other known nodes.

`ipfs-update --version` is wrong

I'm supposed to have 1.5.2 and it gives me

./ipfs-update --version
ipfs-update version 1.5.1

Content from manually removed peers is not re-allocated

I think this happens because the monitor only throws alerts for the current set of peers so it will not complain for a peer that has been removed from the cluster. Should not be like this. Remember to add a test.

Modularize components

Move components to subpackages. Make sure they log to different facilities.

feature request: full text and fact search using bleve

bleve is like elastic search, but 100% go.
Its mature and many projects use it.

For IPFS it would provide very powerful search.

Its also very easy to integrate.
dgraph got it in in just a week for example

dgraph-io/dgraph#592

Please consider this at least and discuss.

Cluster should detect peer failure and re-allocate affected content

Currently Cluster can allocate content to a number of peers but will not detect failures and re-allocate in that case.

Allocation is based on metrics which are regularly pushed to the Leader. If the last metric from a peer is expired or invalid then the peer is not considered as an available allocation when pinning. When re-pinning content, this situation is also detected and a new allocation will be found, so that part is done.

The idea is then give the PeerMonitor the task of producing Alerts on a channel on which the main component is listening. When the PeerMonitor detects that a component is down (because, i.e. last metric has expired), it sends an alert. Cluster will then find which Cids are allocated to the problematic peer and re-trigger Pin operations for each.

This implies PeerMonitors should be made aware of current clusterPeers (or be aware themselves with the RPC API to the pinManager).

Dynamic modification of peer members. Peer-aware components

It should be able to add peers and remove peers from the Cluster while the cluster is running.

Bonus points for auto-discovery of cluster members from a given seed.

Connectivity graph

Need an easy way to list all libp2p nodes involved in cluster (members and IPFS) and see what's connected to what (ideally everything is connected to everything).

Eliminate random test failures in Travis

Travis tests fail sometimes in random places (usually tests around replication). It has usually proven useful to increase delays, but should really look closer into it.

Users should be able to lend their disk space.

One way I'd help with IPFS' adoption would be to lend 100-500GB of my spare hard disk space. I'd like to simply be able to start up an IPFS software piece and instruct it to be in "lending mode" -- I don't care what gets hosted on my machine if it helps the network, to put it bluntly.

EDIT: I can't see a way how to apply the user-story label here.

Reading List

I figured we should put a reading-list together to cover a lot of the concepts relevant to cluster. LINKS ONLY PLEASE, don't add files.

We'll want to touch on:

erasure coding/FEC
- reed-solomon codes, tornado codes, raptor codes, etc
replication models UX
- RAID models
- geographical partitions
rebalancing contents
- self-healing (erasure coding, etc)
- tuning for demand (measured, expected by predictive model, or expected externally)
- to deal with speed delays
consensus
- distributed consensus
  - paxos, raft, etc.
  - intra-datacenter (<1ms latencies)
  - inter-datacenter (>10ms latencies, regional data locality needs)
  - global (>100ms latencies, ...)
- byzantine agreement (consensus)
  - traditional: byzantine paxos, pbft, etc.
  - modern: blockchains (PoW, PoS, ...), FBA, HBA.
- resource-heterogeneous consensus networks
  - latencies, message complexity
  - handling slow peers (demoting to follow, expulsion, etc)
representing pinsets
- HAMT/CHAMP pinset
- Pin an IPLD Selector

Erasure Coding Layer

I thought up a possible architecture for a Reed-Solomon (or other erasure coding algorithm) layer on top of IPFS. My notes are here. @hsanjuan informed me that ipfs-cluster was already a tool that was planned, and that this architecture could slot into ipfs-cluster, so I should raise an issue here. One of the key points of the system is that there would be IPFS nodes that provide IPFS files that they do not have locally, but instead have to generate by accessing other files from the IPFS network and re-combining them.

Pinning rings

Users should be able to start an ipfs-cluster node and have it join a pinning ring, that is, an existing set of nodes. These nodes would be archiving some interesting material for the participants. The newcomer should have an easy way to join the effort.

For this to work:

The user should just provide a peer ID and the multiaddress where the node is available
A ring administrator (or just let an existing participant do it?) adds a new peer to the cluster
The new member should be able to autodiscover all members of the ring and join the consensus
ipfs-cluster starts storing content on the new node
Interesting to explore the idea of observing members (ipfs-cluster nodes which can observe and track pins but do not have the right to become leaders or perform remote rpc requests).

Current state and considerations:

It is assumed that an ipfs-cluster is a set of nodes fully managed by the same administrator, while this proposal implies regular John Does providing nodes to a pinning ring.
ipfs-cluster cannot add new nodes dynamically, but this is coming soon
An ipfs-cluster node cannot autodiscover other nodes, it picks them up from the configuration file.
ipfs-cluster updates would be difficult to perform across the pinning ring if nodes depend on different individuals.

The main key here is to understand what is the trust model in a pinning ring, how a pinning ring member gets trusted and loses the trust, and who can take those actions.

Link to prebuilt binaries in README

It would be helpful to point to the prebuilt binaries on dist.ipfs.io in the install section of the README

Add a 'running cluster FAQs' to address common questions when using cluster

What happens when a node is shutdown?
How do I upgrade my cluster?
How do I manage my existing cluster and what considerations need to be made?
How is the cluster state stored?
What are the security implications when running cluster?
What do I do if a node crashes?
Can cluster lose my pins?

etc.

Upgrading and backups

There are a number of pain points if the consensus state format changes upon an upgrade.

Currently, persistence is obtained via Raft snapshots which are loaded on boot and written on shutdown (at least). Raft snapshots format is coming from go-libp2p-raft FSMSnapshot implementation, which is just a serialization of the state using msgpack.

If the state changes, loading the snapshot is likely to break. Also, this format is unreadable to the user and hard to work with.

A few thoughts about tackling this:

ipfs-cluster should store a readable, easy to work with, json backup of the state on shutdown. [DONE]
There should be a pain-free procedure (at the very least documented) to load and old backup onto a newer State (a migration). This is potentially tricky on a large cluster:
- Raft snapshots need to be removed [manually?] or Raft will attempt deserializing them on top of a newer version of the state.
- Raft may also replay log entries on boot, which does not make sense if snapshots have been removed, so probably all Raft data needs to be cleaned up upon upgrade.
- If Raft starts with a clean state after an upgrade, the new state could load the old backup and migrate it to the new format. With that state, we could directly run a rollback operation (which replaces the whole clean Raft state with the migrated version). However go-libp2p-consensus rollbacks are not very specific and this would work only because it's the way go-libp2p-raft does it at the moment.
- An alternative it is to let cluster replay and commit every entry on the old state. This might be slightly cleaner. Replaying would need to make sure allocations remain the same as in the imported state.

Use-case: data archival on a large scale with a volatile network.

At Climate Mirror, we deal with a large amount a data on a few, unfortunately centralized servers. However, a new initiative, Our Data Our Hands (ourdataourhands.org), hopes to shift the burden of storing data to everybody and their grandmothers' computers. We hope to accomplish this by providing docker containers that, once started, join a cluster of similar peers and contribute storage space. We will also sell pre-rolled hardware with large hard drives that can help stabilize the network.

In order to accomplish this, we need a few things:

Restricting the cluster pin function by using some form of signed commands, so that only team members can pin things.
Ideally some form of a (t,n)-threshold signature scheme when pinning, so a threshold of authorized team members must approve of a pin before it's added.
We are counting on a significant number of nodes dropping out/rejoining with some frequency (up to say a quarter of nodes could randomly drop and then rejoin), so replication should be high.
We need to be able to identify bad actors and drop them (I understand this can be done with Merkel DAGs, y'all's favorite).
We need to be able to scale this cluster, ehem, bigly, and be able to handle peers dropping and rejoining with frequency. Raft might be a problem.
Some peers will be able to contribute a gigabyte of storage, others two terabytes.
Older peers = more reliable = more assignments--this should be possible via consensus.

Some of these are dreamy (threshold signatures to keep us from creating a massive botnet under one person's control) but others are fairly important, like 1, 3, 5, and 6.

Member list should provide additional informations

Currently it just shows IDs of cluster members. It should:

Show ID of IPFS node below
Show if the IPFS node below is running
Show multiaddresses of the IPFS daemons

Support Member Add and Member Remove dynamically

This implies making some components PeerAware and relaying any changes to the peer sets. Need to investigate how Raft behaves when altering the peers.

Pin IPFS Proxy API /add results in the same node where they are submitted

Some thoughts about it.

Currently we let the allocation strategy dictate where content should be pinned, even when it comes from an intercepted /add request in the ipfs proxy. That means that it could be allocated somewhere else, and thus, content might need to be transferred one more time than if one of those allocations was the peer in which it was added.

If we forced one of the allocations to be the same peer where it was added:

We would save some bandwidth in the cases where the node is not allocated the content
We would override the pinning strategy, disregarding allocation strategies
There would not be danger of GC running at the same time the content is being pinned somewhere else because it was just added (this is not good at the moment).
There would not be an unpinned copy of the content taking disk space in another node.

Note that, this can be also part of a pinning strategy in an allocator, where candidate peers are allocated content if they are already pinning it.

ipfs-cluster-service options which override configuration keys become permanent

They set the configuration key and then configuration is saved upon exiting, thus becoming permanent.

IPFS cluster applications namings

at some point, move ipfs-clusterd (that should be the tool name) into its own repo (ipfs/go-ipfs-clusterd). this doesnt have to be now, but let's keep this repo for prototyping and general discussion. ipfs-cluster will mean a few diff repos.

Only the consensus component should communicate with the Leader()

Currently Pin/Unpin in the main component call RPC on the Leader() directly, but they should not be part of the specifics of the consensus protocol below.

LogPin and LogUnpin in the consensus component should call the RPC on the Leader instead.

Automated docker builds + builds on dist.ipfs.io

ipfs-cluster builds in dist.ipfs.io
automated docker builds under ipfs/ipfs-cluster

Autodiscovery - autosetup of cluster members

Given a single existing cluster member, a new cluster node should be able to set itself up, retrieve and connect to all members of the cluster.

Note the trickiness of this:

At least some components are not ready to work or should not attempt to work (consensus) before receiving the list of peers
It's not as easy as connecting to someone and retrieving current cluster members (i.e. would fail when starting a bunch of new members at the same time). Also involves broadcasting that a new node is available.
Need to give more thought. Should aim for something simple and straightfoward. Errors during automated setup are the worst.

CLI tooling (flags)

ipfscluster-server
- See if it is possible to have -config PATH and not -config string (not possible with vanilla flag I think)
- Study use of https://github.com/urfave/cli
- Make ipfscluster-server -init a subcommand too ipfscluster-server init
- Add -f to overwrite config file rather than asking the user to do it
ipfscluster
- Longer general help
- See also about urfave/cli
- Probably should be able to launch the server
- ipfscluster id

Use case: home directories in ipfs-cluster

I read in the meeting notes that you're asking for use cases. Here is my very personal use case and wishlist for ipfs-cluster. I'm not 100% sure if this doesn't go beyond the scope of ipfs-cluster, but I'll just write it down here anyway.

I want to replace any distributed filesystem I've currently in use with ipfs. I'm using XtreemFS (because it works well over WAN) and have been using AndrewFS, GlusterFS and HDFS (without WANdisco) in the past.

My first use case is to store home directories in ipfs clusters and these are the features that I would really like to have:

Auth: Every user gets authenticated and has access only to the parts of the cluster that he has permissions for (authorization).
Encryption: Yes, one could roll his own. But having it already integrated could come with a lot of goodies. Like per-user keys and key management in general. This could maybe be done with the excellent gocryptfs.
Admin management web interface: It would be great to have a management interace, where you can set the above 2 things (auth and encryption) and also manage the whole of the cluster. Which replication scheme to use. Managing sub-clusters. Setting different replication schemes for different sub-clusters. Kinda like Amazon S3 -> Amazon Glacier. One cluster is the high availability store, another is less well replicated (effect on speed through striping). Also in the interface the admin should be able to control the traffic and disk space quotas of the participants of the cluster (nodes, sub-clusters)
Client-side quota limit: Although the actual quotas that are used should be handled by the system/admin, each participant should be able to set a hard limit to traffic and diskspace that cannot be exceeded.
Replication schemes: RAID schemes are well understood, but RAID5 or RAID6 isn't enough. Especially when there are many participants who need high availability of the files that are on the cluster, which is over WAN. The replication isn't just about data safety, it's also about speed. So just going up the RAID level with an increasing number of participants, RAIDX? It would be great to have replication schemes that would take into account the popularity and age of a file. Have old documents only replicated for example on 5 nodes, while new and popular documents that get opened and used a lot are replicated on 20 nodes.
Balancer: Every distributed filesystem needs a balancer. Different levels of aggressiveness for the balancer would be great. Especially since the cluster could be over WAN, with nodes or sub-clusters going offline a lot. So there should be balancing schemes for LAN and WAN.

So how would I use the above?
I would have a company-wide cluster where everyone can access their home directories from everywhere. The cluster would have sub-clusters which represent the different sites. Those would be connected over WAN. Inside each sub-cluster there would be nodes which are locally connected over LAN. I don't want to just have dedicated machines building up the cluster, but also each and every "client", which is why the client-side quota limit is important in my opinion.

I hope this is the same vision that you have for ipfs-cluster. I think it's a pretty common use case for a distributed filesystem.

End to End testing and benchmarks

Need to automatically bring up a real cluster, in a real cloud/hw, with real IPFS to and perform a number of standard cluster workloads.

Any measures extracted from these tests can be used as future reference regarding the performance of the Cluster.

Repo Breakdown

We'll need these other NEW repos:

ipfs

https://github.com/ipfs/ipfs-clusterd - the ipfs-clusterd tool (in Go)
https://github.com/ipfs/ipfs-clusterctl - the ipfs-clusterctl tool (in Go)
https://github.com/ipfs/ipfs-repo-cluster - IPLD data structs for distributed "ipfs repo"
some work will go into go-ipfs
some work will go into ipfs api

libp2p

https://github.com/libp2p/consensus-interface - the consensus interface
wrap ethereum (geth?) in libp2p consensus interface
wrap raft in libp2p consensus interface
libp2p exo transport interface
libp2p mounted transport/pipe interface

may need others.

Captain log

I need to add a captain log.

Configuration files location and format

Study configuration location change to $(pwd)/.ipfs-cluster

I dont like this. It's relative, prone to user error, departs from IPFS convention. If anything configurations live in /etc/ in any standard system. Local configs are usually in .config/<app-name> these days.

See if others care about using Go CamelCased keys, as IPFS does, for configuration entries.

I prefer the javascript style because that's the usual with JSON. Also related to how API responses. Look.

Handle panics/errors when configuration is really invalid (empty node ID etc).
Rename ipfs_port to ipfs_node_port and so on

Use multiaddress format like:

```
"ipfs_cluster_api": "/ip4/127.0.0.1/tcp/9095/http",
"ipfs_node_api":    "/ip4/127.0.0.1/tcp/5001/http",
```

Sharness tests

CLI apps should be tested by sharness, at least to check that they are not utterly broken.

Hangs when shutting down + graceful disconnects

- also ^c on one node, then ^c on the other hangs, looks like it's trapping the exit signal and waiting for the other members to respond, so it's stuck. cant kill it, only kill -9.

ctrl-c in one node causes some errors. Raft is for sure going to complain about this (as it should). Gotta investigate if there are further problems.

Meeting notes - 2016-07-08 17:00Z

Participants:
@whyrusleeping
@hermanjunge
@jbenet
@christianlundkvist

Notes

daemon startup:

read config (whats in the config? do we even need a config?)
join cluster (clusterID)
- autodiscovery mechanisms?
  - mdns
  - ipfs service discovery
- manually specify hosts
get pinset (how? ask master?)
- question: are pinsets replicated across all nodes?
  - pinset could just be a single hash, consensus to agree on that hash
- when new nodes join, how are existing pinsets re-striped?
  - at first, this could be static. no restriping
once you have pinset, ensure you have all content specified
serve API

Api endpoints:

control port
'ipfs' API port
inter-cluster rpc port

dev timeline

step zero:
- skeletons of clusterd and cluster-control
step one:
- single node cluster
step two:
- many nodes, fully mirrored
step three:
- fancier striping option

Tasks:

@whyrusleeping will make skeleton programs (ipfs-clusterd)
@hermanjunge will start working on containerization orchestration / CI
@hermanjunge will flesh out 6 deployment use cases
@JLilic and @hermanjunge to flesh out business use cases (??)
@jbenet working on transports and pipe stuff (libp2p)
@jbenet and @christianlundkvist work on consensus interface
@jbenet fleshing out spec skeletons for API pieces and so on

#1 (comment)

Some user feedback from a test session

Here's some feedback from a user session. we tried it a bit-- got stuck then we moved to something else, but at least we got some info.

Trying out ipfs-cluster
- ipfs-cluster git repo should point to binaries
- there is no install.sh script
  - maybe we should follow our standard thing (like go-ipfs and other tools do it)
  - but may be fine, after all this is meant more for servers
  - could call it install-local.sh if install.sh is misleading
got a cluster up super easily, great!
ctl := ipfs-cluster-ctl
service := ipfs-cluster-service
ctl id
- should multiaddrs end with ipfs-cluster?
  - no, they shouldn't, just a libp2p endpoint
  - this should probably be mounted on the child node's libp2p node
  - go-ipfs does not yet support this. so this is fine as is now.
ctl status may want to be ctl pin status
ctl peers ls looks good
- 2 cluster nodes connected
- wish it gave me more info on connectivity (who's connected to who?)
  - but this may be a security issue-- meaning, in byzantine setting it doesn't. (leaks info, etc)
- child ipfs nodes directly connected (and to the public network)
- it errors if cluster peer not online.
ctl pin
- 20s delay on pinning a small file... not clear why. (how would we diagnose?)
INFO log on pin may want to say when it has been replicated to everyone
- consenus replicating the entry is not enough, we want to replicate the data too
- looks like right now consensus is clearing (committed and done) after the pin entry
  - should maybe wait for data to be replicated / committed before clearing it in consensus
  - or maybe use 2 phases/ops:
    - 1. pin (commits the pin to whole cluster),
    - and 2) - pin complete (the pin has actually completed to enough of the cluster to count as consensus)
added another peer, bad times
- somehow got into a bad state, cannot connect peers
- ctl peers ls on the leader takes a while
- ctl peers ls does not say what address using to connect to peer.
  - important for debugging
- we couldn't get the peers all talking to each other
tested loosing a network interface (cable)
- not connecting to peers over other interface
- finally connected
- looks to be libp2p issues.

It would be really useful to have other informers, which fetch different metrics, and other allocators which implement different strategies, for example a disk-space metric and allocator.

They just need to implement the Informer interface and the Allocator interface respectively. The existing examples show how this is done in a simple way.

ipfs-cluster / ipfs-cluster Goto Github PK

ipfs-cluster's Issues

ipfs cluster meeting on 2016-07-01 17:30Z

Agenda

Interfaces / Components

Tools