GithubHelp home page GithubHelp logo

bitwalker / libcluster Goto Github PK

View Code? Open in Web Editor NEW
1.9K 33.0 183.0 188 KB

Automatic cluster formation/healing for Elixir applications

License: MIT License

Elixir 100.00%
clustering erlang-distribution elixir

libcluster's Introduction

libcluster

Build Status Module Version Hex Docs Total Download License Last Updated

This library provides a mechanism for automatically forming clusters of Erlang nodes, with either static or dynamic node membership. It provides a pluggable "strategy" system, with a variety of strategies provided out of the box.

You can find supporting documentation here.

Features

  • Automatic cluster formation/healing
  • Choice of multiple clustering strategies out of the box:
    • Standard Distributed Erlang facilities (e.g. epmd, .hosts.erlang), which supports IP-based or DNS-based names
    • Multicast UDP gossip, using a configurable port/multicast address,
    • Kubernetes via its metadata API using via a configurable label selector and node basename; or alternatively, using DNS.
    • Rancher, via its metadata API
  • Easy to provide your own custom clustering strategies for your specific environment.
  • Easy to use provide your own distribution plumbing (i.e. something other than Distributed Erlang), by implementing a small set of callbacks. This allows libcluster to support projects like Partisan.

Installation

defp deps do
  [{:libcluster, "~> MAJ.MIN"}]
end

You can determine the latest version by running mix hex.info libcluster in your shell, or by going to the libcluster page on Hex.pm.

Usage

It is easy to get started using libcluster, simply decide which strategy you want to use to form a cluster, define a topology, and then start the Cluster.Supervisor module in the supervision tree of an application in your Elixir system, as demonstrated below:

defmodule MyApp.App do
  use Application

  def start(_type, _args) do
    topologies = [
      example: [
        strategy: Cluster.Strategy.Epmd,
        config: [hosts: [:"[email protected]", :"[email protected]"]],
      ]
    ]
    children = [
      {Cluster.Supervisor, [topologies, [name: MyApp.ClusterSupervisor]]},
      # ..other children..
    ]
    Supervisor.start_link(children, strategy: :one_for_one, name: MyApp.Supervisor)
  end
end

The following section describes topology configuration in more detail.

Example Configuration

You can configure libcluster either in your Mix config file (config.exs) as shown below, or construct the keyword list structure manually, as shown in the previous section. Either way, you need to pass the configuration to the Cluster.Supervisor module in it's start arguments. If you prefer to use Mix config files, then simply use Application.get_env(:libcluster, :topologies) to get the config that Cluster.Supervisor expects.

config :libcluster,
  topologies: [
    epmd_example: [
      # The selected clustering strategy. Required.
      strategy: Cluster.Strategy.Epmd,
      # Configuration for the provided strategy. Optional.
      config: [hosts: [:"[email protected]", :"[email protected]"]],
      # The function to use for connecting nodes. The node
      # name will be appended to the argument list. Optional
      connect: {:net_kernel, :connect_node, []},
      # The function to use for disconnecting nodes. The node
      # name will be appended to the argument list. Optional
      disconnect: {:erlang, :disconnect_node, []},
      # The function to use for listing nodes.
      # This function must return a list of node names. Optional
      list_nodes: {:erlang, :nodes, [:connected]},
    ],
    # more topologies can be added ...
    gossip_example: [
      # ...
    ]
  ]

Strategy Configuration

For instructions on configuring each strategy included with libcluster, please visit the docs on HexDocs, and look at the module doc for the strategy you want to use. The authoritative documentation for each strategy is kept up to date with the module implementing it.

Clustering

You have a handful of choices with regards to cluster management out of the box:

  • Cluster.Strategy.Epmd, which relies on epmd to connect to a configured set of hosts.
  • Cluster.Strategy.LocalEpmd, which relies on epmd to connect to discovered nodes on the local host.
  • Cluster.Strategy.ErlangHosts, which uses the .hosts.erlang file to determine which hosts to connect to.
  • Cluster.Strategy.Gossip, which uses multicast UDP to form a cluster between nodes gossiping a heartbeat.
  • Cluster.Strategy.Kubernetes, which uses the Kubernetes Metadata API to query nodes based on a label selector and basename.
  • Cluster.Strategy.Kubernetes.DNS, which uses DNS to join nodes under a shared headless service in a given namespace.
  • Cluster.Strategy.Rancher, which like the Kubernetes strategy, uses a metadata API to query nodes to cluster with.

You can also define your own strategy implementation, by implementing the Cluster.Strategy behavior. This behavior expects you to implement a start_link/1 callback, optionally overriding child_spec/1 if needed. You don't necessarily have to start a process as part of your strategy, but since it's very likely you will need to maintain some state, designing your strategy as an OTP process (e.g. GenServer) is the ideal method, however any valid OTP process will work. See the Cluster.Strategy module for details on the callbacks you need to implement and the arguments they receive.

If you do not wish to use the default Erlang distribution protocol, you may provide an alternative means of connecting/ disconnecting nodes via the connect and disconnect configuration options, if not using Erlang distribution you must provide a list_nodes implementation as well. They take a {module, fun, args} tuple, and append the node name being targeted to the args list. How to implement distribution in this way is left as an exercise for the reader, but I recommend taking a look at the Firenest project currently under development. By default, libcluster uses Distributed Erlang.

Third-Party Strategies

The following list of third-party strategy implementations is not comprehensive, but are known to exist.

Copyright and License

Copyright (c) 2016 Paul Schoenfelder

This library is MIT licensed. See the LICENSE.md for details.

libcluster's People

Contributors

alexkovalevych avatar alexmarold avatar aseigo avatar balena avatar bitwalker avatar bradschwartz avatar bryanhuntesl avatar dgmcguire avatar doughsay avatar dschniepp avatar dzhlobo avatar elcritch avatar enricoboccadifuoco avatar ericmj avatar flowerett avatar frekw avatar gazler avatar getong avatar gsmlg avatar hermanverschooten avatar ivarvong avatar jsonmaur avatar maennchen avatar mobileoverlord avatar oestrich avatar rlipscombe avatar seivan avatar sleipnir avatar wojtekmach avatar zorbash avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

libcluster's Issues

Libcluster running as a dependency for multiple apps in an umbrella

Experiencing an odd issue with libcluster. Using the EPMD strategy.
Setup:
Multiple apps under a single umbrella. Distillery targeted releases for certain apps within the umbrealla.

Each app runs as a single systemd service and each service must connect to each other.

What we are noticing is only one EPMD process gets started on the box, not one per service, but only the first service that gets started.

This causes issues like if service 1 goes down, service 2 is unable to connect to service 1 if he gets restarted.

I would expect that service 1 and service 2 both have epmd process running.

Is my understanding correct? I am really just fishing here. Let me know if you need anything from me.

k8s: Clustering apps with different base names

Say, I have three Elixir/Erlang services A(3 replicas), B(2 replicas) and C(2 replicas) in same namespace. And I would like to cluster A and B apps together and leave app C on its own. I do not think I can do this with existing codebase. Please correct me otherwise.

I can do a PR for this feature if it cannot be done currently.

Any chance of a new release to hex?

Hey!

We're running into an issue on our Kubernetes cluster running the latest release in hex (v2.3.0) that looks like it has be fixed as part of an already merged PR #38. Any chance we could get an new release in hex that contains that PR?

PS. Thanks for all your awesome work. We're actively using (or about to use) several of your packages and they've been extremely helpful.

Unable to connect to nodes

I followed issue 54.
I have a minikube cluster with specific deployment, serviceaccount, role, etc.
And still unable to connect nodes.
Tried with hex and github versions.

k8s/deployment.yaml
kind: Deployment
metadata:
  name: phishx
  namespace: phishx-deployment
spec:
  replicas: 3
  template:
    metadata:
      labels:
        app: phishx-app
        tier: phishx-tier
    spec:
      serviceAccountName: phishx-account
      containers:
        - name: phishx
          image: phishx/release:0.5.4
          imagePullPolicy: Always
          ports:
            - containerPort: 4000
          args: ["foreground"]
          env:
            - name: MY_POD_IP
              valueFrom:
                fieldRef:
                  fieldPath: status.podIP
            - name: MY_POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
      imagePullSecrets:
        - name: cloud.docker.com
config/prod.exs
config :libcluster,
  topologies: [
    k8s: [
      strategy: Cluster.Strategy.Kubernetes,
      config: [
        kubernetes_selector: "tier=phishx-tier",
        kubernetes_node_basename: "phishx"
      ]
    ]
]
kubectl get pods -o wide
NAME                      READY     STATUS    RESTARTS   AGE       IP           NODE
phishx-66b67bfb59-6jrmh   1/1       Running   0          4m        172.17.0.6   minikube
phishx-66b67bfb59-csrvd   1/1       Running   0          4m        172.17.0.4   minikube
phishx-66b67bfb59-rsg2v   1/1       Running   0          4m        172.17.0.5   minikube
kubectl logs -f phishx-66b67bfb59-6jrmh
18:55:32.111 [warn] [libcluster:k8s] unable to connect to :"[email protected]"
18:55:32.126 [warn] [libcluster:k8s] unable to connect to :"[email protected]"
18:55:32.129 [warn] [libcluster:k8s] unable to connect to :"[email protected]"

mixed or hybrid cluster strategies?

hi

is it possible to mix strategies?

i.e. a core of "pets" running say in Erlang strategy and a larger number of "cattle" in Kubernetes strategy but ALL in a single cluster?

a kind of hybrid or mixed cluster

I want the stability of Erlang on say AWS or DO and Kubernetes for scaling out

Sending messages across cluster

Right now libcluster doesn't seem to be able to send messages across cluster. From what I can tell after looking at :publish implementation, the event is only sent to subscribers, which are processes on that same node, but not nodes in cluster (correct me if I'm wrong).

I am using epmd strategy. The way I'm testing it is this:

  1. Run nodes:

    $ iex --sname node1@localhost --cookie test_cookie -S mix
    $ iex --sname node2@localhost --cookie test_cookie -S mix
  2. Connect nodes:

    iex(node1@localhost)1> Cluster.Strategy.connect_nodes([:'node2@localhost'])
    iex(node2@localhost)1> Cluster.Strategy.connect_nodes([:'node1@localhost'])
  3. Subscribe both shells to cluster events:

    iex(node1@localhost)2> Cluster.Events.subscribe(self())
    iex(node2@localhost)2> Cluster.Events.subscribe(self())
  4. Publish event from within node1:

    iex(node1@localhost)3> Cluster.Events.publish({:ok, "Hello, world!"})
    
    15:04:34.703 [debug] [libcluster] [events] <== {:publish, {:ok, "Hello, world!"}}
    :ok
    
  5. Flush shell messages on both nodes:

    on node1:

    iex(node1@localhost)4> flush()
    {:ok, "Hello, world!"}
    :ok
    iex(node1@localhost)5>
    

    and on node2:

    iex(node2@localhost)3> flush()
    :ok
    

Was this done by design? If so, what is the purpose of publishing events within the same node?

At the same time nodeup and nodedown events work perfectly across the cluster.

Is there a plan to support publishing arbitrary events across cluster? If so, I'd like to work on a PR to bring this support.

Kubernetes cluster with 'terminating' pods

Good evening. Can you help me. I'm using libcluster in Kubernetes with the strategy Cluster.Strategy.Kubernetes and libring for dynamic messaging

config :libcluster,
  topologies: [
    feed: [
      strategy: Cluster.Strategy.Kubernetes,
      config: [kubernetes_selector: "app=feed", kubernetes_node_basename: "feed"]
    ],
    fixture: [
      strategy: Cluster.Strategy.Kubernetes,
      config: [kubernetes_selector: "app=fixture", kubernetes_node_basename: "fixture"]
    ],
    web: [
      strategy: Cluster.Strategy.Kubernetes,
      config: [kubernetes_selector: "app=web", kubernetes_node_basename: "web"]
    ]
  ]

For some reason, sometimes after node moved to Terminating state, it still in cluster. Eventually, request to such node results in request timeout.

Refactor to allow injecting strategies outside the builtin ones.

Seems a lot of the code between the strategies could be reused between them and even used outside.
The biggest difference I see is essentially how it access a list of nodes wether its hardcoded or does a lookup.

Could be useful to make it easier to add new strategies if you could dog food some code for the existing builtin ones.

Error while overloading :connect / :disconnect / :list_nodes

While overloading :connect :disconnect and :list_nodes, I came to this error :

18:28:16.549 [error] GenServer #PID<0.158.0> terminating ** (RuntimeError) Elixir.ExCluster.Connection.list/1 is undefined! (libcluster) lib/strategy/strategy.ex:117: Cluster.Strategy.ensure_exported!/3 (libcluster) lib/strategy/strategy.ex:39: Cluster.Strategy.connect_nodes/4 (libcluster) lib/strategy/gossip.ex:116: Cluster.Strategy.Gossip.handle_heartbeat/2 (libcluster) lib/strategy/gossip.ex:90: Cluster.Strategy.Gossip.handle_info/2 (stdlib) gen_server.erl:616: :gen_server.try_dispatch/4 (stdlib) gen_server.erl:686: :gen_server.handle_msg/6 (stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3 Last message: {:udp, #Port<0.4903>, {192, 168, 1, 26}, 45892, <<104, 101, 97, 114, 116, 98, 101, 97, 116, 58, 58, 131, 116, 0, 0, 0, 1, 100, 0, 4, 110, 111, 100, 101, 100, 0, 13, 110, 111, 100, 101, 48, 50, 64, 116, 104, 97, 110, 111, 115>>} State: %Cluster.Strategy.State{config: [port: 45892, if_addr: "0.0.0.0", multicast_addr: "230.1.1.251", multicast_ttl: 1], connect: {ExCluster.Connection, :connect, []}, disconnect: {ExCluster.Connection, :disconnect, []}, list_nodes: {ExCluster.Connection, :list, [:connected]}, meta: {{230, 1, 1, 251}, 45892, #Port<0.4903>}, topology: :gossip_topo}

My understanding of the problem is that when trying to use a custom implementation, the module is not always loaded (in strategy.ex ensure_exported)

You will find a sample to reproduce the problem here :
https://github.com/freandre/ex_cluster

A merge request with my fix will follow.

Thanks for your work ;)

Multiple topologies with same Strategy

Hi. I was trying to set up multiple clusters of the same nodes running the same strategy from my mix config, but but I noticed that each worker in the strategy is currently id'd using __MODULE__, making it so that you can only configure topology of each strategy in the configs.exs file. Unsure if this is intended and just wanted to drop a note.

I've fallen back to some configuration hacks for the meantime.
Thanks!

Need more information on Multicast mode

Hello,

We are looking at using libcluster at our org and the network/infrastructure people need more information on how libcluster uses multicast. Do you have a whitepaper or some docs on the library and how it uses Multicast? Dense mode vs Sparse mode, does it use subscriptions with pruning, etc.

Thank you,

Stephen

tests fail

I perform a fresh checkout of master - run the tests and get the following failures:

  1) test connect_nodes/4 handles connect ignore (Cluster.StrategyTest)
     test/strategy_test.exs:45
     ** (RuntimeError) Elixir.Cluster.Nodes.list_nodes/1 is undefined!
     code: assert capture_log(fn ->
     stacktrace:
       (libcluster) lib/strategy/strategy.ex:117: Cluster.Strategy.ensure_exported!/3
       (libcluster) lib/strategy/strategy.ex:39: Cluster.Strategy.connect_nodes/4
       test/strategy_test.exs:51: anonymous fn/2 in Cluster.StrategyTest."test connect_nodes/4 handles connect ignore"/1
       (ex_unit) lib/ex_unit/capture_log.ex:78: ExUnit.CaptureLog.capture_log/2
       test/strategy_test.exs:49: (test)



  2) test connect_nodes/4 does not connect existing nodes again (Cluster.StrategyTest)
     test/strategy_test.exs:12
     ** (RuntimeError) Elixir.Cluster.Nodes.list_nodes/1 is undefined!
     code: assert :ok = Strategy.connect_nodes(__MODULE__, connect, list_nodes, [Node.self()])
     stacktrace:
       (libcluster) lib/strategy/strategy.ex:117: Cluster.Strategy.ensure_exported!/3
       (libcluster) lib/strategy/strategy.ex:39: Cluster.Strategy.connect_nodes/4
       test/strategy_test.exs:16: (test)



  3) test disconnect_nodes/4 does not disconnect missing noded (Cluster.StrategyTest)
     test/strategy_test.exs:59
     ** (RuntimeError) Elixir.Cluster.Nodes.list_nodes/1 is undefined!
     code: assert :ok = Strategy.disconnect_nodes(__MODULE__, disconnect, list_nodes, [Node.self()])
     stacktrace:
       (libcluster) lib/strategy/strategy.ex:117: Cluster.Strategy.ensure_exported!/3
       (libcluster) lib/strategy/strategy.ex:80: Cluster.Strategy.disconnect_nodes/4
       test/strategy_test.exs:63: (test)



  4) test disconnect_nodes/4 does disconnect new nodes (Cluster.StrategyTest)
     test/strategy_test.exs:68
     ** (RuntimeError) Elixir.Cluster.Nodes.list_nodes/1 is undefined!
     code: assert capture_log(fn ->
     stacktrace:
       (libcluster) lib/strategy/strategy.ex:117: Cluster.Strategy.ensure_exported!/3
       (libcluster) lib/strategy/strategy.ex:80: Cluster.Strategy.disconnect_nodes/4
       test/strategy_test.exs:74: anonymous fn/1 in Cluster.StrategyTest."test disconnect_nodes/4 does disconnect new nodes"/1
       (ex_unit) lib/ex_unit/capture_log.ex:78: ExUnit.CaptureLog.capture_log/2
       test/strategy_test.exs:72: (test)



  5) test connect_nodes/4 handles connect failure (Cluster.StrategyTest)
     test/strategy_test.exs:33
     ** (RuntimeError) Elixir.Cluster.Nodes.list_nodes/1 is undefined!
     code: assert capture_log(fn ->
     stacktrace:
       (libcluster) lib/strategy/strategy.ex:117: Cluster.Strategy.ensure_exported!/3
       (libcluster) lib/strategy/strategy.ex:39: Cluster.Strategy.connect_nodes/4
       test/strategy_test.exs:39: anonymous fn/2 in Cluster.StrategyTest."test connect_nodes/4 handles connect failure"/1
       (ex_unit) lib/ex_unit/capture_log.ex:78: ExUnit.CaptureLog.capture_log/2
       test/strategy_test.exs:37: (test)



  6) test disconnect_nodes/4 handles connect failure (Cluster.StrategyTest)
     test/strategy_test.exs:82
     ** (RuntimeError) Elixir.Cluster.Nodes.list_nodes/1 is undefined!
     code: assert capture_log(fn ->
     stacktrace:
       (libcluster) lib/strategy/strategy.ex:117: Cluster.Strategy.ensure_exported!/3
       (libcluster) lib/strategy/strategy.ex:80: Cluster.Strategy.disconnect_nodes/4
       test/strategy_test.exs:88: anonymous fn/1 in Cluster.StrategyTest."test disconnect_nodes/4 handles connect failure"/1
       (ex_unit) lib/ex_unit/capture_log.ex:78: ExUnit.CaptureLog.capture_log/2
       test/strategy_test.exs:86: (test)



  7) test disconnect_nodes/4 handles disconnect ignore (Cluster.StrategyTest)
     test/strategy_test.exs:97
     ** (RuntimeError) Elixir.Cluster.Nodes.list_nodes/1 is undefined!
     code: assert capture_log(fn ->
     stacktrace:
       (libcluster) lib/strategy/strategy.ex:117: Cluster.Strategy.ensure_exported!/3
       (libcluster) lib/strategy/strategy.ex:80: Cluster.Strategy.disconnect_nodes/4
       test/strategy_test.exs:103: anonymous fn/1 in Cluster.StrategyTest."test disconnect_nodes/4 handles disconnect ignore"/1
       (ex_unit) lib/ex_unit/capture_log.ex:78: ExUnit.CaptureLog.capture_log/2
       test/strategy_test.exs:101: (test)



  8) test connect_nodes/4 does connect new nodes (Cluster.StrategyTest)
     test/strategy_test.exs:21
     ** (RuntimeError) Elixir.Cluster.Nodes.list_nodes/1 is undefined!
     code: assert capture_log(fn ->
     stacktrace:
       (libcluster) lib/strategy/strategy.ex:117: Cluster.Strategy.ensure_exported!/3
       (libcluster) lib/strategy/strategy.ex:39: Cluster.Strategy.connect_nodes/4
       test/strategy_test.exs:27: anonymous fn/2 in Cluster.StrategyTest."test connect_nodes/4 does connect new nodes"/1
       (ex_unit) lib/ex_unit/capture_log.ex:78: ExUnit.CaptureLog.capture_log/2
       test/strategy_test.exs:25: (test)

.

  9) test start_link/1 calls right functions (Cluster.Strategy.KubernetesTest)
     test/kubernetes_test.exs:28
     ** (RuntimeError) failed to start child with the spec {Cluster.Strategy.Kubernetes, [topology: :name, config: [kubernetes_node_basename: nil, kubernetes_selector: "app=", kubernetes_master: "cluster.localhost", kubernetes_service_account_path: "/common/libcluster/test/fixtures/kubernetes/service_account"], connect: {Cluster.Nodes, :connect, [#PID<0.226.0>]}, disconnect: {Cluster.Nodes, :disconnect, [#PID<0.226.0>]}, list_nodes: {Cluster.Nodes, :list_nodes, [[]]}, block_startup: true]}.
     Reason: an exception was raised:
         ** (RuntimeError) Elixir.Cluster.Nodes.list_nodes/1 is undefined!
             (libcluster) lib/strategy/strategy.ex:117: Cluster.Strategy.ensure_exported!/3
             (libcluster) lib/strategy/strategy.ex:80: Cluster.Strategy.disconnect_nodes/4
             (libcluster) lib/strategy/kubernetes.ex:97: Cluster.Strategy.Kubernetes.load/1
             (libcluster) lib/strategy/kubernetes.ex:79: Cluster.Strategy.Kubernetes.init/1
             (stdlib) gen_server.erl:365: :gen_server.init_it/2
             (stdlib) gen_server.erl:333: :gen_server.init_it/6
             (stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3
     code: capture_log(fn ->
     stacktrace:
       (ex_unit) lib/ex_unit/callbacks.ex:311: ExUnit.Callbacks.start_supervised!/2
       test/kubernetes_test.exs:31: anonymous fn/0 in Cluster.Strategy.KubernetesTest."test start_link/1 calls right functions"/1
       (ex_unit) lib/ex_unit/capture_log.ex:78: ExUnit.CaptureLog.capture_log/2
       test/kubernetes_test.exs:30: (test)

....

 10) test start_link/1 calls right functions (Cluster.Strategy.EpmdTest)
     test/epmd_test.exs:12
     No message matching {:connect, :foo@bar} after 100ms.
     The process mailbox is empty.
     code: capture_log(fn ->
     stacktrace:
       test/epmd_test.exs:24: anonymous fn/0 in Cluster.Strategy.EpmdTest."test start_link/1 calls right functions"/1
       (ex_unit) lib/ex_unit/capture_log.ex:78: ExUnit.CaptureLog.capture_log/2
       test/epmd_test.exs:13: (test)

Finished in 0.4 seconds
15 tests, 10 failures

Randomized with seed 131338

Should I be performing some setup prior to testing ?

[libcluster:app_name] cannot query kubernetes (unauthorized): endpoints is forbidden: User cannot list endpoints in the namespace "staging": Unknown user, User cannot list endpoints in the namespace

I am running kubernetes 1.9.4 on my cluster

I have two pods , gate which is trying to connect to coolapp

I am using libcluster to connect my nodes
I get the following error:

[libcluster:app_name] cannot query kubernetes (unauthorized): endpoints is forbidden: User "system:serviceaccount:staging:default" cannot list endpoints in the namespace "staging": Unknown user "system:serviceaccount:staging:default"

here is my config in gate under config/prod:

 config :libcluster,
 topologies: [
   app_name: [
     strategy: Cluster.Strategy.Kubernetes,
     config: [
       kubernetes_selector: "tier=backend",
       kubernetes_node_basename: System.get_env("MY_POD_NAMESPACE") || "${MY_POD_NAMESPACE}"]]]

here is my configuration:

vm-args

## Name of the node
-name ${MY_POD_NAMESPACE}@${MY_POD_IP}
## Cookie for distributed erlang
-setcookie ${ERLANG_COOKIE}
# Enable SMP automatically based on availability
-smp auto

creating the secrets:

kubectl create secret generic erlang-config --namespace staging --from-literal=erlang-cookie=xxxxxx
kubectl create configmap vm-config --namespace staging --from-file=vm.args

gate/deployment.yaml

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: gate
  namespace: staging
spec:
  replicas: 1
  revisionHistoryLimit: 1
  strategy:
      type: RollingUpdate
  template:
    metadata:
      labels:
        app: gate
        tier: backend
    spec:
      securityContext:
        runAsUser: 0
        runAsNonRoot: false
      containers:
      - name: gate
        image: gcr.io/development/gate:0.1.7
        args:
          - foreground
        ports:
        - containerPort: 80
        volumeMounts:
        - name: config-volume
          mountPath: /beamconfig
        env:
        - name: MY_POD_NAMESPACE
          value: staging
        - name: MY_POD_IP
          valueFrom:
            fieldRef:
              fieldPath: status.podIP
        - name: MY_POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: RELEASE_CONFIG_DIR
          value: /beamconfig
        - name: ERLANG_COOKIE
          valueFrom:
            secretKeyRef:
              name: erlang-config
              key: erlang-cookie
      volumes:
      - name: config-volume
        configMap:
          name: vm-config

coolapp/deployment.yaml:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: coolapp
  namespace: staging
spec:
  replicas: 1
  revisionHistoryLimit: 1
  strategy:
      type: RollingUpdate
  template:
    metadata:
      labels:
        app: coolapp
        tier: backend
    spec:
      securityContext:
        runAsUser: 0
        runAsNonRoot: false
     # volumes
      volumes:
      - name: config-volume
        configMap:
          name: vm-config
      containers:
      - name: coolapp
        image: gcr.io/development/coolapp:1.0.3
        volumeMounts:
        - name: secrets-volume
          mountPath: /secrets
          readOnly: true
        - name: config-volume
          mountPath: /beamconfig
        ports:
        - containerPort: 80
        args:
          - "foreground"
        env:
        - name: MY_POD_NAMESPACE
          value: staging
        - name: MY_POD_IP
          valueFrom:
            fieldRef:
              fieldPath: status.podIP
        - name: MY_POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: REPLACE_OS_VARS
          value: "true"
        - name: RELEASE_CONFIG_DIR
          value: /beamconfig
        - name: ERLANG_COOKIE
          valueFrom:
            secretKeyRef:
              name: erlang-config
              key: erlang-cookie
        # proxy_container
      - name: cloudsql-proxy
        image: gcr.io/cloudsql-docker/gce-proxy:1.11
        command: ["/cloud_sql_proxy", "--dir=/cloudsql",
            "-instances=staging:us-central1:com-staging=tcp:5432",
            "-credential_file=/secrets/cloudsql/credentials.json"]
        volumeMounts:
          - name: cloudsql-instance-credentials
            mountPath: /secrets/cloudsql
            readOnly: true
          - name: cloudsql
            mountPath: /cloudsql
 

poison 3.0 dependency is aggressive

Is it necessary to have such an aggressive poison dependency? Being a library, libcluster should be as conservative as it can here, otherwise it makes it difficult to use with other libraries.

Failed to use "poison" (versions 1.5.0 to 1.5.2) because
  deps/libcluster/mix.exs requires ~> 3.0
  ex_aws (versions 1.0.0 to 1.1.2) requires >= 1.2.0
  mix.exs specifies ~> 1.5

[usability] more helpful missing config message

In the case that the config is missing the crash report/message is terrifying for new users.

It might be worthwhile to add a pre-flight check and print something more like :

libcluster -> missing config x 
21:54:49.998 [error] CRASH REPORT Process <0.374.0> with 0 neighbours crashed with reason: #{'__exception__' => true,'__struct__' => 'Elixir.Protocol.UndefinedError',description 
=> <<>>,protocol => 'Elixir.Enumerable',value => nil} in 'Elixir.Enumerable':'impl_for!'/1 line 1                                                                                 
21:54:49.998 [error] Supervisor noslides_sup had child 'Elixir.Cluster.Supervisor' started with 'Elixir.Cluster.Supervisor':start_link([nil,[{name,'Elixir.NoSlides.ClusterSuperviso
r'}]]) at undefined exit with reason #{'__exception__' => true,'__struct__' => 'Elixir.Protocol.UndefinedError',description => <<>>,protocol => 'Elixir.Enumerable',value => nil} 
in 'Elixir.Enumerable':'impl_for!'/1 line 1 in context start_error                                                                                                                
                                                                                                                                                                                  
21:54:50.014 [error] Unable to start NoSlides supervisor because: {:shutdown, {:failed_to_start_child, Cluster.Supervisor, {%Protocol.UndefinedError{description: "", protocol: Enu
merable, value: nil}, [{Enumerable, :impl_for!, 1, [file: 'lib/enum.ex', line: 1]}, {Enumerable, :reduce, 3, [file: 'lib/enum.ex', line: 141]}, {Enum, :reduce, 3, [file: 'lib/enu
m.ex', line: 3023]}, {Cluster.Supervisor, :get_configured_topologies, 1, [file: 'lib/supervisor.ex', line: 62]}, {Cluster.Supervisor, :init, 1, [file: 'lib/supervisor.ex', line: 
57]}, {:supervisor, :init, 1, [file: 'supervisor.erl', line: 295]}, {:gen_server, :init_it, 2, [file: 'gen_server.erl', line: 374]}, {:gen_server, :init_it, 6, [file: 'gen_server
.erl', line: 342]}]}}}                                                                                                                                                            
21:54:50.014 [error] CRASH REPORT Process <0.371.0> with 0 neighbours exited with reason: bad return value ok from 'Elixir.NoSlides.Application':start(normal, []) in application_m
aster:init/4 line 138                                                                                                                                                             
21:54:50.014 [info] Application noslides exited with reason: bad return value ok from 'Elixir.NoSlides.Application':start(normal, [])                                               
** (Mix) Could not start application noslides: NoSlides.Application.start(:normal, []) returned a bad value: :ok                                                                    
[os_mon] cpu supervisor port (cpu_sup): Erlang has closed                                                          

libcluster v2.0.0+ not compatible with v2 config

Hi Paul,

First of all, thanks for your amazing work on many different elixir projects like Timex and exrm - all of which have been great contributions to the elixir community and which we have been very happy to work with at Lix.

Unfortunately It appears that version 2 of libcluster was put out a bit to hastily. There seem to have been a mix-up with the new way of configuring libcluster and old code remaining that uses the v1 style of config. My guess is that in your projects where you test libcluster you have remains of old v1 configuration which allow your testing to work - but with a clean v2 config it doesn't.

The reason I say this is that I found a few different problems running version 2:

Kubernetes strategy fails to initialize:
When trying to fetch topology on L38 from opts we get a KeyError because what is actually passed as opts is a list with a keyword list as only element due to the wrapping of opts in a list on L35. Removing this wrapping solves the problem.

Kubernetes strategy fails to look up kubernetes_node_basename and kubernetes_selector:
When trying to get the app-name and selector on L101 from the configuration we get warning logs on L144 due to app_name being nil. This is caused because even in v2 of libcluster it's trying to look up the configuration in the v1 format. A change which uses Keyword.fetch! to look up the app_name and selector from the config part of the state fixes this problem.

Best regards

Simon

Libcluster crash when using the `.hosts.erlang` clustering strategy

Hi @bitwalker! I'm trying to use the ErlangHosts strategy but it crashes with this stacktrace:

20:18:28.149 [info]  Application libcluster exited: Cluster.App.start(:normal, []) returned an error: shutdown: failed to start child: Cluster.Strategy.ErlangHosts
    ** (EXIT) an exception was raised:
        ** (FunctionClauseError) no function clause matching in Cluster.Strategy.ErlangHosts.configured_timeout/1
            (libcluster) lib/strategy/erlang_hosts.ex:47: Cluster.Strategy.ErlangHosts.configured_timeout([topology: :erlang_hosts, connect: {:net_kernel, :connect, []}, disconnect: {:net_kernel, :disconnect, []}, list_nodes: {:erlang, :nodes, [:connected]}, config: [timeout: 30000]])
            (libcluster) lib/strategy/erlang_hosts.ex:37: Cluster.Strategy.ErlangHosts.init/1
            (stdlib) gen_server.erl:365: :gen_server.init_it/2
            (stdlib) gen_server.erl:333: :gen_server.init_it/6

From what I can see, it could be because configured_timeout is not called with a map like %{opts: opts} but with the result of the map.

Do you wish me to open a PR to fix this? :)

How do you want to handle further strategies?

As I saw in the latest pull request you are not fan of rolling strategies into the mainline libcluster repo.

We developed an EC2 strategy for libcluster and would like to share it with the community but without a reference I am afraid that nobody will find it?

Btw thank you for your amazing contributions to the elixir community!

How to ensure that the cluster has connected before starting dependent GenServer

I may start my application like this:

defmodule MyApp.Application do
  use Application

  def start(_type, _args) do
    topologies = [
      example: [
        strategy: Cluster.Strategy.Epmd,
        config: [hosts: [:"[email protected]", :"[email protected]"]],
      ]
    ]
    children = [
      {Cluster.Supervisor, [topologies, [name: MyApp.ClusterSupervisor]]},
      {Pow.Store.Backend.MnesiaCache, extra_db_nodes: Node.list()}
    ]
    Supervisor.start_link(children, strategy: :one_for_one, name: MyApp.Supervisor)
  end
end

The Pow.Store.Backend.MnesiaCache is a GenServer that will connect to an Mnesia cluster if there are any connected nodes that's also running Mnesia. However since Node.list/0 will just return an empty list here, I need to make sure that libcluster already have connected before calling Node.list/0 for MnesiaCache.

What would be the best way to deal with this?

Also, thanks for this library @bitwalker!

possible race condition when ensuring exported functions.

๐Ÿ‘‹

While playing around with libcluster and partisan I keep getting the following error when I kill one of the nodes and then start it again:

21:15:36.313 [error] CRASH REPORT Process <0.311.0> with 0 neighbours crashed with reason: #{'__exception__' => true,'__struct__' => 'Elixir.RuntimeError',message => <<"Elixir.PC.list_nodes/0 is undefined!">>} in 'Elixir.Cluster.Strategy':'ensure_exported!'/3 line 156
21:15:36.314 [error] Supervisor 'Elixir.PC.Cluster.Supervisor' had child 'Elixir.Cluster.Strategy.Gossip' started with 'Elixir.Cluster.Strategy.Gossip':start_link([#{'__struct__' => 'Elixir.Cluster.Strategy.State',config => [],connect => {'Elixir.PC',connect_node,...},...}]) at <0.311.0> exit with reason #{'__exception__' => true,'__struct__' => 'Elixir.RuntimeError',message => <<"Elixir.PC.list_nodes/0 is undefined!">>} in 'Elixir.Cluster.Strategy':'ensure_exported!'/3 line 156 in context child_terminated
21:15:36.314 [error] Supervisor 'Elixir.PC.Cluster.Supervisor' had child 'Elixir.Cluster.Strategy.Gossip' started with 'Elixir.Cluster.Strategy.Gossip':start_link([#{'__struct__' => 'Elixir.Cluster.Strategy.State',config => [],connect => {'Elixir.PC',connect_node,...},...}]) at <0.311.0> exit with reason reached_max_restart_intensity in context shutdown
21:15:36.314 [error] Supervisor 'Elixir.PC.Supervisor' had child 'Elixir.Cluster.Supervisor' started with 'Elixir.Cluster.Supervisor':start_link([[{example,[{strategy,'Elixir.Cluster.Strategy.Gossip'},{connect,{'Elixir.PC',connect_node,[]}},...]}],...]) at <0.305.0> exit with reason shutdown in context child_terminated
21:15:36.316 [error] gen_server <0.313.0> terminated with reason: #{'__exception__' => true,'__struct__' => 'Elixir.RuntimeError',message => <<"Elixir.PC.list_nodes/0 is undefined!">>} in 'Elixir.Cluster.Strategy':'ensure_exported!'/3 line 156

Seems like :libcluster tries to check that PC.list_nodes/0 is exported before the PC is loaded?

Unable to establish cluster in Kubernetes using Kubernetes.DNS strategy.

I'm currently deploying my distillery application to a Kubernetes cluster (v1.9.7). I have set up the headless service, configured my release with the appropriate vm.args and configured my application to set up the topology and start the cluster supervisor. I have confirmed that each of the pods created by my deployment can, in fact, ping every other pod in the deployment. I have confirmed that the DNS record for the headless service correctly resolves (indeed it appears libcluster resolves the node IPs correctly as well). However, I am still getting the following error:

09:48:33.964 [warn] [libcluster:kubernetes] unable to connect to :"[email protected]"
09:48:33.965 [warn] [libcluster:kubernetes] unable to connect to :"[email protected]"
09:48:33.966 [error] GenServer #PID<0.1888.0> terminating
** (FunctionClauseError) no function clause matching in Cluster.Strategy.Kubernetes.DNS.load/1
    (libcluster) lib/strategy/kubernetes_dns.ex:55: Cluster.Strategy.Kubernetes.DNS.load({:noreply, %Cluster.Strategy.State{config: [service: "contrasting-lambkin-flywheel-remoting.flywheel.svc.cluster.local", application_name: "flywheel", polling_interval: 20000], connect: {:net_kernel, :connect_node, []}, disconnect: {:erlang, :disconnect_node, []}, list_nodes: {:erlang, :nodes, [:connected]}, meta: #MapSet<[:"[email protected]"]>, topology: :kubernetes}})
    (libcluster) lib/strategy/kubernetes_dns.ex:48: Cluster.Strategy.Kubernetes.DNS.handle_info/2
    (stdlib) gen_server.erl:637: :gen_server.try_dispatch/4
    (stdlib) gen_server.erl:711: :gen_server.handle_msg/6
    (stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
Last message: :timeout
09:48:33.971 [info] Application flywheel exited: shutdown
{"Kernel pid terminated",application_controller,"{application_terminated,flywheel,shutdown}"}
Kernel pid terminated (application_controller) ({application_terminated,flywheel,shutdown})

It would appear that, for some reason, the VM is unable to connect to any of its peers. As I mentioned, I've verified that the network traffic is unimpeded so I'm figuring it must simply be some configuration mistake I've made but I cannot for the life of me determine what it is.

Allow multiple discoveries to be started

And allow the connect/disnonect callbacks to be configured. It is important to check the result of the callbacks as returning :ignored means the node is not part of a network and that false means it cannot connect.

Gossip strategy example not working

Just copy pasting the example gossip strategy config from the readme results in this:

01:02:21.787 [info] Application libcluster exited: Cluster.App.start(:normal, []) returned an error: shutdown: failed to start child: Cluster.Strategy.Gossip
    ** (EXIT) an exception was raised:
        ** (KeyError) key :topology not found in: [[topology: :gossip_example, connect: {:net_kernel, :connect, []}, disconnect: {:net_kernel, :disconnect, []}, config: [port: 45892, if_addr: {0, 0, 0, 0}, multicast_addr: {230, 1, 1, 251}, multicast_ttl: 1]]]
            (elixir) lib/keyword.ex:343: Keyword.fetch!/2
            (libcluster) lib/strategy/gossip.ex:46: Cluster.Strategy.Gossip.init/1
            (stdlib) gen_server.erl:365: :gen_server.init_it/2
            (stdlib) gen_server.erl:333: :gen_server.init_it/6
            (stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3

I can't figure out why the opts being passed to the gossip worker are being wrapped in an extra list like that, when this doesn't happen for the other strategies...

Headless k8 service instead of relying on K8 APIs

One of the ways we solved this was creating a headless Service for all the Deployments or Daemonset that should join the cluster.
That way when the application launches using the <app@ip> it can do a lookup agains the service name and get the addresses for the rest of the cluster to join.

I am not sure, but my guess is (and correct me if I am wrong) using the Kubernetes API might break requirements for some people who don't want to expose those things on the pod/containers.

Help with configuring/using Kubernetes strategy

I'm trying to understand why my topology config ain't working and i guess this line is a little bit confusing.

app_name looks like it's the application name and not the kubernetes_node_basename

app_name = Keyword.fetch!(config, :kubernetes_node_basename)

i will be grateful if i can find an example using the strategy Elixir.Cluster.Strategy.Kubernetes with full configuration.

Thanks ๐Ÿ˜„

Cluster.Strategy.Gossip in Windows

I am trying to implement Cluster.Strategy.Gossip in Windows x64 environment for a classroom project, but I always get this error when launching the second node. Is there any known solution to implement this strategy in Windows?

[info] Application chat exited: Chat.start(:normal, []) returned an error: shutdown: failed to start child: Cluster.Supervisor ** (EXIT) shutdown: failed to start child: Cluster.Strategy.Gossip ** (EXIT) an exception was raised: ** (MatchError) no match of right hand side value: {:error, :eaddrinuse} (libcluster) lib/strategy/gossip.ex:89: Cluster.Strategy.Gossip.init/1 (stdlib) gen_server.erl:374: :gen_server.init_it/2 (stdlib) gen_server.erl:342: :gen_server.init_it/6 (stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3 ** (Mix) Could not start application chat: Chat.start(:normal, []) returned an error: shutdown: failed to start child: Cluster.Supervisor ** (EXIT) shutdown: failed to start child: Cluster.Strategy.Gossip ** (EXIT) an exception was raised: ** (MatchError) no match of right hand side value: {:error, :eaddrinuse} (libcluster) lib/strategy/gossip.ex:89: Cluster.Strategy.Gossip.init/1 (stdlib) gen_server.erl:374: :gen_server.init_it/2 (stdlib) gen_server.erl:342: :gen_server.init_it/6 (stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3

unable to connect to :nonode@nohost warning using Gossip strategy

libcluster version: 3.1.1

I start my server as usual PORT=4000 iex --name [email protected] -S mix phx.server
I get this warning when I start just 1 node, but also when I have 2 nodes connected as shown in the screenshot below:

image

How can I turn this warning off? I tried this in dev.exs, but no luck

config :libcluster, warn: false

I've tried rm -rf _build/ deps/ mix.lock and mix deps.get && mix deps.compile and restart the server but am still getting this. It's the infinite warning spams that are getting very annoying.

Thanks!

Kubernetes Strategy: Why do node names need to be the same?

If I missed an issue where this is discuss, I'd appreciate knowing where read up.

If this hasn't been discussed before, I would be interested in understanding.

Why do nodes need to have the same basename? Nodes with different names can connect to one another in dist erl, and all is needed is a shared cookie, and nodes that can find other nodes will connect.

Is this just something that this library hadn't decided to support? Is there a reason that this doesn't work that I'm missing?

Release with `kubernetes_ip_lookup_mode`

Hi ๐Ÿ‘‹

I would like to know if there's a plan to release a new version of libcluster any time soon with support for the kubernetes_ip_lookup_mode option.

We're relying on the pod mode to only tigger a disconnect call when a pod receive a sigkill. Otherwise we've a graceful shutdown process being handled on the pod that receive the sigterm.

I'm also curious to know if there's a plan to change the default value to pod because although I agree that endpoints provides a safe default during the Pod initialisation it also result on the other pods receiving a disconnect call while a pod is (potentially) being gracefully terminated.

Thanks ๐Ÿ™

Cannot get connection id for node in OTP 21.0

Creating issue for tracking.

Given OTP 21.0, Elixir 1.6.6 and libcluster 3.0.1:

iex --name [email protected] --cookie test -S mix phx.server
Erlang/OTP 21 [erts-10.0] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1] [hipe]

[info] [swarm on [email protected]] [tracker:init] started
[error]
** Cannot get connection id for node :"[email protected]"

[warn] [libcluster:db] unable to connect to :"[email protected]"
[warn] [libcluster:db] unable to connect to :"[email protected]"
[info] Running VanguardWeb.Endpoint with Cowboy using http://0.0.0.0:4000
Interactive Elixir (1.6.6) - press Ctrl+C to exit (type h() ENTER for help)
iex([email protected])1> 21:00:06 - info: compiled 6 files into 2 files, copied 3 in 739 ms
[info] [swarm on [email protected]] [tracker:cluster_wait] joining cluster..
[info] [swarm on [email protected]] [tracker:cluster_wait] no connected nodes, proceeding without sync

Given OTP 20.3.8, Elixir 1.6.6 and libcluster 3.0.1:

iex --name [email protected] --cookie test -S mix phx.server
Erlang/OTP 20 [erts-9.3.3] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:10] [hipe] [kernel-poll:false]

[info] [swarm on [email protected]] [tracker:init] started
[info] [libcluster:db] connected to :"[email protected]"
[warn] [libcluster:db] unable to connect to :"[email protected]"
[info] Running VanguardWeb.Endpoint with Cowboy using http://0.0.0.0:4000
Interactive Elixir (1.6.6) - press Ctrl+C to exit (type h() ENTER for help)
iex([email protected])1> 21:47:42 - info: compiled 6 files into 2 files, copied 3 in 730 ms
[info] [swarm on [email protected]] [tracker:cluster_wait] joining cluster..
[info] [swarm on [email protected]] [tracker:cluster_wait] no connected nodes, proceeding without sync

Hostname is incorrect after deploying to k8s

I am using Cluster.Strategy.Kubernetes.DNSSRV for the strategy. Everything works fine on the initial deploy. But after a few deploys I get the following error:

[backend-elixir-0] {"context":{"runtime":{"application":"libcluster","file":"lib/logger.ex","function":"error/2","line":17,"module_name":"Cluster.Logger","vm_pid":"<0.3785.0>"},"system":{"hostname":"backend-elixir-0","pid":1}},"dt":"2019-10-06T23:46:44.601190Z","event":null,"level":"error","message":"[libcluster:k8s] 'backend-elixir.backend-elixir-staging.svc.cluster.local.' : lookup against backend-elixir failed: :nxdomain"}
[backend-elixir-0] {"context":{"runtime":{"application":"libcluster","file":"lib/logger.ex","function":"error/2","line":17,"module_name":"Cluster.Logger","vm_pid":"<0.3785.0>"},"system":{"hostname":"backend-elixir-0","pid":1}},"dt":"2019-10-06T23:46:44.604630Z","event":null,"level":"error","message":"[libcluster:k8s] 'backend-elixir.backend-elixir-staging.svc.cluster.local.' : lookup against backend-elixir failed: :nxdomain"}

The only way to recover from it is delete the mnesia cache disk for each of the pods and restart everything.

The only thing using mnesia so far is pow.

I am new to Elixir, but I am happy to gather more information if someone points me in the right direction.

Failed connection to Kubernetes API via HTTPS

@bitwalker ,First, Thanks for the great library.
Trying to use the kubernetes strategy, but it keeps failing on the request to kubernetes with this error :
2017-06-04T10:14:03.029535969Z 10:14:03.029 [error] [libcluster:examplephx] request to kubernetes failed!: {:failed_connect, [{:to_address, {'kubernetes.default.svc.cluster.local', 443}}, {:inet, [:inet], {:eoptions, {:undef, [{:ssl, :connect, ['kubernetes.default.svc.cluster.local', 443, [:binary, {:active, false}, {:ssl_imp, :new}, :inet, {:verify, :verify_none}], :infinity], []}, {:http_transport, :connect, 4, [file: 'http_transport.erl', line: 109]}, {:httpc_handler, :connect, 4, [file: 'httpc_handler.erl', line: 902]}, {:httpc_handler, :connect_and_send_first_request, 3, [file: 'httpc_handler.erl', line: 916]}, {:httpc_handler, :init, 1, [file: 'httpc_handler.erl', line: 243]}, {:proc_lib, :init_p_do_apply, 3, [file: 'proc_lib.erl', line: 247]}]}}}]}

I have :

  • a secret containing the ERLANG_COOKIE,
  • a config map from this vm.args file:

Name of the node

-name ${MY_BASENAME}@${MY_POD_IP}

Cookie for distributed erlang

-setcookie ${ERLANG_COOKIE}

Enable SMP automatically based on availability

-smp auto

  • and a replication controller running the app with three pods which is exposed to a service.

seems that the application cant connect to an https site (http works)
although i have libssl-dev installed.

I'd appreciate if you can help me figure out what am i doing wrong or what am i missing.
thanks in advance!

Gossip if_addr is not being sanitized if configured as a binary

Using one of the provided sample configurations, the Gossip Strategy will fail to start as only the mutlicast_addr is being sanitized from binary to four element tuple. Sample configuration:

config :libcluster,
  topologies: [
    gossip_example: [
      strategy: Elixir.Cluster.Strategy.Gossip,
      config: [
        port: 45892,
        if_addr: "0.0.0.0",
        multicast_addr: "230.1.1.251",
        multicast_ttl: 1]]]

No function clause matching in Mix.Releases.Runtime.Control.ping/2

I apologize if this is just a dumb config issue, but I've been using libcluster since Kubernetes support was added without issue until today.

I am trying to deploy and keep getting the following error instead of a ping. Any thoughts?

** (FunctionClauseError) no function clause matching in Mix.Releases.Runtime.Control.ping/2    
    
    The following arguments were given to Mix.Releases.Runtime.Control.ping/2:
    
        # 1
        []
    
        # 2
        %{cookie: :totallyFakeCookie}
    
    Attempted function clauses (showing 1 out of 1):
    
        def ping(+_args+, -%{peer: peer}-)
    
    (distillery) lib/mix/lib/releases/runtime/control.ex:395: Mix.Releases.Runtime.Control.ping/2
    (distillery) lib/entry.ex:44: Mix.Releases.Runtime.Control.main/1
    (stdlib) erl_eval.erl:677: :erl_eval.do_apply/6
    (elixir) lib/code.ex:232: Code.eval_string/3

New Strategy: DNS or Consul based discovery

Would you be interested in a PR adding Consul and (m)DNS based clustering support?

Consul: We can listen to consul services and if a new member is added to a server, we consider it part of the erlang cluster.

We could also generalise this using DNS. you make a SRV request to workers.my-internal.network.local and see how many records come back and join these in the cluster. (Consul gives us such a DNS based interface). I think actually, this implementation would both generalise the Kubernetes and the Consul strategies, as Kubernetes exposes services using KubeDNS.

We could further generalise this to mDNS which uses multicast UDP to find peers. Similar to the current multicast discovery protocol, but standardised by RFC.

I have some free time this summer, and I would like to work on it to learn some erland and elixir.
I was wondering if you already gave this thought before, and if you have any suggestions in this direction.

Does the release name matter?

Hi! I ran into an interesting issue today and was wondering if I'm doing something wrong.

My application is called :server i.e. in mix.exs it says app: :server

When I set my rel/config.exs release name to :server all works fine i.e.

release :server do
...

If I change it to something else, though, Distillery complains about missing :applications e.g.

release :foo do
...

Errors I get look like this

$ MIX_ENV=prod mix release --env=prod
==> Assembling release..
==> Building release foo:0.0.1 using environment prod
==> One or more direct or transitive dependencies are missing from
    :applications or :included_applications, they will not be included
    in the release:

    :phoenix
    :phoenix_ecto
    :postgrex
...

Is this expected behavior or could I be doing something wrong? I'm using Distillery 1.5.2 and Phoenix 1.3.0.

Thanks!!

New release needed for OTP 21.0 compatibility

With the release of OTP 21.0, the latest available version 2.5.0 on Hex isn't compatible with that.

01:36:19.228 [info]  Application libcluster exited: Cluster.App.start(:normal, []) returned an error: shutdown: failed to start child: Cluster.Strategy.Epmd
    ** (EXIT) an exception was raised:
        ** (RuntimeError) net_kernel.connect/1 is undefined!
            (libcluster) lib/strategy/strategy.ex:117: Cluster.Strategy.ensure_exported!/3
            (libcluster) lib/strategy/strategy.ex:43: anonymous fn/6 in Cluster.Strategy.connect_nodes/4
            (elixir) lib/enum.ex:1899: Enum."-reduce/3-lists^foldl/2-0-"/3
            (libcluster) lib/strategy/strategy.ex:41: Cluster.Strategy.connect_nodes/4
            (libcluster) lib/strategy/epmd.ex:27: Cluster.Strategy.Epmd.start_link/1
            (stdlib) supervisor.erl:379: :supervisor.do_start_child_i/3
            (stdlib) supervisor.erl:365: :supervisor.do_start_child/2
            (stdlib) supervisor.erl:349: anonymous fn/3 in :supervisor.start_children/2
            (stdlib) supervisor.erl:1157: :supervisor.children_map/4
            (stdlib) supervisor.erl:315: :supervisor.init_children/2
            (stdlib) gen_server.erl:374: :gen_server.init_it/2
            (stdlib) gen_server.erl:342: :gen_server.init_it/6
            (stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3

Any chance we can get a new libcluster version deployed?

Lib cluster elixir 1.3.4 and Erlang 19.0 with Swarm

Hi Bitwalker,

I am getting the following error when trying to run swarm with libcluster.

c:\Users\tom\IdeaProjects\swarmtest>iex --name [email protected] -S mix
Eshell V8.0 (abort with ^G)

=INFO REPORT==== 11-Oct-2016::10:52:13 ===
application: logger
exited: stopped
type: temporary
** (Mix) Could not start application libcluster: Cluster.App.start(:normal, []) returned an error: shutdown: failed to s
tart child: Cluster.Strategy.Gossip
** (EXIT) an exception was raised:
** (UndefinedFunctionError) function Logger.info/1 is undefined or private. Did you mean one of:

  * info/1
  * info/2

        (logger) Logger.info("[libcluster] [strategy:gossip] starting")
        (libcluster) lib/strategy/gossip.ex:39: Cluster.Strategy.Gossip.init/1
        (stdlib) gen_server.erl:328: :gen_server.init_it/6
        (stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3

I have attached the mix project in case i am doing something silly but it all looks fine to me.

swarmtest.zip

Still fairly new to elixir and wanting to use this project on my pi-cluster (but testing on windows w7)

Any help appreciated!

thanks

Shifters

Umbrella app setup issues

I have a Phoenix umbrella app with a root mix.exs of:

  defp deps do
    [{:libcluster, github: "bitwalker/libcluster"}]
  end

and the App's application.ex

defmodule App.Application do
  # See https://hexdocs.pm/elixir/Application.html
  # for more information on OTP Applications
  @moduledoc false

  use Application

  def start(_type, _args) do
    topologies = Application.get_env(:libcluster, :topologies) || []

    children = [
      # Start the Ecto repository
      App.Repo,
      # Start the PubSub system
      {Phoenix.PubSub, name: App.PubSub},
      # Start a worker by calling: App.Worker.start_link(arg)
      # {App.Worker, arg}
      {Cluster.Supervisor, [topologies, [name: App.ClusterSupervisor]]}
    ]

    Supervisor.start_link(children, strategy: :one_for_one, name: App.Supervisor)
  end
end

Starting two instances with:

PORT=4000 iex --sname a -S mix phx.server
PORT=4001 iex --sname b -S mix phx.server

gives me:

[warn] [libcluster:app] unable to connect to :a@50
and
[warn] [libcluster:app] unable to connect to :b@50

Is there any special setup required for umbrella apps? Also does it matter in which app (App or AppWeb) it's started on?

:list_nodes error

Seeing this in my VS Code output logs (though the app seems to work fine):

Compiling with Mix env test
==> libcluster
Compiling 11 files (.ex)

== Compilation error in file lib/strategy/dns_poll.ex ==
** (CompileError) lib/strategy/dns_poll.ex:48: unknown key :list_nodes for struct Cluster.Strategy.State

[Warn  - 9:11:40 AM] could not compile dependency :libcluster, "mix compile" failed. You can recompile this dependency with "mix deps.compile libcluster", update it with "mix deps.update libcluster" or clean it with "mix deps.clean libcluster"

Tried pulling from master, with no change. Any help would be appreciated!

Very noisy logging using gossip strategy

Is this expected?

screen shot 2017-05-04 at 1 11 24 pm

It logs output at least once a second, sometimes more. This seems a bit overly chatty to me, and would blow up our centralized logging service...

Possibility to await first Lookup

Is it possible to await the first lookup until the application is started?

I have an application that should not get into the situation where there are name conflicts in the cluster. Because of that it is important that the worker does not start until the cluster is detected.

How can I best achieve that?

Reason for disconnecting from a leaving node on Kubernetes strategy

Hey there, first of all thank you for this library ๐Ÿ‘

While using the Kubernetes strategy we had some issues while trying to gracefully shutdown a node from the cluster, by trapping the SIGTERM signal.

When a node receives a SIGTERM, the node is removed from the list of pods of k8s internal API, so the Kubernetes strategy disconnects the node from the cluster.

Since we are using Swarm to distribute processes around the cluster, we have a problem when the node is disconnected from the cluster (sooner than we expected), since Swarm will redistribute the node processes on the rest of the cluster, while these processes are still being processed by the node. Even worse, Swarm will spawn the processes that were running in the rest of the cluster in the current node.

To fix this we used an altered version of the Kubernetes strategy which does not disconnect from removed pods, since eventually they will disconnect by them selves eventually, but we would like to know if there is any reason for explicitly disconnecting from a removed pod and what consequences we could have by not doing it?

How to test locally?

Hi, sorry to ask a question on the issue tracker, but not sure where else to ask.

How can I play around with this locally? If I use the Epmd strategy, then I can only start up a single Erlang VM and it claims to connect to both hosts in the config (which doesn't make sense to me).

$ iex --cookie foo --sname node1 -S mix
11:34:55.841 [info]  [libcluster:example] connected to :node1@haswell
11:34:55.841 [info]  [libcluster:example] connected to :node2@haswell

$ iex --cookie foo --sname node2 -S mix
** (Mix) Could not start application libcluster: Cluster.App.start(:normal, []) returned an error: shutdown: failed to start child: Cluster.Strategy.Epmd
    ** (EXIT) an exception was raised:
        ** (RuntimeError) Elixir.ClusterWork.connect/1 is undefined!

But really I would like to get the Gossip strategy to work locally. I created a second ip address on en0 like so:

$ ifconfig en0
en0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
	ether 60:f8:1d:b3:58:d8
	inet6 fe80::148c:37c9:89e3:157d%en0 prefixlen 64 secured scopeid 0x4
	inet 192.168.1.68 netmask 0xffffff00 broadcast 192.168.1.255
	inet 192.168.1.69 netmask 0xffffff00 broadcast 192.168.1.255
	nd6 options=201<PERFORMNUD,DAD>
	media: autoselect
	status: active

And my libcluster config looks like:

config :libcluster,
  topologies: [
    gossip_example: [
      strategy: Cluster.Strategy.Gossip,
      config: [
        port: 45892,
        if_addr: {192,168,1,68},
        # if_addr: {192,168,1,69},
        multicast_addr: {228,6,7,8},
        # a TTL of 1 remains on the local network,
        # use this to change the number of jumps the
        # multicast packets will make
        multicast_ttl: 1
      ],
      connect: {ClusterWork, :connect, []}
    ]
  ]

And I start two iex sessions:

$ iex --cookie foo --sname node1 -S mix
(edit config to use second if_addr)
$ iex --cookie foo --sname node2 -S mix

And they both just heartbeat but never connect.

Thanks for the help.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.