GithubHelp home page GithubHelp logo

Consul fails to start about consul-helm HOT 10 CLOSED

hashicorp avatar hashicorp commented on July 23, 2024 3
Consul fails to start

from consul-helm.

Comments (10)

jmreicha avatar jmreicha commented on July 23, 2024 1

Ugh it looks like I was on an outdated version of the repo. After updating the repo and reinstalling it is working now.

from consul-helm.

mitchellh avatar mitchellh commented on July 23, 2024

Does it never stabilize? It looks like its starting to work, just hasn't joined all the servers yet.

Also, PVCs might be a real issue for sure. I didn't realize actually that StatefulSets with PVCs can be started before the PVC is available (spoiled in the environment we run in I guess). Do you know what that looks like? Is the directory just not available yet? That might be something we have to build into an init container or something (to wait for it to be ready).

from consul-helm.

mmisztal1980 avatar mmisztal1980 commented on July 23, 2024

@mitchellh no, it never does, funnily it used to work with the previous release of rook (v0.7).

In my understanding the StatefulSet starts and attempts to bind the PVCs, until that is done, the pod should report an unbound pvc issue - should be easily accessbile via kubectl describe or kubectl logs

from consul-helm.

mitchellh avatar mitchellh commented on July 23, 2024

But during that time, the containers are started?

Sorry, easiest way to figure this out would be if you did more digging or I can get a reproduction. For the latter, is there an easy way for me to get a similar environment up and running?

from consul-helm.

mmisztal1980 avatar mmisztal1980 commented on July 23, 2024

They appear to be - the logs are there.

I'm happy to do more digging, however in order for you to get a repro, I'd have to provide you with terraform files, helm value files & k8s manifests to get a copy of my env going OR I'll simply share a kubeconfig file so that you can poke around and leave it running for the night

from consul-helm.

mmisztal1980 avatar mmisztal1980 commented on July 23, 2024

@mitchellh I've cloned the repo and made some alterations:
values.yaml
server.storageclass: rook-ceph-block

server-statefulset.yaml
readinessProbe.initialDelaySeconds: 60

Results

  • All consul-* pods were up and running v. fast, however they have not been ready for about 50s
  • All consul-server-* pods were not up and running immediately, they were stuck in ContainerCreating state for approx 50s when the 1st pod reported Running. None of the pods were Ready
  • Once all 3 consul-server pods were up and running the initialDelaySeconds period has passed and they started to pass the readiness test
helm status consul
LAST DEPLOYED: Wed Sep 26 23:34:25 2018
NAMESPACE: service-discovery
STATUS: DEPLOYED

RESOURCES:
==> v1/Pod(related)
NAME             READY  STATUS   RESTARTS  AGE
consul-kktl5     1/1    Running  0         2m
consul-tbjmt     1/1    Running  0         2m
consul-x5cqq     1/1    Running  0         2m
consul-server-0  1/1    Running  0         2m
consul-server-1  1/1    Running  0         2m
consul-server-2  1/1    Running  0         2m

==> v1/ConfigMap
NAME                  DATA  AGE
consul-client-config  1     2m
consul-server-config  1     2m

==> v1/Service
NAME           TYPE       CLUSTER-IP   EXTERNAL-IP  PORT(S)                                                                  AGE
consul-dns     ClusterIP  10.3.228.30  <none>       53/TCP,53/UDP                                                            2m
consul-server  ClusterIP  None         <none>       8500/TCP,8301/TCP,8301/UDP,8302/TCP,8302/UDP,8300/TCP,8600/TCP,8600/UDP  2m
consul-ui      ClusterIP  10.3.98.102  <none>       80/TCP                                                                   2m

==> v1/DaemonSet
NAME    DESIRED  CURRENT  READY  UP-TO-DATE  AVAILABLE  NODE SELECTOR  AGE
consul  3        3        3      3           3          <none>         2m

==> v1/StatefulSet
NAME           DESIRED  CURRENT  AGE
consul-server  3        3        2m

==> v1beta1/PodDisruptionBudget
NAME           MIN AVAILABLE  MAX UNAVAILABLE  ALLOWED DISRUPTIONS  AGE
consul-server  N/A            0                0                    2m

Compare to previous results

If you compare the previous results, the 0/1(s) mean that the pod is running, however the readiness test is failed.

NAME              READY     STATUS              RESTARTS   AGE
consul-56lvs      0/1       Running             0          36s
consul-jttwp      0/1       Running             0          36s
consul-qpgdn      0/1       Running             0          36s
consul-server-0   0/1       ContainerCreating   0          36s
consul-server-1   0/1       ContainerCreating   0          36s
consul-server-2   0/1       Running             0          36s

Hypothesis (FailureTreshold?)

This is what the docs say:

failureThreshold: When a Pod starts and the probe fails, Kubernetes will try failureThreshold times 
before giving up. Giving up in case of liveness probe means restarting the Pod. In case of readiness
probe the Pod will be marked Unready. Defaults to 3. Minimum value is 1.

The server-statefulset.yaml failure treshold is:
failureThreshold: 2 and the initial delay is: initialDelaySeconds: 5

Meaning that kubernetes started to fail the checks and gave up before the pvc(s) were bound to the pods?

from consul-helm.

jmreicha avatar jmreicha commented on July 23, 2024

@mmisztal1980 I made the adjustments that you mentioned, but am still seeing pod failing health checks and servers sitting in a pending state.

consul             consul-dzjnx                              0/1     Running     0          4m
consul             consul-g8lmf                              0/1     Running     0          4m
consul             consul-kx8l6                              0/1     Running     0          4m
consul             consul-server-0                           0/1     Pending     0          4m
consul             consul-server-1                           0/1     Pending     0          4m
consul             consul-server-2                           0/1     Pending     0          4m

Consul logs seem to indicate that the stateful set isn't binding the PVC. I am also using Rook/Ceph for storage.

pod description
Name:               consul-server-0
Namespace:          consul
Priority:           0
PriorityClassName:  <none>
Node:               <none>
Labels:             app=consul
                    chart=consul-0.1.0
                    component=server
                    controller-revision-hash=consul-server-5cf54754b
                    hasDNS=true
                    release=consul
                    statefulset.kubernetes.io/pod-name=consul-server-0
Annotations:        consul.hashicorp.com/connect-inject: false
Status:             Pending
IP:
Controlled By:      StatefulSet/consul-server
Containers:
  consul:
    Image:       consul:1.2.3
    Ports:       8500/TCP, 8301/TCP, 8302/TCP, 8300/TCP, 8600/TCP, 8600/UDP
    Host Ports:  0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/UDP
    Command:
      /bin/sh
      -ec
      CONSUL_FULLNAME="consul"

      exec /bin/consul agent \
        -advertise="${POD_IP}" \
        -bind=0.0.0.0 \
        -bootstrap-expect=3 \
        -client=0.0.0.0 \
        -config-dir=/consul/config \
        -datacenter=dc1 \
        -data-dir=/consul/data \
        -domain=consul \
        -hcl="connect { enabled = true }" \
        -ui \
        -retry-join=${CONSUL_FULLNAME}-server-0.${CONSUL_FULLNAME}-server.${NAMESPACE}.svc \
        -retry-join=${CONSUL_FULLNAME}-server-1.${CONSUL_FULLNAME}-server.${NAMESPACE}.svc \
        -retry-join=${CONSUL_FULLNAME}-server-2.${CONSUL_FULLNAME}-server.${NAMESPACE}.svc \
        -server

    Readiness:  exec [/bin/sh -ec curl http://127.0.0.1:8500/v1/status/leader 2>/dev/null | \
grep -E '".+"'
] delay=60s timeout=5s period=3s #success=1 #failure=2
    Environment:
      POD_IP:      (v1:status.podIP)
      NAMESPACE:  consul (v1:metadata.namespace)
    Mounts:
      /consul/config from config (rw)
      /consul/data from data (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-lv4v8 (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  data-consul-server-0
    ReadOnly:   false
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      consul-server-config
    Optional:  false
  default-token-lv4v8:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-lv4v8
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  11s (x245 over 10m)  default-scheduler  pod has unbound PersistentVolumeClaims (repeated 3 times)

I checked the claims (kubectl get pvc) but don't see any claims made my Consul showing up in there.

from consul-helm.

mmisztal1980 avatar mmisztal1980 commented on July 23, 2024

Added a PR where the probe settings are configurable via the chart values. That should help in our case.

from consul-helm.

mmisztal1980 avatar mmisztal1980 commented on July 23, 2024

@mitchellh would the above PR be satisfactory? Tweaking the probe settings seems to have fixed the issue for myself and @jmreicha

from consul-helm.

adilyse avatar adilyse commented on July 23, 2024

As mentioned in a comment on the PR, these configuration options won't be added to the values.yaml at this time. If you need further customization on the Helm chart, read more about options here.

from consul-helm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.