GithubHelp home page GithubHelp logo

Comments (4)

mmontes11 avatar mmontes11 commented on June 2, 2024

Hey @djakobczak1 ! Sorry for late response, I was busy with v0.0.26 and v0.0.27 releases, which notably improve Galera support.

Very accurate issue report! I've managed to reproduce, will get back to you with a suggestion soon. 👍🏻

from mariadb-operator.

mmontes11 avatar mmontes11 commented on June 2, 2024

@djakobczak1 I think now I understand the root cause of the problem and have a suggestion to solve it.

  1. operator sees that wsrep_cluster_size is now above MinClusterSize so it finishes reconciliation
  2. started pod fails during e.g. SST and restarts, now operator will not notice that

Indeed, once the cluster reaches healthy state the reconciliation stops and the operator misses that the wsrep_cluster_size goes below the MinClusterSize.

Why?

Currently the controller responsible for this subscribes to the events of the StatefulSet, basically whenever something changes in the StatefulSet a reconciliation is triggered to check whether the cluster is healthy and it sets the GaleraReady accordingly so the other controller performs the recovery..

The problem is that there are situations like this one (but probably more) where the Pods reach a CrashLoopBackOff state, so nothing is changed in the StatefulSet, and therefore no reconciliations are enqueued.

How to solve it?

I've managed to trigger the recovery in the scenario you reported by requeuing reconciliations every 10s even if the cluster is healthy. I suggest we introduce a clusterMonitorInterval, by default set to 10s:

spec:
  galera:
    recovery:
       clusterMonitorInterval: 10s

This way we will allow the user to configure the interval at which the cluster is monitored and we make sure we don't miss any event!

As an alternative, I thought about subscribing to the Pods status, but there might be situations where no reconciliations are triggered for the Pods either, like advanced CrashLoopBackoff state.

What do you think?

from mariadb-operator.

djakobczak1 avatar djakobczak1 commented on June 2, 2024

Thank you for detailed response.
Suggested solution with configurable clusterMonitorInterval sounds good to me from user perspective.

from mariadb-operator.

mmontes11 avatar mmontes11 commented on June 2, 2024

Great @djakobczak1 ! I will raise a PR soon for supporting this.

from mariadb-operator.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.