GithubHelp home page GithubHelp logo

Comments (7)

deniseyu avatar deniseyu commented on June 18, 2024

hi @Marc-Assmann , sorry it's taken a while for someone to get back to you.

Without being able to poke around at this environment now it's difficult to say what exactly went wrong. there are questions that I would about the deployment but I can appreciate that since this was 2 months ago, that data might not be around anymore! have you run into this problem since/is this an issue your team is still facing?

from concourse-bosh-release.

alexbakar avatar alexbakar commented on June 18, 2024

I was able to reproduce this issue several times on 5.2.0 and 5.5.0. Our setup contains two web nodes and six workers. I have analyzed the issue and this is my theory for what happens:

  • a redeployment of Concourse is triggered (on Bosh level)
  • when the draining starts the signal USR2 triggers the retirement of the worker. A corresponding retire command is correctly sent to the TSA and then it can be forwarded it to the ATC (both running on a web node);
  • as the whole Concourse deployment is being redeployed it might happen that the web and worker nodes are being redeployed at the same time;
  • in some cases the web node VM (to which the retire command has been sent) might be recreated; Note that the ATC on this old web node has not marked the worker as "retiring" in the DB;
  • during the recreate of the web node there seems to be no meaningful response to the calling worker node; thus the worker's drain logic doesn't react;
  • the drain logic on the worker node just waits until the timeout passes (in our case it is 12 hours). In that period new jobs are being scheduled on this worker (and executed). After the timeout the drain script signals the worker process with TERM, QUIT, and finally KILL signal and after that the VM can be recreated.

I think concourse/concourse#2523 might be the same.

from concourse-bosh-release.

alexbakar avatar alexbakar commented on June 18, 2024

The same issue also happens if the draining performs "landing", not "retiring". I am wondering how to get more details from the web node, for example the logs from the "old" web node (as the old VM is being destroyed the logs are gone). Any ideas?

from concourse-bosh-release.

radoslav-tomov avatar radoslav-tomov commented on June 18, 2024

Hi,

I've debugged this and managed to reproduce it locally via docker-compose. It happens when

  • SIGUSR1/SIGUSR2 signal has been sent to the worker process
  • the worker receives the signal, but the call from the TSA client fails

Then the workerBeacon fails with some error and it is restarted.

Bottom line is, that in the case of a Bosh deployment the drain script continues to run till it reaches the drain timeout, then followed by TERM/15/QUIT/2/KILL.

from concourse-bosh-release.

radoslav-tomov avatar radoslav-tomov commented on June 18, 2024

Seems that the behavior can be explained by the web and worker instance groups being updated in parallel. In some cases, the timing is enough to simultaneously update a web node and worker node, which might cause the issue if the worker node is connected to the specific web node being updated. In our case having just 2 web nodes makes the probability of hitting this exact race condition quite high.

It can be avoided, by not running updates of both instance groups in parallel.

from concourse-bosh-release.

deniseyu avatar deniseyu commented on June 18, 2024

thanks @radoslav-tomov and @alexbakar for spending time investigating this!

I am wondering how to get more details from the web node, for example the logs from the "old" web node (as the old VM is being destroyed the logs are gone).

I think the only way to get logs from a destroyed VM is to enable log forwarding at deploy time and set them up to export them to somewhere like Papertrail 😕

Next part is speculative, not necessarily a request for contributors to implement:

I wonder, since BOSH introduced a new lifecycle hook for pre_stop, if there's another way to address the timing issue without turning off parallel deployment. Maybe web nodes can wait (up to a configurable threshold) until there are no more registered workers before restarting for example. That should enable some time overlap for the slowest BOSH operations and not prolong the overall deployment time by too much I think..

cc @cirocosta and @pivotal-jamie-klassen for additional input!

from concourse-bosh-release.

alexbakar avatar alexbakar commented on June 18, 2024

Thanks @deniseyu for the help.

For the analysis of the issue I've forwarded the logs and later checked them using Kibana.
I've checked the pre_stop stage description. I wonder do we always know the timing scheme of the redeployment (i.e. when the recreation of the group of web instances is started and when the group of workers). Can we miss some of the workers when the pre_stop of the web is called? I mean what is the chance for a worker recreation (and draining respectfully) to be triggered right after the pre_stop of the web node has finished? If there is no such chance, then the approach with using pre_stop is fine. Otherwise the race condition might still happen in some cases.

from concourse-bosh-release.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.