Hi there! Bug Report When deploying an update of

hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Draining of concourse worker taking very long about concourse-bosh-release HOT 7 OPEN

concourse commented on June 18, 2024

Draining of concourse worker taking very long

from concourse-bosh-release.

Comments (7)

deniseyu commented on June 18, 2024

hi @Marc-Assmann , sorry it's taken a while for someone to get back to you.

Without being able to poke around at this environment now it's difficult to say what exactly went wrong. there are questions that I would about the deployment but I can appreciate that since this was 2 months ago, that data might not be around anymore! have you run into this problem since/is this an issue your team is still facing?

from concourse-bosh-release.

alexbakar commented on June 18, 2024

I was able to reproduce this issue several times on 5.2.0 and 5.5.0. Our setup contains two web nodes and six workers. I have analyzed the issue and this is my theory for what happens:

a redeployment of Concourse is triggered (on Bosh level)
when the draining starts the signal USR2 triggers the retirement of the worker. A corresponding retire command is correctly sent to the TSA and then it can be forwarded it to the ATC (both running on a web node);
as the whole Concourse deployment is being redeployed it might happen that the web and worker nodes are being redeployed at the same time;
in some cases the web node VM (to which the retire command has been sent) might be recreated; Note that the ATC on this old web node has not marked the worker as "retiring" in the DB;
during the recreate of the web node there seems to be no meaningful response to the calling worker node; thus the worker's drain logic doesn't react;
the drain logic on the worker node just waits until the timeout passes (in our case it is 12 hours). In that period new jobs are being scheduled on this worker (and executed). After the timeout the drain script signals the worker process with TERM, QUIT, and finally KILL signal and after that the VM can be recreated.

I think concourse/concourse#2523 might be the same.

from concourse-bosh-release.

alexbakar commented on June 18, 2024

The same issue also happens if the draining performs "landing", not "retiring". I am wondering how to get more details from the web node, for example the logs from the "old" web node (as the old VM is being destroyed the logs are gone). Any ideas?

from concourse-bosh-release.

radoslav-tomov commented on June 18, 2024

Hi,

I've debugged this and managed to reproduce it locally via docker-compose. It happens when

SIGUSR1/SIGUSR2 signal has been sent to the worker process
the worker receives the signal, but the call from the TSA client fails

Then the workerBeacon fails with some error and it is restarted.

Bottom line is, that in the case of a Bosh deployment the drain script continues to run till it reaches the drain timeout, then followed by TERM/15/QUIT/2/KILL.

from concourse-bosh-release.

radoslav-tomov commented on June 18, 2024

Seems that the behavior can be explained by the web and worker instance groups being updated in parallel. In some cases, the timing is enough to simultaneously update a web node and worker node, which might cause the issue if the worker node is connected to the specific web node being updated. In our case having just 2 web nodes makes the probability of hitting this exact race condition quite high.

It can be avoided, by not running updates of both instance groups in parallel.

from concourse-bosh-release.

deniseyu commented on June 18, 2024

thanks @radoslav-tomov and @alexbakar for spending time investigating this!

I am wondering how to get more details from the web node, for example the logs from the "old" web node (as the old VM is being destroyed the logs are gone).

I think the only way to get logs from a destroyed VM is to enable log forwarding at deploy time and set them up to export them to somewhere like Papertrail 😕

Next part is speculative, not necessarily a request for contributors to implement:

I wonder, since BOSH introduced a new lifecycle hook for pre_stop, if there's another way to address the timing issue without turning off parallel deployment. Maybe web nodes can wait (up to a configurable threshold) until there are no more registered workers before restarting for example. That should enable some time overlap for the slowest BOSH operations and not prolong the overall deployment time by too much I think..

cc @cirocosta and @pivotal-jamie-klassen for additional input!

from concourse-bosh-release.

alexbakar commented on June 18, 2024

Thanks @deniseyu for the help.

For the analysis of the issue I've forwarded the logs and later checked them using Kibana.
I've checked the pre_stop stage description. I wonder do we always know the timing scheme of the redeployment (i.e. when the recreation of the group of web instances is started and when the group of workers). Can we miss some of the workers when the pre_stop of the web is called? I mean what is the chance for a worker recreation (and draining respectfully) to be triggered right after the pre_stop of the web node has finished? If there is no such chance, then the approach with using pre_stop is fine. Otherwise the race condition might still happen in some cases.

from concourse-bosh-release.

Draining of concourse worker taking very long about concourse-bosh-release HOT 7 OPEN

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs