Comments (7)
hi @Marc-Assmann , sorry it's taken a while for someone to get back to you.
Without being able to poke around at this environment now it's difficult to say what exactly went wrong. there are questions that I would about the deployment but I can appreciate that since this was 2 months ago, that data might not be around anymore! have you run into this problem since/is this an issue your team is still facing?
from concourse-bosh-release.
I was able to reproduce this issue several times on 5.2.0 and 5.5.0. Our setup contains two web nodes and six workers. I have analyzed the issue and this is my theory for what happens:
- a redeployment of Concourse is triggered (on Bosh level)
- when the draining starts the signal USR2 triggers the retirement of the worker. A corresponding retire command is correctly sent to the TSA and then it can be forwarded it to the ATC (both running on a web node);
- as the whole Concourse deployment is being redeployed it might happen that the web and worker nodes are being redeployed at the same time;
- in some cases the web node VM (to which the retire command has been sent) might be recreated; Note that the ATC on this old web node has not marked the worker as "retiring" in the DB;
- during the recreate of the web node there seems to be no meaningful response to the calling worker node; thus the worker's drain logic doesn't react;
- the drain logic on the worker node just waits until the timeout passes (in our case it is 12 hours). In that period new jobs are being scheduled on this worker (and executed). After the timeout the drain script signals the worker process with TERM, QUIT, and finally KILL signal and after that the VM can be recreated.
I think concourse/concourse#2523 might be the same.
from concourse-bosh-release.
The same issue also happens if the draining performs "landing", not "retiring". I am wondering how to get more details from the web node, for example the logs from the "old" web node (as the old VM is being destroyed the logs are gone). Any ideas?
from concourse-bosh-release.
Hi,
I've debugged this and managed to reproduce it locally via docker-compose. It happens when
- SIGUSR1/SIGUSR2 signal has been sent to the worker process
- the worker receives the signal, but the call from the TSA client fails
Then the workerBeacon fails with some error and it is restarted.
Bottom line is, that in the case of a Bosh deployment the drain script continues to run till it reaches the drain timeout, then followed by TERM/15/QUIT/2/KILL.
from concourse-bosh-release.
Seems that the behavior can be explained by the web and worker instance groups being updated in parallel. In some cases, the timing is enough to simultaneously update a web node and worker node, which might cause the issue if the worker node is connected to the specific web node being updated. In our case having just 2 web nodes makes the probability of hitting this exact race condition quite high.
It can be avoided, by not running updates of both instance groups in parallel.
from concourse-bosh-release.
thanks @radoslav-tomov and @alexbakar for spending time investigating this!
I am wondering how to get more details from the web node, for example the logs from the "old" web node (as the old VM is being destroyed the logs are gone).
I think the only way to get logs from a destroyed VM is to enable log forwarding at deploy time and set them up to export them to somewhere like Papertrail 😕
Next part is speculative, not necessarily a request for contributors to implement:
I wonder, since BOSH introduced a new lifecycle hook for pre_stop
, if there's another way to address the timing issue without turning off parallel deployment. Maybe web nodes can wait (up to a configurable threshold) until there are no more registered workers before restarting for example. That should enable some time overlap for the slowest BOSH operations and not prolong the overall deployment time by too much I think..
cc @cirocosta and @pivotal-jamie-klassen for additional input!
from concourse-bosh-release.
Thanks @deniseyu for the help.
For the analysis of the issue I've forwarded the logs and later checked them using Kibana.
I've checked the pre_stop stage description. I wonder do we always know the timing scheme of the redeployment (i.e. when the recreation of the group of web instances is started and when the group of workers). Can we miss some of the workers when the pre_stop of the web is called? I mean what is the chance for a worker recreation (and draining respectfully) to be triggered right after the pre_stop of the web node has finished? If there is no such chance, then the approach with using pre_stop is fine. Otherwise the race condition might still happen in some cases.
from concourse-bosh-release.
Related Issues (20)
- Expose NewRelic insights url
- Conjur Credential Manager web spec configuration is invalid
- Concourse version 6.0.0 upgrade error
- Upgrade registry-image resource to 0.10.0 HOT 2
- Garden properties are always included even when using containerd runtime
- Go 1.15 breaks LDAP integration with AD controllers due to CN x509 field deprecation
- Gdn assets are not updated on upgrade HOT 2
- base_resource_type_defaults does not start concourse if you set
- expose baggageclaim bind_ip property to support p2p streaming
- concourse_tls_ca_cert issue while upgrading from v6.7.5 to v7.1.0 HOT 3
- Load balancing on web servers
- Increase systemd TasksMax to prevent "fork rejected by pids controller" errors HOT 1
- database backup through bbr feature fails randomly with concourse 7.2 HOT 1
- Explore Concourse running on Jammy jellyfish HOT 2
- Failed to open connector cf, unknown connector type HOT 2
- Failed to open connector cf, unknown connector type HOT 1
- Allow configuration of `web.noop`
- Check that windows workers have config parity with linux workers
- Add contributing guide HOT 4
- bbr-atcdb job doesn't produce a valid bbr-sdk config.json when using bosh links
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from concourse-bosh-release.