Comments (12)
We'll be fixing this pretty hard in the next release. A solution is in the works. We're just gonna rip out those two columns entirely and change how we run checks for manually triggered builds (do it on the side that pulls from the queue, not the side that pushes to it and does all this locking).
from atc.
Hi there!
We use Pivotal Tracker to provide visibility into what our team is working on. A story for this issue has been automatically created.
The current status is as follows:
- #133300157 Jobs can get stuck at 'waiting for suitable set of input versions'
This comment, as well as the labels on the issue, will be automatically updated as the status in Tracker changes.
from atc.
We do see at least one example of a job hanging in this manner where all previous builds were successes that went green.
from atc.
For a time, we were running Concourse 2.3.1 against Garden runC 1.0.1 inadvertently. We tried switching it back, and we're still not sure if that helped or not.
The UI hanging message is "waiting for a suitable set of input versions". The column in the DB that we manually remediate by setting to false to get the jobs unblocked is jobs.resource_checking
@schubert @tom-meyer @dsharp-pivotal @jingyimei
from atc.
@cjcjameson any idea if the ATC is still running in to too many open files?
from atc.
No, I don't think so. We are getting a lot of "atc.container-keepaliver.looking-up-container" and then "atc.baggage-collector.could-not-locate-worker". Separate issue, probably
from atc.
Cool. I'm pretty sure those are fine.
from atc.
We've been doing more spelunking and our discoveries confirm the general diagnosis in the story description: code which should allow resource_checking to resolve doesn't get triggered or gets lost in a crash.
There might be more details however... We don't have to crash the ATC to demonstrate the behavior; we only need to have a long-running or semi-crashing container, then trigger another build. We will still be watching for reproductions of the error.
We often see this error occur as a consequence of the container hang that is documented here: cloudfoundry/guardian#54
Regardless of the guardian issue, however, we think that the Concourse side logic for the resource_checking
field could be made more resilient to this general class of issues.
from atc.
We've changed our best guess regarding what causes container hangs (cloudfoundry/guardian#54) -- we understand now that it's likely related to stdout
being left open, not directly attributable to zombie processes or daemons.
We are still seeking to connect this 'hang' -- containers staying open because children of their initial test process still have stdout
open -- to why Concourse mount volume locking or resource checking might jam up in some failure condition.
However, we're working more on the underlying hang at the moment; we don't have a minimal reproduction of this on the Concourse end yet.
from atc.
I am seeing this often with concourse 2.4.0, especially on a pipeline where I frequently trigger jobs and then expect downstream jobs, waiting for passed
, to go off. I'm trying to get up the courage to put psql -c 'UPDATE jobs SET resource_checking = false WHERE resource_checking = true;'
in a cron job, but I really don't want to. I saw this in 2.3.1 as well, but less often.
I have had ATC crashes, as well as semi-frequent worker crashes- so a fairly unfriendly environment. I also see more
WARNING: you don't own a lock of type ExclusiveLock
WARNING: you don't own a lock of type ExclusiveLock
in my postgres logs than would be ideal, if that helps at all.
from atc.
We have observed the behavior of a job getting stuck at waiting for suitable inputs
but with a new cause. This one seems to be we rolled a worker while it was mid build of the job. This resulted in a connection reset by peer
error, and the following manually triggered build was not able to get off the ground.
from atc.
Hello again!
All stories related to this issue have been accepted, so I'm going to automatically close this issue.
At the time of writing, the following stories have been accepted:
- #133300157 Jobs can get stuck at 'waiting for suitable set of input versions'
If you feel there is still more to be done, or if you have any questions, leave a comment and we'll reopen if necessary!
from atc.
Related Issues (20)
- is there support for custom base url path? HOT 4
- favicon retains green/red color after navigating back to pipeline in chrome HOT 2
- Input and output paths cannot be absolute HOT 2
- Usage of resource not detected HOT 3
- UI using Windows (Chrome/Firefox) doesn't load job results. HOT 4
- Default pipeline view should show all jobs when using groups HOT 4
- Job history refresh is very annoying and dates are wrong HOT 2
- Add X-Accel-Buffering header to SSE endpoints HOT 1
- ATC web ui no showing job information HOT 4
- ATC not setting necessary security related HTTP headers HOT 5
- Dragging the view around for a pipeline always goes to the job the drag started on HOT 2
- Error saying no workers satisfying: resource type 'docker-image', platform 'linux' HOT 2
- Building pulsing is buggy on concourse 2.5.1 HOT 6
- Legend doesn't render properly at some browser zoom levels HOT 1
- job name shouldn't be tied to the job history HOT 2
- Feature request: base path in CONCOURSE_EXTERNAL_URL HOT 1
- As a user I would like to run a task from the pipeline view HOT 2
- ATC Should have a healthcheck endpoint attached to it HOT 2
- Long resource versions cannot be saved into the database HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from atc.