cloudfoundry / capi-release Goto Github PK

View Code? Open in Web Editor NEW

23.0 45.0 100.0 4.88 MB

Bosh Release for Cloud Controller and friends

License: Apache License 2.0

HTML 43.01% Shell 22.41% Ruby 34.58%

cloudfoundry bosh-release bosh

capi-release's Introduction

Cloud Foundry CAPI Bosh Release

This is the bosh release for Cloud Foundry's Cloud Controller API.

CI: CAPI Concourse Pipelines

Components

Cloud Controller: The primary API of Cloud Foundry.
Cloud Controller Clock: Triggers periodic jobs for the Cloud Controller.
Cloud Controller Workers: Execute background jobs for the Cloud Controller.
Webdav Blobstore: An optional stand-alone blobstore for the Cloud Controller.
NFS Mounter: Connects Cloud Controller with an NFS blobstore.
CC Uploader: Uploads files from Diego to the Cloud Controller.
TPS Watcher: Reports crash events from Diego to the Cloud Controller.

For more details on the integration between Diego and Capi Release, see Diego Design Notes.

Configuring Release

Contributing

Read Contribution Guidelines
Public Pivotal Tracker project showing current team priorities

Testing

Run CATS

capi-release's People

Contributors

Stargazers

Watchers

Forkers

shalako alexjh apshirley ryanpei abelhu ericpromislow petergtz jhvhs emalm ishustava ericfortinsp geofffranks samze vmware-archive flangewad cf-container-networking utako cloudfoundry-attic tinygrasshopper idev4u gcoon151 simonjjones drnic idoru dgodd bingosummer marcschunk pamplemoussecache ssurenr xiaozhu36 vmlinuxer cloudfoundryonazure cafxx herrjulz gdankov alphagov mirceaalexruse zhangweiinhpe hpcloud armfoundry licshire keymon cf-routing loggregator gmrodgers johha govau oozie svrc tstjep tjvman madamkiwi jspawar aminjam mariash infra-red bwasmith johncornish bedrock-forks c0d1ngm0nk3y kirederik dpb587-pivotal andrew-edgar flothinkspi-forks andreas-kupries pivotal-yugam-sharma weiquan0605 akhettal isabella232 pivotal-marcela-campo nirvananimbusa paas-ta jenspinney eirini-forks andy-paine sap-contributions belinda-liu rluan springerpe kinjelom zpascal cunnie acrmp svkrieger pivotalgeorge fhambrec jpalermo ohkyle evanfarrar ebroberson klapkov a-b marcpaquette

capi-release's Issues

cloud_controller_ng pre-backup-lock does not work when there is more than 1 CC instance

Issue

When there is more than 1 CC (the multi-az case is most common), the pre-backup-lock script fails when running bbr because the api is never unavailable because bbr runs these scripts serially (I think).

Steps to Reproduce

Deploy cf with > 1 instance of the api job and attempt to run bbr.

Current result

Failure text from bbr should look something like:

...
[bbr] 2017/10/09 16:57:36 INFO - Locking cloud_controller_ng on api/34b834f6-f161-485d-89d1-2f507e49b8b3 for backup...
[bbr] 2017/10/09 16:59:19 INFO - Done.
[bbr] 2017/10/09 16:59:19 ERROR - pre-backup-lock script for job cloud_controller_ng failed on api/34b834f6-f161-485d-89d1-2f507e49b8b3.
Stdout: 0
Endpoint https://api.apps.bam-bbr.cf-app.com/ did not go down after 90 seconds

Stderr:
...

cloud_controller_api_health_check is flakey

Issue

Occasionally the CAPI healthcheck fails, for reasons that seem to have nothing to do with the status of the API itself. This causes 20 seconds of API downtime, even though the API is fine, because it takes 20 seconds for the route registrar to re-check the healthiness of the API.

Context

On our BOSH lite environment we're seeing CATs failures, which manifest themselves like so:

[2018-05-08 00:36:03.68 (UTC)]> cf api api.snitch.cf-app.com --skip-ssl-validation 
Setting api endpoint to api.snitch.cf-app.com...
API endpoint not found at 'https://api.snitch.cf-app.com'
FAILED

When we check route_registrar.log on the api machine, we're seeing that the health check script had failed:

[2018-05-08 00:35:56+0000] {"timestamp":"1525739756.149144173","source":"Route Registrar","message":"Route Registrar.Script failed to exit within timeout","lo
g_level":1,"data":{"script":"/var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_health_check","stderr":"  % Total    % Received % Xferd  Average Speed
   Time    Time     Time  Current\n                                 Dload  Upload   Total   Spent    Left  Speed\n\r  0     0    0     0    0     0      0
  0 --:--:-- --:--:-- --:--:--     0\r  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0\r  0     0    0     0    0     0      0
    0 --:--:--  0:00:02 --:--:--     0\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000","stdout":"\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000","timeout":3000000000}}

When the health check fails, route registrar deregisters the api.SYSTEM_DOMAIN route and not check again for another 20sec (by default in cf-deployment). It will result in api downtime for this time period.

We saw no errors in the cloud_controller logs that will indicate that the process has crashed.

Steps to Reproduce

We don't yet have steps for triggering this failure on demand. Anything that caused the curl to intermittently fail should demonstrate the problem, though.

Expected result

We expect not to see API downtime unless the API is actually down. It would be nice for the health check script to be more resilient and not cause downtime unnecessarily.

Possible Fix

Retry logic in the health check script.

cc @njbennett

Unknown Error Has Occurred When Running CATS

Issue

When running CATS we get the following failure:

[2018-02-21 20:05:21.32 (UTC)]> cf delete-buildpack CATS-7-BPK-f7686884-77b3-4bd6-6 -f 
Deleting buildpack CATS-7-BPK-f7686884-77b3-4bd6-6...
FAILED
Error deleting buildpack CATS-7-BPK-f7686884-77b3-4bd6-6
An unknown error occurred.

Context

capi-release 1.28.0 (but noticed this with other versions)
multi-node galera mysql cluster

Steps to Reproduce

Deploy CF with mult-node galera mysql cluster and run cats (8 ginkgo nodes)

Expected result

CATs pass

Current result

We've seen CATS fail many times with this same error but while running many different commands. It is a flake and if we rerun the tests they pass.

The cloud_controller_worker logs indicate that the mysql server has gone away.
https://gist.github.com/joshzarrabi/bd37c23bd7b5a3e0a064f1eeab28b80d

We x-team paired with the mysql team and they showed us logs that indicate if the mysql proxies switched nodes. About 2000s (30m) before the error occurred, the proxies did switch nodes.

Furthermore, we queried the open connections to the database and noticed that there were many, long-running (~4000s) connections to the database. We think there might be an issue with cloud_controller_worker not properly noticing connections changes (ref jeremyevans/sequel#368 (comment)).

Possible Fix

We are curious if you're configuring this value to validate connections (http://sequel.jeremyevans.net/rdoc-plugins/files/lib/sequel/extensions/connection_validator_rb.html).

cc @kkallday @APShirley @ndhanushkodi

Add app_ssh_endpoint configurable

Issue

SSH endpoint of cloud foundry deployments are not configurable.

Context

As an CF operator, I wish to keep SSH access within private network. The first thing I wish to do is to change the URL of Diego SSH access.

Steps to Reproduce

Not Applicable

Expected result

Not Applicable

Current result

Not Applicable

Possible Fix

Check for a input from user before setting this defaults:

capi-release/jobs/cloud_controller_ng/templates/cloud_controller_ng.yml.erb

Line 86 in 7eb8f94

 app_ssh_endpoint: <%= "ssh." + p("system_domain") + ":" + p("app_ssh.port").to_s %> 

Deploy succeeded but ruby_buildpack was missing

Issue

We deployed CAPI 1.40.0 to Azure. The deploy succeeded but trying to push an app failed due to the "ruby_buildpack" not being present in the CAPI DB (although other buildpacks were present).

In the post-start log we see:

[2017-12-22 00:05:36+0000] + chpst -u vcap:vcap bundle exec rake buildpacks:install
[2017-12-22 00:05:39+0000] SEQUEL DEPRECATION WARNING: ...
[2017-12-22 00:05:40+0000] + [[ 0 -ne 0 ]]
[2017-12-22 00:05:40+0000] + popd
[2017-12-22 00:05:40+0000] + exit 0

In the cloud_controller_ng.log, we see:

{"timestamp":1513901227.9414902,"message":"Buildpack ruby_buildpack failed to install or update. Error: #<Faraday::ConnectionFailed wrapped=#<Net::OpenTimeout: Net::OpenTimeout>>","log_level":"error","source":"cc.background","data":{},"thread_id":47339562332440,"fiber_id":47339590905600,"process_id":12124,"file":"/var/vcap/data/packages/cloud_controller_ng/46683eb3e4b842f2d70df86eb544df41b20d252f/cloud_controller_ng/app/jobs/runtime/buildpack_installer.rb","lineno":38,"method":"rescue in perform"}
{"timestamp":1513901227.9422777,"message":"Request failed: 500: {\"error_code\"=>\"UnknownError\", \"description\"=>\"An unknown error occurred.\", \"code\"=>10001, \"test_mode_info\"=>{\"description\"=>\"Net::OpenTimeout\", \"error_code\"=>\"CF-ConnectionFailed\", \"backtrace\"=>[\"/var/vcap/packages/ruby-2.3/lib/ruby/2.3.0/net/http.rb:930:in `connect'\", \"/var/vcap/packages/ruby-2.3/lib/ruby/2.3.0/net/http.rb:863:in `do_start'\", \"/var/vcap/packages/ruby-2.3/lib/ruby/2.3.0/net/http.rb:852:in `start'\", \"/var/vcap/packages/ruby-2.3/lib/ruby/2.3.0/net/http.rb:1398:in `request'\", \"/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.3.0/gems/faraday-0.11.0/lib/faraday/adapter/net_http.rb:80:in `perform_request'\", \"/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.3.0/gems/faraday-0.11.0/lib/faraday/adapter/net_http.rb:38:in `block in call'\", \"/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.3.0/gems/faraday-0.11.0/lib/faraday/adapter/net_http.rb:85:in `with_net_http_connection'\", \"/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.3.0/gems/faraday-0.11.0/lib/faraday/adapter/net_http.rb:33:in `call'\", \"/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.3.0/gems/faraday_middleware-0.11.0.1/lib/faraday_middleware/response/follow_redirects.rb:78:in `perform_with_redirection'\", \"/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.3.0/gems/faraday_middleware-0.11.0.1/lib/faraday_middleware/response/follow_redirects.rb:66:in `call'\", \"/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.3.0/gems/faraday-0.11.0/lib/faraday/rack_builder.rb:139:in `build_response'\", \"/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.3.0/gems/faraday-0.11.0/lib/faraday/connection.rb:377:in `run_request'\", \"/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.3.0/gems/azure-core-0.1.8/lib/azure/http_response_helper.rb:27:in `set_up_response'\", \"/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.3.0/gems/azure-core-0.1.8/lib/azure/core/http/http_request.rb:143:in `call'\", \"/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.3.0/gems/azure-core-0.1.8/lib/azure/core/http/retry_policy.rb:41:in `call'\", \"/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.3.0/gems/azure-core-0.1.8/lib/azure/core/http/retry_policy.rb:41:in `call'\", \"/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.3.0/gems/azure-core-0.1.8/lib/azure/core/http/http_request.rb:104:in `block in with_filter'\", \"/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.3.0/gems/azure-core-0.1.8/lib/azure/core/http/signer_filter.rb:28:in `call'\", \"/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.3.0/gems/azure-core-0.1.8/lib/azure/core/http/signer_filter.rb:28:in `call'\", \"/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.3.0/gems/azure-core-0.1.8/lib/azure/core/http/http_request.rb:104:in `block in with_filter'\", \"/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.3.0/gems/azure-core-0.1.8/lib/azure/core/service.rb:36:in `call'\", \"/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.3.0/gems/azure-core-0.1.8/lib/azure/core/filtered_service.rb:34:in `call'\", \"/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.3.0/gems/azure-core-0.1.8/lib/azure/core/signed_service.rb:41:in `call'\", \"/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.3.0/gems/azure-storage-0.12.1.preview/lib/azure/storage/service/storage_service.rb:53:in `call'\", \"/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.3.0/gems/azure-storage-0.12.1.preview/lib/azure/storage/blob/blob_service.rb:59:in `call'\", \"/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.3.0/gems/azure-storage-0.12.1.preview/lib/azure/storage/blob/block.rb:154:in `put_blob_block'\", \"/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.3.0/gems/fog-azure-rm-0.3.1/lib/fog/azurerm/requests/storage/put_blob_block.rb:12:in `put_blob_block'\", \"/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.3.0/gems/fog-azure-rm-0.3.1/lib/fog/azurerm/requests/storage/multipart_save_block_blob.rb:74:in `block (2 levels) in multipart_save_block_blob'\"]}}","log_level":"error","source":"cc.background","data":{"job_guid":"f44f29ac-94c8-495a-8e33-55d597f14d45"},"thread_id":47339562332440,"fiber_id":47339590905600,"process_id":12124,"file":"/var/vcap/data/packages/cloud_controller_ng/46683eb3e4b842f2d70df86eb544df41b20d252f/cloud_controller_ng/app/jobs/logging_context_job.rb","lineno":43,"method":"block in log_error"}

Seems like there are two issues:

Why did the deploy succeed despite the buildpack upload failing? Is the buildpack upload asynchronous?
Should the fog azure client have retried on networking errors from Azure? I think it is standard practice on most IaaS clients to retry on 500s and temporary networking errors with an exponential backoff.

Let us know if you need any additional info.

@ljfranklin && @utako, Former CAPIbaras

No support for Quota definition updates

I have reported the issue in BOSH (see full description: cloudfoundry/bosh#1201), but likely it can be fixed by changing the way updates are handled by the CC scripts in this release.

Inconsistent routes between CC and BBS

We're seeing a strange issue where cf apps lists this app as running and with a route to it configured:

$ cf apps
Getting apps in org my-org / space my-space as myuser...
cf app OK

name         requested state   instances   memory   disk   urls
my_app       started           1/1         256M     1G     my_app.mydomain.net

But when curling the app, the gorouter returns 404:

$ curl  my_app.mydomain.net
404 Not Found: Requested route ('my_app.mydomain.net') does not exist.

We checked BBS and indeed its routes for that app is empty:

"routes": {
    "cf-router": [],
    "..."
}

This didn't just happen to one app, but actually to 16 out of 400, which were not touched at the time through cf or similar.

There are 2 questions now:

How was it possible that those routes got lost in the first place when the apps and their routes where not touched? Or, asked differently: Is it possible that for some reason CC/nsync removed the routes temporarily and then dropped the necessary re-addition accidentally, maybe because BBS wasn't up temporarily or whatever?
Given that this is the current state, shouldn't the nsync-bulker find this missing route and add it? After looking at nsync-bulker my understanding is that its stale-detection logic does not look at route discrepancies. Should it? This last piece is underlined by the fact that temporarily mapping an additional route also made the old one working again. So the trigger was enough to sync (through nsync-listener) what's in CC into the BBS.

Steps to Reproduce

Unfortunately, we don't know how to reproduce this. This issue is more aimed at understanding what could have caused this situation and if a change in the nsync-bulker might be necessary.

Thanks,
Peter

/cc @smoser-ibm @julzdiverse

Url-encode the database username

Issue

Azure MySQL database's server login name is in the format username@servername.
https://docs.microsoft.com/en-us/azure/mysql/quickstart-create-mysql-server-database-using-azure-portal#get-the-connection-information

To workaround this issue, we can use username%40servername as the database username. But the format which backup-and-restore-sdk-release requires is username@servername. This is a conflict.

Expected result

When using external databases, we can use username@servername as the database username. Both capi-release and backup-and-restore-sdk-release can accept the format.

Current result

We need to use username%40servername as the database username for capi-release. But backup-and-restore-sdk-release doesn't accept it.

Possible Fix

bingosummer@2dfd952

No support for Quota definition updates

This is a re-open of #4

AFAIK this has not been addressed. At least updating documentation would be nice (where you would explicitly list all kinds of bootstrap things like quotas, security groups etc.).

blobstore job fails to start after vm crash or reboot

Issue

The blobstore job fails to start after the VM was rebooted.

Context

The nginx.stderr.log from the failure shows the following line hundreds of times:

nginx: [emerg] open() "/var/vcap/sys/run/blobstore/nginx.pid" failed (2: No such file or directory)

The control script appears to rely on the pre-start script to setup that directory:

function setup_blobstore_directories {
  local run_dir=/var/vcap/sys/run/blobstore
  local log_dir=/var/vcap/sys/log/blobstore
  local data=/var/vcap/store/shared
  local tmp_dir=$data/tmp/uploads
  local nginx_webdav_dir=/var/vcap/packages/nginx_webdav

  mkdir -p $run_dir
  mkdir -p $log_dir
  mkdir -p $data
  mkdir -p $tmp_dir
  chown -R vcap:vcap $run_dir $log_dir $data $tmp_dir $nginx_webdav_dir "${nginx_webdav_dir}/.."
}

According to the time stamps from the log, the pre-start script did run 3-days before the reboot:

-rw-r--r-- 1 vcap vcap 41200 Jul 20 14:21 nginx.stderr.log
-rw-r--r-- 1 vcap vcap     0 Jul 17 17:23 nginx.stdout.log
-rw-r----- 1 vcap vcap   622 Jul 17 17:23 pre-start.stderr.log
-rw-r----- 1 vcap vcap     0 Jul 17 17:23 pre-start.stdout.log

Unfortunately, most of the directories that are created by that script live on temporary file systems that bosh sets up. In particular, /var/vcap/sys/run:

# df /var/vcap/data/sys/run
Filesystem     1K-blocks  Used Available Use% Mounted on
tmpfs               1024    16      1008   2% /var/vcap/data/sys/run

Since it's a tmpfs, it's memory only file system and all data gets lost on a reboot. That means that the directory used for the nginx pidfile is gone when the blobstore control script starts.

Steps to Reproduce

Deploy cloud foundry
Reboot the blobstore_z1 job

Expected result

The blobstore job recovers when the reboot is complete.

Current result

The blobstore job fails to recover. This causes the cloud controllers, cloud controller workers, and the runtimes to fail.

/v2/spaces/<SPACE_GUID>/security_groups returns duplicate entries

Issue

I received a defect pointing out a curious behaviour with the query /v2/spaces/<SPACE_GUID>/security_groups. The response of that query contains duplicate security groups instances if the same instance has been bound to different spaces. In my case I receive the 4 times the same instance, 1 for the global rule, and 3 more per each one of the bound spaces to it.

Context

This was tested in a fresh bosh-lite with latest CF and deployed by cf-deployment 49a9d51ae75bbafc2e1f5c1ba644e3d08afb4e7d

We are also seeing this issue in Bluemix.

Steps to Reproduce

cf security-groups
Getting security groups as admin
OK

     Name              Organization   Space
#0   public_networks
#1   dns               myorg          space1
     dns               myorg          space2
     dns               myorg          space3
#2   load_balancer

Expected result

Get DNS security group only once.

Current result

I am getting the group one for the global and 1 time per associate space.

{
   "total_results": 5,
   "total_pages": 1,
   "prev_url": null,
   "next_url": null,
   "resources": [
      {
         "metadata": {
            "guid": "66194193-b8be-423e-b7fc-18c352901b87",
            "url": "/v2/security_groups/66194193-b8be-423e-b7fc-18c352901b87",
            "created_at": "2017-06-27T13:54:29Z",
            "updated_at": "2017-06-27T13:54:29Z"
         },
         "entity": {
            "name": "public_networks",
            "rules": [
               {
                  "destination": "0.0.0.0-9.255.255.255",
                  "protocol": "all"
               },
               {
                  "destination": "11.0.0.0-169.253.255.255",
                  "protocol": "all"
               },
               {
                  "destination": "169.255.0.0-172.15.255.255",
                  "protocol": "all"
               },
               {
                  "destination": "172.32.0.0-192.167.255.255",
                  "protocol": "all"
               },
               {
                  "destination": "192.169.0.0-255.255.255.255",
                  "protocol": "all"
               }
            ],
            "running_default": true,
            "staging_default": true,
            "spaces_url": "/v2/security_groups/66194193-b8be-423e-b7fc-18c352901b87/spaces",
            "staging_spaces_url": "/v2/security_groups/66194193-b8be-423e-b7fc-18c352901b87/staging_spaces"
         }
      },
      {
         "metadata": {
            "guid": "90c7e9e1-1029-410c-b2c3-ddbcdd8d4564",
            "url": "/v2/security_groups/90c7e9e1-1029-410c-b2c3-ddbcdd8d4564",
            "created_at": "2017-06-27T13:54:29Z",
            "updated_at": "2017-06-27T13:54:29Z"
         },
         "entity": {
            "name": "dns",
            "rules": [
               {
                  "destination": "0.0.0.0/0",
                  "ports": "53",
                  "protocol": "tcp"
               },
               {
                  "destination": "0.0.0.0/0",
                  "ports": "53",
                  "protocol": "udp"
               }
            ],
            "running_default": true,
            "staging_default": true,
            "spaces_url": "/v2/security_groups/90c7e9e1-1029-410c-b2c3-ddbcdd8d4564/spaces",
            "staging_spaces_url": "/v2/security_groups/90c7e9e1-1029-410c-b2c3-ddbcdd8d4564/staging_spaces"
         }
      },
      {
         "metadata": {
            "guid": "90c7e9e1-1029-410c-b2c3-ddbcdd8d4564",
            "url": "/v2/security_groups/90c7e9e1-1029-410c-b2c3-ddbcdd8d4564",
            "created_at": "2017-06-27T13:54:29Z",
            "updated_at": "2017-06-27T13:54:29Z"
         },
         "entity": {
            "name": "dns",
            "rules": [
               {
                  "destination": "0.0.0.0/0",
                  "ports": "53",
                  "protocol": "tcp"
               },
               {
                  "destination": "0.0.0.0/0",
                  "ports": "53",
                  "protocol": "udp"
               }
            ],
            "running_default": true,
            "staging_default": true,
            "spaces_url": "/v2/security_groups/90c7e9e1-1029-410c-b2c3-ddbcdd8d4564/spaces",
            "staging_spaces_url": "/v2/security_groups/90c7e9e1-1029-410c-b2c3-ddbcdd8d4564/staging_spaces"
         }
      },
      {
         "metadata": {
            "guid": "90c7e9e1-1029-410c-b2c3-ddbcdd8d4564",
            "url": "/v2/security_groups/90c7e9e1-1029-410c-b2c3-ddbcdd8d4564",
            "created_at": "2017-06-27T13:54:29Z",
            "updated_at": "2017-06-27T13:54:29Z"
         },
         "entity": {
            "name": "dns",
            "rules": [
               {
                  "destination": "0.0.0.0/0",
                  "ports": "53",
                  "protocol": "tcp"
               },
               {
                  "destination": "0.0.0.0/0",
                  "ports": "53",
                  "protocol": "udp"
               }
            ],
            "running_default": true,
            "staging_default": true,
            "spaces_url": "/v2/security_groups/90c7e9e1-1029-410c-b2c3-ddbcdd8d4564/spaces",
            "staging_spaces_url": "/v2/security_groups/90c7e9e1-1029-410c-b2c3-ddbcdd8d4564/staging_spaces"
         }
      },
      {
         "metadata": {
            "guid": "6dc1130d-f1a5-414e-87ca-28649af58568",
            "url": "/v2/security_groups/6dc1130d-f1a5-414e-87ca-28649af58568",
            "created_at": "2017-06-27T13:54:29Z",
            "updated_at": "2017-06-27T13:54:29Z"
         },
         "entity": {
            "name": "load_balancer",
            "rules": [
               {
                  "destination": "10.244.0.34",
                  "protocol": "all"
               }
            ],
            "running_default": true,
            "staging_default": false,
            "spaces_url": "/v2/security_groups/6dc1130d-f1a5-414e-87ca-28649af58568/spaces",
            "staging_spaces_url": "/v2/security_groups/6dc1130d-f1a5-414e-87ca-28649af58568/staging_spaces"
         }
      }
   ]
}

un-namespace `release_level_backup` property

Thanks for submitting an issue to capi-release. We are always trying to improve! To help us, please fill out the following template.

Issue

un-namespace release_level_backup property

Context

The release_level_backup property here: https://github.com/cloudfoundry/capi-release/blob/develop/jobs/bbr-cloudcontrollerdb/spec#L19 is currently namespaced with cloudcontroller (job name), probably because it was modelled after the bbr exemplar release.

We (bbr) have recently learned that the convention for new job-level properties is for there not to be any namespacing, so we've updated our exemplar release. Could you un-namespace the release_level_backup property in your release?

Rotating `internal_api_password` while pushing breaks an application for a while

Issue

While rotating internal_api_password some application may end up in a 'corrupted/non-recoverable' state for a while.

Context

We tried to change internal_api_password, deploy cf and then diego while continuously pushing a sample application.
A push was in progress while cf was being updated and the push started to hang. After some time we interrupted the push command.
The next push failed due to Error restarting application: Server error, status code: 400, error code: 100001, message: The app is invalid: VCAP::CloudController::BuildCreate::StagingInProgress.
We tried to delete the app then, but we got the error Server error, status code: 500, error code: 10011, message: Database error.

Steps to Reproduce

Run the following command in an endless loop

timeout 180 cf push APP_NAME

Modify internal_api_password
Deploy cf
Deploy diego

Expected result

Application lifecycle should work for all applications after the cf and diego deployment.

Current result

After diego is deployed, applications lifecycle works again for apps that have been not pushed during the cf deployment.
Apps pushed during cf deployment will be in the described state. I don't know if the user can fix the application or not.

After some time (I retried the next day), some timeout probably kicked in and I was able to delete the broken app.

Possible Fix

No concrete idea, maybe some improvement in error handling / rollback action in the push handling.

Configure two API endpoints for single deployment

Hi!
I want to configure two different API endpoint for my CF deployment. In deployment manifest i added routes for API, UAA, etc. But when i try to set cf api api.cf.1.example.com, it redirects me too login.cf.2.example.corp which is not accessible from first network. It looks like CLI using property system_domain which is set to cf.2.example.corp. Is there any way to add additional API endpoint? Thanks!

~ → CF_TRACE=true cf api --skip-ssl-validation api.cf.1.example.com
Setting api endpoint to api.cf.1.example.com...
REQUEST: [2016-06-14T22:01:51+03:00]
GET /v2/info HTTP/1.1
Host: api.cf.1.example.com
Accept: application/json
Content-Type: application/json
User-Agent: go-cli 6.19.0+b29b4e0 / darwin
RESPONSE: [2016-06-14T22:01:52+03:00]
HTTP/1.1 200 OK
Content-Length: 646
Content-Type: application/json;charset=utf-8
Date: Tue, 14 Jun 2016 19:01:52 GMT
Server: nginx
X-Content-Type-Options: nosniff
X-Vcap-Request-Id: d65953c2-b665-4366-6a1e-0ca5fabf8654
X-Vcap-Request-Id: d65953c2-b665-4366-6a1e-0ca5fabf8654::33d6c84c-d63f-4e5c-9213-bec863380272
{"name":"","build":"","support":"http://support.cloudfoundry.com","version":0,"description":"","authorization_endpoint":"https://login.cf.2.example.corp","token_endpoint":"https://uaa.cf.2.example.corp","min_cli_version":null,"min_recommended_cli_version":null,"api_version":"2.56.0","app_ssh_endpoint":"ssh.cf.2.example.corp:2222","app_ssh_host_key_fingerprint":"b0:ef:fa:6d:90:e0:08:71:5a:c2:79:83:4e:c5:eb:14","app_ssh_oauth_client":"ssh-proxy","routing_endpoint":"https://api.cf.2.example.corp/routing","logging_endpoint":"wss://loggregator.cf.2.example.corp:443","doppler_logging_endpoint":"wss://doppler.cf.2.example.corp:4443"}
OK
API endpoint:   https://api.cf.1.example.com (API version: 2.56.0)
Not logged in. Use 'cf login' to log in.

Automatically assigning subdomain routes to apps within a space

As a developer using the cloudfoundry platform it would be really useful to be able to namespace the hostname of apps per space so that we only need one manifest per application and we don't need to add custom routes for every deployment to each space.

An example:
I an app I would like to deploy to two spaces: dev and ci.
The app name in my manifest is: foobar
When I deploy my service to the dev space I would get a route: foobar.dev.mydomain.com
When I deploy my service to the ci space I would get a route: foobar.ci.mydomain.com

Currently this would fail as the dev deployment would take the route foobar.mydomain.com and I would have to specify the host for the ci deployment.

There are workarounds for this but it would be much more seamless if spaces could be assigned subdomains

No support for Security Group definition updates

This is a re-opening of #9

Issue

When you change existing SG definitions in cf manifest or add new security groups and apply changes with BOSH, the changes are not applied even though instances are updated by BOSH. Behaviour looks similar to #4

Context

I want to update existing SG or add new SG.

Steps to Reproduce

Update existing SG or add new SG in CF manifest. Bosh deploy, watch BOSH update controller VM.

Expected result

Changes are applied and new SGs exist when you do cf security-groups.

Current result

Changes are not applied and no new security groups are created.

Possible Fix

Either apply changes or at least document that even though changes from manifest are applied against environment, your release is ignoring them.

Versions

BOSH v255.8, CF v233

Met a problem,the file_descriptor_limit param configuration api can be support?

Issue

Hello,capi team

Recently, I met a problem,when i push a tomcat application to cf,sometimes appear "too many open files",the default dfs is 16384,maybe this is an application bug,whatever his for a moment,but now can I change this parameter (not the "instance_file_descriptor_limit" parameter in the CC config) when i push app? such as like cf push app app_name –ulimit nofile=20480:40960.

Thanks.

cf-acceptance-tests services helpers don't respect provided timeouts

This file has hardcoded definitions for the timeout variables: https://github.com/cloudfoundry/cf-acceptance-tests/blob/master/helpers/services/globals.go

This doesn't respect the timeouts that the person running the CATS can define in the config file: https://github.com/cloudfoundry/cf-acceptance-tests/blame/05f5a76bcc211a71aa7bf27f3b96d8818f85f491/README.md#L124-L127

No Docker-Support in Manifest-File

Thanks for submitting an issue to capi-release. We are always trying to improve! To help us, please fill out the following template.

Issue

Sadly there is no way to define a Docker-image in manifest.yml at the moment.
Instead you have to define it on every cf push, via -o.

Possible Fix

Add anything like docker_image to manifest.yml
Would be awesome!

App uploads fail intermittently when GCS is the blobstore

Issue

The Zipkin CATs suite regularly fails when run against an environment using a GCS blobstore. This appears to be because these tests are pushing a realistically sized Java Spring application, rather than a tiny test app.

We've now turned off the Zipkin tests in the environment experiencing the issue, but an example of the output is available here: https://release-integration.ci.cf-app.com/teams/main/pipelines/cf-deployment/jobs/fresh-cats/builds/970

Steps to Reproduce

Deploy a cf-deployment with use-gcs-blobstore.yml or use-gcs-blobstore-service-account.yml
Assemble a CATS integration-config.json for that environment that enables zipkin tests
Run CONFIG=~/workspace/my-env/integration_config.json ./bin/test -focus "Zipkin" -untilItFails

Optional step 4:
parallel cf push bubble-dog-{} ::: A B C D E F G H I J K L M N O P
in an app directory containing a bit less than 1G of dog gifs

Note: This was tested on an environment with a bunch more customizations, which we assume aren't relevant but haven't confirmed.

Expected result

Test continues to pass indefinitely, or until the load balancer throws a 502

Current result

Within ~5 runs, we see the test fail with the following error:

Waiting for API to complete processing files...
Job (1e460ba3-08e9-4a52-b068-e9f642a558ba) failed: An unknown error occurred.
FAILED

If you're running the optional step #4 you are likely to see this failure immediately.

Possible Fix

We might just need to increase the blobstore timeout when using GCS.

Google recommends retrying with truncated exponential backoff in response to GCS errors

Please put the cloud controller job spec in the conventional location

Issue

I need to reference the cloud controller job spec to discover the use of a manifest property. I expect to find it in /jobs/cloud_controller_ng. Instead I must trace the thread through to cloud_controller's own repo and find the job spec hidden in /bosh/jobs/cloud_controller_ng/spec.

First, the job spec is a BOSH release artifact, and has no meaning in the context of a standalone component.

Second, this location his highly unconventional and so challenging to find.

Third, the job spec should be considered your configuration documentation for the component within the context of the BOSH release. Putting it in the conventional place enables operators to understand how to configure the component via a BOSH manifest.

Steps to Reproduce

Look for the CAPI job spec
Spend much longer finding it than it should take

Expected result

Find the CAPI job spec at https://github.com/cloudfoundry/capi-release/blob/master/jobs/cloud_controller_ng/spec

Current result

Find that https://github.com/cloudfoundry/capi-release/blob/develop/jobs/cloud_controller_ng is a link to ../src/cloud_controller_ng/bosh/jobs/cloud_controller_ng.
Retrace your steps up the tree and back down through /src/cloud_controller_ng @ 4423884 which takes you to the CC repo
Get lost, try looking through several directories, before going back to https://github.com/cloudfoundry/capi-release/blob/develop/jobs/cloud_controller_ng to remind yourself of the unconventional path
Eventually find your way to https://github.com/cloudfoundry/cloud_controller_ng/blob/b44ae3cb312841b7dd8f8b3cf39eab6fe95ede55/bosh/jobs/cloud_controller_ng/spec

Possible Fix

Put the spec in the expected location: capi-release under /jobs/cloud_controller_ng/spec

CC, worker, and clock jobs require hm9000.port property even if using Diego backend

Thanks for submitting an issue to capi-release. We are always trying to improve! To help us, please fill out the following template.

When I deploy Cloud Foundry with a Diego backend, and remove all HM9000 instance groups and properties, I get:

     - Unable to render templates for job 'cloud_controller_ng'. Errors are:
       - Error filling in template 'cloud_controller_api.yml.erb' (line 208: Can't find property '["hm9000.port"]')
   - Unable to render jobs for instance group 'clock_global'. Errors are:
     - Unable to render templates for job 'cloud_controller_clock'. Errors are:
       - Error filling in template 'cloud_controller_clock.yml.erb' (line 181: Can't find property '["hm9000.port"]')
   - Unable to render jobs for instance group 'qpi_worker'. Errors are:
     - Unable to render templates for job 'cloud_controller_worker'. Errors are:
       - Error filling in template 'cloud_controller_worker.yml.erb' (line 180: Can't find property '["hm9000.port"]')

I'd expect that if I'm not even deploying HM9k, capi jobs shouldn't fail to render templates if I don't supply the HM9000 port.

Please remove "domain" property in job specs

Thanks for submitting an issue to capi-release. We are always trying to improve! To help us, please fill out the following template.

Issue

The top-level domain property in several job specs should be removed in favour of just using system_domain.

Context

The property is specified in a few jobs:

Fixing this issue will help unblock this story to remove the property from manifest generation templates and confusing documentation.

The domain property is a constant source of confusion. See this mailing list thread. (Examples of the confusion between domain and system_domain can be found regularly on the mailing lists, GitHub issues, and Slack questions).

Steps to Reproduce

N/A

Expected result

No more tears.

Current result

Tears.

Possible Fix

Remove properties from job specs, and references to them in job ERB templates.

`name of issue` screenshot

N/A

VCAP_APPLICATION doesn't have org id/org name

Thanks for submitting an issue to capi-release. We are always trying to improve! To help us, please fill out the following template.

Issue

VCAP_APPLICATION doesn't have org id/org name
[provide quick introduction so this issue can be triaged]

Context

when you do cf env my-app, you get the VCAP_APPLICATION.
If has the app name, app guid, space name, target api etc. But it is missing org name.
[provide more detailed introduction]

Steps to Reproduce

[ordered list the process to finding and recreating the issue, example below]
cf env my-app
OK

System-Provided:

{
"VCAP_APPLICATION": {
"application_id": "2aeb77ed-7622-466a-9b98-39458eab903c",
"application_name": "java-starter-a",
"application_uris": [
"java-starter-a.xxx"
],
"application_version": "66796b97-f7ec-4717-84dd-d4784e029eb8",
"cf_api": "https://xxx.xxx.xx",
"limits": {
"disk": 1024,
"fds": 16384,
"mem": 1024
},
"name": "java-starter-a",
"space_id": "1e34e388-78cc-4fbc-9b69-ea0c892d653d",
"space_name": "dev",
"uris": [
"java-starter-a.xxxx"
],
"users": null,
"version": "66796b97-f7ec-4717-84dd-d4784e029eb8"
}
}

Expected result

[describe what you would expect to have resulted from this process]

org id:
org_name:

Current result

[describe what you you currently experience from this process, and thereby explain the bug]

Possible Fix

[not obligatory, but suggest fixes or reasons for the bug]

`name of issue` screenshot

[if relevant, include a screenshot]

No support for Security Group definition updates

Issue

Context

I want to update existing SG or add new SG.

Steps to Reproduce

Update existing SG or add new SG in CF manifest. Bosh deploy, watch BOSH update controller VM.

Expected result

Changes are applied and new SGs exist when you do cf security-groups.

Current result

Changes are not applied and no new security groups are created.

Possible Fix

Either apply changes or at least document that even though changes from manifest are applied against environment, your release is ignoring them.

Versions

BOSH v255.8, CF v233

cloud_controller_ng pre-restore-lock exits before monit finished

Issue

For cloud_controller_ng the pre-restore-lock script does not wait till monit has fully stopped the jobs. So in the case that the restore is very quick, the subsequent post-restore-unlock script fails.

Context

Running DRATS

Steps to Reproduce

On a cloud_controller_ng job run /var/vcap/jobs/cloud_controller_ng/bin/bbr/pre-restore-lock && /var/vcap/jobs/cloud_controller_ng/bin/bbr/post-restore-unlock

Expected result

Success, ends in a state where everything is currently running.

Current result

Failing with stderr output monit: action failed -- Other action already in progress -- please try again later. monit summary shows that workers are down.

Possible Fix

Wait for monit actions to complete in pre-restore-lock before exiting.

CAPI Failure When UAA Isn't Available on Internal Address Is Late and Obscure

If UAA isn't set to have uaa.service.cf.internal in uaa.zones.internal.hostnames, or is otherwise unavailable, CC emits 502s with no error message on all endpoints that attempt token validation.

It should probably fail to start, instead. Failing that, it should emit a clear error message.

Steps to Reproduce

Override the uaa.zones.internal.hostnames to nil in your stub, deploy, then try to use CC. I've not tried this, so you could instead deploy with the minimal-aws example manifest prior to the fix here.

Expected result

The CC should probably fail to come up if it can't talk to UAA after a deploy. This would allow the canary to prevent all of the CCs from becoming non-functional in a misconfigured upgrade scenario.

If not, the message should at least say "yo I can't get the key from UAA because 404s on this wack internal address" or something.

Current result

From the user's perspective:

$ cf orgs
Getting orgs as admin...

FAILED
Server error, status code: 502, error code: 0, message:

From the logs:

{"timestamp":1485217641.3378894,"message":"Fetching uaa verification keys failed","log_level":"error","source":"cc.uaa_verification_keys","data":{},"thread_id":47193615883740,"fiber_id":47193621784980,"process_id":28295,"file":"/var/vcap/data/packages/cloud_controller_ng/aa586e54ee45aa382a0d5ebbab32e1c8aa048953.1-9face0a59275e9a96297203adbc563da1e7b8afd/cloud_controller_ng/lib/cloud_controller/uaa/uaa_verification_keys.rb","lineno":22,"method":"update_keys"}
{"timestamp":1485217641.3382447,"message":"Failed communicating with UAA: The UAA was unavailable","log_level":"error","source":"cc.security_context_setter","data":{},"thread_id":47193615883740,"fiber_id":47193621784980,"process_id":28295,"file":"/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/middleware/security_context_setter.rb","lineno":22,"method":"rescue in call"}

And this is after the deploy is successful, so only post-deployment testing catches it.

isolation segment feature [migrated from diego-release]

Original issue: cloudfoundry/diego-release#247
Original poster: @YunSangJun

I have tested isolation segment feature on follow cf version.

cf v250
Diego release v1.4.1. Release notes for v1.4.1 · v1.4.0 · v1.3.1 · v1.3.0.
Garden-Runc release v1.0.4. Release notes for v1.0.4.
cflinuxfs2-rootfs release v1.44.0. Release notes for v1.44.0 · v1.43.0 · v1.42.0.
stemcell 3312.12

I found api which bind segment to org but there is no api that bind segment to space.
http://v3-apidocs.cloudfoundry.org/version/3.3.0/#entitle-one-or-more-organizations-for-an-isolation-segment

When do you plan to release the api?

GET /v2/events next_url parameter changed

Issue

When I use next_url parameter to see List all Events through CAPI (GET /v2/events)
I think there is a changes with the next_url parameter

Context

next_url parameter is different
To see next event log, "next_url" parameter not working

cf-release 251 (Contains CAPI release v1.15.0)
"next_url": "/v2/events?order-by=timestamp&order-by=id&order-direction=asc&page=2&results-per-page=50"

cf-release 247 (Contains CAPI release v1.11.0)
"next_url": "/v2/events?order-direction=asc&page=2&results-per-page=50"

I didn't see anything about order-by parameter
https://apidocs.cloudfoundry.org/251/events/list_all_events.html

Has any plan about support next_url parameter?

Expected result

As before, with next_url parameter can see next page events

$ cf curl "/v2/events?order-direction=asc&page=2&results-per-page=50" | more
{
"total_results": 2366,
"total_pages": 48,
"prev_url": "/v2/events?order-direction=asc&page=1&results-per-page=50",
"next_url": "/v2/events?order-direction=asc&page=3&results-per-page=50",
"resources": [
{
"metadata": {
"guid": "bef7dbbe-7a64-4de3-8d58-93bfc2572c84",
"url": "/v2/events/bef7dbbe-7a64-4de3-8d58-93bfc2572c84",
"created_at": "2017-03-14T01:14:28Z",
"updated_at": "2017-03-14T01:14:28Z"
},
"entity": {
"type": "audit.app.process.crash",
..
..
..

Current result

$ cf curl "/v2/events?order-by=timestamp&order-by=id&order-direction=asc&page=2&results-per-page=50"
{
"code": 10012,
"description": "Cannot order by: id",
"error_code": "CF-OrderByParameterInvalid"
}

pre-start.erb.sh chown code is redundant

Thanks for submitting an issue to capi-release. We are always trying to improve! To help us, please fill out the following template.

Issue

In blobstore's pre-start.sh , chown is doing more work than it needs to.

Context

In this line:

chown -R vcap:vcap $run_dir $log_dir $data $tmp_dir $nginx_webdav_dir "${nginx_webdav_dir}/.."

chown is running over /var/vcap/packages/nginx_webdav twice

Steps to Reproduce

Use two terminal windows. One to deploy cf, and the other to ssh into the VM
and watch the processes. Run something like watch -c 'ps auxww | grep chown | grep -v -e grep -e watch

Expected result

chown should run for about 5 seconds

Current result

chown runs for about 5 minutes. This is because the nginx_webdav
package doesn't need any preprocessing, so /var/vcap/packages/nginx_webdav is a
symlink to /var/vcap/packages-src/HASH and therefore "${nginx_webdav_dir}/.."
resolves to /var/vcap/packages-src which contains the full complement of CF
packages.

Possible Fix

The reference to "${nginx_webdav_dir}/.." is a code smell.
If you really want to chown -R both dirs, it would be better to write the code like this:

local packages_dir=/var/vcap/packages
local nginx_webdav_dir=$packages_dir/nginx_webdav
...
chown ... $nginx_webdav_dir $packages_dir

cf-acceptance-tests security groups tests should test precondition

The security groups tests will fail if you already have wide-open security groups. It would be nice to have a precondition (BeforeEach) that asserts that the existing security groups are sufficiently closed off that the test has a chance of passing. If not, it could fail with a custom error message saying that your security groups are too open for this test to pass, and the user should either change their sec gps or not run the test.

/cc @wendorf

The job `cloud_controller_ng` failed when updating instance api

Please see cloudfoundry/cloud_controller_ng#1047 for detailed information.

After some investigating, I found the root cause.
In https://github.com/cloudfoundry/capi-release/blob/develop/jobs/cloud_controller_ng/monit, the timeout to start the process cloud_controller_ng is 30 seconds.
In https://github.com/cloudfoundry/capi-release/blob/develop/shared_job_templates/blobstore_waiter.sh.erb, wait_for_blobstore will try 30 times to curl blobstore and sleep 1 second between each try. It means, wait_for_blobstore may return "Blobstore is available" after 30s, because sometimes blobstore needs a longer time to be ready. But at that moment, monit regarded cloud_controller_ng as failed. In fact, the process cloud_controller_ng has been up.

The possible fix is to increase the timeout in the monit file.

Cloud Controller pre_start.sh script should not restart directly the consul_agent

Issue

cloud_controller_ng pre-start.sh script restarts directly the consul agent by calling its ctl script, starting it with as root user instead of vcap. If consul_agent gets restarted, it will fail due permissions trying to write on files owned by root

Context

cloud_controller_ng pre-start.sh script restarts directly the consul agent by calling its ctl script. This was added here as part of #128916909 (I cannot find the story url).

But the consul_agent is meant to run as vcap:vcap user and group and the monit file specifies that.

If we restart consul_agent by calling the ctl directly from cloud_controller pre-start.sh, consul_agent will run as root. That will work, but its files (config&data) will be created as root, and later the process won't be able to start due permissions after the VM gets restarted or monit restart consul_agent is executed.

More info in this conversation https://cloudfoundry.slack.com/archives/bosh-core-dev/p1478184425003326

Steps to Reproduce and Current result

Deploy cloud_controller with consul_agent.
consul_agent will be running as root.
run monit restart consul_agent. It will fail starting because these files are owned by root. /var/vcap/jobs/consul_agent/config/*.json and /var/vcap/data/consul_agent/serf/*
Or restart the VM. consul_agent will not start properly.

In general:

monit restart consul_agent: Starts consul agent as vcap:vcap. Would fail if file permissions are wrong.
monit restart all or bosh restart api: will start consul_agent as root, as cloud_controller_ng executes the pre-start.sh

Expected result

consul_agent should not be restarted from cloud controller, and it should always run as whatever its defined in the consul release. Currently is vcap:vcap.

Current result

See "Steps to Reproduce"

Possible Fix

Stop restarting consul_agent from cloud_controller.
If you need to wait for consul_agent for service discovery of the DB, it is possible to:
- add retry logic in CC
- rely on monit restart service feature.
- wait in the ctl script until consul_agent is started
Bosh dev team mention that "bosh need to get on the feature that will start basic services before other jobs. Coming up"

Workaround

Run monit restart all or bosh restart api. Will restart the consul_agent as root.

Hard to debug service broker CATS test

Adding this issue here instead of cf-acceptance-tests repo to give visibility to the CAPI team.

We have a failed CI job which runs cf-acceptance-tests against a CF deployment: https://runtime.ci.cf-app.com/pipelines/cf-release/jobs/a1-diego-cats/builds/65

• Failure in Spec Setup (BeforeEach) [314.151 seconds]
Service Instance Lifecycle Asynchronous operations [BeforeEach] when there is an existing service instance can delete a service instance 
/tmp/build/357ec5d1/gopath/src/github.com/cloudfoundry/cf-acceptance-tests/services/service_instance_lifecycle_test.go:379

  Timed out after 300.000s.
  Expected process to exit.  It did not.

  /tmp/build/357ec5d1/gopath/src/github.com/cloudfoundry/cf-acceptance-tests/helpers/services/broker.go:116

It appears to take > 5 minutes to cf start. Most tests do an AppReport after a cf start fails which does a cf logs --recent dump. This is very useful because the timestamps help tell the story of what exactly is taking too long (creating container, downloading blobs, uploading blobs, fetching Ruby gems, etc.). That would be desirable here, might be difficult given the structure of the broker DSL.

Generally seems undesirable to have lots of orchestration and assertions hidden inside test helpers.

nsync fails to update desired state in BBS silently

Thanks for submitting an issue to capi-release. We are always trying to improve! To help us, please fill out the following template.

Issue

nsync fails to update desired state in BBS silently.

Context

If capi.nsync.cc.base_url is provided without a protocol, e.g. api.system-domain.com instead of https://api.system-domain.com, then rather than failing fast at deployment time, the bulker will continually fail to update freshness because of an invalid scheme "".

Eventually, stale LRPs are left around in Diego and aren't GC'd, leading to insufficient resources.

Steps to Reproduce

Deploy CF + Diego with capi.nsync.cc.base_url set to api.YOUR_SYSTEM_DOMAIN without any http:// or https:// scheme in the URL.

Expected result

When misconfiguring the URL like this, I'd expect to nsync to fail on the deployment, since the configuration it has been given is bad and will never work.

Current result

Currently nsync deploys correctly, and only after a long time do I eventually see lots of app pushes fail due to insufficient resources.

Possible Fix

Validate the input in the startup or pre-start scripts, and fail fast if the input is invalid.

cf-acceptance-tests services tests don't provide output logs

I have several flaky service instance lifecycle tests. The code looks deterministic, namely that when create an asynchronous service instance, the fake broker should first respond saying that it is in_progress, and then respond saying it's complete. It appears to continue saying it's in_progress. We were wondering if there was some caching of the response happening anywhere, and wanted to check if requests were making it all the way to the service broker app. We see the requests making it to the router. The problem is the CATS in question don't dump the service broker app's logs when the test fails.

It appears that default_fog_configuration no longer works

Issue

default_fog_connection: doesn't appear to work in release 234

capi-release/jobs/cloud_controller_ng/spec

Line 242 in 9cc12b6

cc.default_fog_connection.provider:

Context

Though a valid default value exists in the spec in order for my CC to work properly I had to provide fog_connection for each of the blobstore configs

Steps to Reproduce

Deploy CF Release 234 with fog. Don't specify any config for any of the blobstores (should default to fog right?)
Attempt to interact with the blobstore

Expected result

Interaction works

Current result

Get an error in the logs similar to this (from an hm9000.start but the trace will be similar enough)

{"timestamp":1460742570.8665578,"message":"exception processing subscription for: 'hm9000.start' '{\"message_id\"=>\"c6bc5978-82cb-4a53-715b-8ab472b63e0c\", \"droplet\"
=>\"5c783707-a3b0-4dbe-919d-57cc70068f6d\", \"version\"=>\"367ffbd2-c0e5-4db3-9a46-3e601c932c23\", \"instance_index\"=>0}' \n#<NoMethodError: undefined method `downcase
' for nil:NilClass>\n /var/vcap/data/packages/cloud_controller_ng/469145d8fb043950582100c0c1a179c756fcff1f.1-f99d5f059a284cf7ac19ff6e00ffb5520513de4d/cloud_controller_n
g/lib/cloud_controller/blobstore/fog/fog_client.rb:27:in `local?'\n/var/vcap/data/packages/cloud_controller_ng/469145d8fb043950582100c0c1a179c756fcff1f.1-f99d5f059a284c
f7ac19ff6e00ffb5520513de4d/cloud_controller_ng/lib/cloud_controller/blobstore/url_generator.rb:64:in `droplet_download_url'\n/var/vcap/data/packages/cloud_controller_ng
/469145d8fb043950582100c0c1a179c756fcff1f.1-f99d5f059a284cf7ac19ff6e00ffb5520513de4d/cloud_controller_ng/lib/cloud_controller/dea/start_app_message.rb:11:in `initialize
'\n/var/vcap/data/packages/cloud_controller_ng/469145d8fb043950582100c0c1a179c756fcff1f.1-f99d5f059a284cf7ac19ff6e00ffb5520513de4d/cloud_controller_ng/lib/cloud_control
ler/dea/client.rb:252:in `new'\n/var/vcap/data/packages/cloud_controller_ng/469145d8fb043950582100c0c1a179c756fcff1f.1-f99d5f059a284cf7ac19ff6e00ffb5520513de4d/cloud_co
ntroller_ng/lib/cloud_controller/dea/client.rb:252:in `start_instance_at_index'\n/var/vcap/data/packages/cloud_controller_ng/469145d8fb043950582100c0c1a179c756fcff1f.1-
f99d5f059a284cf7ac19ff6e00ffb5520513de4d/cloud_controller_ng/lib/cloud_controller/dea/client.rb:137:in `block in start_instances'\n/var/vcap/data/packages/cloud_control
ler_ng/469145d8fb043950582100c0c1a179c756fcff1f.1-f99d5f059a284cf7ac19ff6e00ffb5520513de4d/cloud_controller_ng/lib/cloud_controller/dea/client.rb:135:in `each'\n/var/vc
ap/data/packages/cloud_controller_ng/469145d8fb043950582100c0c1a179c756fcff1f.1-f99d5f059a284cf7ac19ff6e00ffb5520513de4d/cloud_controller_ng/lib/cloud_controller/dea/cl
ient.rb:135:in `start_instances'\n/var/vcap/data/packages/cloud_controller_ng/469145d8fb043950582100c0c1a179c756fcff1f.1-f99d5f059a284cf7ac19ff6e00ffb5520513de4d/cloud_controller_ng/lib/cloud_controller/dea/hm9000/respondent.rb:69:in `process_hm9000_start'\n/var/vcap/data/packages/cloud_controller_ng/469145d8fb043950582100c0c1a179c756fcff1f.1-f99d5f059a284cf7ac19ff6e00ffb5520513de4d/cloud_controller_ng/lib/cloud_controller/dea/hm9000/respondent.rb:25:in `block in handle_requests'\n/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.2.0/gems/cf-message-bus-0.3.4/lib/cf_message_bus/message_bus.rb:88:in `yield'\n/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.2.0/gems/cf-message-bus-0.3.4/lib/cf_message_bus/message_bus.rb:88:in `run_handler'\n/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.2.0/gems/cf-message-bus-0.3.4/lib/cf_message_bus/message_bus.rb:23:in `block (2 levels) in subscribe'\n/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.2.0/gems/eventmachine-1.0.9.1/lib/eventmachine.rb:1067:in `call'\n/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.2.0/gems/eventmachine-1.0.9.1/lib/eventmachine.rb:1067:in `block in spawn_threadpool'","log_level":"error","source":"cc.runner","data":{},"thread_id":47172617773120,"fiber_id":47172624694260,"process_id":13900,"file":"/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.2.0/gems/cf-message-bus-0.3.4/lib/cf_message_bus/message_bus.rb","lineno":90,"method":"rescue in run_handler"}

Possible Fix

Remove default_fog_connection asking everyone to specify a fog configuration for each blobstore type or fix it.

"cf disable-diego" for Docker app results in confusing error message [migrated from diego-release]

Original issue: cloudfoundry/diego-release#199
Original poster: @jochenehret

Hi Diego developers,

when we push a Docker app to Diego and then call cf disable-diego <appname>, we get a confusing error message and the app status is inconsistent:

$ cf disable-diego lattice-app
Setting lattice-app Diego support to false
FAILED
Error: CF-AppPackageNotFound - The app package could not be found: 92e75ab6-d631-4069-9fb5-c6cd1dfedd6a
{
   "code": 150002,
   "description": "The app package could not be found: 92e75ab6-d631-4069-9fb5-c6cd1dfedd6a",
   "error_code": "CF-AppPackageNotFound"
}

The app is not listed in cf diego-apps any more, but still running. The status is inconsistent:

$ cf apps
Getting apps in org test / space diego-test as admin...
OK

name           requested state   instances   memory   disk   urls
lattice-app    started           ?/1         1G       1G     lattice-app.cfapps.hcp-dev04.aws.sapcloud.io

I know that it doesn't make sense to disable Diego for Docker apps, but can you still try to provide a better error message?

Thanks and Best Regards,

Jochen.

cf deployment failed with 1 of 1 post-start scripts failed. Failed Jobs: cloud_controller_ng.

Issue

cf deployment failed after Updating instance api

Context

i'm trying to install cloud foundry on vsphere,but there is some problems.
command:
bosh -e bosh -d cf deploy cf-deployment/cf-deployment.yml --vars-store cf-deployment/env-repo/deployment-vars.yml -v system_domain=test.com -o cf-deployment/operations/scale-to-one-az.yml
result:

Task 229
Task 229 | 06:51:37 | Preparing deployment: Preparing deployment (00:00:04)
Task 229 | 06:51:50 | Preparing package compilation: Finding packages to compile (00:00:00)
Task 229 | 06:51:51 | Updating instance api: api/c7569305-a073-4c92-a941-3d7a2b7411e9 (0) (canary) (00:01:38)
L Error: Action Failed get_task: Task e789016d-b853-44a2-64c4-913edb115b87 result: 1 of 1 post-start scripts failed. Failed Jobs: cloud_controller_ng.
Task 229 | 06:53:29 | Error: Action Failed get_task: Task e789016d-b853-44a2-64c4-913edb115b87 result: 1 of 1 post-start scripts failed. Failed Jobs: cloud_controller_ng.
Task 229 Started Wed Apr 25 06:51:37 UTC 2018
Task 229 Finished Wed Apr 25 06:53:29 UTC 2018
Task 229 Duration 00:01:52
Task 229 error
Updating deployment:
Expected task '229' to succeed but state is 'error'
Exit code 1

I downloadthe logs.
cf.api.0-20180427-084548-843073093.zip

Pls help to analyse data and give me some suggestion to fix the problem.
Thanks.

Deploys that roll the blobstore VM sometimes take a long time

Issue

In one of our environments deploying CF with internal blobstore, sometimes, when the blobstore VM is updated, it takes a really long time to update it (30-45 min).

Possible causes

We think this happens because the pre-start script for the blobstore job does a recursive chown of all persistent and ephemeral disk directories. Since BOSH doesn't have a timeout for this script, it will take as long as it takes to do a recursive operation on these directories.

Our persistent and ephemeral disks seem pretty empty:

# df -ah
/dev/sda3       3.5G  828M  2.5G  25% /var/vcap/data
/dev/sda3       3.5G  828M  2.5G  25% /var/log
tmpfs           1.0M   20K 1004K   2% /var/vcap/data/sys/run
/dev/sda3       3.5G  828M  2.5G  25% /tmp
/dev/sda3       3.5G  828M  2.5G  25% /var/tmp
/dev/sda1       2.8G  1.3G  1.4G  47% /home
/dev/sdb1        99G   11G   83G  12% /var/vcap/store

Steps to Reproduce

Run a bosh deploy that will result in a blobstore roll (e.g. stemcell update). The blobstore persistent disk must have some data in it in order to see elevated deploy times.

Expected result

It would be great if this operation was smarter and could take less time every time you deploy, as it is causing quite significant app pushability downtime.

Current result

It takes 30-45 min to update the blobstore VM.

Allow Operator to specify third-party `ca_cert` when communicating with blobstore

Issue

We're testing using Minio as an internal blobstore for CF. Minio is an s3-compatible blobstore and works with the fog-aws gem more or less out of the box. But we're hitting a small snag when trying to get CAPI to talk to Minio over HTTPS. We have self-signed certs on our Minio server, but I don't see that CAPI allows us to specify blobstore.ca_cert or similar to tell CAPI to trust the certs returned by the Minio server. Due to this, we get SSL verification errors unless we do the workaround described below.

Workaround

We can workaround this by redeploying our BOSH Director with a new trusted certificate, but this requires an extra manual step from our users.

Apps pushed with `v3-push --no-start` cannot `build`

Thanks for submitting an issue to capi-release. We are always trying to improve! To help us, please fill out the following template.

Issue

Apps pushed with v3-push --no-start cannot build

Context

We v3-push --no-start an app to apps manager. In Apps Manager we tried to start the app. Starting the app, in Apps Manager, consist of building a droplet if one does not exist.

Steps to Reproduce

v3-push --no-start this app: https://github.com/cloudfoundry/cf-acceptance-tests/tree/master/assets/syslog-drain-listener

the api calls we are making in order:

v3/apps/66a01d5a-733c-457d-9d71-b3848c0e455c/packages
POST to /api/v3/builds with package_guid returned from [1]
v3/builds/8cecc7f9-96fa-45d3-ac14-cb3a0e90ae15, build_guid returned from [2] --- leads to error

Expected result

The build succeeds and the app starts

Current result

The build fails with

"state": "FAILED",
  "error": "NoAppDetectedError - An app was not successfully detected by any available buildpack",

Possible Fix

[not obligatory, but suggest fixes or reasons for the bug]

`name of issue` screenshot

Official Cloud Foundry docs missing info for webdav configuration.

The official stub documented for deploying Cloud Foundry on AWS includes stuff about webdav configuration:

https://github.com/cloudfoundry/cf-release/blob/fe6c99602c42fcc733ea75ff03f32040b176b346/spec/fixtures/aws/cf-stub.yml#L64-L95

But the documentation on how to fill out this stub doesn't explain how to fill out the values for replacement:

https://github.com/cloudfoundry/docs-deploying-cf/blob/master/aws/cf-stub.html.md.erb

/cc @utako @SocalNick

Spec description for cc.default_quota_definition wrong

The spec files for the various CC jobs all reference cc.default_quota_definition and give the description as "Local to use a local (NFS) file system. AWS to use AWS.".

This is obviously not correct, and should be improved to be meaningful.

Some useful context: the spiff-based manifest-generation approach has reduced people's reliance on spec files. I suspect that as we move away from it (and are thus trying to configure releases based on the spec files themselves), more issues with the legibility and accuracy of spec files will tend to arise.

Please add patch for nginx-http-upload-module

Issue

The patch for the http-upload-module of nginx seems not to be included in the shipped nginx binary. Please review nginx-upload-module commit aba1e3f

Context

Wrong initialization of upload path cause segmentation faults. We can observe this behavior when Dynatrace agent instruments the nginx.

Steps to Reproduce

Add Dynatrace agent to cloud_controller VM, restart nginx, cf push an app.

Expected result

App pushed to CF successfully

Current result

Nginx causes segmentation faults.

Possible Fix

vkholodkov/nginx-upload-module@aba1e3f

Latest commit from 2.2 branch

Unable to build capi-release on linux

Using Ubuntu 16.04.1 LTS, I'm getting an error during pre-packaging of cloud_controller_ng. Using system ruby 2.3.1:

$ bosh create-release --tarball=/tmp/capi.tgz
Building a release from directory '/home/lyle/workspace/capi-release':
  Running prep scripts:
    Running command: 'bash -x /home/lyle/workspace/capi-release/packages/cloud_controller_ng/pre_packaging', stdout: 'Fetching gem metadata from https://rubygems.org/.......
Fetching version metadata from https://rubygems.org/..
Fetching dependency metadata from https://rubygems.org/.
Fetching https://github.com/nats-io/ruby-nats
Fetching https://github.com/cloudfoundry/vcap-concurrency.git
Fetching https://github.com/cloudfoundry/delayed_job_sequel.git
Fetching https://github.com/zipmark/rspec_api_documentation.git
Using rake 11.3.0
Using i18n 0.7.0
Using json 1.8.3
Using erubis 2.7.0
Using multipart-post 2.0.0
Using unf_ext 0.0.7.2
Using vcap-concurrency 0.1.0 from https://github.com/cloudfoundry/vcap-concurrency.git (at 2a5b017@2a5b017)
Using thor 0.19.1
Using bundler 1.13.7
Using unf 0.1.4
Using nats 0.5.1 from https://github.com/nats-io/ruby-nats (at 8571cf9@8571cf9)
Using delayed_job_sequel 4.1.0 from https://github.com/cloudfoundry/delayed_job_sequel.git (at master@908f388)
Updating files in vendor/cache
  * rake-11.3.0.gem
Could not find i18n-0.7.0.gem for installation
', stderr: '+ set -e -x
+ cd /home/lyle/.bosh/tmp/bosh-resource-archive537005973/capi-release/src/cloud_controller_ng
+ BUNDLE_WITHOUT=development:test
+ bundle package --all --no-install --path ./vendor/cache
':
      exit status 7

Exit code 1

Weirdly running bundle package --all --no-install --path ./vendor/cache locally fails the first time but passes the second try. Turns out after the first run Bundler adds BUNDLE_DISABLE_SHARED_GEMS=true to ./.bundle/confg after the first run. Adding this environment variable to the pre-packaging script causes the pre-packaging to succeed:

BUNDLE_DISABLE_SHARED_GEMS=true BUNDLE_WITHOUT=development:test bundle package --all --no-install --path ./vendor/cache

Still not sure why I'm seeing this behavior or why that variable is changing things. Will keep digging...

some clock_jobs never run if last_completed_at < last_started_at value

Issue

In a scenario where a clock_job's last_completed_at value is less than the last_started_at value, any job without a defined timeout value will never execute. This can happen if the job or server crashes before updating the db.
The need_to_run_job method in /lib/cloud_controller/clock/distributed_executor.rb will always evaluate as nil rather than true or false.
For some reason, the cc.jobs.global.timeout_in_seconds property isn't being applied to jobs that don't have their own explicit value set.

After adding some simple debug lines to the code:

 def need_to_run_job?(job, interval, timeout, fudge=0)
  last_started_at = job.last_started_at
  return true if last_started_at.nil?

  last_completed_at = job.last_completed_at
  @logger.info "debug_line:  last_completed_at = #{last_completed_at}"
  interval_has_elapsed = now >= (last_started_at + interval - fudge)
  @logger.info "debug_line:  interval_has_elapsed = #{interval_has_elapsed}"
  last_run_completed = last_completed_at && (last_completed_at >= last_started_at)
  @logger.info "debug_line:  last_run_completed = #{last_run_completed}"
  timeout_elapsed = timeout && (now >= (last_started_at + timeout))
  @logger.info "debug_line:  timeout value = #{timeout}"
  @logger.info "debug_line:  timeout_elapsed = #{timeout_elapsed}"
  @logger.info "debug_line:  need_to_run_job = #{interval_has_elapsed && (last_run_completed || timeout_elapsed)}"
  interval_has_elapsed && (last_run_completed || timeout_elapsed)
end

Here is the output:

"timestamp":1510261901.005922, "message":"debug_line: Job name = pending_droplets"
"timestamp":1510261901.0065062,"message":"debug_line: last_completed_at = 2017-10-09 19:36:00 UTC"
"timestamp":1510261901.0071208,"message":"debug_line: interval_has_elapsed = true"
"timestamp":1510261901.0076826,"message":"debug_line: last_run_completed = false"
"timestamp":1510261901.0081966,"message":"debug_line: timeout value = "
"timestamp":1510261901.0087192,"message":"debug_line: timeout_elapsed = "
"timestamp":1510261901.0092213,"message":"debug_line: need_to_run_job = "
"timestamp":1510261901.0098224,"message":"Skipping enqueue for pending_droplets. Job last started at 2017-11-09 19:46:06 UTC, last completed at 2017-10-09 19:36:00 UTC, interval: 300"

Format of VCAP_SERVICES whitespace causing problems switching from DEA to Diego

Issue

We are trying to move our users applications from DEA to Diego. In the process we have noticed several instances where the application works when running in DEA but fail when switching to Diego caused by differences in the whitespace of VCAP_SERVICES between the 2 runtimes.

Since VCAP_SERVICES is created by the CC the first question is why is Diego's representation different than DEA. A quick search found this (https://github.com/cloudfoundry/dea_ng/blob/7683f22a51c294d5f5fcc023734752c03ab98237/lib/dea/env.rb#L52) in the DEA code which would appear to be the cause.

Here is a more detailed look at 2 of the problem scenarios we've run into thus far:

We found a number of our applications using a home grown JSON parser. This parser was easy enough to fix but it makes the migration to Diego slightly more complex for the customers using this dependency since they need to upgrade before they can switch.
Another more complex situation we ran into was with a bug in the thirdpary Java iText library that attempts to parse env itself (https://github.com/itext/itextpdf/blob/2.1.7/itext/src/rtf/com/lowagie/text/rtf/parser/destinations/RtfDestinationFontTable.java#L585). This library is old, thirdparty, and a free OSS version is no longer maintained making it more difficult to fix.

In both scenarios we can certainly figure out fixes and make our users upgrade to new versions of libraries as a requirement to moving to Diego. But, before doing that I thought I'd ensure this isn't something the capi team wants to potentially address at a higher level, potentially making it easier for all CF user's to move to Diego.

Steps to Reproduce

Look at environment of a DEA vs Diego app and see the formatting difference of VCAP_SERVICES is significant

Expected result

Expect an application with buggy code to work switching between DEA and Diego.

Current result

An application that has buggy code processing environment variables could break when switching between DEA and Diego.

Possible Fix

It would be nice if VCAP_SERVICES was formatted the same for DEA and Diego to ease application migration.

At the very least going forward, it might be a good idea to be more specific about how VCAP_SERVICES, VCAP_APPLICATION, etc. whitespace should be formatted such that tests can be written to help ensure format stays consistent as CC continues to evolve. I would hate to some day upgrade CC only to find VCAP_SERVICES formatting has changed exposing bugs in customer applications with how they are processing these variables. Perhaps stripping all white space would be a reasonable standard?

Reference: This is the original Diego issue I created before I realized it probably wasn't a Diego problem: cloudfoundry/diego-release#221

PostgreSQL recommends updating to 9.4.9, capi-release libpq package uses 9.4.6

According to the PostgreSQL announcement mailing list, 9.4.9 introduces security updates and they suggest upgrading ASAP. The official article from the PostgreSQL Global Development Group is a little more mild and doesn't quite say to do it ASAP. At any rate, the postgres-release team will be upgrading. We noticed that capi-release uses a 9.4.6 blob for your libpq package. Y'all may want to upgrade soon as well, although since this is a patch release, it should be fine to have 9.4.6 libpq talking to a 9.4.9 server.

Typos in stager_config.json.erb could prevent bridged stager from running.

See

capi-release/jobs/stager/templates/stager_config.json.erb

Lines 37 to 41 in 2dd99a4

 config[:bbs_client_cache_size] = valu 

 end 

 if_p("capi.stager.bbs.max_idle_conns_per_host") do |value| 

 config[:bbs_max_idle_conns_per_host] = valu

If either property capi.stager.bbs.client_session_cache_size or capi.stager.bbs.max_idle_conns_per_host is non-null the stager won't run.

	config[:bbs_client_cache_size] = valu
	end

	if_p("capi.stager.bbs.max_idle_conns_per_host") do \|value\|
	config[:bbs_max_idle_conns_per_host] = valu