kubernetes-retired / kube-aws Goto Github PK

[EOL] A command-line tool to declaratively manage Kubernetes clusters on AWS

License: Apache License 2.0

Shell 10.35% Go 89.06% Makefile 0.50% Dockerfile 0.09%

high-availability kubernetes-cluster aws cloudformation infrastructure-as-code

kube-aws's Introduction

Kubernetes on AWS (kube-aws)

Note: The master branch may be in an unstable or even broken state during development. Please use releases instead of the master branch in order to get stable binaries.

kube-aws is a command-line tool to create/update/destroy Kubernetes clusters on AWS.

Contributing

This repository is about to enter read-only mode, and no further updates will be made here.

If you would like to contribute, please find our fork located here

Details of how to develop kube-aws are in our Developer Guide.

kube-aws's People

Contributors

Stargazers

Watchers

Forkers

mumoshu kenden pieterlange camilb bfallik qqshfox cknowles mekanoe colhom gianrubio camsteack nicr9 joshua kerin heschlie mefellows dexhorthy everpeace csats dimpavloff neoandroid pyar6329 jackzzj smileisak cclear vidyasagargudapati jgmize artushin ankon hotelsdotcom reiinakano helinwang whereisaaron mgilbir checkr neillturner jmcarp vipinsthename iameli linearregression terradatum soellman fsero etsangsplk cheungpat jollinshead philips maniacs-ops msanterre hasbro17 miradatv swestcott ajoslin zhangtiny123 almogbaku trmiller bunnyyiu mennink jeremyd martinlogan zsoltk-iw paulcichonski karlem mrwlad rochacon jpb adnankobir tyrannasaurusbanks robertgstoehl jlk velo vvelazquez donghaima ytsarev spatronis iflix trinitronx erickfaustino pikuzi 40a devonartis tomas-edwardsson cloud-robotics anushkasandaruwan lucasproclc ef-labs itsmaverick amitkumarj441 adyromantika wallentx dvdthms sathiyas tmjd asifdxtreme mbssaiakhil cloud-architecture keivnhuang karthiklml thinegan joenoyard

kube-aws's Issues

Production Quality Deployment

Copy of the old issue with a lot of boxes ticked thanks to contributions by @colhom @mumoshu @cgag and many others. Old list follows, will update where necessary.

The goal is to offer a "production ready solution" for provisioning a coreos kubernetes cluster on AWS. These are the major functionality blockers that have been thought of:

Proposal: enable to attach user defined security groups for worker instances.

Hi community, thanks for sharing the great project. I'm very happy if I could contribute this project would be better.

Motivation

I think restricting access to ec2 instances by security group is a common pattern. For example, assume the situation that you had some database instances and you wanted to restrict access to those db instances by some specific security group.

Proposal

In such situations, I believe that kube-aws could be more useful if kube-aws can attach user defined security groups to its worker nodes.

Current possible workaround and its disadvantage

We can use the security group which kube-aws will generate for worker nodes, let this be sg-w. For above example, that means, we can configure the security group which will be attached to db instances, let this be sg-db, so that sg-db allows ingress from sg-w.

Unfortunately, in this case, this doesn't allow user to destroy the k8s cluster unless user modified sg-db becausesg-w is referred from sg-db and sg-w is managed by kube-aws. I believe that this is not preferred by many users because this solution introduces undesirable dependencies.

Proposal: Clouformation wait for signal

In the current cloudformation template, kube-aws use PauseTime flag to wait for create/update resources.

    UpdatePolicy" : {
        "AutoScalingRollingUpdate" : {
          "MinInstancesInService" : "{{.ControllerCount}}",
          "MaxBatchSize" : "1",
          "PauseTime" : "PT2M"
        }
      }

My proposal is to use cfn-signal to control the resources [1][2] and only send the signal after the last service was started. I can implement this proposal and I'm here to listen for suggestions.

[Unit]
Description=Cloudformation Signal Ready
After=docker.service
Requires=docker.service
After=etcd2.service
Requires=etcd2.service

[Install]
WantedBy=multi-user.target

[Service]
Type=oneshot
EnvironmentFile=/etc/environment
ExecStart=/bin/bash -c '\
eval $(docker run crewjam/ec2cluster); \
docker run --rm crewjam/awscli cfn-signal \
    --resource MasterAutoscale --stack $TAG_AWS_CLOUDFORMATION_STACK_ID \
    --region $REGION; \
'

[1] https://aws.amazon.com/blogs/devops/use-a-creationpolicy-to-wait-for-on-instance-configurations/
[2] http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-attribute-creationpolicy.html

Documentation of long-term managing capabilities

Maybe this is only documentation issue...

But for me it appears to be unclear if this project is meant to support created kubernetes cluster over time or just initial provisioning.
If it is meant to be long term support i miss a docuimetnation (wiki?), that states clearly what to do i common situation like scaling to new number if nodes or another management stuff.

Lucky, there is propably not so much, but still worth to document it clear in detail for those who not involved in this project so much - but just want to use it. Or step before, get idea of it.

Documentation on external DNS seems out of date

Hello all,

I just setup a Kubernetes cluster with version v0.9.1. I see changes since version 0.8:

etcd cluster is now deployed on its own instances
controller is in an autoscalling group
controller is behind a load balancer

The doc: https://coreos.com/kubernetes/docs/latest/kubernetes-on-aws-launch.html, says:

Otherwise, navigate to the DNS registrar hosting the zone for the provided external DNS name. Ensure a single A record exists, routing the value of externalDNSName defined in cluster.yaml to the externally-accessible IP of the master node instance.

You can invoke kube-aws status to get the cluster API IP address after cluster creation, if necessary. This command can take a while.

I did what I did with version 0.8, I set an A record in my zone file to the public IP address of my controller node. However it is now behind a load balancer and the security group of the controller prevent HTTPS access from "0.0.0.0" and constraints it to only the load balancer and the worker nodes.

To access my cluster, I had to modify the security group and add a HTTPS inbound rule with "0.0.0.0". I guess I'm not supposed to do that. What are the recommandation here ?

Regards,

Option to create and automatically mount EFS volumes

While NFS is kind of a dirty solution for cluster storage, i think it fills a basic (highly available storage across the cluster) need for most operators. Users can use hostPath volumes to access their persistent data from any node in any AZ.

Should probably include some caveats about what kinds of data you should (not) store on this.

It's an easy patch, will contribute PR soon.

Feature: Rescheduler add-on

Do we want to support the rescheduler add-on out of the box? Useful links - k8s, kops

Internet Gateway needs to be added to VPC before cluster can be created.

Again I believe that there is a need for mentioning this in the documentation.

Update the `coreos/awscli` docker image according to the needs

Initial ideas

It will settle between the below two images.

What is working in today's kube-aws:
```
FROM alpine:3.1

RUN apk --update add bash curl less groff jq python py-pip && pip install --upgrade awscli s3cmd && mkdir /root/.aws
```
According to:
- https://quay.io/repository/coreos/awscli/image/e33d9681c7bf50d591d9fcce3db7d21fbf261affde613e10c0abebb74f6b6a1c
- https://quay.io/repository/coreos/awscli/build/49336018-390e-4c2a-aa44-1f84d67949ea
What is going to be needed for future work:

cc @gianrubio

e.g. #49 (comment) which requires the cfn-signal command contained in the aws-cfn-bootstrap package to be included in a official-ish/controllable docker image(i.e. quay.io/coreos/awscli)

I've talked about it with @pieterlange and this is what it would look like:
```
FROM alpine
RUN apk update && apk add python py-pip && rm -rf /var/cache/apk/*
RUN pip install awscli
RUN pip install https://s3.amazonaws.com/cloudformation-examples/aws-cfn-bootstrap-latest.tar.gz 
```

Requirements

Basically, the new image is ok as long as it:
- keeps backward-compatibility(possibly including all the packages contained in the former image
- contains only what is required for us as exactly as possible
- sizes as small as possible
- is not based on very out-dated alpine

Implementation

It'll be something like:

FROM alpine:3.4

RUN apk --no-cache --update add bash curl less groff jq python py-pip && \
  pip install awscli s3cmd https://s3.amazonaws.com/cloudformation-examples/aws-cfn-bootstrap-latest.tar.gz && \
  apk del remove py-pip && \
  mkdir /root/.aws

Umbrella documentation issue

There's a bit of documentation already out there, but some of it is hard to find and a lot of it is outdated. The move to the dedicated repo included a bunch of improvements of which nearly all are mostly documented in github issues.

Proposal for documentation checklist:

calico-node.service and calico-policy-controller are failing to start

I have been running a kube-aws E2E test against a cluster created from #34 (with minor fixes basically unrelated to how etcd+calico works) and seeing the error like:

from journalctl -e -u calico-node.service:

Nov 06 06:56:24 ip-10-0-0-80.ap-northeast-1.compute.internal systemd[1]: Started Calico per-host agent.
Nov 06 06:56:24 ip-10-0-0-80.ap-northeast-1.compute.internal rkt[25660]: image: using image from file /usr/lib64/rkt/stage1-images/stage1-fly.aci
Nov 06 06:56:25 ip-10-0-0-80.ap-northeast-1.compute.internal rkt[25660]: image: using image from local store for image name quay.io/calico/node:v0.19.0
Nov 06 06:56:29 ip-10-0-0-80.ap-northeast-1.compute.internal rkt[25660]: Traceback (most recent call last):
Nov 06 06:56:29 ip-10-0-0-80.ap-northeast-1.compute.internal rkt[25660]:   File "startup.py", line 292, in <module>
Nov 06 06:56:29 ip-10-0-0-80.ap-northeast-1.compute.internal rkt[25660]:     client = IPAMClient()
Nov 06 06:56:29 ip-10-0-0-80.ap-northeast-1.compute.internal rkt[25660]:   File "/usr/lib/python2.7/site-packages/pycalico/datastore.py", line 228, in __init__
Nov 06 06:56:29 ip-10-0-0-80.ap-northeast-1.compute.internal rkt[25660]:     "%s" % (ETCD_CA_CERT_FILE_ENV, etcd_ca))
Nov 06 06:56:29 ip-10-0-0-80.ap-northeast-1.compute.internal rkt[25660]: pycalico.datastore_errors.DataStoreError: Invalid ETCD_CA_CERT_FILE. Certificate Authority cert is required and must be
Nov 06 06:56:29 ip-10-0-0-80.ap-northeast-1.compute.internal rkt[25660]: Calico node failed to start
Nov 06 06:56:29 ip-10-0-0-80.ap-northeast-1.compute.internal systemd[1]: calico-node.service: Main process exited, code=exited, status=1/FAILURE
Nov 06 06:56:29 ip-10-0-0-80.ap-northeast-1.compute.internal systemd[1]: calico-node.service: Unit entered failed state.
Nov 06 06:56:29 ip-10-0-0-80.ap-northeast-1.compute.internal systemd[1]: calico-node.service: Failed with result 'exit-code'.
Nov 06 06:56:30 ip-10-0-0-80.ap-northeast-1.compute.internal systemd[1]: calico-node.service: Service hold-off time over, scheduling restart

from kubectl get po --all-namespaces:

$ kubectl get po --all-namespaces
NAMESPACE       NAME                                                                     READY     STATUS              RESTARTS   AGE
calico-system   calico-policy-controller-ip-10-0-0-102.ap-northeast-1.compute.internal   1/2       CrashLoopBackOff    6          10m
calico-system   calico-policy-controller-ip-10-0-0-103.ap-northeast-1.compute.internal   1/2       CrashLoopBackOff    5          9m
kube-system     heapster-v1.2.0-3646253287-hp1cm                                         0/2       ContainerCreating   0          11m
kube-system     kube-apiserver-ip-10-0-0-102.ap-northeast-1.compute.internal             1/1       Running             0          10m
kube-system     kube-apiserver-ip-10-0-0-103.ap-northeast-1.compute.internal             1/1       Running             0          8m
kube-system     kube-controller-manager-ip-10-0-0-102.ap-northeast-1.compute.internal    1/1       Running             0          9m
kube-system     kube-controller-manager-ip-10-0-0-103.ap-northeast-1.compute.internal    1/1       Running             0          8m
kube-system     kube-dns-v20-qdkp8                                                       0/3       ContainerCreating   0          11m
kube-system     kube-proxy-ip-10-0-0-102.ap-northeast-1.compute.internal                 1/1       Running             0          10m
kube-system     kube-proxy-ip-10-0-0-103.ap-northeast-1.compute.internal                 1/1       Running             0          8m
kube-system     kube-proxy-ip-10-0-0-58.ap-northeast-1.compute.internal                  1/1       Running             0          9m
kube-system     kube-proxy-ip-10-0-0-59.ap-northeast-1.compute.internal                  1/1       Running             0          10m
kube-system     kube-proxy-ip-10-0-0-60.ap-northeast-1.compute.internal                  1/1       Running             0          10m
kube-system     kube-proxy-ip-10-0-0-61.ap-northeast-1.compute.internal                  1/1       Running             0          10m
kube-system     kube-scheduler-ip-10-0-0-102.ap-northeast-1.compute.internal             1/1       Running             0          9m
kube-system     kube-scheduler-ip-10-0-0-103.ap-northeast-1.compute.internal             1/1       Running             0          8m

I'm not sure whether it is caused by the changes from the PR or not right now.

KMS encrypt of private keys causes unnecessary CloudFormation replacements

Each KMS encrypt operation results in different ciphertext, causing updates of the userdata section inside of the launchconfiguration. This results in complete cluster rollovers on no-op's/changes to only one cluster component (eg instance type)

Haven't checked what happens with nodepools yet.

Some solutions to this issue:

put encrypted blobs in an S3 bucket
cache encrypted blobs and only update userdata on actual updates

I like the S3 bucket idea more if only because we don't need to keep local kube-aws state but would like some input from the community.

Allow workers to sit inside private subnets within an existing VPC

Use case proposal:

Use an existing VPC, e.g. has an existing RDS deployment inside
Private subnets used for at least workers, perhaps controllers too if we use a bastion (first use case we want public controllers so we can SSH into the cluster)
Use existing NAT Gateway per AZ for HA with the appropriate route tables used per worker AZ

Similar to kubernetes/kops#428.

It was part of what coreos/coreos-kubernetes#716 was intended to cover. Pre-filing this issue so we can move discussion out of #35.

Creating cluster with an existing subnet

Hello,
We can specify an existing VPC, but not an existing subnet. At work, I can't create network stuff. I have forked the file stack-template.json but It would be cool to have the option built-in.

Thanks.

Conformance tests failing in 794a22f

Failure [627.107 seconds]
[BeforeSuite] BeforeSuite
/go/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/e2e.go:154

  Nov  2 00:45:26.164: Error waiting for all pods to be running and ready: 1 / 10 pods in namespace "kube-system" are NOT in the desired state in 10m0s
  POD                              NODE                                          PHASE   GRACE CONDITIONS
  heapster-v1.2.0-3646253287-cdm3e ip-10-0-0-200.ap-northeast-1.compute.internal Pending       [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2016-11-02 00:11:52 +0000 UTC  } {Ready False 0001-01-01 00:00:00 +0000 UTC 2016-11-02 00:11:52 +0000 UTC ContainersNotReady containers with unready status: [heapster heapster-nanny]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2016-11-02 00:11:51 +0000 UTC  }]


  /go/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/e2e.go:124
------------------------------

Ran 105 of 385 Specs in 627.110 seconds
FAIL! -- 0 Passed | 105 Failed | 0 Pending | 280 Skipped --- FAIL: TestE2E (627.14s)
FAIL

List of IAM permissions

Hi, it would be nice to have precise list of IAM permissions needed to be successfully deploy a cluster.

For example to delegate cluster creation to a non-admin IAM account.

Kube node drainer service fails to start.

I think the line Restart=on-failure from kube-node-drainer.service should be removed.

Getting these errors:

Failed to restart kubelet.service: Unit kube-node-drainer.service is not loaded properly: Invalid argument.
kube-node-drainer.service: Service has Restart= setting other than no, which isn't allowed for Type=oneshot services. Refusing.

master: flanneld seems to be timing out while running decrypt-tls-asset in ExecStartPre

With 7ea5f6b, I've seen an error message like:

Nov 17 08:48:10 ip-10-0-0-216.ap-northeast-1.compute.internal systemd[1]: flanneld.service: Start-pre operation timed out. Terminating.
Nov 17 08:48:11 ip-10-0-0-216.ap-northeast-1.compute.internal etcdctl[8586]: open /etc/kubernetes/ssl/etcd-client.pem: no such file or directory
Nov 17 08:48:11 ip-10-0-0-216.ap-northeast-1.compute.internal systemd[1]: flanneld.service: Control process exited, code=exited status=1

Full log can be seen at https://gist.github.com/mumoshu/6f9fe119f882d3fcda40322d209123d8

It seems that after decrypt-tls-assets timing out, systemd continues to run next ExecStartPre, which also end up with an error like etcd-client.pem: no such file or directory(it might be so because you systemd terminated decrypt-tls-assets which is intended to generate that file!)

It seems to take about 3 min 30 sec until flanneld fully gets up and running.
Could we shorten it by removing unnecessary timeouts like this?

does hostPort not work on kube-aws/CoreOS?

I am unable to get it to docker expose the port on the host. See kubernetes/kubernetes#23920 (comment) also

Proposal: AWS environment variable file

It is currently impossible to directly utilize cloudformation Refs inside of the userdata / cloud-init files as those files get passed verbatim to the UserData parameters in the stack-template.json file.

I think it's useful to provide an /etc/aws-environment file in which to pass a list of AWS environment variables by Ref, for the end user to utilize using systemd's EnvironmentFile option .

We can easily append this right now by using Fn::Join to the UserData as the last section in cloud_init is write_files.

Immediate usecase: passing in cloudformation wait condition urls.

kube-dns-v20 Number of replicas too low (1) for multi-node cluster availability.

The number of replicas for kube-dns-v20 is set by default to 1 when setting up a new cluster. This means that during a rolling-upgrade this service will go down multiple times while it gets re-scheduled to different nodes. Upping the default replica count to 2, or 3 would be a better solution for people un-aware of this setting. I'm hoping that the next time I spin up a cluster I won't have to remember to do this manually. (config option or default > 1).

rkt too old in CoreOS stable

Based on the report here kube-aws uses CoreOS Linux stable but Kubernetes doesn't like this old version of rkt: coreos/bugs#1640

How is everyone going to handle a leap second added to the end of 2016, for kube-aws created clusters?

IFAIK,

CoreOS uses systemd-timesyncd instead of ntpd or chronyd to sync time, by default.
It is possible to switch to ntpd as described in the CoreOS doc
We usually(?) handle leap second with the slew feature which can be seen in ntpd, chronyd
The slew feature make the leap second invisible and the clock is gradually fixed over time. Theoretically, it will be enough to completely avoid issues from a leap second.

However:

systemd-timesyncd's slew feature is limited. It applies the slew feature only when the gap of the clock is less than 0.4 sec according to timesyncd's source
ntpd.service included in coreos-overlay doesn't seem to provide -x -g option to enable "slew" according to coreos-overlay's source

Feature: Node pools

succeed to coreos/coreos-kubernetes#667

I'd like to add GKE's "Node Pools" like feature to kube-aws.

With that feature, we can differentiate the following things per pool without maintaining multiple etcd clusters and kubernetes controller planes:

instance type
storage size
instance profile
security group
min/max spot price
auto scaling group or spot fleet
node labels

It will eventually give us more granular control over:

security
- in combination with node selectors, we can
  1. restrict nodes which are able to communicate with databases containing sensitive information and
  2. schedule only necessary pods onto the nodes
cost effectiveness
- in combination with node selectors, we can
  1. make pool(s) backed by spot instances and label nodes with e.g. spot=true
  2. explicitly schedule pods which can be spawned/killed more aggressively onto those nodes
scalability
- creating a separate pool per availability zone allows us utilizing cluster-autoscaler in the multi-AZ setup (see coreos/coreos-kubernetes#668)
etc
- If you are aware of one, let us know!

Beware that until the taint-and-toleration feature is implemented into Kubernetes, we'd need to give pods specific node selectors to prevent them from being scheduled onto undesirable nodes.

Edit: Taints and tolerations are partially supported in Kubernetes 1.4.x and introduced in kube-aws since #132

TODOs(possibly done in separate pull requests)

kube-aws nodepool update (#130)
- fix kube-aws nodepool update which was always failing 😉 (#140)
kube-aws nodepool validate #161
kube-aws nodepool destroy (#139)
Node labels configurable via node-pools/yourpoolname/cluster.yaml #149
Spot Fleet support (#112 #113)
confirm #68 doesn't affect node pools
redo #49 for node pools (#128)

Non-TODOs

Multi region deployment a.k.a deploy node pools in another region(s)

Restarting kubelet forgets CNI state and reuses pod IPs

Previously coreos/coreos-kubernetes#755.

I have a cluster built with kube-aws-v0.8.3 and hyperkube-v1.4.5_coreos.0.

While trying to shake out an EBS volume mount problem on one of the minions, I restarted kubelet using systemctl restart kubelet on that host. After the restart, it seems active pod IPs are getting reassigned.

core@ip-10-0-48-65 ~ $ journalctl | grep Determined.pod.ip 
Nov 07 22:31:20 ip-10-0-48-65.ec2.internal kubelet-wrapper[1706]: I1107 22:31:20.320259    1706 docker_manager.go:2162] Determined pod ip after infra change: "kubernetes-dashboard-v1.4.1-wbo4r_kube-system(e6f3c346-a539-11e6-b0a2-06cd71806322)": "10.2.37.2"
Nov 07 22:31:20 ip-10-0-48-65.ec2.internal kubelet-wrapper[1706]: I1107 22:31:20.549372    1706 docker_manager.go:2162] Determined pod ip after infra change: "kube-dns-v20-83ltq_kube-system(e6f4151f-a539-11e6-b0a2-06cd71806322)": "10.2.37.3"
Nov 07 22:31:20 ip-10-0-48-65.ec2.internal kubelet-wrapper[1706]: I1107 22:31:20.608372    1706 docker_manager.go:2162] Determined pod ip after infra change: "heapster-v1.2.0-3646253287-1nj4n_kube-system(e6f4a391-a539-11e6-b0a2-06cd71806322)": "10.2.37.4"
Nov 07 22:31:36 ip-10-0-48-65.ec2.internal kubelet-wrapper[1706]: I1107 22:31:36.128090    1706 docker_manager.go:2162] Determined pod ip after infra change: "heapster-v1.2.0-4088228293-q7nnq_kube-system(f0f4afa3-a539-11e6-b0a2-06cd71806322)": "10.2.37.5"
Nov 08 16:25:09 ip-10-0-48-65.ec2.internal kubelet-wrapper[1706]: I1108 16:25:09.883226    1706 docker_manager.go:2162] Determined pod ip after infra change: "sterling-etcd-txvpf_default(cca2979b-a5cf-11e6-b0a2-06cd71806322)": "10.2.37.6"
Nov 08 16:28:54 ip-10-0-48-65.ec2.internal kubelet-wrapper[1706]: I1108 16:28:54.127206    1706 docker_manager.go:2162] Determined pod ip after infra change: "sterling-etcd-2180225719-92xfx_default(6639c26d-a5d0-11e6-b0a2-06cd71806322)": "10.2.37.7"
[kubelet restart happens here]
Nov 08 20:43:38 ip-10-0-48-65.ec2.internal kubelet-wrapper[10530]: I1108 20:43:38.716862   10530 docker_manager.go:2162] Determined pod ip after infra change: "sterling-etcd-2180225719-g6a5e_default(3c3850ff-a5ee-11e6-b0a2-06cd71806322)": "10.2.37.2"
Nov 08 21:03:54 ip-10-0-48-65.ec2.internal kubelet-wrapper[10530]: I1108 21:03:54.104308   10530 docker_manager.go:2162] Determined pod ip after infra change: "haproxy-init-3189418104-vd5cn_default(dafe6a15-a5f6-11e6-b0a2-06cd71806322)": "10.2.37.3"
Nov 08 21:05:21 ip-10-0-48-65.ec2.internal kubelet-wrapper[10530]: I1108 21:05:21.789693   10530 docker_manager.go:2162] Determined pod ip after infra change: "bash_default(0f4309fa-a5f7-11e6-b0a2-06cd71806322)": "10.2.37.4"
Nov 08 21:06:29 ip-10-0-48-65.ec2.internal kubelet-wrapper[10530]: I1108 21:06:29.234689   10530 docker_manager.go:2162] Determined pod ip after infra change: "haproxy-init_default(3778d187-a5f7-11e6-b0a2-06cd71806322)": "10.2.37.5"
Nov 08 21:11:11 ip-10-0-48-65.ec2.internal kubelet-wrapper[10530]: I1108 21:11:11.348714   10530 docker_manager.go:2162] Determined pod ip after infra change: "haproxy-init_default(dfa25588-a5f7-11e6-b0a2-06cd71806322)": "10.2.37.6"
Nov 08 21:58:29 ip-10-0-48-65.ec2.internal kubelet-wrapper[10530]: I1108 21:58:29.094155   10530 docker_manager.go:2162] Determined pod ip after infra change: "sterling-etcd-2180225719-sj3a3_default(7a7ea798-a5fe-11e6-b0a2-06cd71806322)": "10.2.37.7"

This message from the time o the restart seems to explain the problem:

Nov 07 22:31:20 ip-10-0-48-65.ec2.internal kubelet-wrapper[1706]: 2016/11/07 22:31:20 Error retriving last reserved ip: Failed to retrieve last reserved ip: open /var/lib/cni/networks/podnet/last_reserved_ip: no such file or directory

UserData for the Controller and Worker

Hi,

We are used to put some customised tasks into the AWS Cloud Init part for adding own logging and other stuff to AWS instances.

I wonder if the userdata can be decrypted to allow us achieve it ?

Base64 decode it and then followed by using the KMS key to decrypt ?

Thanks

Ken

Bump kubernetes version to v1.4.5+coreos.0

https://github.com/coreos/kubernetes/releases/tag/v1.4.5%2Bcoreos.0

Proposal: Switch CloudFormation to YAML

Quite self explanatory, is there any interest in switching to the YAML form of CloudFormation? If there is interest of at least POC, I can submit a pull request for it. If we go ahead, we'll need to pick the right timing for a merge as some other pulls will likely need updating.

Destructive defaults on 'kube-aws render'

I just accidentially ran kube-aws render without specifying whether i wanted to render the credentials or stack templates. The default now renders both, even if credentials already exist, overwriting the old credentials in the process.

I feel this is dangerous. We probably want to add a --force option to the kube-aws render credentials command.

Command output:

$ ./kube-aws render
WARNING: 'kube-aws render' is deprecated. See 'kube-aws render --help' for usage
Generating TLS credentials...
-> Generating new TLS CA
-> Generating new TLS assets
Success! Stack rendered to stack-template.json.

Next steps:
1. (Optional) Validate your changes to cluster.yaml with "kube-aws validate"
2. (Optional) Further customize the cluster by modifying stack-template.json or files in ./userdata.
3. Start the cluster with "kube-aws up".

Creating cluster with an existing IAM roles

Hello,
At work, I can't create roles. I have forked the file stack-template.json but It would be cool to have the option built-in.

Thanks.

repo: Enable Codecov integration

To encourage everyone including me to write tests

What happened to signed releases?

Hey everybody,

Are there any plans to sign future releases of kube-aws?

When this project was a part of coreos/coreos-kubernetes, a signature was provided with each release and you had the option of verifying it before untarring/installing. There still seems to be some vestigial documentation explaining how to go about verifying releases in the README.md.

Does who/whatever creates the releases not have access to CoreOS's signing key? Should we clear out the aforementioned parts of the docs?

repo: Enable Travis CI builds

To automate unit testing which for now I have to run against each pull request manually!
I have write but not admin privileges so we have to ask anyone with admin privileges to enable it.

Fix instability while starting kubelet on worker/controller nodes

i.e. coreos/coreos-kubernetes#697

Flannel temporarily logged L3 miss/Route not found errors with the kubedns pod's ip after replacing worker nodes

Within a worker node in a main cluster(not node pool) created via kube-aws update with the kube-aws binary built from cb2a9cf which is the latest master plus the node pools feature, two errors, Route for 10.2.86.2 not found and L3 miss: 10.2.86.2, have been periodically observed.
They seemed to be preventing kubedns from being fully up and all the kube-aws E2E tests were failing until flanneld finally stopped complaining with those errors and then kubedns became up.

I'm not aware of the root cause(s) of this issue and fix(s) right now.
How to work around?
I guess we have to wait for some time until it starts working again.

I've just waited 10 or more minutes until flannel finally stopped complaining and kubedns became up, ran kube-aws E2E again, and verified it was working at that time.

journalctl -e:

Nov 29 04:00:59 ip-10-0-0-45.ap-northeast-1.compute.internal rkt[1378]: I1129 04:00:59.447507 01378 network.go:225] L3 miss: 10.2.86.2
Nov 29 04:00:59 ip-10-0-0-45.ap-northeast-1.compute.internal rkt[1378]: I1129 04:00:59.447584 01378 network.go:229] Route for 10.2.86.2 not found
Nov 29 04:00:59 ip-10-0-0-45.ap-northeast-1.compute.internal rkt[1047]: Downloading ACI:  8.65 MB/36.4 MB
Nov 29 04:01:03 ip-10-0-0-45.ap-northeast-1.compute.internal rkt[1047]: Downloading ACI:  8.67 MB/36.4 MB
Nov 29 04:01:03 ip-10-0-0-45.ap-northeast-1.compute.internal rkt[1378]: I1129 04:01:03.455575 01378 network.go:225] L3 miss: 10.2.86.2
Nov 29 04:01:03 ip-10-0-0-45.ap-northeast-1.compute.internal rkt[1378]: I1129 04:01:03.455620 01378 network.go:229] Route for 10.2.86.2 not found
Nov 29 04:01:03 ip-10-0-0-45.ap-northeast-1.compute.internal dockerd[1585]: time="2016-11-29T04:01:03.782349506Z" level=info msg="Container 1a6f7a10aadaad041c2453e480fa2e2b92b77d8621bb4f82d9e008cb73d01c0d
Nov 29 04:01:03 ip-10-0-0-45.ap-northeast-1.compute.internal dockerd[1585]: time="2016-11-29T04:01:03.794613394Z" level=warning msg="container 1a6f7a10aadaad041c2453e480fa2e2b92b77d8621bb4f82d9e008cb73d01
Nov 29 04:01:03 ip-10-0-0-45.ap-northeast-1.compute.internal kubelet-wrapper[1491]: I1129 04:01:03.795873    1491 docker_manager.go:2443] checking backoff for container "dnsmasq" in pod "kube-dns-v20-boc9
Nov 29 04:01:03 ip-10-0-0-45.ap-northeast-1.compute.internal kubelet-wrapper[1491]: E1129 04:01:03.821522    1491 docker_manager.go:746] Logging security options: {key:seccomp value:unconfined msg:}
Nov 29 04:01:03 ip-10-0-0-45.ap-northeast-1.compute.internal kubelet-wrapper[1491]: I1129 04:01:03.958906    1491 docker_manager.go:2443] checking backoff for container "kubedns" in pod "kube-dns-v20-boc9
Nov 29 04:01:04 ip-10-0-0-45.ap-northeast-1.compute.internal kubelet-wrapper[1491]: E1129 04:01:04.019310    1491 docker_manager.go:746] Logging security options: {key:seccomp value:unconfined msg:}
Nov 29 04:01:04 ip-10-0-0-45.ap-northeast-1.compute.internal rkt[1378]: I1129 04:01:04.461484 01378 network.go:225] L3 miss: 10.2.86.2
Nov 29 04:01:04 ip-10-0-0-45.ap-northeast-1.compute.internal rkt[1378]: I1129 04:01:04.461528 01378 network.go:229] Route for 10.2.86.2 not found
Nov 29 04:01:04 ip-10-0-0-45.ap-northeast-1.compute.internal kubelet-wrapper[1491]: I1129 04:01:04.828621    1491 operation_executor.go:900] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/d5
Nov 29 04:01:05 ip-10-0-0-45.ap-northeast-1.compute.internal rkt[1378]: I1129 04:01:05.463523 01378 network.go:225] L3 miss: 10.2.86.2
Nov 29 04:01:05 ip-10-0-0-45.ap-northeast-1.compute.internal rkt[1378]: I1129 04:01:05.463568 01378 network.go:229] Route for 10.2.86.2 not found
Nov 29 04:01:05 ip-10-0-0-45.ap-northeast-1.compute.internal kubelet-wrapper[1491]: I1129 04:01:05.847086    1491 operation_executor.go:900] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/d5
Nov 29 04:01:07 ip-10-0-0-45.ap-northeast-1.compute.internal rkt[1047]: Downloading ACI:  8.69 MB/36.4 MB
Nov 29 04:01:11 ip-10-0-0-45.ap-northeast-1.compute.internal rkt[1378]: I1129 04:01:11.430824 01378 network.go:225] 
Nov 29 04:01:11 ip-10-0-0-45.ap-northeast-1.compute.internal rkt[1378]: I1129 04:01:11.430871 01378 network.go:229] Route for 10.2.86.2 not found
Nov 29 04:01:12 ip-10-0-0-45.ap-northeast-1.compute.internal rkt[1378]: I1129 04:01:12.433536 01378 network.go:225] L3 miss: 10.2.86.2
Nov 29 04:01:12 ip-10-0-0-45.ap-northeast-1.compute.internal rkt[1378]: I1129 04:01:12.433578 01378 network.go:229] Route for 10.2.86.2 not found
Nov 29 04:01:13 ip-10-0-0-45.ap-northeast-1.compute.internal rkt[1378]: I1129 04:01:13.435535 01378 network.go:225] L3 miss: 10.2.86.2
Nov 29 04:01:13 ip-10-0-0-45.ap-northeast-1.compute.internal rkt[1378]: I1129 04:01:13.435580 01378 network.go:229] Route for 10.2.86.2 not found

subnet.env for flanneld:

core@ip-10-0-0-45 ~ $ cat /run/flannel/subnet.env
FLANNEL_NETWORK=10.2.0.0/16
FLANNEL_SUBNET=10.2.86.1/24
FLANNEL_MTU=8951
FLANNEL_IPMASQ=true

running processes:

core@ip-10-0-0-45 ~ $ ps aux | grep docker
root      1491  2.5  2.7 391284 107536 ?       Ssl  03:42   1:02 /kubelet --api-servers=https://kubeawstest2.my.example.com --network-plugin-dir=/etc/kubernetes/cni/net.d --network-plugin=cni --container-runtime=docker --rkt-path=/usr/bin/rkt --rkt-stage1-image=coreos.com/rkt/stage1-coreos --register-node=true --allow-privileged=true --pod-manifest-path=/etc/kubernetes/manifests --cluster_dns=10.3.0.10 --cluster_domain=cluster.local --cloud-provider=aws --kubeconfig=/etc/kubernetes/worker-kubeconfig.yaml --tls-cert-file=/etc/kubernetes/ssl/worker.pem --tls-private-key-file=/etc/kubernetes/ssl/worker-key.pem
root      1585  1.5  1.4 795364 55756 ?        Ssl  03:42   0:36 docker daemon --host=fd:// --mtu=8951 --selinux-enabled
root      1605  0.0  0.3 344388 13940 ?        Ssl  03:42   0:00 containerd -l /var/run/docker/libcontainerd/docker-containerd.sock --runtime runc --start-timeout 2m
root      2022  0.0  0.0 198952  2872 ?        Sl   03:44   0:00 containerd-shim f4e32cb437b1dd5b2827aef86ee24351eae1a112550999716668fc5c5f5b1a62 /var/run/docker/libcontainerd/f4e32cb437b1dd5b2827aef86ee24351eae1a112550999716668fc5c5f5b1a62 runc
root      2270  0.0  0.0 198952  3504 ?        Sl   03:45   0:00 containerd-shim 81d018b436fd59575e4d83606a421a36abc66006298c7501e260d713304a640e /var/run/docker/libcontainerd/81d018b436fd59575e4d83606a421a36abc66006298c7501e260d713304a640e runc
root      6002  0.0  0.1 133416  5660 ?        Sl   03:49   0:00 containerd-shim f77979ab53a0a750f09a7044d40fa54fdcf360c3bb05930bb686c710968aac25 /var/run/docker/libcontainerd/f77979ab53a0a750f09a7044d40fa54fdcf360c3bb05930bb686c710968aac25 runc
root      6446  0.0  0.1 198952  5612 ?        Sl   03:49   0:00 containerd-shim 08ade40b65540e60e602b792b6bc4178f3bad71faa86132f675c4e0f36842419 /var/run/docker/libcontainerd/08ade40b65540e60e602b792b6bc4178f3bad71faa86132f675c4e0f36842419 runc
root     10968  0.0  0.0 198952  3508 ?        Sl   04:22   0:00 containerd-shim 1e491087f589fb66ea762f933565ab1dd05383371e9e141445824b124b174456 /var/run/docker/libcontainerd/1e491087f589fb66ea762f933565ab1dd05383371e9e141445824b124b174456 runc
core     11104  0.0  0.0   6760   916 pts/0    S+   04:23   0:00 grep --colour=auto docker

Other, more complete logs can be seen at https://gist.github.com/mumoshu/da18f8f3ba46688f0a3e5e51889f60d1

Avoid the 51200 bytes limitation errors of CloudFormation

With the current master 6255751, when creating a test cluster for E2e testing, I've finally hit the limitation error:

*your template body here* at 'templateBody' failed to satisfy constraint: Member must have length less than or equal to 51200

fix(test): rkt+stable is supported now

kube-aws had been intentionally failing fast when tried to use rkt with CoreOS stable which had been bundled with versions of rkt too old for Kubernetes.
However, starting today, it is ok to use rkt+stable because the latest stable includes rkt newer enough. (Also see #21)

Our CI caught that fact and started failing several hours ago. Let fix!
https://travis-ci.org/coreos/kube-aws/builds/172463595

request for clarification: why does kube-aws update take down my controller

I am very new to kubernetes and coreos, but i'm confused about why the following commands take down my cluster for a few minutes (on v0.9.1):

# change workerCount in cluster.yaml from 1 to 2
kube-aws render stack
kube-aws validate
kube-aws update

I was expecting another worker node to be launched.

Missing git history from coreos-kubernetes/tree/master/multi-node/aws

The first commit message in this repo says:
"Move everything under the multi-node/aws directory from coreos/coreos-kubernetes c25ac21"

But all the commit history from the original location is lost
https://github.com/coreos/coreos-kubernetes/tree/master/multi-node/aws

Could you maybe extract the directory's history?
http://stackoverflow.com/questions/1662753/export-subtree-in-git-with-history

Also, should https://github.com/coreos/coreos-kubernetes/tree/master/multi-node/aws be stopped/blocked if this repo is to become the official location?

etcd2 cluster doesnt start

Im not sure whether i have something configured incorrectly or whether this is a bug so apologies if this is posted as an issue and its not. I have looked for docs around this config and cant see anything obviously wrong but happy to read more if someone can point me in the right direction.

Problem is that using default config the etcd2 cluster wont start.

Im using v0.9.1-rc.2

below is the etc2d config in userdata/cloud-config-etcd file which is unchanged from what was generated using the kube-aws render command

_units:
- name: etcd2.service
drop-ins:
- name: 20-etcd2-aws-cluster.conf
content: |
[Unit]
Requires=decrypt-tls-assets.service
After=decrypt-tls-assets.service

        [Service]
        Environment=ETCD_NAME=%H

        Environment=ETCD_PEER_TRUSTED_CA_FILE=/etc/etcd2/ssl/ca.pem
        Environment=ETCD_PEER_CERT_FILE=/etc/etcd2/ssl/etcd.pem
        Environment=ETCD_PEER_KEY_FILE=/etc/etcd2/ssl/etcd-key.pem

        Environment=ETCD_CLIENT_CERT_AUTH=true
        Environment=ETCD_TRUSTED_CA_FILE=/etc/etcd2/ssl/ca.pem
        Environment=ETCD_CERT_FILE=/etc/etcd2/ssl/etcd.pem
        Environment=ETCD_KEY_FILE=/etc/etcd2/ssl/etcd-key.pem

        Environment=ETCD_INITIAL_CLUSTER_STATE=new
        Environment=ETCD_INITIAL_CLUSTER={{.EtcdInitialCluster}}
        Environment=ETCD_DATA_DIR=/var/lib/etcd2
        Environment=ETCD_LISTEN_CLIENT_URLS=https://%H:2379
        Environment=ETCD_ADVERTISE_CLIENT_URLS=https://%H:2379
        Environment=ETCD_LISTEN_PEER_URLS=https://%H:2380
        Environment=ETCD_INITIAL_ADVERTISE_PEER_URLS=https://%H:2380
        PermissionsStartOnly=true
        ExecStartPre=/usr/bin/chown -R etcd:etcd /var/lib/etcd2
  enable: true
  command: start_

my cluster.yaml contains this:

hostedZoneId: "sandbox.testwaikato.kiwi"
and
etcdCount: 3

all other DNS and etcd config in it is default.

The problem is that when the etcd tries to start with this config there is an error:

Nov 15 20:10:52 ip-172-19-76-198.sandbox.testwaikato.kiwi systemd[1]: Starting etcd2...
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=https://ip-172-19-76-198.sandbox.testwaikato.kiwi:2379
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: recognized and used environment variable ETCD_CERT_FILE=/etc/etcd2/ssl/etcd.pem
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: recognized and used environment variable ETCD_CLIENT_CERT_AUTH=true
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: recognized and used environment variable ETCD_DATA_DIR=/var/lib/etcd2
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: recognized and used environment variable ETCD_ELECTION_TIMEOUT=1200
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: recognized and used environment variable ETCD_INITIAL_ADVERTISE_PEER_URLS=https://ip-172-19-76-198.sandbox.testwaikato.kiwi:2380
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: recognized and used environment variable ETCD_INITIAL_CLUSTER=ip-172-19-76-198.ap-southeast-2.compute.internal=https://ip-172-19-76-198.ap-southeast-2.compute.intern
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: recognized and used environment variable ETCD_INITIAL_CLUSTER_STATE=new
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: recognized and used environment variable ETCD_KEY_FILE=/etc/etcd2/ssl/etcd-key.pem
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: recognized and used environment variable ETCD_LISTEN_CLIENT_URLS=https://ip-172-19-76-198.sandbox.testwaikato.kiwi:2379
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: recognized and used environment variable ETCD_LISTEN_PEER_URLS=https://ip-172-19-76-198.sandbox.testwaikato.kiwi:2380
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: recognized and used environment variable ETCD_NAME=ip-172-19-76-198.sandbox.testwaikato.kiwi
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: recognized and used environment variable ETCD_PEER_CERT_FILE=/etc/etcd2/ssl/etcd.pem
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: recognized and used environment variable ETCD_PEER_KEY_FILE=/etc/etcd2/ssl/etcd-key.pem
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: recognized and used environment variable ETCD_PEER_TRUSTED_CA_FILE=/etc/etcd2/ssl/ca.pem
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: recognized and used environment variable ETCD_TRUSTED_CA_FILE=/etc/etcd2/ssl/ca.pem
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: etcd Version: 2.3.7
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: Git SHA: fd17c91
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: Go Version: go1.7.1
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: Go OS/Arch: linux/amd64
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: setting maximum number of CPUs to 1, total number of available CPUs is 1
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: found invalid file/dir lost+found under data dir /var/lib/etcd2 (Ignore this if you are upgrading etcd)
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: peerTLS: cert = /etc/etcd2/ssl/etcd.pem, key = /etc/etcd2/ssl/etcd-key.pem, ca = , trusted-ca = /etc/etcd2/ssl/ca.pem, client-cert-auth = false
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: listening for peers on https://ip-172-19-76-198.sandbox.testwaikato.kiwi:2380
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: clientTLS: cert = /etc/etcd2/ssl/etcd.pem, key = /etc/etcd2/ssl/etcd-key.pem, ca = , trusted-ca = /etc/etcd2/ssl/ca.pem, client-cert-auth = true
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: listening for client requests on https://ip-172-19-76-198.sandbox.testwaikato.kiwi:2379
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: stopping listening for client requests on https://ip-172-19-76-198.sandbox.testwaikato.kiwi:2379
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: stopping listening for peers on https://ip-172-19-76-198.sandbox.testwaikato.kiwi:2380
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: couldn't find local name "ip-172-19-76-198.sandbox.testwaikato.kiwi" in the initial cluster configuration
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi systemd[1]: etcd2.service: Main process exited, code=exited, status=1/FAILURE
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi systemd[1]: Failed to start etcd2.
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi systemd[1]: etcd2.service: Unit entered failed state.
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi systemd[1]: etcd2.service: Failed with result 'exit-code'.

running hostname on each etcd server returns the non aws hostname
e.g
ip-172-19-76-198.sandbox.testwaikato.kiwi

if i then go onto my etcd2 servers and change the /etc/systemd/system/etc2.service.d/20-etcd2-aws-cluster.conf file by replacing the %H references with the aws DNS values for the host

e.g

Environment=ETCD_LISTEN_CLIENT_URLS=https://ip-172-19-76-198.ap-southeast-2.compute.internal:2379
Environment=ETCD_ADVERTISE_CLIENT_URLS=https://ip-172-19-76-198.ap-southeast-2.compute.internal:2379
Environment=ETCD_LISTEN_PEER_URLS=https://ip-172-19-76-198.ap-southeast-2.compute.internal:2380
Environment=ETCD_INITIAL_ADVERTISE_PEER_URLS=https://ip-172-19-76-198.ap-southeast-2.compute.internal:2380

and start the service it works.

It looks to me like the Environment=ETCD_INITIAL_CLUSTER=ip-172-19-76-198.ap-southeast-2.compute.internal=https://ip-172-19-76-198.ap-southeast-2.compute.internal:2380,ip-172-19-77-197.ap-southeast-2.compute.internal=https://ip-172-19-77-197.ap-southeast-2.compute.internal:2380,ip-172-19-76-199.ap-southeast-2.compute.internal=https://ip-172-19-76-199.ap-southeast-2.compute.internal:2380 line is using the aws DNS entries but by having %H in the user data i get ip-172-19-76-199.sandbox.testwaikato.kiwi in my config and even though they both resolve etcd wont start because of this?

So is this a bug or is there someway to set the config to either my local dns names OR aws hostnames in userdata/cloud-config-etcd ?

etcd management

There are currently some people forking because we're not sure about the current etcd solution. Lets discuss the issues in this topic. A lot of us seem to center around @crewjam's etcd solution but there's also others:

I have a personal preference for https://crewjam.com/etcd-aws/ (https://github.com/crewjam/etcd-aws) but we should definitely have this conversation as a community (as i tried in the old repo coreos/coreos-kubernetes#629)
Lets combine our efforts @colhom @camilb @dzavalkinolx

Branches for inspiration:

https://github.com/pieterlange/kube-aws/tree/feature/external-etcd (refers to etcd-aws managed by separate cloudformation template)
@camilb's work
- https://github.com/camilb/etcd-aws/tree/ssl (SSL support for etcd-aws)
- https://github.com/camilb/coreos-kubernetes/tree/etcd-asg

Currently missing features:

backup/restore
node cycling
node discovery from ASG
cluster recovery from complete failure

As noted in the overall production readiness issue #9 there's also work being done on etcd being hosted inside of kubernetes itself, which is probably where all of this is going in the end.

Can't build executable kube-aws from the lastest release source code.

I must say that I'm not familiar with Go now...

Download the source code package
And try to run ./build, but failed with NOT a git repo

Then I cloned the repo, and run ./build
Some packages could not found at GOROOT and GOPATH, I noticed there a glide.yaml and glide.lock file, so try to install glide and run glide install .
And there's a warning:
[WARN] The name listed in the config file (github.com/coreos/kube-aws) does not match the current location (.)

Continue to run ./build command, and still raise errors:

Building kube-aws for GOOS=linux GOARCH=amd64 Building kube-aws df86ecf75b30cc80b8994699bdf31832cbee6257 cmd/kube-aws/command_destroy.go:8:2: cannot find package "github.com/coreos/kube-aws/cluster" in any of: $GOROOT/src/github.com/coreos/kube-aws/cluster (from $GOROOT) $GOPATH/src/github.com/coreos/kube-aws/cluster (from $GOPATH) cmd/kube-aws/command_destroy.go:9:2: cannot find package "github.com/coreos/kube-aws/config" in any of: $GOROOT/src/github.com/coreos/kube-aws/config (from $GOROOT) $GOPATH/src/github.com/coreos/kube-aws/config (from $GOPATH) cmd/kube-aws/command_render.go:16:2: cannot find package "github.com/coreos/kube-aws/tlsutil" in any of: $GOROOT/src/github.com/coreos/kube-aws/tlsutil (from $GOROOT) $GOPATH/src/github.com/coreos/kube-aws/tlsutil (from $GOPATH) cmd/kube-aws/command_destroy.go:6:2: cannot find package "github.com/spf13/cobra" in any of: $GOROOT/src/github.com/spf13/cobra (from $GOROOT) $GOPATH/src/github.com/spf13/cobra (from $GOPATH)

glide list shows:

INFO] Package github.com/aws/aws-sdk-go/aws/awserr found in vendor/ folder [WARN] Version not set for package github.com/aws/aws-sdk-go/aws/awserr [INFO] Package github.com/aws/aws-sdk-go/aws/credentials found in vendor/ folder [WARN] Version not set for package github.com/aws/aws-sdk-go/aws/credentials [INFO] Package github.com/aws/aws-sdk-go/aws/client found in vendor/ folder [WARN] Version not set for package github.com/aws/aws-sdk-go/aws/client [INFO] Package github.com/aws/aws-sdk-go/aws/corehandlers found in vendor/ folder [WARN] Version not set for package github.com/aws/aws-sdk-go/aws/corehandlers [INFO] Package github.com/aws/aws-sdk-go/aws/credentials/stscreds found in vendor/ folder [WARN] Version not set for package github.com/aws/aws-sdk-go/aws/credentials/stscreds [INFO] Package github.com/aws/aws-sdk-go/aws/defaults found in vendor/ folder [WARN] Version not set for package github.com/aws/aws-sdk-go/aws/defaults [INFO] Package github.com/aws/aws-sdk-go/aws/request found in vendor/ folder [WARN] Version not set for package github.com/aws/aws-sdk-go/aws/request [INFO] Package github.com/aws/aws-sdk-go/private/endpoints found in vendor/ folder [WARN] Version not set for package github.com/aws/aws-sdk-go/private/endpoints [INFO] Package github.com/go-ini/ini found in vendor/ folder [WARN] Version not set for package github.com/go-ini/ini [INFO] Package github.com/aws/aws-sdk-go/aws/awsutil found in vendor/ folder [WARN] Version not set for package github.com/aws/aws-sdk-go/aws/awsutil [INFO] Package github.com/aws/aws-sdk-go/aws/client/metadata found in vendor/ folder [WARN] Version not set for package github.com/aws/aws-sdk-go/aws/client/metadata [INFO] Package github.com/aws/aws-sdk-go/aws/signer/v4 found in vendor/ folder [WARN] Version not set for package github.com/aws/aws-sdk-go/aws/signer/v4 [INFO] Package github.com/aws/aws-sdk-go/private/protocol found in vendor/ folder [WARN] Version not set for package github.com/aws/aws-sdk-go/private/protocol [INFO] Package github.com/aws/aws-sdk-go/private/protocol/query found in vendor/ folder [WARN] Version not set for package github.com/aws/aws-sdk-go/private/protocol/query [INFO] Package github.com/aws/aws-sdk-go/private/waiter found in vendor/ folder [WARN] Version not set for package github.com/aws/aws-sdk-go/private/waiter [INFO] Package github.com/aws/aws-sdk-go/private/protocol/ec2query found in vendor/ folder [WARN] Version not set for package github.com/aws/aws-sdk-go/private/protocol/ec2query [INFO] Package github.com/aws/aws-sdk-go/private/protocol/restxml found in vendor/ folder [WARN] Version not set for package github.com/aws/aws-sdk-go/private/protocol/restxml [WARN] Package github.com/coreos/kube-aws/config is not installed [INFO] Not found in vendor/: github.com/coreos/kube-aws/config (1) [WARN] Package github.com/coreos/kube-aws/cluster is not installed [INFO] Not found in vendor/: github.com/coreos/kube-aws/cluster (1) [WARN] Package github.com/coreos/kube-aws/tlsutil is not installed [INFO] Not found in vendor/: github.com/coreos/kube-aws/tlsutil (1) [INFO] Package github.com/inconshreveable/mousetrap found in vendor/ folder [WARN] Version not set for package github.com/inconshreveable/mousetrap [INFO] Package github.com/spf13/pflag found in vendor/ folder [WARN] Version not set for package github.com/spf13/pflag [INFO] Package github.com/aws/aws-sdk-go/private/protocol/jsonrpc found in vendor/ folder [WARN] Version not set for package github.com/aws/aws-sdk-go/private/protocol/jsonrpc [INFO] Package github.com/coreos/coreos-cloudinit/config found in vendor/ folder [WARN] Version not set for package github.com/coreos/coreos-cloudinit/config [INFO] Package github.com/coreos/yaml found in vendor/ folder [WARN] Version not set for package github.com/coreos/yaml [WARN] Package github.com/coreos/kube-aws/coreosutil is not installed [INFO] Not found in vendor/: github.com/coreos/kube-aws/coreosutil (1) [INFO] Package github.com/aws/aws-sdk-go/service/sts found in vendor/ folder [WARN] Version not set for package github.com/aws/aws-sdk-go/service/sts [INFO] Package github.com/aws/aws-sdk-go/aws/credentials/ec2rolecreds found in vendor/ folder [WARN] Version not set for package github.com/aws/aws-sdk-go/aws/credentials/ec2rolecreds [INFO] Package github.com/aws/aws-sdk-go/aws/credentials/endpointcreds found in vendor/ folder [WARN] Version not set for package github.com/aws/aws-sdk-go/aws/credentials/endpointcreds [INFO] Package github.com/aws/aws-sdk-go/aws/ec2metadata found in vendor/ folder [WARN] Version not set for package github.com/aws/aws-sdk-go/aws/ec2metadata [INFO] Package github.com/jmespath/go-jmespath found in vendor/ folder [WARN] Version not set for package github.com/jmespath/go-jmespath [INFO] Package github.com/aws/aws-sdk-go/private/protocol/rest found in vendor/ folder [WARN] Version not set for package github.com/aws/aws-sdk-go/private/protocol/rest [INFO] Package github.com/aws/aws-sdk-go/private/protocol/query/queryutil found in vendor/ folder [WARN] Version not set for package github.com/aws/aws-sdk-go/private/protocol/query/queryutil [INFO] Package github.com/aws/aws-sdk-go/private/protocol/xml/xmlutil found in vendor/ folder [WARN] Version not set for package github.com/aws/aws-sdk-go/private/protocol/xml/xmlutil [INFO] Package github.com/aws/aws-sdk-go/private/protocol/json/jsonutil found in vendor/ folder [WARN] Version not set for package github.com/aws/aws-sdk-go/private/protocol/json/jsonutil [ERROR] Error listing dependencies: Error resolving imports

Then I reset the changes from glide install, and use go get to install the missing deps, run command ./build again and show the following:
`Building kube-aws df86ecf

# _$HOME/kube-aws/cmd/kube-aws
cmd/kube-aws/command_init.go:60: undefined: config.DefaultClusterConfig
cmd/kube-aws/command_render.go:93: undefined: config.KubeConfigTemplate
cmd/kube-aws/command_render.go:109: undefined: config.CloudConfigController
cmd/kube-aws/command_render.go:110: undefined: config.CloudConfigWorker
cmd/kube-aws/command_render.go:111: undefined: config.CloudConfigEtcd
cmd/kube-aws/command_render.go:112: undefined: config.StackTemplateTemplate`

Could you please give some suggestions or release a binary package?

decrypt-tls-assets needlessly taking time

I've noticed that the current implementation of decrypt-tls-assets running rkt run multiple times seems to be resulting in adding up to 20 seconds(or possibly more?) in total run time.

One of the root causes seems to be that rkt run runs integrity check on a container image. Possibly related rkt/rkt#1350
Without bypassing the check, we can make total run time shorter via running rkt run just once and loop over all the tls assets to decrypt inside the container.

I've benchmarked three implementations from v1 to v3. v1 is the original implementation.
https://gist.github.com/mumoshu/d1bb67ba48b975eccbec4bdf83559ade

v2 looks good as it has a good balance of simplicity and speed.

vanilla v0.9.1-rc.4:
https://gist.github.com/mumoshu/8473065955a4602677fc2698c3720abf

v0.9.1-rc.4 + v2 decrypt-tls-assets:
https://gist.github.com/mumoshu/8bec0b49e7628ce5d929b467b1f70e15

Restarts triggered too early/too much for install-kube-system.service and install-calico-system.service

I've looked into journalctl output from my controller node and saw:

Nov 23 02:01:44 ip-10-0-0-57.ap-northeast-1.compute.internal systemd[1]: install-kube-system.service: Control process exited, code=exited status=3
Nov 23 02:01:44 ip-10-0-0-57.ap-northeast-1.compute.internal systemd[1]: Failed to start install-kube-system.service.
Nov 23 02:01:44 ip-10-0-0-57.ap-northeast-1.compute.internal systemd[1]: install-kube-system.service: Unit entered failed state.
Nov 23 02:01:44 ip-10-0-0-57.ap-northeast-1.compute.internal systemd[1]: install-kube-system.service: Failed with result 'exit-code'.
Nov 23 02:01:44 ip-10-0-0-57.ap-northeast-1.compute.internal coreos-cloudinit[875]: 2016/11/23 02:01:44 Result of "start" on "install-kube-system.service": failed
Nov 23 02:01:44 ip-10-0-0-57.ap-northeast-1.compute.internal coreos-cloudinit[875]: 2016/11/23 02:01:44 Calling unit command "start" on "install-calico-system.service"'
Nov 23 02:01:44 ip-10-0-0-57.ap-northeast-1.compute.internal systemd[1]: Starting install-calico-system.service...
Nov 23 02:01:44 ip-10-0-0-57.ap-northeast-1.compute.internal systemctl[1132]: activating
Nov 23 02:01:44 ip-10-0-0-57.ap-northeast-1.compute.internal systemd[1]: install-calico-system.service: Control process exited, code=exited status=3
Nov 23 02:01:44 ip-10-0-0-57.ap-northeast-1.compute.internal systemd[1]: Failed to start install-calico-system.service.
Nov 23 02:01:44 ip-10-0-0-57.ap-northeast-1.compute.internal systemd[1]: install-calico-system.service: Unit entered failed state.
Nov 23 02:01:44 ip-10-0-0-57.ap-northeast-1.compute.internal systemd[1]: install-calico-system.service: Failed with result 'exit-code'.
Nov 23 02:01:44 ip-10-0-0-57.ap-northeast-1.compute.internal coreos-cloudinit[875]: 2016/11/23 02:01:44 Result of "start" on "install-calico-system.service": failed
Nov 23 02:01:44 ip-10-0-0-57.ap-northeast-1.compute.internal coreos-cloudinit[875]: 2016/11/23 02:01:44 Calling unit command "start" on "cfn-signal.service"

It seems that install-kube-system and install-calico-system are retried every second and spamming like this for about ten minutes, until they finally succeeds to pass health checks in ExecStartPre=/usr/bin/curl http://127.0.0.1:8080/version:

Nov 23 02:11:17 ip-10-0-0-57.ap-northeast-1.compute.internal systemd[1]: install-kube-system.service: Control process exited, code=exited status=7
Nov 23 02:11:17 ip-10-0-0-57.ap-northeast-1.compute.internal systemd[1]: Failed to start install-kube-system.service.
Nov 23 02:11:17 ip-10-0-0-57.ap-northeast-1.compute.internal systemd[1]: install-kube-system.service: Unit entered failed state.
Nov 23 02:11:17 ip-10-0-0-57.ap-northeast-1.compute.internal systemd[1]: install-kube-system.service: Failed with result 'exit-code'.
Nov 23 02:11:17 ip-10-0-0-57.ap-northeast-1.compute.internal systemd[1]: install-kube-system.service: Service hold-off time over, scheduling restart.
Nov 23 02:11:17 ip-10-0-0-57.ap-northeast-1.compute.internal systemd[1]: Stopped install-kube-system.service.
Nov 23 02:11:17 ip-10-0-0-57.ap-northeast-1.compute.internal systemd[1]: Starting install-kube-system.service...
Nov 23 02:11:17 ip-10-0-0-57.ap-northeast-1.compute.internal systemctl[11810]: active
Nov 23 02:11:17 ip-10-0-0-57.ap-northeast-1.compute.internal systemctl[11818]: active
Nov 23 02:11:17 ip-10-0-0-57.ap-northeast-1.compute.internal curl[11822]:   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
Nov 23 02:11:17 ip-10-0-0-57.ap-northeast-1.compute.internal curl[11822]:                                  Dload  Upload   Total   Spent    Left  Speed
Nov 23 02:11:17 ip-10-0-0-57.ap-northeast-1.compute.internal curl[11822]: [149B blob data]
Nov 23 02:11:17 ip-10-0-0-57.ap-northeast-1.compute.internal systemd[1]: install-kube-system.service: Control process exited, code=exited status=7
Nov 23 02:11:17 ip-10-0-0-57.ap-northeast-1.compute.internal systemd[1]: Failed to start install-kube-system.service.
Nov 23 02:11:17 ip-10-0-0-57.ap-northeast-1.compute.internal systemd[1]: install-kube-system.service: Unit entered failed state.
Nov 23 02:11:17 ip-10-0-0-57.ap-northeast-1.compute.internal systemd[1]: install-kube-system.service: Failed with result 'exit-code'.
Nov 23 02:11:17 ip-10-0-0-57.ap-northeast-1.compute.internal systemd[1]: install-kube-system.service: Service hold-off time over, scheduling restart.
Nov 23 02:11:17 ip-10-0-0-57.ap-northeast-1.compute.internal systemd[1]: Stopped install-kube-system.service.
Nov 23 02:11:17 ip-10-0-0-57.ap-northeast-1.compute.internal systemd[1]: Starting install-kube-system.service...

It doesn't seem to prevent controller nodes to be up or make it not functional.
But we should reduce these useless errors to allow users focus on actual errors if any.

Destroy do not clean up everything

It looks like on destroy now all resources are cleaned up.
What i've seen as left over is:

Create New Role Role Actions

IAM > Roles > kubernetes-master
IAM > Roles > kubernetes-minion

and i believe security Groups too.. But my they are deleted meanwhile, so cannot provide names.

Bump Kubernetes version to 1.4.6+coreos.0

in the same way as coreos/coreos-kubernetes#760

Feature: Spot Fleet support for worker nodes

Quite self explanatory but I'd like to add this to kube-aws.

Upstream issue: kubernetes/kubernetes#24472

Initial Implementation in this project: #113
Documentation: https://github.com/coreos/kube-aws/blob/master/Documentation/kubernetes-on-aws-node-pool.md#deploying-a-node-pool-powered-by-spot-fleet

Spot fleet backed worker nodes are supported since v0.9.2-rc.3:

# Launch a main cluster
kube-aws init ...
kube-aws render
kube-aws up ...

# Launch a node pool powered by Spot Fleet
kube-aws node-pools init --node-pool-name mypoolname ...
echo -e "worker:\n  spotFleet:\n    targetCapacity: 3\n" >> node-pools/mypoolname/cluster.yaml
kube-aws node-pools render --node-pool-name mypoolname
kube-aws node-pools up --node-pool-name mypoolname --s3-uri ...

An experimental feature to automatically taint nodes with user-provided taints is supported since v0.9.2-rc.4(not yet released) so we can ensure only pods tolerant to frequent node terminations are scheduled to spot instances/spot-fleet-powered nodes:

Utilizing Spot Fleet gives us chances to dramatically reduce cost being spent on EC2 instances powering Kubernetes worker nodes
AWS says cost reduction is up to 90%. I can confirm that in my daily used region ap-northeast-1 it is up to 89% right now, with slightly varying cost for each instance type.

I believe that on top of the recent work on Node Pools #46, it is easier than ever to implement a POC of the Spot Fleet support.
I'll send a pull request to show it shortly.
I'd appreciate your feedback(s)!

Several concerns I've came up with until now:

cluster-autoscaler doesn't support Spot Fleets
- If you want to make nodes in a spot fleet auto-scaled, you probably need to tinker resulting CloudFormation templates to include appropriate configuration. See https://aws.amazon.com/jp/blogs/aws/new-auto-scaling-for-ec2-spot-fleets/ for the official announcement of autoscaling for fleets.
- Upstream issue: kubernetes-retired/contrib#2066
  - We need to teach cluster-autoscaler how it selects which node pool to expand
    - It shouldn't select an ASG which is suspended and a spot fleet which all/part of groups are beyond the bid price
    - If a pending node can be scheduled in an ASG or a spot fleet, it should select the one according to user preference
It seems there's no way to use cfn-signal like we did for standard, asg-based worker nodes to hold CloudFormation's creation/update completion until e.g. kubelet's become ready
I'm not yet sure how we can rolling-update nodes in a spot fleet like we did for standard asg-based worker nodes.
I'm assuming users already have the aws-ec2-spot-fleet-role IAM role created in their AWS accounts automatically by accessing Spot Fleets in AWS Console at least once
- But if an user had not, kube-aws nodepool up will fail while copy-pasting an error message from CloudFormation like IAM role aws-ec2-spot-fleet-role doesn't exist, which may be useless to the user as it doesn't provide any information to notify the user needed to arrive Spot Fleet in AWS console at least once
- We could create such IAM role like described in https://cloudonaut.io/3-simple-ways-of-saving-up-to-90-of-ec2-costs/, instead of assuming/referencing the possibly existing IAM role

TODOs:

Add more tests
Make rootVolumeSize and rootVolumeIOPS for each launchConfiguration defaults to worker.spotFleet.unitRootVolumeSize * weightedCapacity, worker.spotFleet.unitRootVolumeIOPS * weightedCapacity respectively
Node labels #149
Taints #132
- Relevant k8s taints and toleration experience from community: https://medium.com/@alejandro.ramirez.ch/reserving-a-kubernetes-node-for-specific-nodes-e75dc8297076#.lr7x178mo
- The kubelet options like --register-node=true --register-schedurable=false followed by kubectl taint and kubectl uncordon would avoid any race-condition which can result in undesired pods getting scheduled to undesired nodes while adding a new node/kubelet to a cluster.
  - Relevant upstream issue: kubernetes/kubernetes#15108
- Relevant upstream issue: kubernetes/kubernetes#35166
  - It is recommended to not taint nodes by default. However, I'd definitely like to do that so our users in their early stages are saved from confusions like "I use Spot Fleet to power my worker nodes and why my pod(s) frequently goes down?"
Experimental.LoadBalancer.Names is not taken into account #167
- Is it safe to add spot instances launched by a spot fleet automatically via aws-cli?
  - How to clean up after the spot instance terminates?
Create Name tags on a spot instance with the same value as workers in a main cluster #167
Integration tests with CloudFormation (Requires an AWS account and runs stack validations in many combinations of cluster configurations in cluster.yamls)

kube-aws update message formatting

After running kube-aws update, the message is incorrectly formatted:

λ kube-aws update
Update stack: {
  StackId: "arn:aws:cloudformation:us-east-1:1111111111:stack/my-kube-stack/8e136f00-b9bb-11e6-a757-50fae987c09a"
}
Success! Your AWS resources are being updated:
%!(EXTRA string=Cluster Name:		my-kube-stack
Controller DNS Name:	my-kube-stack-ElbAPISe-J31VEJ0AB17A-2048286780.us-east-1.elb.amazonaws.com
)

Creating new cluster in brand new VPC requires "DNS Hostnames" setting to be turned on

I believe that this is needed in the documentation/quickstart. It was a small stumble for me. Without it Etcd hostname couldn't be resolved by the controller.

http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/vpc-dns.html#vpc-dns-hostnames

Dec 05 14:08:55 ip-10-0-0-231.us-west-2.compute.internal etcdctl[8593]: Error: client: etcd cluster is unavailable or misconfigured
Dec 05 14:08:55 ip-10-0-0-231.us-west-2.compute.internal etcdctl[8593]: error #0: dial tcp: lookup ip-10-0-0-4.us-west-2.compute.internal: no such host

EBS volumes using old-generation standard magnetic disks for controller and etcd

This issue was already explained once in coreos/coreos-kubernetes#554 and was fixed in coreos/coreos-kubernetes#559 but is missing for controller now in the stack-template.

Also the etcd (maybe on purpose?!) cluster is even not considered to be configurable at all.