GithubHelp home page GithubHelp logo

openshift / installer Goto Github PK

View Code? Open in Web Editor NEW
1.4K 160.0 1.4K 1.03 GB

Install an OpenShift 4.x cluster

Home Page: https://try.openshift.com

License: Apache License 2.0

Go 86.88% Shell 2.62% Dockerfile 0.03% Makefile 0.08% HCL 8.56% Smarty 0.02% Python 1.09% PowerShell 0.72%

installer's Introduction

OpenShift Installer

Supported Platforms

Quick Start

First, install all build dependencies.

Clone this repository. Then build the openshift-install binary with:

hack/build.sh

This will create bin/openshift-install. This binary can then be invoked to create an OpenShift cluster, like so:

bin/openshift-install create cluster

The installer will show a series of prompts for user-specific information and use reasonable defaults for everything else. In non-interactive contexts, prompts can be bypassed by providing an install-config.yaml.

If you have trouble, refer to the troubleshooting guide.

Connect to the cluster

Details for connecting to your new cluster are printed by the openshift-install binary upon completion, and are also available in the .openshift_install.log file.

Example output:

INFO Waiting 10m0s for the openshift-console route to be created...
INFO Install complete!
INFO To access the cluster as the system:admin user when using 'oc', run
    export KUBECONFIG=/path/to/installer/auth/kubeconfig
INFO Access the OpenShift web-console here: https://console-openshift-console.apps.${CLUSTER_NAME}.${BASE_DOMAIN}:6443
INFO Login to the console with user: kubeadmin, password: 5char-5char-5char-5char

Cleanup

Destroy the cluster and release associated resources with:

openshift-install destroy cluster

Note that you almost certainly also want to clean up the installer state files too, including auth/, terraform.tfstate, etc. The best thing to do is always pass the --dir argument to create and destroy. And if you want to reinstall from scratch, rm -rf the asset directory beforehand.

installer's People

Contributors

abhinavdahiya avatar alexsomesan avatar andfasano avatar barbacbd avatar bfournie avatar cpanato avatar crawford avatar enxebre avatar fedosin avatar ggreer avatar jcpowermac avatar jstuever avatar kans avatar kyoto avatar mandre avatar mxinden avatar openshift-ci[bot] avatar openshift-merge-bot[bot] avatar openshift-merge-robot avatar patrickdillon avatar pawanpinjarkar avatar pierreprinetti avatar quentin-m avatar r4f4 avatar rna-afk avatar rwsu avatar squat avatar staebler avatar wking avatar zaneb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

installer's Issues

Controller Manager pod exit loop

Version

$ openshift-install version
bin/openshift-install v0.1.0-73-g25c367c1106670027cd6d3937004d63fd222c635
Terraform v0.11.8

Platform (aws|libvirt|openshift):

libvirt

What happened?

Cluster spins up. controller-manager is in a crash loop with new pods being created for each new instance.

I1009 20:16:53.177131       1 controller_manager.go:37] Starting controllers on 0.0.0.0:8443 (v4.0.0-alpha.0+02f888e-285)
I1009 20:16:53.178288       1 controller_manager.go:48] DeploymentConfig controller using images from ""
I1009 20:16:53.178298       1 controller_manager.go:54] Build controller using images from ""
I1009 20:16:53.178352       1 standalone_apiserver.go:101] Started health checks at 0.0.0.0:8443
I1009 20:16:53.179275       1 leaderelection.go:185] attempting to acquire leader lease  kube-system/openshift-master-controllers...
I1009 20:16:53.192729       1 leaderelection.go:253] lock is held by controller-manager-f5hgz and has not yet expired
I1009 20:16:53.192742       1 leaderelection.go:190] failed to acquire lease kube-system/openshift-master-controllers

What you expected to happen?

Container manager should stay up.

How to reproduce it (as minimally and precisely as possible)?

$ bin/openshift-install cluster

Anything else we need to know?

I am attempting to get further logs of the exiting process.

References

show more cluster info after installer finish

As for now, when the installer finishes its job, it only prints how many resources were applied, but it does not tell me anything else...

Some suggestions for what we should print out after installer finish:

  • A path to kubeconfig and how to use it (something users can copy&pasta)
  • List of public hostnames for masters perhaps (including the core@ username for login). Right now, I have to either go to EC2 console to figure out what it public hostname for my masters and bootstrap node is or parse the terraform vars...
  • An API URL to use to login with the user I specified? (oc login https://<hostname>:6443/)
  • The help (?) button in setup should probably also show the "hidden" environment variable you can set to not be asked the question on next installer run ;-)

Also it would be a good idea to warn that you haven't set the public SSH key variable, so you won't be able to SSH into your cluster, probably ask users if they are OK with that :-)

Also I noticed that the cluster is not fully up when the installer complete, which might not be great experience... If we don't want to poll the API server readiness in installer, we should probably tell users to wait for the cluster to fully provision ?

dns names for etcd urls cannot be resolved

The hostname plus dns suffix in endpoints is insufficient to get a working etcd connection.

After merging the etcd endpoints and wiring the operators to consume the information (plus openshift/cluster-openshift-apiserver-operator#33), we still fail to access etcd. Debugging the daemonset, oc -n openshift-apiserver debug daemonset.apps/apiserver, will show

sh-4.2# nslookup test1-etcd-0.tt.testing
;; reply from unexpected source: 10.2.0.6#53, expected 10.3.0.10#53

It looks like we'll either be blocked until DNS/SDN functions properly or until the endpoint can provide a working IP address.

@abhinavdahiya @aaronlevy @mfojtik @smarterclayton @sttts @ironcladlou

libvirt installer leaves lots of artifacts when failing

Error: Error applying plan:

3 error(s) occurred:

  • libvirt_volume.master: 1 error(s) occurred:

  • libvirt_volume.master: storage volume 'master0' already exists

  • libvirt_network.tectonic_net: 1 error(s) occurred:

  • libvirt_network.tectonic_net: Error updating autostart for network: virError(Code=55, Domain=19, Message='Requested operation is not valid: cannot set autostart for transient network')

  • module.bootstrap.libvirt_volume.bootstrap: 1 error(s) occurred:

  • libvirt_volume.bootstrap: storage volume 'bootstrap' already exists

Terraform does not automatically rollback in the face of errors.
Instead, your Terraform state file has been partially updated with
any resources that successfully completed. Please address the error
above and apply again to incrementally change your infrastructure.

I clean up some stuff, then re-run, then it complains about other stuff already existing, rinse, repeat.

InstallConfig Cannot Be Embedded in Kube Types Due to missing DeepCopyInto Methods

I've been working on re-using InstallConfig within Hive's ClusterDeployment CRD and run into some problems that could use discussion:

  • Can't be used within a kube type due to missing kube generated methods (DeepCopyInto). Will need to be maintained as a full Kubernetes object for this to work. At present I'm copying your code, and getting codegen working in our code base, with a few tweaks that are required.
  • Use of IPNet for CIDRs doesn't work in Kube. Need a custom type or string I believe.
  • InstallConfig carries ObjectMeta which can trigger some things we don't necessarily want in our repo. (this one may not be a big deal)

Should the canonical source of the cluster config type live in the Installer? Would the config ever contain options that the installer ignored or did not act on? (and perhaps Hive or other actors would?) (I think the answer is probably "no" here)

How can we share defaulting and validation code? Ideally we want to inform an API caller their config is not valid without relying on the Installer to fail in a pod we're running. Quicker feedback will be important and ideally we should all be sharing the same code to do so.

Is InstallConfig appropriately named? Per last arch call we agreed it's not just an install time thing. Would ClusterConfig be more accurate?

Some options to clean this up:

(1) Keep InstallConfig in Installer. Hookup Kubernetes code gen, commit to all the guarantees required for the type going forward: treat as an externally facing API object that must to be greater than or equal to the serialization of any embedding format.

(2) Hand it over to Hive, lets us maintain the Kube generation and API contract, vendor into your repo. (I would propose a breakout of something like ClusterConfig in the Hive repo, and InstallConfig remaining in your repo and having fields ClusterConfig and Admin (which doesn't map nicely to kube secrets we'd use for this info)).

(3) Place ClusterConfig definition directly into the core OpenShift API server. I don't know if this would fly but I kind of liked the idea, it makes it very official, we get API server for free and better options for versioning, validation and defaulting than CRDs, and it would signify how hard it is, or should be to change.

(4) Spin into a separate project and repo we all vendor.

Open to other suggestions. Please let me know what you think.

TLS cert/key data duplicated in ignition configs

This is a dev/code optimization issue. Not a functionality bug.

The code for ignition config takes the cert/key pairs and dumps them into files. Here:
https://github.com/openshift/installer/blob/master/pkg/asset/ignition/bootstrap.go#L241-L246

Also, the operators manifest assets will take those key/certs and put them into a secret. Here:
https://github.com/openshift/installer/blob/master/modules/bootkube/resources/manifests/kube-apiserver-secret.yaml#L8-L23

The secret should likely exist. Not sure if we need all those files through the ignition target.

/cc @crawford

deployment on libvirt hanging forever

Versions

Tectonic version (release or commit hash):

41077a9d818756c16ffbf073499a7ccef0a79c60

Terraform version (terraform version):

Terraform v0.11.8

Platform (aws|libvirt):

libvirt

What happened?

deployed with libvirt but deployment is not finishing

What you expected to happen?

deployment successfull

How to reproduce it (as minimally and precisely as possible)?

deploy with libvirt

Anything else we need to know?

i deployed with a customized libvirt uri qemu+ssh://[email protected]/system
also used Red Hat CoreOS release 3.10 for the image
and it worked well, ie boostrap and master0 nodes are there, with the appropriate ip but the deployment is stuck on the following

-- Logs begin at Tue 2018-10-02 19:58:10 UTC. --
Oct 02 20:19:14 testk-bootstrap bootkube.sh[21652]: Found Machine Config Operator's image: registry.svc.ci.openshift.org/openshift/origin-v4.0-20181002182152@sha256:d2c602936193f81a808d477b065c01dab18d15914b2ec71c380f310d4a44b441
Oct 02 20:19:14 testk-bootstrap bootkube.sh[21652]: Rendering MCO manifests...
Oct 02 20:19:15 testk-bootstrap tectonic.sh[888]: kubectl --namespace kube-system get pods --output custom-columns=STATUS:.status.phase,NAME:.metadata.name --no-headers=true failed. Retrying in 5 seconds...
Oct 02 20:19:15 testk-bootstrap bootkube.sh[21652]: Trying to pull registry.svc.ci.openshift.org/openshift/origin-v4.0-20181002182152@sha256:d2c602936193f81a808d477b065c01dab18d15914b2ec71c380f310d4a44b441...Failed
Oct 02 20:19:15 testk-bootstrap bootkube.sh[21652]: unable to find image: unable to pull registry.svc.ci.openshift.org/openshift/origin-v4.0-20181002182152@sha256:d2c602936193f81a808d477b065c01dab18d15914b2ec71c380f310d4a44b441
Oct 02 20:19:15 testk-bootstrap systemd[1]: bootkube.service: main process exited, code=exited, status=125/n/a
Oct 02 20:19:15 testk-bootstrap systemd[1]: Unit bootkube.service entered failed state.
Oct 02 20:19:15 testk-bootstrap systemd[1]: bootkube.service failed.
Oct 02 20:19:20 testk-bootstrap tectonic.sh[888]: kubectl --namespace kube-system get pods --output custom-columns=STATUS:.status.phase,NAME:.metadata.name --no-headers=true failed. Retrying in 5 seconds...
Oct 02 20:19:21 testk-bootstrap systemd[1]: bootkube.service holdoff time over, scheduling restart.
Oct 02 20:19:24 testk-bootstrap systemd[1]: Started Bootstrap a Kubernetes cluster.
Oct 02 20:19:24 testk-bootstrap systemd[1]: Starting Bootstrap a Kubernetes cluster...
Oct 02 20:19:25 testk-bootstrap bootkube.sh[21833]: Found Machine Config Operator's image: registry.svc.ci.openshift.org/openshift/origin-v4.0-20181002182152@sha256:d2c602936193f81a808d477b065c01dab18d15914b2ec71c380f310d4a44b441
Oct 02 20:19:25 testk-bootstrap bootkube.sh[21833]: Rendering MCO manifests...
Oct 02 20:19:26 testk-bootstrap tectonic.sh[888]: kubectl --namespace kube-system get pods --output custom-columns=STATUS:.status.phase,NAME:.metadata.name --no-headers=true failed. Retrying in 5 seconds...
Oct 02 20:19:26 testk-bootstrap bootkube.sh[21833]: Trying to pull registry.svc.ci.openshift.org/openshift/origin-v4.0-20181002182152@sha256:d2c602936193f81a808d477b065c01dab18d15914b2ec71c380f310d4a44b441...Failed
Oct 02 20:19:26 testk-bootstrap bootkube.sh[21833]: unable to find image: unable to pull registry.svc.ci.openshift.org/openshift/origin-v4.0-20181002182152@sha256:d2c602936193f81a808d477b065c01dab18d15914b2ec71c380f310d4a44b441
Oct 02 20:19:26 testk-bootstrap systemd[1]: bootkube.service: main process exited, code=exited, status=125/n/a
Oct 02 20:19:26 testk-bootstrap systemd[1]: Unit bootkube.service entered failed state.
Oct 02 20:19:26 testk-bootstrap systemd[1]: bootkube.service failed.
Oct 02 20:19:31 testk-bootstrap tectonic.sh[888]: kubectl --namespace kube-system get pods --output custom-columns=STATUS:.status.phase,NAME:.metadata.name --no-headers=true failed. Retrying in 5 seconds...
Oct 02 20:19:31 testk-bootstrap systemd[1]: bootkube.service holdoff time over, scheduling restar



References

enter text here

the shipped terraform is not used when the installer is called via PATH lookup

I ran into this issue trying to destroy a libvirt cluster:

$ tectonic destroy --dir=test1

Error:
Terraform doesn't allow running any operations against a state
that was written by a future Terraform version. The state is
reporting it is written by Terraform '0.11.7'

Please run at least that version of Terraform to continue.

Between create and destroy, I have installed terraform to my PATH.
I suspect there is something odd in the logic to find the terraform binary, as removing it from PATH makes the error go away and succeed in destroying the cluster.

Add OWNERS file

There should be an OWNERS file (like this one) to allow prow to manage who is allowed to review and/or approve PRs.

Documentation/dev/build: Stale tree (references tectonic-installer/)

The build docs have:

tectonic
├── config.tf
├── examples
├── modules
├── steps
└── tectonic-installer
    ├── darwin
    │   ├── tectonic
    │   ├── terraform
    │   └── terraform-provider-matchbox
    └── linux
        ├── tectonic
        ├── terraform
        └── terraform-provider-matchbox

But that seems to have been adjusted in e22e31f (coreos/tectonic-installer#3142), which renamed tectonic-installer to installer:

$ git show e22e31f1083 | grep package_dir
-    package_dir = "tectonic-installer/darwin",
-    package_dir = "tectonic-installer/linux",
-    package_dir = "tectonic-installer/darwin",
-    package_dir = "tectonic-installer/linux",
-    package_dir = "tectonic-installer/darwin",
+    package_dir = "tectonic-%s" % TECTONIC_VERSION,
-    package_dir = "tectonic-installer/linux",
-    package_dir = "examples",
+    package_dir = "installer",

Somebody who understands the current structure should update Documentation/dev/build.md to match.

install fails due to route53 issue

Terraform has been successfully initialized!

You may now begin working with Terraform. Try running "terraform plan" to see
any changes that are required for your infrastructure. All Terraform commands
should now work.

If you ever set or change modules or backend configuration for Terraform,
rerun this command to reinitialize your working directory. If you forget, other
commands will detect it and remind you to do so if necessary.
data.terraform_remote_state.assets: Refreshing state...
data.aws_availability_zones.azs: Refreshing state...
data.aws_region.current: Refreshing state...
data.aws_route53_zone.tectonic: Refreshing state...

Error: Error refreshing state: 1 error(s) occurred:

* module.dns.data.aws_route53_zone.tectonic: 1 error(s) occurred:

* module.dns.data.aws_route53_zone.tectonic: data.aws_route53_zone.tectonic: no matching Route53Zone found


I provided no route53 zone information as it was listed as optional.

    # (optional) If set, the given Route53 zone ID will be used as the internal (private) zone.
    # This zone will be used to create etcd DNS records as well as internal API and internal Ingress records.
    # If set, no additional private zone will be created.
    #
    # Example: `"Z1ILINNUJGTAO1"`
    # privateZone:

Feature: Proxy container images on the bootstrap node?

Especially with #281 on the table, it would be great if we had a way to mirror/cache images locally while bringing up the masters, and then a way to stuff those images into image streams once the cluster registry was up. The bootstrap node is reachable by the masters as they're coming up, so it seems like a good place to put this. Thoughts?

unsupported protocol error raised calling init

Since PR #280 I get:

$ tectonic init --config ~/tectonic.libvirt.yaml 
FATA[0000] Get : unsupported protocol scheme ""

Debugging a little I see the following:

  DEBU[0034] nexting                                       layer=debugger
> github.com/openshift/installer/pkg/types/config/libvirt.(*Libvirt).UseCachedImage() /home/aim/go-projects/openshift-installer/src/github.com/openshift/installer/pkg/types/config/libvirt/cache.go:48 (PC: 0xb611fc)
    43:	
    44:		cache := diskcache.New(cacheDir)
    45:		transport := httpcache.NewTransport(cache)
    46:		resp, err := transport.Client().Get(libvirt.Image)
    47:		if err != nil {
=>  48:			return err
    49:		}
    50:		if resp.StatusCode != 200 {
    51:			return fmt.Errorf("%s while getting %s", resp.Status, libvirt.Image)
    52:		}
    53:		defer resp.Body.Close()
(dlv) p err
error(*net/url.Error) *{
	Op: "Get",
	URL: "",
	Err: error(*net/http.badStringError) *{
		what: "unsupported protocol scheme",
		str: "",},}
(dlv) p cacheDir
"/home/aim/.cache/openshift-install/libvirt"

Default for aws region causes AMI lookup to fail

Versions

0284d4e

Platform (aws|libvirt):

aws

What happened?

Installer tried to use default region us-east-1 (N. Virginia) in URL.

✔ ~/go/src/github.com/openshift/installer [master ↓·15|✔] 
10:58 $ ./bin/openshift-install cluster
? Email Address [email protected]
???? The password of the cluster administrator. This will be used to log in to the console.
? Password ********
? SSH Public Key /home/auser/.ssh/id_rsa.pub
? Base Domain crtest.openshift.com
? Cluster Name rhcos-cr3
? Pull Secret {  "auths": {    ....   } }
? Platform aws
? Region us-east-1 (N. Virginia)
FATA[0326] failed to generate asset: failed to generate asset "Machine API Operator": failed to lookup RHCOS AMI: InvalidEndpointURL: invalid endpoint uri
caused by: parse https://ec2.us-east-1 (N. Virginia).amazonaws.com/: invalid character " " in host name 

What you expected to happen?

URL should have only contained us-east-1.

How to reproduce it (as minimally and precisely as possible)?

Follow prompts for default/interactive aws cluster creation.

libvirt install: use fqdn for libvirt instance names

[root@mguginop50 ~]# virsh list
 Id    Name                           State
----------------------------------------------------
 1     master0                        running
 2     bootstrap                      running
 3     worker-sdb9x                   running

This output should match resolvable fqdn of the cluster. Would make cleaning up / managing easier.

libvirt: can't create workers with non-default storage path

If your libvirt default storage pool is not /var/lib/libvirt/images then the libvirt-machine-controller fails to create the workers:

$ oc logs pod/clusterapi-controllers-85f6bfd9d5-6rbb8 -n openshift-cluster-api -c libvirt-machine-controller
...
I0921 18:24:47.612590       1 logs.go:41] [INFO] Created libvirt client
I0921 18:24:47.612648       1 logs.go:41] [DEBUG] Create a libvirt volume with name worker-b9f4v for pool default from the base volume /var/lib/libvirt/images/coreos_base
E0921 18:24:47.614725       1 actuator.go:50] Coud not create libvirt machine: error creating volume: Can't retrieve volume /var/lib/libvirt/images/coreos_base
I0921 18:24:48.016159       1 controller.go:79] Running reconcile Machine for worker-2fp6s
I0921 18:24:48.023462       1 actuator.go:70] Checking if machine worker-2fp6s for cluster dev exists.
I0921 18:24:48.023638       1 logs.go:41] [DEBUG] Check if a domain exists
I0921 18:24:48.029976       1 logs.go:41] [INFO] Created libvirt client
I0921 18:24:48.030852       1 controller.go:123] reconciling machine object worker-2fp6s triggers idempotent create.
I0921 18:24:48.033465       1 actuator.go:46] Creating machine "worker-2fp6s" for cluster "dev".
I0921 18:24:48.036047       1 logs.go:41] [INFO] Created libvirt client
I0921 18:24:48.036107       1 logs.go:41] [DEBUG] Create a libvirt volume with name worker-2fp6s for pool default from the base volume /var/lib/libvirt/images/coreos_base
E0921 18:24:48.038373       1 actuator.go:50] Coud not create libvirt machine: error creating volume: Can't retrieve volume /var/lib/libvirt/images/coreos_base

I worked around it with a bind mount. If it's not configurable then it would be handy if it was.
(also, there's a typo "Coud" in the error message)

Drop changes.md and start a new change log after our next release?

A lot of changes have been made since the last time changes.md was touched:

$ git log -1 changes.md | cat
commit 8f1a763ee32512980bd0613b2c0d54f7a227daed
Author: sudhaponnaganti <[email protected]>
Date:   Tue Sep 12 12:29:05 2017 -0700

    Update changes.md
$ git log --oneline --no-merges 8f1a763e..HEAD | wc -l
1251

Given the Tectonic -> OpenShift rebranding and significant overhauling, can we scrap changes.md and start a fresh change log when we cut our first OpenShift-branded release? I'm certainly not excited about reviewing that many commits to work up detailed change notes ;).

On the other hand, it may be useful to have someone with a good understanding of both the Tectonic 1.7.3 installer and the new OpenShift installer who can write migration notes to help previous Tectonic-installer users across the transition. In that case, maybe we'd keep the existing changes.md.

aws example is unclear

Trying to fill in the aws.yaml per the instructions, there's a lack of help to figure out what i'm supposed to do with it.

admin:
  email: "[email protected]"
  password: "verysecure"
  sshKey: "ssh-ed25519 AAAA..."

what are these values? ssh credentials for aws? a cluster-admin account that's going to be created? something else entirely?

InstallConfig: Add machine-pool defaults

Spun off from this comment, and with a positive nod from @abhinavdahiya here, we should add:

  // AWS is the default configuration used when installing on AWS for machine pools which do not define their own platform configuration.
  DefaultMachinePlatform *AWSMachinePoolPlatform `json:"defaultMachingPlatform,omitempty"`

to AWSPlatform (and similarly for LibvirtPlatform) to allow configurations to avoid repeating the same machine pool platform information twice when they want the same config for both pools.

Minimal installer role policy definition required

Versions

Tectonic version (release or commit hash):

78e6f76292331a7fadc57ce3f362e558890fde84

Terraform version (terraform version):

Terraform v0.11.7

Platform (aws|azure|openstack|metal|vmware):

aws

What happened?

Attempting cluster install and encountered missing permissions in minimal tectonic-installer role defined: https://coreos.com/tectonic/docs/latest/files/aws-policy.json

.....
module.vpc.aws_route.to_nat_gw[0]: Creation complete after 1s (ID: r-rtb-0e5f8e681080289494)
module.vpc.aws_route.to_nat_gw[1]: Creation complete after 1s (ID: r-rtb-464190201080289494)

Error: Error applying plan:

4 error(s) occurred:

* module.vpc.aws_elb.api_external: 1 error(s) occurred:

* aws_elb.api_external: AccessDenied: User: arn:aws:sts::870454162620:assumed-role/tectonic-installer/some_session_name is not authorized to perform: iam:CreateServiceLinkedRole on resource: arn:aws:iam::870454162620:role/aws-service-role/elasticloadbalancing.amazonaws.com/AWSServiceRoleForElasticLoadBalancing
	status code: 403, request id: 29e31655-9a66-11e8-b637-51e763ff4feb
* module.vpc.aws_elb.console: 1 error(s) occurred:

* aws_elb.console: AccessDenied: User: arn:aws:sts::870454162620:assumed-role/tectonic-installer/some_session_name is not authorized to perform: iam:CreateServiceLinkedRole on resource: arn:aws:iam::870454162620:role/aws-service-role/elasticloadbalancing.amazonaws.com/AWSServiceRoleForElasticLoadBalancing
	status code: 403, request id: 293152be-9a66-11e8-becb-479fbfd4bfac
* module.vpc.aws_elb.api_internal: 1 error(s) occurred:

* aws_elb.api_internal: AccessDenied: User: arn:aws:sts::870454162620:assumed-role/tectonic-installer/some_session_name is not authorized to perform: iam:CreateServiceLinkedRole on resource: arn:aws:iam::870454162620:role/aws-service-role/elasticloadbalancing.amazonaws.com/AWSServiceRoleForElasticLoadBalancing
	status code: 403, request id: 29e93174-9a66-11e8-a014-91fa91cc8907
* module.vpc.aws_elb.tnc: 1 error(s) occurred:

* aws_elb.tnc: AccessDenied: User: arn:aws:sts::870454162620:assumed-role/tectonic-installer/some_session_name is not authorized to perform: iam:CreateServiceLinkedRole on resource: arn:aws:iam::870454162620:role/aws-service-role/elasticloadbalancing.amazonaws.com/AWSServiceRoleForElasticLoadBalancing
	status code: 403, request id: 2932b24f-9a66-11e8-becb-479fbfd4bfac

Terraform does not automatically rollback in the face of errors.
Instead, your Terraform state file has been partially updated with
any resources that successfully completed. Please address the error
above and apply again to incrementally change your infrastructure.


FATA[0172] Failed to run Terraform: exit status 1       

What you expected to happen?

Minimal installer role should be defined in github. Setup process for that role should be in the readme.

How to reproduce it (as minimally and precisely as possible)?

- Setup tectonic-installer role with: https://coreos.com/tectonic/docs/latest/files/aws-policy.json
- Run new cluster install. 

References

https://coreos.com/tectonic/docs/latest/install/aws/requirements.html

examples directory still used though it's removed from the repo

Versions

06a4afb

What happened?

The following still reference examples/ which is no longer available:

What you expected to happen?

If examples/ is needed it should be in the repo (or a reference on how to get it should be available). Otherwise, the references to examples/ should be removed.

How to reproduce it (as minimally and precisely as possible)?

Look at the files :-D

Errors during init provide insufficient guidance when missing required fields

Versions

Tectonic version (release or commit hash):

78e6f76292331a7fadc57ce3f362e558890fde84

Terraform version (terraform version):

Terraform v0.11.7

Platform (aws|azure|openstack|metal|vmware):

aws

What happened?

As a new user to tectonic init, tried populating aws config example. Clearly missing some fields in the file, but error output from is not always indicative of missing field:

15:47 $ tectonic init --config=examples/tectonic.aws.yaml
ERRO[0000] Found 1 error in the cluster definition:     
ERRO[0000] error 1: open : no such file or directory    
FATA[0000] found 1 cluster definition error   

What you expected to happen?

Error should identify which field is missing.

How to reproduce it (as minimally and precisely as possible)?

aws config file without required file fields.

InstallConfig machine pool typing and validation

Spun off from this thread, @dgoodwin expressed concerns about Machines corner cases. What if no pool named master is configured? What if multiple pools named master is configured? What if a pool named somethingTheInstallerDoesNotUnderstand is configured? I'd floated:

Master MachinePool `json:"master"

Worker MachinePool `json:"worker"

but @dgoodwin pointed out compat issues between that approach and ClusterDeployment information.

@abhinavdahiya suggested we leave Machines alone, but we don't currently have any validation for Machines, and I expect we'll want to have some at some point.

@dgoodwin suggested leaving Machines as it stands but adding a typed variable for the name with constants for the master and worker names (he suggested control-plane and compute for the names) with verification checking for exactly one control-plane and at least one compute. He didn't give an opinion on the somethingTheInstallerDoesNotUnderstand case.

I'm not clear enough on the trade-offs to have much of an opinion here, but think we should sort out how we want verification to work here before InstallConfig gets too popular and migration becomes difficult ;).

Documentation/dev/libvirt-howto.md: URL for RHCOS image 404's

Following the instructions in Documentation/dev/libvirt-howto.md for the first time the URL for the operating system image 404's:

$ curl http://aos-ostree.rhev-ci-vms.eng.rdu2.redhat.com/rhcos/images/cloud/latest/rhcos.qcow2.qemu.gz
<html>
<head><title>404 Not Found</title></head>
<body bgcolor="white">
<center><h1>404 Not Found</h1></center>
<hr><center>nginx/1.8.0</center>
</body>
</html>

Image resize support for libvirt/development use cases

When using qcow2's and libvirt one of the first steps is to resize the image. A temporary workaround was proposed (see openshift/os#228) to keep people from needing to do this manually for development purposes.

A cleaner approach for the future would be to have the installer be able to resize the image to what is requested (maybe with a minimum size limit).

examples/*.yaml contain non-YAML example comments

For example, we have:

  # (optional) Extra AWS tags to be applied to created resources.
  #
  # Example: `{ "key" = "value", "foo" = "bar" }`
  # extraTags:

But that example should use : instead of = (openshift/release#1209). I expect this is because the examples were previously generated from Terraform configurations (using terraform-examples until d61abd4, coreos/tectonic-installer#3137). With the coming asset-graph approach replacing (I think) these config files, it may not be worth cleaning this up. But I thought I'd file an issue to raise awareness until we get around to dropping those examples.

openshift-install may be ignoring region chosen for install config

Testing openshift-install in a container this morning, I started getting an error like:

bash-4.2# openshift-install cluster
? Email Address [email protected]
? Password [? for help] ********
? SSH Public Key <none>
? Base Domain REDACTED
? Cluster Name dgoodwin1
? Pull Secret REDACTED
? Platform aws
? Region us-east-1 (N. Virginia)
FATA[0046] failed to generate asset: failed to determine default AMI: MissingRegion: could not find region configuration 

I made a guess that it was ignoring what I selected and instead looking for ~/.aws/config which contains a default region to use. When I ran my container with "-v /home/dgoodwin/.aws/config:/home/user/.aws/config:Z" this particular error went away, leading me to believe it may not properly be loading the region from the install config that gets generated.

Guess would be that Terraform is not loading AWS creds the same way as the installer.

Unpacking tectonic-dev.tar.gz generates "implausibly old time stamp" warnings

$ git describe --always
ae41b0a
$ bazel build tarball
INFO: Analysed target //:tarball (0 packages loaded).
INFO: Found 1 target...
Target //:tectonic-dev up-to-date:
  bazel-bin/tectonic-dev.tar.gz
INFO: Elapsed time: 0.163s, Critical Path: 0.01s
INFO: 0 processes.
INFO: Build completed successfully, 1 total action
$ tar -xzf bazel-bin/tectonic-dev.tar.gz 2>&1 | head -n3
tar: ./tectonic-dev/config.tf: implausibly old time stamp 1969-12-31 16:00:00
tar: ./tectonic-dev/modules/aws/etcd/ignition.tf: implausibly old time stamp 1969-12-31 16:00:00
tar: ./tectonic-dev/modules/aws/etcd/nodes.tf: implausibly old time stamp 1969-12-31 16:00:00
$ tar --version | head -n1
tar (GNU tar) 1.26

Those "implausibly old time stamp" warnings are probably so we get reproducible tarballs (without drifting timestamps) when we don't actually care about the timestamps on these files. In order to avoid distracting people running tar, do we want to change our suggestion to:

tar -xz --warning=no-timestamp -f bazel-bin/tectonic-dev.tar.gz

That would make life less noisy for folks running the command, but a bit more noisy for folks reading the README. Thoughts?

non-human readable fields in marshalled InstallConfig

The current pkg/types.InstallConfig's networking section marshals to

    networking:
      podCIDR:
        IP: 10.2.0.0
        Mask: //8AAA==
      serviceCIDR:
        IP: 10.3.0.0
        Mask: //8AAA==
      type: canal

This is not human readable or editable, which the InstallConfig needs to be.

Kubelet can't start API server if it needs a fresh certificate from API server

In the local dev case, one may only have provisioned a single master. If one restart the master, then on restart, the kubelet will fail like so if the certificate expired:

Aug 24 15:41:58 test1-master-0 systemd[1]: Started Kubernetes Kubelet.
Aug 24 15:41:59 test1-master-0 docker[19162]: Flag --rotate-certificates has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluste
Aug 24 15:41:59 test1-master-0 docker[19162]: Flag --pod-manifest-path has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/
Aug 24 15:41:59 test1-master-0 docker[19162]: Flag --allow-privileged has been deprecated, will be removed in a future version
Aug 24 15:41:59 test1-master-0 docker[19162]: Flag --minimum-container-ttl-duration has been deprecated, Use --eviction-hard or --eviction-soft instead. Will be removed in a future version.
Aug 24 15:41:59 test1-master-0 docker[19162]: Flag --cluster-dns has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubele
Aug 24 15:41:59 test1-master-0 docker[19162]: Flag --cluster-domain has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kub
Aug 24 15:41:59 test1-master-0 docker[19162]: Flag --client-ca-file has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kub
Aug 24 15:41:59 test1-master-0 docker[19162]: Flag --anonymous-auth has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kub
Aug 24 15:41:59 test1-master-0 docker[19162]: Flag --cgroup-driver has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kube
Aug 24 15:41:59 test1-master-0 docker[19162]: I0824 15:41:59.069056   19185 server.go:418] Version: v1.11.0+d4cacc0
Aug 24 15:41:59 test1-master-0 docker[19162]: I0824 15:41:59.069191   19185 server.go:496] acquiring file lock on "/var/run/lock/kubelet.lock"
Aug 24 15:41:59 test1-master-0 docker[19162]: I0824 15:41:59.069220   19185 server.go:501] watching for inotify events for: /var/run/lock/kubelet.lock
Aug 24 15:41:59 test1-master-0 docker[19162]: I0824 15:41:59.069373   19185 plugins.go:97] No cloud provider specified.
Aug 24 15:41:59 test1-master-0 docker[19162]: E0824 15:41:59.071501   19185 bootstrap.go:195] Part of the existing bootstrap client certificate is expired: 2018-08-23 17:08:07 +0000 UTC
Aug 24 15:41:59 test1-master-0 docker[19162]: I0824 15:41:59.072551   19185 certificate_store.go:131] Loading cert/key pair from "/var/lib/kubelet/pki/kubelet-client-current.pem".
Aug 24 15:41:59 test1-master-0 docker[19162]: F0824 15:41:59.093262   19185 server.go:262] failed to run Kubelet: cannot create certificate signing request: Post https://test1-api.mco.testing:6443/apis/certificates.k8s.io/v1beta1/certifica
Aug 24 15:41:59 test1-master-0 systemd[1]: kubelet.service: main process exited, code=exited, status=255/n/a
Aug 24 15:41:59 test1-master-0 systemd[1]: Unit kubelet.service entered failed state.
Aug 24 15:41:59 test1-master-0 systemd[1]: kubelet.service failed.

@aaronlevy says:

So what I”m thinking happened: we give master nodes a short-lived certificate (30min iirc) during the initial bootstrap. The intention is that this gets rotated out after that 30 minutes. However, if there was a single master and a reboot timed such that the rotation didn’t happen (and now it’s expired)… puts us in a bit of a pickle.

Error "unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1:" when trying to create a new-app

Versions

Tectonic version (release or commit hash):

ffca3e23530bad0efb2ebaef8d5d9cb147b6374f

Terraform version (terraform version):

Terraform v0.11.7

Platform (aws|azure|openstack|metal|vmware):

aws

What happened?

I created the AWS cluster using the readme.  It spun up a cluster.  I am trying to deploy this

https://github.com/sclorg/nodejs-ex

I then issued the following commands (with the appropriate kubeconfig):
oc new-project nodejs-echo  --display-name="nodejs" --description="Sample Node.js app"
oc project nodejs-echo

oc --config=kubeconfig create -f https://raw.githubusercontent.com/openshift/origin/master/examples/image-streams/image-streams-centos7.json

# This is where the trouble starts
oc --config=kubeconfig new-app https://github.com/sclorg/nodejs-ex -l name=myapp
--> Found image 1bb994c (39 hours old) in image stream "nodejs-echo/nodejs" under tag "10" for "nodejs"

    Node.js 10.9.0 
    -------------- 
    Node.js  available as docker container is a base platform for building and running various Node.js  applications and frameworks. Node.js is a platform built on Chrome's JavaScript runtime for easily building fast, scalable network applications. Node.js uses an event-driven, non-blocking I/O model that makes it lightweight and efficient, perfect for data-intensive real-time applications that run across distributed devices.

    Tags: builder, nodejs, nodejs-10.9.0

    * The source repository appears to match: nodejs
    * A source build using source code from https://github.com/sclorg/nodejs-ex will be created
      * The resulting image will be pushed to image stream tag "nodejs-ex:latest"
      * Use 'start-build' to trigger a new build
    * This image will be deployed in deployment config "nodejs-ex"
    * Port 8080/tcp will be load balanced by service "nodejs-ex"
      * Other containers can access this service through the hostname "nodejs-ex"

error: Unable to to get list of available resources: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request

What you expected to happen?

I would expect the app to build to completion

How to reproduce it (as minimally and precisely as possible)?

See above

Anything else we need to know?

Here is the config I used to create the cluster

admin:
  email: [email protected]
  password: xxxxx
  sshKey: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC7GW7qcnemBaZMw+8UNfgqwTvv6pPKftQtBZowfXfPRq5TWivRMnW/5SlW5eiLWrYsgAfnC/bREuBnE6lRmxeH2rKXb/WW1aLqNAe5XvC2xYaH9kXpQoaQWD6hbWL0mHfHowj1sWOrN25YETLbVJ3Wh4rLugIPIDKteB91HoZlv5N5Tao4yZO7YxMMScqDEo//Asdua+jCzCEuNZmSyqYZjuO8Ub20FP76Di2avR4qtrxgakIi2SxpwtCZLJndJCYI7COTSuF1WCv+uMjafLTaCeEtLw5GnoLCpb7nPqocgBFWKcJ58RkxX76JPOpdGLCLasvxS7/6pmDMyB/eO3vH
aws:
  region: us-west-1
baseDomain: crtest.openshift.com
ca: null
containerLinux:
  version: latest
etcd:
  nodePools:
  - etcd
iscsi: null
licensePath: xxxxx
master:
  nodePools:
  - master
name: rhcos-cr
networking:
  podCIDR: 10.2.0.0/16
  serviceCIDR: 10.3.0.0/16
nodePools:
- count: 3
  name: etcd
- count: 1
  name: master
- count: 3
  name: worker
platform: aws
pullSecretPath: xxxxx
worker:
  nodePools:
  - worker


References

N/A```

kube-addon-operator keeps crashing

Let me know if this isn't the right place for component issues.

Versions

# rpm-ostree status
State: idle
AutomaticUpdates: disabled
Deployments:
● ostree://rhcos:openshift/3.10/x86_64/os
                   Version: 4.0.4357 (2018-08-15 18:29:40)
                    Commit: 75574ee3d0b788386b6fe715b2179b0425f98c46ea21724b1967be16b8faee4a
# rpm -qa | grep ^origin
origin-node-3.11.0-0.alpha.0.893.80abd58.x86_64
origin-clients-3.11.0-0.alpha.0.893.80abd58.x86_64
origin-hyperkube-3.11.0-0.alpha.0.893.80abd58.x86_64

Terraform version (terraform version):

$ terraform version
Terraform v0.11.7

Your version of Terraform is out of date! The latest version
is 0.11.8. You can update by downloading from www.terraform.io/downloads.html

Platform (aws|azure|openstack|metal|vmware):

libvirt

What happened?

The kube-addon-operator keeps crashing:

$ oc get -n tectonic-system pods
NAME                                         READY     STATUS    RESTARTS   AGE
directory-sync-d84d84d9f-frfgs               1/1       Running   0          2h
kube-addon-operator-6bdf5dcd67-zwgkb         1/1       Running   14         2h
tectonic-alm-operator-79b6996f74-rd77l       1/1       Running   0          2h
tectonic-channel-operator-f545c8db8-lt5kr    1/1       Running   0          2h
tectonic-clu-6b8d87785f-pnx9g                1/1       Running   0          2h
tectonic-node-agent-7cqxf                    1/1       Running   2          2h
tectonic-node-agent-tts6j                    1/1       Running   3          2h
tectonic-stats-emitter-d87f669fd-mxldv       2/2       Running   0          2h
tectonic-utility-operator-6c7b696f79-6mhlv   1/1       Running   0          2h
$ oc logs -n tectonic-system kube-addon-operator-6bdf5dcd67-zwgkb
I0823 19:25:00.088020       1 run.go:60] kube-addon-operator starting
I0823 19:25:00.287656       1 leaderelection.go:174] attempting to acquire leader lease...
E0823 19:25:00.493659       1 event.go:260] Could not construct reference to: '&v1.ConfigMap{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"kube-addon", GenerateName:"", Namespace:"tectonic-system", SelfLink:"
/api/v1/namespaces/tectonic-system/configmaps/kube-addon", UID:"5dd8de70-a6f4-11e8-8854-be841c320f7d", ResourceVersion:"24966", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63670639711, loc:(*time.Location)(0x197566
0)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string{"control-plane.alpha.kubernetes.io/leader":"{\"holderIdentity\":\"kube-addon-operator-6bdf5dcd6
7-zwgkb\",\"leaseDurationSeconds\":90,\"acquireTime\":\"2018-08-23T16:48:30Z\",\"renewTime\":\"2018-08-23T19:25:00Z\",\"leaderTransitions\":0}"}, OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]
string(nil), ClusterName:""}, Data:map[string]string(nil)}' due to: 'no kind is registered for the type v1.ConfigMap'. Will not report event: 'Normal' 'LeaderElection' 'kube-addon-operator-6bdf5dcd67-zwgkb became leader'
I0823 19:25:00.493747       1 leaderelection.go:184] successfully acquired lease tectonic-system/kube-addon
I0823 19:25:00.493781       1 run.go:99] started leading: running kube-addon-operator
I0823 19:25:01.587697       1 update.go:53] Upgrade triggered, req: 1, comp: 0
I0823 19:25:24.187116       1 update.go:78] Start upgrading
I0823 19:25:25.322595       1 component.go:79] Updating component ClusterRoleBinding/metrics-server:system:auth-delegator
I0823 19:25:26.089057       1 component.go:101] Finished update of component: ClusterRoleBinding/metrics-server:system:auth-delegator
I0823 19:25:27.096713       1 component.go:79] Updating component RoleBinding/kube-system/metrics-server-auth-reader
I0823 19:25:27.454396       1 component.go:101] Finished update of component: RoleBinding/kube-system/metrics-server-auth-reader
I0823 19:25:28.462451       1 component.go:79] Updating component APIService/v1beta1.metrics.k8s.io
I0823 19:25:28.789396       1 component.go:101] Finished update of component: APIService/v1beta1.metrics.k8s.io
I0823 19:25:29.888677       1 component.go:79] Updating component Deployment/kube-system/metrics-server
I0823 19:25:31.387669       1 component.go:101] Finished update of component: Deployment/kube-system/metrics-server
I0823 19:25:32.397426       1 component.go:79] Updating component Service/kube-system/metrics-server
I0823 19:25:32.787819       1 component.go:101] Finished update of component: Service/kube-system/metrics-server
I0823 19:25:33.791862       1 component.go:79] Updating component ServiceAccount/kube-system/metrics-server
I0823 19:25:34.088068       1 component.go:101] Finished update of component: ServiceAccount/kube-system/metrics-server
I0823 19:25:35.095802       1 component.go:79] Updating component ClusterRole/system:metrics-server
I0823 19:25:35.399154       1 component.go:101] Finished update of component: ClusterRole/system:metrics-server
I0823 19:25:36.413012       1 component.go:79] Updating component ClusterRoleBinding/system:metrics-server
I0823 19:25:36.636727       1 component.go:101] Finished update of component: ClusterRoleBinding/system:metrics-server
I0823 19:25:37.665680       1 component.go:79] Updating component ConfigMap/openshift-web-console/webconsole-config
I0823 19:25:37.908471       1 component.go:101] Finished update of component: ConfigMap/openshift-web-console/webconsole-config
I0823 19:25:38.917026       1 component.go:79] Updating component ServiceAccount/openshift-web-console/webconsole
I0823 19:25:39.192305       1 component.go:101] Finished update of component: ServiceAccount/openshift-web-console/webconsole
I0823 19:25:40.200972       1 component.go:79] Updating component Service/openshift-web-console/webconsole
I0823 19:25:40.401035       1 component.go:101] Finished update of component: Service/openshift-web-console/webconsole
I0823 19:25:41.409767       1 component.go:79] Updating component Deployment/openshift-web-console/webconsole
I0823 19:25:43.091352       1 component.go:101] Finished update of component: Deployment/openshift-web-console/webconsole
I0823 19:25:44.134619       1 component.go:79] Updating component ClusterRoleBinding/registry-registry-role
I0823 19:25:44.414042       1 component.go:101] Finished update of component: ClusterRoleBinding/registry-registry-role
I0823 19:25:45.421271       1 component.go:79] Updating component ServiceAccount/default/registry
I0823 19:25:45.689676       1 component.go:101] Finished update of component: ServiceAccount/default/registry
I0823 19:25:46.696835       1 update.go:121] Not current object default/docker-registry is running
I0823 19:25:46.696853       1 component.go:196] Install mode, no original object is needed, skipping.
I0823 19:25:46.697898       1 component.go:226] No original or current manifest found, it's in install mode for manifest for default/docker-registry
I0823 19:25:46.697914       1 component.go:79] Updating component Service/default/docker-registry
E0823 19:25:46.936398       1 update.go:66] Error updating: Failed update of component: Service/default/docker-registry due to: Service "docker-registry" is invalid: spec.clusterIP: Invalid value: "10.3.0.25": provided IP is not in the val
id range. The range of valid IPs is 172.18.0.0/16
I0823 19:25:57.945730       1 update.go:53] Upgrade triggered, req: 1, comp: 0
I0823 19:26:11.187993       1 update.go:78] Start upgrading
...

What you expected to happen?

The kube-addon-operator not crashing.

How to reproduce it (as minimally and precisely as possible)?

Run installer on top of RHCOS. Also pulled in #134 and #150.

Race during libvirt network creation

Following https://github.com/openshift/installer/blob/master/Documentation/dev/libvirt-howto.md targeting RHCOS, on the first run of tectonic install --dir=test1, I always hit:

libvirt_network.tectonic_net: Creating...
  addresses.#:             "" => "1"
  addresses.0:             "" => "192.168.124.0/24"
  bridge:                  "" => "tt0"
  dns_forwarder.#:         "" => "1"
  dns_forwarder.0.address: "" => "8.8.8.8"
  domain:                  "" => "mco.testing"
  mode:                    "" => "nat"
  name:                    "" => "tectonic"
null_resource.console_dns: Creating...
null_resource.console_dns: Provisioning with 'local-exec'...
null_resource.console_dns (local-exec): Executing: ["/bin/sh" "-c" "virsh -c qemu:///system net-update tectonic add dns-host \"<host ip='192.168.124.50'><hostname>test1</hostname></host>\" --live --config"]
null_resource.console_dns (local-exec): error: failed to get network 'tectonic'
null_resource.console_dns (local-exec): error: Network not found: no network with matching name 'tectonic'

libvirt_network.tectonic_net: Creation complete after 6s (ID: e16cd2a2-111a-4d9c-8ebb-679185c98b0a)
module.libvirt_base_volume.libvirt_volume.coreos_base: Still creating... (10s elapsed)
null_resource.console_dns: Still creating... (10s elapsed)
module.libvirt_base_volume.libvirt_volume.coreos_base: Still creating... (20s elapsed)
null_resource.console_dns: Still creating... (20s elapsed)
module.libvirt_base_volume.libvirt_volume.coreos_base: Creation complete after 28s (ID: /var/lib/libvirt/images/coreos_base)

Error: Error applying plan:

1 error(s) occurred:

* null_resource.console_dns: Error running command 'virsh -c qemu:///system net-update tectonic add dns-host "<host ip='192.168.124.50'><hostname>test1</hostname></host>" --live --config': exit status 1. Output: error: failed to get network 'tectonic'
error: Network not found: no network with matching name 'tectonic'

Then rerunning tectonic install --dir=test1 works and completes the installation.

Looks like a race between network creation and calling virsh -c qemu:///system net-update?

install-config.yml edits don't affect manifests/cluster-config.yaml

Versions

Tectonic version (release or commit hash):

41077a9d818756c16ffbf073499a7ccef0a79c60

Platform (aws|libvirt):

libvirt

What happened?

manifests/cluster-config.yaml has wrong data, especially in libvirt section.

What you expected to happen?

My changes should be respected.

How to reproduce it (as minimally and precisely as possible)?

export OPENSHIFT_INSTALL_PLATFORM=libvirt
export OPENSHIFT_INSTALL_BASE_DOMAIN=tt2.testing
export OPENSHIFT_INSTALL_CLUSTER_NAME=test2
export OPENSHIFT_INSTALL_PULL_SECRET_PATH=~/coreos.pull.json
export OPENSHIFT_INSTALL_LIBVIRT_URI=qemu+tcp://192.168.124.1/system
export OPENSHIFT_INSTALL_SSH_PUB_KEY_PATH=~/.ssh/id_rsa.pub
export OPENSHIFT_INSTALL_EMAIL_ADDRESS='[email protected]'
export OPENSHIFT_INSTALL_PASSWORD='supergoodpasswordgoeshere'
export OPENSHIFT_INSTALL_LIBVIRT_IMAGE='file:///home/mgugino/images/rhcos-qemu.qcow2'

bin/openshift-install install-config
# make changes to install-config.yml
bin/openshift-install ignition-configs
bin/openshift-install manifests

References

https://gist.github.com/michaelgugino/8b4bbab8f77cb842802f24b14828b11e

openshift-install inconsistent AWS credentials usage

Similar to #335, it appears the installer may be unpredictable with AWS credentials. I was experimenting with running in a container using env vars for AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. Because it was in a container I had no ~/.aws/credentials file. This was good enough for the installer's initial phases, however once terraform ran I hit this error:

data.terraform_remote_state.assets: Refreshing state...                                                                                                                                            
data.aws_region.current: Refreshing state...                                                                                                                                                       
data.aws_availability_zones.azs: Refreshing state...                                                                                                                                               
data.aws_route53_zone.tectonic: Refreshing state...                                                                                                                                                
                                                                                                                                                                                                   
Error: Error refreshing state: 1 error(s) occurred:                                                                                                                                                
                                                                                                                                                                                                   
* module.dns.data.aws_route53_zone.tectonic: 1 error(s) occurred:                                                                                                                                  
                                                                                                                                                                                                   
* module.dns.data.aws_route53_zone.tectonic: data.aws_route53_zone.tectonic: no matching Route53Zone found                                                                                         
                                                                                                                                                                                                   
                                                                                                                                                                                                   
ERRO[0056] terraform failed: failed to run Terraform: exit status 1                                                                                                                                
FATA[0056] failed to generate asset: failed to read tfstate file "/tmp/openshift-install-314417690/infra.tfstate": open /tmp/openshift-install-314417690/infra.tfstate: no such file or directory 

On a hunch from others hitting similar issues yesterday I mounted in "-v /home/dgoodwin/.aws/credentials:/home/user/.aws/credentials:Z", in this file default was my desired credentials, and now terraform was happy to complete the install.

It appears the installer and terraform are not loading credentials the same way, specifically terraform does not appear to respect the standard env vars for AWS creds.

Teach installer to specify OS version through oscontainer image URLs

As mentioned in #267, the latest public RHCOS AMI will likely not be the latest RHCOS we want. Right now, MachineConfigs hardcode ://dummy as the OSImageURL: https://github.com/openshift/machine-config-operator/blob/a91ab5279755f87f0953f294e9add7584761a489/pkg/controller/template/render.go#L208. This should instead be the URL to the oscontainer image the installer would like to have installed. In more concrete terms: we could e.g. hardcode it for now as is currently done in ami.go for AMIs until we're ready to just always pick the latest, but also have something analogous to ec2AMIOverride, e.g. osImageURLOverride?

The MCD will then immediately update the node to the desired oscontainer upon startup.

InstallConfig: Pointers for Networking, etc.?

Spun off from this comment, we have several properties which are structures instead of pointers (e.g. Networking). If we make those pointers, we can get working omitempty handling for them (golang/go#11939) at the cost of slightly more complicated Go handling (but with slightly more efficient shallow copies ;). @abhinavdahiya pointed out that we'll want to have the defaults filled in when we push these into the cluster, so consumers don't have to internalize (or vendor) the defaulting logic. But I think we may still want the pointers to support folks using our type to generate InstallConfig to feed into the installer. ? For example:

  1. Some personal generator uses our package to create an InstallConfig YAML. They feed that into...
  2. openshift-install init, which injects default opinions for networking, etc.
  3. openshift-install something-else pushes the fully-defaulted YAML into the cluster.
  4. Operators fetch the fully-defaulted YAML from the cluster.

Is (1) a use-case that we want to support? Having Networking and other options where the installer can choose sane defaults without user assistance would make life easier for folks doing (1).

Question: Installing a nested libvirt cluster from a Linux VM

This is mostly duplicate to #201 but I just want to understand if I have a Linux VM (assume CentOS/Fedora) can I use the installer on top of it to provision single node cluster (Provided I have required libvirt dependency but doesn't have nested virtualization)?

aws: Plan for Multi-AZ Support

A proposal adapted from what we originall planned in cluster-operator:

  • add "subnets" array to platform.aws. Name, region, and CIDR which must be within vpcCIDRBlock.
  • add array of subnet names above to MachinePool which those machines will be balanced across. Requires AWS actuator support.

Both of these changes look additive and optional, so something we could add post 4.0 but just wanted to raise in case anyone spots looming problems with the current format.

Stop using docker

We'd like to not have to ship docker in newer versions of the OS. We will support running podman - and podman runs more nicely as a systemd unit.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.