googlecloudplatform / kubeflow-distribution Goto Github PK

View Code? Open in Web Editor NEW

74.0 30.0 63.0 940 KB

Blueprints for Deploying Kubeflow on Google Cloud Platform and Anthos

License: Apache License 2.0

Makefile 43.51% Shell 46.31% Go 2.47% Python 7.71%

kubeflow-distribution's Introduction

Google Cloud distribution of Kubeflow

The official documentation is available here.

To deploy a full-fledged Kubeflow on Google Cloud Kubernetes cluster, follow steps below.

Kubeflow is deployed as follows

Deploy mangement cluster using the manifests in management.
- The management cluster runs KCC and optionally ConfigSync
- The management cluster is used to create all Google Cloud resources for Kubeflow (e.g. the GKE cluster)
- A single management cluster could be used for multiple projects or multiple KF deployments
Deploy Kubeflow cluster using the manifests in kubeflow.
- kubeflow contains kustomization rule for each component.
- Component manifests is pulled from upstream kubeflow/manifests repository to individual folder's upstream/ directory.
- Makefile uses kustomize and kubectl to generate and apply resources.

For more information about packages refer to the kpt packages guide

Getting Started

Use the management blueprint to spin up a management cluster
Use the kubeflow blueprint to create a Kubeflow deployment.

Development

Sample material

To get a sense of how each Kubeflow components are used together for ML workflow, try a basic example kubeflow-e2e-mnist.ipynb using Notebook in Kubeflow. It will make use of Notebook, Volume, Pipelines, AutoML, KServe components.

Test Grid

kubeflow-distribution's People

Contributors

Stargazers

Watchers

kubeflow-distribution's Issues

endpoint ready test is failing

https://k8s-testgrid.appspot.com/sig-big-data#kubeflow-gcp-blueprints-master-periodic

Profile deployment: configmap kubeflow-config not found

There are usages of the configmap like

env:
        - name: USERID_HEADER
          valueFrom:
            configMapKeyRef:
              key: userid-header
              name: kubeflow-config

However, kubeflow-config is generated by configmap-generator, it should have a suffix. Not sure exact why, but kustomize didn't append the suffix to profile deployment.

Reproduced with kustomize version 3.1.0. (I cannot use kustomize 3.5.4 to test, there were another error)

Verify that all CNRM resources are in a ready and healthy state

A lot of common problems could be surfaced just by checking if all the CNRM resources are in a ready state.

For example, a common failure mode is a CNRM resource (e.g. a firewall rule) which is not in a ready state because it refers to an invalid resource.

We should write a simple go binary (maybe kfctl subcommand) to check that all resources are in a health state to help identify any problems.

Default workload identity bindings for deployments that need GCP permissions

I've found quite a few deployments don't work properly due to lack of GCP permissions, including:

profiles manager

2020-06-23T08:51:18.266Z        ERROR   controller-runtime.controller   Reconciler error        {"controller": "profile", "request": "/default-profile", "error": "googleapi: Error 403: Permission iam.serviceAccounts.getIamPolicy is required to perform this operation on service account projects/gongyuan-pipeline-test/serviceAccounts/[email protected]., forbidden"}
github.com/go-logr/zapr.(*zapLogger).Error
        /go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:218
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:192
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:171
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:88

pipelines ui, visualization server deployments
default-editor (and default-viewer, maybe later) KSA in user namespace (needed by pipeline runs, tensorboard instances)

/cc @jlewi

I think we should bind workload identity with these KSAs by default.

Proposal

Add workload identity binding for

profiles manager - admin GSA
pipelines ui, visualization server - user GSA

default-editor will be automatically bound to user GSA when profiles manager works properly.

@jlewi I can implement this, thoughts?

Get the mnist notebook tests to pass

Split off from #42

#63 added the notebook tests. The Tekton workflows are being fired off

https://k8s-testgrid.appspot.com/sig-big-data#kubeflow-gcp-blueprints-master-periodic
https://kf-ci-v1.endpoints.kubeflow-ci.cloud.goog/tekton/#/namespaces/kf-ci/pipelineruns/mnist-22vjq

The copy buckets step is failing though and that is causing the task to abort before running the step to copy the test artifacts

metadata is ready test is failing

https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/kubeflow-gcp-blueprints-master-periodic/1280857598203006977

Looks like the deployment is ready

[ACM] Create a kustomize function to remove namespace from cluster scoped CRs

As described in kubernetes-sigs/kustomize#2498 a problem right now is that kustomize will apply the namespace transform to cluster scoped custom resources. As a result we end up setting namespace on certain cluster scoped resources. ACM doesn't allow this.

Related to kubernetes-sigs/kustomize#2498
Related to #4 Deploy with ACM

Setup autodeploy for GCP blueprints

We should setup the auto-deploy infrastructure to autodeploy from blueprints.

This way we ensure that our GCP blueprint is up to date and working.

Make "hack" its own package that is fetched with kpt

Right now we are duplicating scripts like "create_context.sh" in manifests/hack and kubeflow/hack.

Instead we could make hack its own kpt package and then fetch it into upstream so we reuse
the same scripts across the blueprints.

Cluster stuck in non-ready state because channel update rejected

While deploying GCP private clusters I encountered a problem where the command

kubectl --context=gcp-private-dev-mgmt wait --for=condition=Ready --timeout=600s  containercluster gcp-private-0527

would timeout. It turned out the problem was that the last state for the cluster

was

  - lastTransitionTime: "2020-05-29T00:15:38Z"
    message: 'Update call failed: the desired mutation for the following field(s)
      is invalid: [releaseChannel.0.Channel]'
    reason: UpdateFailed
    status: "False"
    type: Ready

gcloud describe however shows the cluster to be in the stable channel. So not sure why the update would fail.

gcloud beta --project=$PROJECT container clusters describe --region=$REGION $CLUSTER 
...
releaseChannel:
  channel: STABLE

Get rid of dependency on yq

Right now the Makefile uses yq to extract the value of various kpt setters.

I filed kptdev/kpt#542 to support this natively in kpt.

Best way to whitelist DockerHub and Quay.io firewall rules

On private deployments we want to deny all external traffic by default. One of the exceptions is to
allow traffic to DockerHub so we can pull docker images stored there.

Right now we do this just by creating a firewall rule that whitelists traffic to the dockerhub site. We get the ips just by testing the domains e.g.

nslookup index.dockerhub.io
nslookup dockerhub.io
nslookup registry-1.docker.io

I don't think there is any guarantee that these IP addresses are static.

Opening this issue to track whether we can come up with a better solution.

Switch to regular from stable channel -

We should switch to using the regular channel over the stable channel by default.

Regular is on 1.16: https://cloud.google.com/kubernetes-engine/docs/release-notes-regular
Stable is still on 1.14 https://cloud.google.com/kubernetes-engine/docs/release-notes-stable

Periodically run IAP endpoint ready test against auto deployments

Follow on to #42

We should setup a periodic test that runs the IAP endpoint is ready test against the auto deployments.

apply-services doesn't exist in Makefile

The one mentioned in https://github.com/kubeflow/gcp-blueprints/blob/master/kubeflow/README.md#common-problems

Management cluster should use stable channel

https://cloud.google.com/kubernetes-engine/docs/concepts/release-channels

Currently we are using the RAPID channel. This is the most unstable and you can't change this on
a cluster.

We should start with stable as that can be promoted.

CLI (kfctl)? to apply CloudEndpoints resources

Service management API is not available via restricted VIP yet
https://cloud.google.com/vpc-service-controls/docs/restricted-vip-services

As a result the CloudEndpoints controller won't work when the GKE cluster is configured to use the restricted VIP.

A simple solution is to have the user run the commands on their machine to create the endpoint to create the DNS entry.

I created a CLI version of the cloud endpoints controller to do this.
https://github.com/jlewi/cloud-endpoints-controller/blob/master/cmd/main.go

We should think about integrating that as a gcp sub command into kfctl

Setup triage action workflow

I have configured the triage workflow but I think we are missing the required secrets. I think we need a GitHub token to modify the repo

Kubeflow deployment fails "webhook.cert-manager.io" unavailable

Deploying Kubeflow fails the first time you run it with the error

Error from server (InternalError): error when creating ".build/kubeflow-apps/cert-manager.io_v1alpha2_certificate_admission-webhook-cert.yaml": Internal error occurred: failed calling webhook "webhook.cert-manager.io": the server is currently unable to handle the request

The problem is its trying to create the certificate for the Kubeflow admission controller and failing because certmanager isn't available yet.

Simply waiting and retrying fixes the problem. Is there a better solution though?

Make regional deployments work; what to do about the disks for metadata

We would like to support regional deployments of Kubeflow; i.e. use a regional cluster. Ideally this would be the default deployment since regional clusters are more reliable and there is minimal cost savings to using a zonal cluster.

One issue we need to resolve is what to do about the disks backing up the metadata DB.

We should be able to use regional PDs.
https://cloud.google.com/compute/docs/disks/high-availability-regional-persistent-disk

EDIT: please thumbs up on the issue if it's important to you.

Add instructions for deploying with ConfigSync/ACM

We should add instructions for deploying with ConfigSync/ACM.

ConfigSync can install KCC so we don't have to do that piece.

However, the current version of KCC is too old and incompatible with some of our specs. So we need to wait for the next release of ACM.

GCP IAP: Create OAuth clients for IAP programmatically

Currently GCP users need to setup IAP client manually: https://www.kubeflow.org/docs/gke/deploy/oauth-setup/

Since IAP API is now GA: https://cloud.google.com/iap/docs/programmatic-oauth-clients
kfctl should create OAuth clients for IAP programmatically when env var like CLIENT_ID is not set.

Recipe and documentation for using kpt to pull in updates

We need to document how to use kpt to pull in updates to manifests and then apply those changes.

Docs for installing on existing cluster

A lot of users have requested help installing Kubeflow on existing clusters
opening this issue to track documenting how you would do that.

At a high level I think you would want to do something like the following

When you configure the blueprint - point it at your existing cluster
- e.g. set the cluster and location to the existing name of your cluster and location
Remove CNRM resources for resources you don't want to create
- e.g. the cluster resource
Possibly update any references to point at your existing resources.

GCP Blueprint autodeployments failing - problems with pipelines

GCP autodeployments; IAP isn't enabled.

Looks like the kubeflow-oauth secret is missing.

GCP Blueprint for KF on private GKE

Uber issue for creating a blueprint for installing Kubeflow on private GKE

junit files not being uploaded from Tekton pipelines

junit files don't appear to be uploaded from or tests as a result results aren't showing up in test grid.
https://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/kubeflow-gcp-blueprints-master/1273551281704669185/

Can we use kpt live to solve ordering issues of apply?

Currently we have to manually apply the applications in a certain order to avoid issues caused by trying to create resources in a namespace before the resource exists.
https://github.com/kubeflow/gcp-blueprints/blob/541432cd52ff57b2ef3711b800aec5236338d3f9/kubeflow/Makefile#L69

https://googlecontainertools.github.io/kpt/reference/live/apply/

kpt live also supports pruning which we need for updates.

Management cluster should not use CNRM in namespaced mode

We are currently recommending installing CNRM in namespaced mode.

I'm not sure thats what we should recommend.

Namespace mode is inconvenient when managing multiple projects as we end up having to create
multiple deployments of the CNRM system
Using ACM to install CNRM doesn't appear to install it in namespaced mode.

Use CNRM (not AnthosCLI) to enable services

Currently we are using the AnthosCLI to enable services
https://github.com/kubeflow/gcp-blueprints/blob/master/kubeflow/README.md#common-problems

Would it be better to instead create CNRM resources in the management cluster?

Load secrets from secret manager

Right now users supply secrets like OAuth ID and Secret via environment variables.

A better approach would be to store the secrets in GCP secret manager. We can then run a process in cluster which creates K8s secrets from secret manager.

[ACM] Support structured repository

Currently our ACM story doesn't use a structured repository. The main reason was because
we would need to refactor the manifests; e.g. put ClusterScoped resources in one directory and namespace resources in a different directory.

Could we reorganize the files using a kustomize function?

Using a flat repository will get pretty unmanageable.

Can we use subdirectories with an unstructured repositories?

Pin manifests commit in blueprint and create auto-update pipeline

Right now the blueprint is tracking the head of master rather than pinning to a specific commit
https://github.com/kubeflow/gcp-blueprints/blob/09077334e2fb3417e1875be3cde8b160ec42297d/kubeflow/Makefile#L29

A better approach would be to pin to a particular commit and then automate opening up PRs whenever there is a change to kubeflow/manifests.

This would have the added benefit of causing auto-deployments to be triggered when kubeflow/gcp-blueprints are updated.

Setup CI against blueprint deployments

We should run the example tests continuously against the blueprint deployments

ACM and iap enabler will conflict - switch to using RCToken

The IAP enabler pod will try to patch the policy with the backend service id which is the audience. This has been a source of problems because the backend can change.

With ASM we should be able to use a configurable audience (RCTOKEN).
https://cloud.google.com/service-mesh/docs/iap-integration#configure_the_iap_access_list

This should allow us to pick and use a deterministic audience so we won't run into issues with the iap-enabler pod fighting ACM to update the ingress policy.

It doesn't look like we can set the RCToken on IAP using a GKEBackendConfig so we might need to write a simple daemon to do that.

Secure blueprint for the management cluster

Per #33 &
https://github.com/kubeflow/gcp-blueprints/blob/master/kubeflow/deploy_private.md

We now have a recipe for deploying KF on private GKE.

One of the gaps though is that our blueprint for the management cluster isn't using private GKE. Making the KF clusters more secure then the management clusters doesn't make sense.

So we probably want to provide a recipe for setting up the blueprint clusters.

AnthosCLI should now support all CNRM resources so we should be able to bootstrap all resources for a private GKE deployment using AnthosCLI.

Our management clusters probably don't need ISTIO so the setup should hopefully be simpler

ISTIO Ingressgateway 503s; /healthz/ready is 404s

When I deployed with ACM the endpoint is returning 503s.
The loadbalancer reports the backend as unhealthy failing healthy checks.

Remove dependency on istioctl

Currently we need to use istioctl to install ASM.

AnthosCLI wasn't working to install ASM on an existing cluster. I think the problem is that AnthosCLI doesn't have an option to just hydrate the ASM manifests. AnthosCLI with try to hydrate and apply.

pipelines is ready test is failing

https://k8s-testgrid.appspot.com/sig-big-data#kubeflow-gcp-blueprints-master-periodic

@Bobgy could you please take a look? It could be the case that the test needs to be updated.

CNRM containercluster missing ipAllocationPolicy fields needed to create private GKE clusters

It looks like in CNRM 1.9.1 the CRD for container cluster is missing ipAllocationPolicy fields needed to
create a private GKE cluster.

To work around that; get the CRD and change the schema to be

...
              ipAllocationPolicy:
                properties:
                  useIpAliases:
                    type: boolean
                  createSubnetwork:
                    type: boolean
                  subnetworkName:
                    type: string
                  clusterIpv4CidrBlock:
                    type: string
                  clusterSecondaryRangeName:
                    type: string
                  servicesIpv4CidrBlock:
                    type: string
                  servicesSecondaryRangeName:
                    type: string
                type: object
...

Periodically run KF ready tests against auto deployments

Follow on to #42

We should setup a periodic test that runs the tests that the kubeflow applications were correctly deployed.

Bug in README for management setup

https://github.com/kubeflow/gcp-blueprints/tree/master/management

Environment variables aren't properly escaped.

$(MGMT_NAME)

is wrong

should be

${MGMT_NAME}

Use KptFile and not settings.yaml file to read back in setters

Right now our Makefile uses a bit of a hack to get back the values of various kpt setters.
We set them in https://github.com/kubeflow/gcp-blueprints/blob/master/kubeflow/instance/settings.yaml
using kpt setters and then read them back in.

But the KptFile is also a YAML file. If we use substitutions do the values get stored there and can we read them from it?

istio-ready test is failing - missing deployment istio-egressgateway

https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/kubeflow-gcp-blueprints-master-periodic/1280857598203006977

>           raise ApiException(http_resp=r)
E           kubernetes.client.rest.ApiException: (404)
E           Reason: Not Found
E           HTTP response headers: HTTPHeaderDict({'Audit-Id': 'b7c92170-04c6-4418-8aa3-dd0cff342cbe', 'Content-Type': 'application/json', 'Date': 'Wed, 08 Jul 2020 13:37:18 GMT', 'Content-Length': '240'})
E           HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"deployments.apps \"istio-egressgateway\" not found","reason":"NotFound","details":{"name":"istio-egressgateway","group":"apps","kind":"deployments"},"code":404}

Avoid duplicating the ISTIO Ingressgateway service when using ASM

When we hydrate the manifests for istio from the ISTIOControl Plane operator
https://github.com/kubeflow/manifests/blob/master/gcp/v2/asm/istio-operator.yaml

We end up generating the following IngressGateway.yaml file
IngressGateway.yaml.txt

This defines the service "istio-ingressgateway".

The problem is this K8s service doesn't contain the annotation

beta.cloud.google.com/backend-config: '{"ports": {"http2":"iap-backendconfig"}}'

Which is needed to associate it with a backendconfig to configure IAP.

In the past we just duplicated this resource:
https://github.com/kubeflow/manifests/blob/master/istio/iap-gateway/base/istio-ingressgateway.yaml

In general this worked because this would be applied after applying the ISTIO config.

With ACM this starts to be more problematic because we end up with two resources with the same name and ACM doesn't allow this.

There's a couple possible options.

We could create a second ingressgateway K8s service with a different name and use that
for our load balancer
We could use a kustomize function to transform the existing ISTIO service and add the
appropriate annotation

#1 has the draw back that we risk getting out of sync with the configs generated by ISTIO ingressgateway.

#2 Is better in this regard because it is more of a template free solution.

#2 has the disadvantage though its depending on newer functionality in kustomize. I'm not even sure its available in of the existing releases (it is available on master).

Related to: #4

istio proxy sidecars have repeated warning messages: wasm log stackdriver_inbound: Stackdriver logging api call error: 13

Reproduce steps

create a jupyter notebook instance
kubectl logs istio-proxy -n
Found error logs below

Description

I initially found these error logs when developing KFP multi user mode, found these logs and then realized they show up in every istio sidecar container.
I'm not seeing any issues as of now, but this is probably sth we need to take a look and fix.

[Envoy (Epoch 0)] [2020-07-06 10:34:21.649][65][warning][wasm] [external/envoy/source/extensions/common/wasm/wasm.cc:1677] wasm log stackdriver_inbound: Stackdriver logging api call error: 13
2020-07-06T10:34:31.617560Z	info	token	Prepared federated token request
2020-07-06T10:34:31.622653Z	info	token	Prepared federated token request
2020-07-06T10:34:31.628111Z	info	token	Received federated token response after 10.508742ms
2020-07-06T10:34:31.628226Z	info	token	Federated token will expire in 3600 seconds
2020-07-06T10:34:31.628255Z	info	token	Prepared access token request
2020-07-06T10:34:31.633483Z	info	token	Received federated token response after 10.796653ms
2020-07-06T10:34:31.633730Z	info	token	Federated token will expire in 3600 seconds
2020-07-06T10:34:31.633792Z	info	token	Prepared access token request
2020-07-06T10:34:31.640949Z	info	token	Received access token response after 12.68243ms
2020-07-06T10:34:31.641195Z	error	token	access token response does not have access token{
  "error": {
    "code": 404,
    "message": "Requested entity was not found.",
    "status": "NOT_FOUND"
  }
}

2020-07-06T10:34:31.641220Z	warn	stsServerLog	token manager fails to generate token: access token response does not have access token. {
  "error": {
    "code": 404,
    "message": "Requested entity was not found.",
    "status": "NOT_FOUND"
  }
}

E0706 10:34:31.641426325      67 oauth2_credentials.cc:152]  Call to http server ended with error 500 [{
  "error": "invalid_target",
  "error_description": "access token response does not have access token. {\n  \"error\": {\n    \"code\": 404,\n    \"message\": \"Requested entity was not found.\",\n    \"status\": \"NOT_FOUND\"\n  }\n}\n",
  "error_uri": ""
}].
[Envoy (Epoch 0)] [2020-07-06 10:34:31.641][65][warning][wasm] [external/envoy/source/extensions/common/wasm/wasm.cc:1677] wasm log stackdriver_inbound: Stackdriver logging api call error: 13
2020-07-06T10:34:31.651403Z	info	token	Received access token response after 17.588733ms
2020-07-06T10:34:31.651550Z	error	token	access token response does not have access token{
  "error": {
    "code": 404,
    "message": "Requested entity was not found.",
    "status": "NOT_FOUND"
  }
}

2020-07-06T10:34:31.651563Z	warn	stsServerLog	token manager fails to generate token: access token response does not have access token. {
  "error": {
    "code": 404,
    "message": "Requested entity was not found.",
    "status": "NOT_FOUND"
  }
}

E0706 10:34:31.651741260      66 oauth2_credentials.cc:152]  Call to http server ended with error 500 [{
  "error": "invalid_target",
  "error_description": "access token response does not have access token. {\n  \"error\": {\n    \"code\": 404,\n    \"message\": \"Requested entity was not found.\",\n    \"status\": \"NOT_FOUND\"\n  }\n}\n",
  "error_uri": ""
}].
[Envoy (Epoch 0)] [2020-07-06 10:34:31.651][64][warning][wasm] [external/envoy/source/extensions/common/wasm/wasm.cc:1677] wasm log stackdriver_inbound: Stackdriver logging api call error: 13

get-pkg doesn't delete tests dir

When a user runs

make get-pkg

The directory $(MANIFESTS_DIR)/tests isn't deleted. It looks like the problem is that kpt pkg get is returning an error so the subsequent rm commands aren't running.

[KF 1.1] GCP Blueprints Release

Tracking bug for release of KF for GCP on 1.1

To release blueprints we need to do the following

Cut a 1.1 branch for gcp-blueprints
- Pin kubeflow/manifests to a particular commit of kubeflow/manifests
Update docs on www.kubeflow.org to use blueprints

Error: resources must be annotated with config.kubernetes.io/index to be written to files

When we pull packages with Kpt we get the error.

Error: resources must be annotated with config.kubernetes.io/index to be written to files

This is an issue with kpt kptdev/kpt#541 not handling the case where the annotations field is present but empty on resources.

We could also cleanup our manifests and add a validation test to prevent the field annotations from being present and set to null/empty (it should just not be present).

kpt fn throws error directory not empty

The first time I apply my kpt function I get the error

kpt fn run .build --image=gcr.io/kubeflow-images-public/kustomize-fns/image-prefix@sha256:7c5b8e0834fc06a2a254307046a32266b0f22256ab34fc12e4beda827b02b628

Error: remove /home/jlewi/git_jlewi-kubeflow-dev/kubeflow-deployments/gcp-private-0527/.build: directory not empty

However, the transform appears to be applied correctly and I rerun the kpt function the error is not reported.

The contents of the .build directory is checked in here.

https://github.com/jlewi/kubeflow-dev/tree/3e117b0afd66e77649a5f6f675dd188faaa2c9cb/kubeflow-deployments/gcp-private-0527/.build