GithubHelp home page GithubHelp logo

googlecloudplatform / kubeflow-distribution Goto Github PK

View Code? Open in Web Editor NEW
74.0 33.0 63.0 940 KB

Blueprints for Deploying Kubeflow on Google Cloud Platform and Anthos

License: Apache License 2.0

Makefile 43.51% Shell 46.31% Go 2.47% Python 7.71%

kubeflow-distribution's Introduction

Google Cloud distribution of Kubeflow

The official documentation is available here.

To deploy a full-fledged Kubeflow on Google Cloud Kubernetes cluster, follow steps below.

Kubeflow is deployed as follows

  • Deploy mangement cluster using the manifests in management.

    • The management cluster runs KCC and optionally ConfigSync
    • The management cluster is used to create all Google Cloud resources for Kubeflow (e.g. the GKE cluster)
    • A single management cluster could be used for multiple projects or multiple KF deployments
  • Deploy Kubeflow cluster using the manifests in kubeflow.

    • kubeflow contains kustomization rule for each component.
    • Component manifests is pulled from upstream kubeflow/manifests repository to individual folder's upstream/ directory.
    • Makefile uses kustomize and kubectl to generate and apply resources.

For more information about packages refer to the kpt packages guide

Getting Started

  1. Use the management blueprint to spin up a management cluster
  2. Use the kubeflow blueprint to create a Kubeflow deployment.

Development

Sample material

To get a sense of how each Kubeflow components are used together for ML workflow, try a basic example kubeflow-e2e-mnist.ipynb using Notebook in Kubeflow. It will make use of Notebook, Volume, Pipelines, AutoML, KServe components.

Test Grid

kubeflow-distribution's People

Contributors

bconsolvo avatar bobgy avatar chensun avatar edi-bice-by avatar fabito avatar gkcalat avatar jlewi avatar kelvins avatar kindomlee avatar linchin avatar philippslang avatar prachirp avatar richardsliu avatar subodh101 avatar sunilfernandes avatar zijianjoy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kubeflow-distribution's Issues

Pin manifests commit in blueprint and create auto-update pipeline

Right now the blueprint is tracking the head of master rather than pinning to a specific commit
https://github.com/kubeflow/gcp-blueprints/blob/09077334e2fb3417e1875be3cde8b160ec42297d/kubeflow/Makefile#L29

A better approach would be to pin to a particular commit and then automate opening up PRs whenever there is a change to kubeflow/manifests.

This would have the added benefit of causing auto-deployments to be triggered when kubeflow/gcp-blueprints are updated.

kpt fn throws error directory not empty

The first time I apply my kpt function I get the error

kpt fn run .build --image=gcr.io/kubeflow-images-public/kustomize-fns/image-prefix@sha256:7c5b8e0834fc06a2a254307046a32266b0f22256ab34fc12e4beda827b02b628

Error: remove /home/jlewi/git_jlewi-kubeflow-dev/kubeflow-deployments/gcp-private-0527/.build: directory not empty

However, the transform appears to be applied correctly and I rerun the kpt function the error is not reported.

The contents of the .build directory is checked in here.

https://github.com/jlewi/kubeflow-dev/tree/3e117b0afd66e77649a5f6f675dd188faaa2c9cb/kubeflow-deployments/gcp-private-0527/.build

Setup triage action workflow

I have configured the triage workflow but I think we are missing the required secrets. I think we need a GitHub token to modify the repo

Management cluster should not use CNRM in namespaced mode

We are currently recommending installing CNRM in namespaced mode.

I'm not sure thats what we should recommend.

  • Namespace mode is inconvenient when managing multiple projects as we end up having to create
    multiple deployments of the CNRM system

  • Using ACM to install CNRM doesn't appear to install it in namespaced mode.

Default workload identity bindings for deployments that need GCP permissions

I've found quite a few deployments don't work properly due to lack of GCP permissions, including:

  • profiles manager
2020-06-23T08:51:18.266Z        ERROR   controller-runtime.controller   Reconciler error        {"controller": "profile", "request": "/default-profile", "error": "googleapi: Error 403: Permission iam.serviceAccounts.getIamPolicy is required to perform this operation on service account projects/gongyuan-pipeline-test/serviceAccounts/[email protected]., forbidden"}
github.com/go-logr/zapr.(*zapLogger).Error
        /go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:218
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:192
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:171
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:88
  • pipelines ui, visualization server deployments
  • default-editor (and default-viewer, maybe later) KSA in user namespace (needed by pipeline runs, tensorboard instances)

/cc @jlewi

I think we should bind workload identity with these KSAs by default.

Proposal

Add workload identity binding for

  • profiles manager - admin GSA
  • pipelines ui, visualization server - user GSA

default-editor will be automatically bound to user GSA when profiles manager works properly.

@jlewi I can implement this, thoughts?

Add instructions for deploying with ConfigSync/ACM

We should add instructions for deploying with ConfigSync/ACM.

ConfigSync can install KCC so we don't have to do that piece.

However, the current version of KCC is too old and incompatible with some of our specs. So we need to wait for the next release of ACM.

Profile deployment: configmap kubeflow-config not found

There are usages of the configmap like

env:
        - name: USERID_HEADER
          valueFrom:
            configMapKeyRef:
              key: userid-header
              name: kubeflow-config

However, kubeflow-config is generated by configmap-generator, it should have a suffix. Not sure exact why, but kustomize didn't append the suffix to profile deployment.

Reproduced with kustomize version 3.1.0. (I cannot use kustomize 3.5.4 to test, there were another error)

Secure blueprint for the management cluster

Per #33 &
https://github.com/kubeflow/gcp-blueprints/blob/master/kubeflow/deploy_private.md

We now have a recipe for deploying KF on private GKE.

One of the gaps though is that our blueprint for the management cluster isn't using private GKE. Making the KF clusters more secure then the management clusters doesn't make sense.

So we probably want to provide a recipe for setting up the blueprint clusters.

AnthosCLI should now support all CNRM resources so we should be able to bootstrap all resources for a private GKE deployment using AnthosCLI.

Our management clusters probably don't need ISTIO so the setup should hopefully be simpler

Error: resources must be annotated with config.kubernetes.io/index to be written to files

When we pull packages with Kpt we get the error.

Error: resources must be annotated with config.kubernetes.io/index to be written to files

This is an issue with kpt kptdev/kpt#541 not handling the case where the annotations field is present but empty on resources.

We could also cleanup our manifests and add a validation test to prevent the field annotations from being present and set to null/empty (it should just not be present).

Best way to whitelist DockerHub and Quay.io firewall rules

On private deployments we want to deny all external traffic by default. One of the exceptions is to
allow traffic to DockerHub so we can pull docker images stored there.

Right now we do this just by creating a firewall rule that whitelists traffic to the dockerhub site. We get the ips just by testing the domains e.g.

nslookup index.dockerhub.io
nslookup dockerhub.io
nslookup registry-1.docker.io

I don't think there is any guarantee that these IP addresses are static.

Opening this issue to track whether we can come up with a better solution.

Verify that all CNRM resources are in a ready and healthy state

A lot of common problems could be surfaced just by checking if all the CNRM resources are in a ready state.

For example, a common failure mode is a CNRM resource (e.g. a firewall rule) which is not in a ready state because it refers to an invalid resource.

We should write a simple go binary (maybe kfctl subcommand) to check that all resources are in a health state to help identify any problems.

Make regional deployments work; what to do about the disks for metadata

We would like to support regional deployments of Kubeflow; i.e. use a regional cluster. Ideally this would be the default deployment since regional clusters are more reliable and there is minimal cost savings to using a zonal cluster.

One issue we need to resolve is what to do about the disks backing up the metadata DB.

We should be able to use regional PDs.
https://cloud.google.com/compute/docs/disks/high-availability-regional-persistent-disk

EDIT: please thumbs up on the issue if it's important to you.

get-pkg doesn't delete tests dir

When a user runs

make get-pkg

The directory $(MANIFESTS_DIR)/tests isn't deleted. It looks like the problem is that kpt pkg get is returning an error so the subsequent rm commands aren't running.

ACM and iap enabler will conflict - switch to using RCToken

The IAP enabler pod will try to patch the policy with the backend service id which is the audience. This has been a source of problems because the backend can change.

With ASM we should be able to use a configurable audience (RCTOKEN).
https://cloud.google.com/service-mesh/docs/iap-integration#configure_the_iap_access_list

This should allow us to pick and use a deterministic audience so we won't run into issues with the iap-enabler pod fighting ACM to update the ingress policy.

It doesn't look like we can set the RCToken on IAP using a GKEBackendConfig so we might need to write a simple daemon to do that.

Avoid duplicating the ISTIO Ingressgateway service when using ASM

When we hydrate the manifests for istio from the ISTIOControl Plane operator
https://github.com/kubeflow/manifests/blob/master/gcp/v2/asm/istio-operator.yaml

We end up generating the following IngressGateway.yaml file
IngressGateway.yaml.txt

This defines the service "istio-ingressgateway".

The problem is this K8s service doesn't contain the annotation

beta.cloud.google.com/backend-config: '{"ports": {"http2":"iap-backendconfig"}}'

Which is needed to associate it with a backendconfig to configure IAP.

In the past we just duplicated this resource:
https://github.com/kubeflow/manifests/blob/master/istio/iap-gateway/base/istio-ingressgateway.yaml

In general this worked because this would be applied after applying the ISTIO config.

With ACM this starts to be more problematic because we end up with two resources with the same name and ACM doesn't allow this.

There's a couple possible options.

  1. We could create a second ingressgateway K8s service with a different name and use that
    for our load balancer
  2. We could use a kustomize function to transform the existing ISTIO service and add the
    appropriate annotation

#1 has the draw back that we risk getting out of sync with the configs generated by ISTIO ingressgateway.

#2 Is better in this regard because it is more of a template free solution.

#2 has the disadvantage though its depending on newer functionality in kustomize. I'm not even sure its available in of the existing releases (it is available on master).

Related to: #4

Make "hack" its own package that is fetched with kpt

Right now we are duplicating scripts like "create_context.sh" in manifests/hack and kubeflow/hack.

Instead we could make hack its own kpt package and then fetch it into upstream so we reuse
the same scripts across the blueprints.

Remove dependency on istioctl

Currently we need to use istioctl to install ASM.

AnthosCLI wasn't working to install ASM on an existing cluster. I think the problem is that AnthosCLI doesn't have an option to just hydrate the ASM manifests. AnthosCLI with try to hydrate and apply.

Kubeflow deployment fails "webhook.cert-manager.io" unavailable

Deploying Kubeflow fails the first time you run it with the error

Error from server (InternalError): error when creating ".build/kubeflow-apps/cert-manager.io_v1alpha2_certificate_admission-webhook-cert.yaml": Internal error occurred: failed calling webhook "webhook.cert-manager.io": the server is currently unable to handle the request

The problem is its trying to create the certificate for the Kubeflow admission controller and failing because certmanager isn't available yet.

Simply waiting and retrying fixes the problem. Is there a better solution though?

istio-ready test is failing - missing deployment istio-egressgateway

https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/kubeflow-gcp-blueprints-master-periodic/1280857598203006977

>           raise ApiException(http_resp=r)
E           kubernetes.client.rest.ApiException: (404)
E           Reason: Not Found
E           HTTP response headers: HTTPHeaderDict({'Audit-Id': 'b7c92170-04c6-4418-8aa3-dd0cff342cbe', 'Content-Type': 'application/json', 'Date': 'Wed, 08 Jul 2020 13:37:18 GMT', 'Content-Length': '240'})
E           HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"deployments.apps \"istio-egressgateway\" not found","reason":"NotFound","details":{"name":"istio-egressgateway","group":"apps","kind":"deployments"},"code":404}

Setup autodeploy for GCP blueprints

We should setup the auto-deploy infrastructure to autodeploy from blueprints.

This way we ensure that our GCP blueprint is up to date and working.

CLI (kfctl)? to apply CloudEndpoints resources

Service management API is not available via restricted VIP yet
https://cloud.google.com/vpc-service-controls/docs/restricted-vip-services

As a result the CloudEndpoints controller won't work when the GKE cluster is configured to use the restricted VIP.

A simple solution is to have the user run the commands on their machine to create the endpoint to create the DNS entry.

I created a CLI version of the cloud endpoints controller to do this.
https://github.com/jlewi/cloud-endpoints-controller/blob/master/cmd/main.go

We should think about integrating that as a gcp sub command into kfctl

[KF 1.1] GCP Blueprints Release

Tracking bug for release of KF for GCP on 1.1

To release blueprints we need to do the following

  1. Cut a 1.1 branch for gcp-blueprints
    • Pin kubeflow/manifests to a particular commit of kubeflow/manifests
  2. Update docs on www.kubeflow.org to use blueprints

Docs for installing on existing cluster

A lot of users have requested help installing Kubeflow on existing clusters
opening this issue to track documenting how you would do that.

At a high level I think you would want to do something like the following

  • When you configure the blueprint - point it at your existing cluster

    • e.g. set the cluster and location to the existing name of your cluster and location
  • Remove CNRM resources for resources you don't want to create

    • e.g. the cluster resource
  • Possibly update any references to point at your existing resources.

Cluster stuck in non-ready state because channel update rejected

While deploying GCP private clusters I encountered a problem where the command

kubectl --context=gcp-private-dev-mgmt wait --for=condition=Ready --timeout=600s  containercluster gcp-private-0527

would timeout. It turned out the problem was that the last state for the cluster

was

  - lastTransitionTime: "2020-05-29T00:15:38Z"
    message: 'Update call failed: the desired mutation for the following field(s)
      is invalid: [releaseChannel.0.Channel]'
    reason: UpdateFailed
    status: "False"
    type: Ready

gcloud describe however shows the cluster to be in the stable channel. So not sure why the update would fail.

gcloud beta --project=$PROJECT container clusters describe --region=$REGION $CLUSTER 
...
releaseChannel:
  channel: STABLE

CNRM containercluster missing ipAllocationPolicy fields needed to create private GKE clusters

It looks like in CNRM 1.9.1 the CRD for container cluster is missing ipAllocationPolicy fields needed to
create a private GKE cluster.

To work around that; get the CRD and change the schema to be

...
              ipAllocationPolicy:
                properties:
                  useIpAliases:
                    type: boolean
                  createSubnetwork:
                    type: boolean
                  subnetworkName:
                    type: string
                  clusterIpv4CidrBlock:
                    type: string
                  clusterSecondaryRangeName:
                    type: string
                  servicesIpv4CidrBlock:
                    type: string
                  servicesSecondaryRangeName:
                    type: string
                type: object
...

Load secrets from secret manager

Right now users supply secrets like OAuth ID and Secret via environment variables.

A better approach would be to store the secrets in GCP secret manager. We can then run a process in cluster which creates K8s secrets from secret manager.

istio proxy sidecars have repeated warning messages: wasm log stackdriver_inbound: Stackdriver logging api call error: 13

Reproduce steps

  1. create a jupyter notebook instance
  2. kubectl logs istio-proxy -n
  3. Found error logs below

Description

I initially found these error logs when developing KFP multi user mode, found these logs and then realized they show up in every istio sidecar container.
I'm not seeing any issues as of now, but this is probably sth we need to take a look and fix.

[Envoy (Epoch 0)] [2020-07-06 10:34:21.649][65][warning][wasm] [external/envoy/source/extensions/common/wasm/wasm.cc:1677] wasm log stackdriver_inbound: Stackdriver logging api call error: 13
2020-07-06T10:34:31.617560Z	info	token	Prepared federated token request
2020-07-06T10:34:31.622653Z	info	token	Prepared federated token request
2020-07-06T10:34:31.628111Z	info	token	Received federated token response after 10.508742ms
2020-07-06T10:34:31.628226Z	info	token	Federated token will expire in 3600 seconds
2020-07-06T10:34:31.628255Z	info	token	Prepared access token request
2020-07-06T10:34:31.633483Z	info	token	Received federated token response after 10.796653ms
2020-07-06T10:34:31.633730Z	info	token	Federated token will expire in 3600 seconds
2020-07-06T10:34:31.633792Z	info	token	Prepared access token request
2020-07-06T10:34:31.640949Z	info	token	Received access token response after 12.68243ms
2020-07-06T10:34:31.641195Z	error	token	access token response does not have access token{
  "error": {
    "code": 404,
    "message": "Requested entity was not found.",
    "status": "NOT_FOUND"
  }
}

2020-07-06T10:34:31.641220Z	warn	stsServerLog	token manager fails to generate token: access token response does not have access token. {
  "error": {
    "code": 404,
    "message": "Requested entity was not found.",
    "status": "NOT_FOUND"
  }
}

E0706 10:34:31.641426325      67 oauth2_credentials.cc:152]  Call to http server ended with error 500 [{
  "error": "invalid_target",
  "error_description": "access token response does not have access token. {\n  \"error\": {\n    \"code\": 404,\n    \"message\": \"Requested entity was not found.\",\n    \"status\": \"NOT_FOUND\"\n  }\n}\n",
  "error_uri": ""
}].
[Envoy (Epoch 0)] [2020-07-06 10:34:31.641][65][warning][wasm] [external/envoy/source/extensions/common/wasm/wasm.cc:1677] wasm log stackdriver_inbound: Stackdriver logging api call error: 13
2020-07-06T10:34:31.651403Z	info	token	Received access token response after 17.588733ms
2020-07-06T10:34:31.651550Z	error	token	access token response does not have access token{
  "error": {
    "code": 404,
    "message": "Requested entity was not found.",
    "status": "NOT_FOUND"
  }
}

2020-07-06T10:34:31.651563Z	warn	stsServerLog	token manager fails to generate token: access token response does not have access token. {
  "error": {
    "code": 404,
    "message": "Requested entity was not found.",
    "status": "NOT_FOUND"
  }
}

E0706 10:34:31.651741260      66 oauth2_credentials.cc:152]  Call to http server ended with error 500 [{
  "error": "invalid_target",
  "error_description": "access token response does not have access token. {\n  \"error\": {\n    \"code\": 404,\n    \"message\": \"Requested entity was not found.\",\n    \"status\": \"NOT_FOUND\"\n  }\n}\n",
  "error_uri": ""
}].
[Envoy (Epoch 0)] [2020-07-06 10:34:31.651][64][warning][wasm] [external/envoy/source/extensions/common/wasm/wasm.cc:1677] wasm log stackdriver_inbound: Stackdriver logging api call error: 13

[ACM] Support structured repository

Currently our ACM story doesn't use a structured repository. The main reason was because
we would need to refactor the manifests; e.g. put ClusterScoped resources in one directory and namespace resources in a different directory.

Could we reorganize the files using a kustomize function?

Using a flat repository will get pretty unmanageable.

Can we use subdirectories with an unstructured repositories?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.