GithubHelp home page GithubHelp logo

gke-federated-learning's Introduction

Blueprint: Preparing a GKE cluster for apps distributed by a third party

This repository contains a blueprint that creates and secures a Google Kubernetes Engine (GKE) cluster that is ready to host custom apps distributed by a third party. The blueprint uses federated learning as an example use case for hosting custom third party apps inside your cluster. Specifically, the blueprint creates and configures a Google Kubernetes Engine (GKE) cluster and related infrastructure such that the cluster is ready to participate in cross-silo federated learning.

Federated learning is a machine learning approach that allows a loose federation of participants (e.g. a group of organisations) to collaboratively improve a shared model, without sharing any sensitive data. In cross-silo federated learning, each participant uses its own data and compute resources, called a silo. Eash silo trains a shared model using only its local data and compute resources. Training results are shared with the federation owner, who updates the shared model and redistributes to the silos for further training rounds, and the process repeats. This way, silos can collaborate to improve the model without sharing data.

This blueprint suggests using a GKE cluster as the compute infrastructure for a silo. The cluster is designed to host containerised apps, distributed by the federation owner, that train the model against local data and manage interation between the silo and federation owner. As these apps are created by the federation owner, they need to be treated as untrusuted or semi-trusted workloads within the silo cluster. Therefore, the silo cluster is configured according to security best practices, and additional controls are put in place to isolate and constrain the trainer workloads. The blueprint uses Anthos features to automate and optimise the configuration and security of the cluster.

The initial version of the blueprint creates infrastructure in Google Cloud. It can be extended to Anthos clusters running on premises or on other public clouds.

Out of scope

This blueprint is focussed on creating and configuring GKE clusters. The following items are out of scope for the blueprint:

  • Creation and orchestration of the federated learning workflows.
  • Management of the federated learning consortium.
  • Preparation of local training data.
  • Deployment and management of the federated learning apps.
  • Communication requirements between the cluster and the federation owner.

Getting started

To deploy this blueprint you need:

  • A Google Cloud project with billing enabled
  • Owner permissions on the project
  • It is expected that you deploy the blueprint using Cloud Shell.
  • You create the infastructure using Terraform. The blueprint uses a local backend. It is recommended to configure a remote backend for anything other than experimentation

Understanding the repository structure

This repository has the following key directories:

  • terraform: contains the Terraform code used to create the project-level infrastructure and resources, for example a GKE cluster, VPC network, firewall rules etc. It also installs Anthos components into the cluster

  • configsync: contains the cluster-level resources and configurations that are applied to your GKE cluster.

  • tenant-config-pkg: a kpt package that you can use as a template to configure new tenants in the GKE cluster.

Architecture

The blueprint uses a multi-tenant architecture. The federated learning workloads are treated as a tenant within the cluster. These tenant workloads are grouped in a dedicated namespace, and isolated on dedicated cluster nodes. This way, you can apply security controls and policies to the nodes and namespace that host the tenant workloads.

Infrastructure

The following diagram describes the infrastructure created by the blueprint

The infrastructure created by the blueprint includes:

  • A VPC network and subnet.
  • A private GKE cluster. The blueprint helps you create GKE clusters that implement recommended security settings, such as those described in the GKE hardening guide. For example, the blueprint helps you:
    • Limit exposure of your cluster nodes and control plane to the internet by creating a private GKE cluster with authorised networks.
    • Use shielded nodes that use a hardened node image with the containerd runtime.
    • Harden isolation of tenant workloads using GKE Sandbox.
    • Enable Dataplane V2 for optimised Kubernetes networking.
    • Encrypt cluster secrets at the application layer.
  • Two GKE node-pools.
    • You create a dedicated node pool to exclusively host tenant apps and resources. The nodes have taints to ensure that only tenant workloads are scheduled onto the tenant nodes
    • Other cluster resources are hosted in the default node pool.
  • VPC Firewall rules
    • Baseline rules that apply to all nodes in the cluster.
    • Additional rules that apply only to the nodes in the tenant node-pool (targeted using the node Service Account below). These firewall rules limit egress from the tenant nodes.
  • Cloud NAT to allow egress to the internet
  • Cloud DNS rules configured to enable Private Google Access such that apps within the cluster can access Google APIs without traversing the internet
  • Service Accounts used by the cluster.
    • A dedicated Service Account used by the nodes in the tenant node-pool
    • A dedicated Service Account for use by tenant apps (via Workload Identity, discussed later)

Applications

The following diagram describes the apps and resources within the GKE cluster

The cluster includes:

  • Config Sync, which keeps cluster configuration in sync with config defined in a Git repository.
    • The config defined by the blueprint includes namespaces, service accounts, network policies, Policy Controller policies and Istio resources that are applied to the cluster.
    • See the configsync dir for the full set of resources applied to the cluster
  • Policy Controller enforces policies ('constraints') for your clusters. These policies act as 'guardrails' and prevent any changes to your cluster that violate security, operational, or compliance controls.
    • Example policies enforced by the blueprint include:
      • Selected constraints similar to PodSecurityPolicy
      • Selected constraints from the template library, including:
        • Prevent creation of external services (Ingress, NodePort/LoadBalancer services)
        • Allow pods to pull container images only from a named set of repos
    • See the resources in the configsync/policycontroller directory for details of the constraints applied by this blueprint.
  • Anthos Service Mesh(ASM) is powered by Istio and enables managed, observable, and secure communication across your services. The blueprint includes service mesh configuration that is applied to the cluster using Config Sync. The following points describe how this blueprint configures the service mesh.
    • The root istio namespace (istio-system) is configured with
      • PeerAuthentication resource to allow only STRICT mTLS communications between services in the mesh
      • AuthorizationPolicies that:
        • by default deny all communication between services in the mesh,
        • allow communication to a set of known external hosts (such as example.com)
      • Egress Gateway that acts a forward-proxy at the edge of the mesh
      • VirtualService and DestinationRule resources that route traffic from sidecar proxies through the egress gateway to external destinations.
    • The tenant namespace is configured for automatic sidecar proxy injection, see next section.
    • Note that the mesh does not include an Ingress Gateway
    • See the servicemesh dir for the cluster-level mesh config

The blueprint configures a dedicated namespace for tenant apps and resources:

  • The tenant namespace is part of the service mesh. Pods in the namespace receive sidecar proxy containers. The namespace-level mesh resources include:
    • Sidecar resource that allows egress only to known hosts (outboundTrafficPolicy: REGISTRY_ONLY)
    • AuthorizationPolicy that defines the allowed communication paths within the namespace. The blueprint only allows requests that originate from within the same namespace. This policy is added to the root policy in the istio-system namespace
  • The tenant namespace has network policies to limit traffic to and from pods in the namespace. For example, the network policy:
    • By default, denies all ingress and egress traffic to/from the pods. This acts as baseline 'deny all' rule,
    • Allows traffic between pods in the namespace
    • Allows egress to required cluster resources like kube-dns, service mesh control plane and the GKE metadata server
    • Allows egress to Google APIs (via Private Google Access)
  • The pods in the tenant namespace are hosted exclusively on nodes in the dedicated tenant node-pool.
    • Any pod deployed to the tenant workspace automatically receives a toleration and nodeAffinity to ensure that it is scheudled only a tenant node
    • The toleration and nodeAffinity are automatically applied using Policy Controller mutations
  • The apps in the tenant namespace use a dedicated Kubernetes service account that is linked to a Google Cloud service account using Workload Identity. This way you can grant appropriate IAM roles to interact with any required Google APIs.
  • The blueprint includes a sample RBAC ClusterRole that grants users permissions to interact with limited resource types. The tenant namespace includes a sample RoleBinding that grants the role to an example user.
    • For example, different teams might be responsible for managing apps within each tenant namespace
    • Users and teams managing tenant apps should not have permissions to change cluster configuration or modify service mesh resources

Deploy the blueprint

  • Open Cloud Shell

  • Fork or clone this repo

  • Change into the directory that contains the Terraform code cd terraform

  • Review the terraform.tfvars file and replace values appropriately

  • Set a Terraform environment variable for your project ID export TF_VAR_project_id=[YOUR_PROJECT_ID]

  • Initialise Terraform terraform init

  • Create the plan; review it so you know what's going on terraform plan -out terraform.out

  • Apply the plan to create the cluster. Note this may take ~15 minutes to complete terraform apply terraform.out

Test

See testing for some manual tests you can perform to verify setup

Add another tenant

Out-of-the-box the blueprint is configured with a single tenant called 'fltenant1'. Adding another tenant is a two-stage process:

  1. Create the project-level infra and resources for the tenant (node pool, service accounts, firewall rules...). You do this by updating the Terraform config and re-applying.
  2. Configure cluster-level resources for the tenant (namespace, network policies, service mesh policies...) You do this by instantiating and configuring a new version of the tenant kpt package, and then applying to the cluster.

See the relevant section in testing for instructions.

gke-federated-learning's People

Contributors

jtangney avatar

Stargazers

 avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.