Blueprint: Preparing a GKE cluster for apps distributed by a third party

This repository contains a blueprint that creates and secures a Google Kubernetes Engine (GKE) cluster that is ready to host custom apps distributed by a third party. The blueprint uses federated learning as an example use case for hosting custom third party apps inside your cluster. Specifically, the blueprint creates and configures a Google Kubernetes Engine (GKE) cluster and related infrastructure such that the cluster is ready to participate in cross-silo federated learning.

Federated learning is a machine learning approach that allows a loose federation of participants (e.g. a group of organisations) to collaboratively improve a shared model, without sharing any sensitive data. In cross-silo federated learning, each participant uses its own data and compute resources, called a silo. Eash silo trains a shared model using only its local data and compute resources. Training results are shared with the federation owner, who updates the shared model and redistributes to the silos for further training rounds, and the process repeats. This way, silos can collaborate to improve the model without sharing data.

This blueprint suggests using a GKE cluster as the compute infrastructure for a silo. The cluster is designed to host containerised apps, distributed by the federation owner, that train the model against local data and manage interation between the silo and federation owner. As these apps are created by the federation owner, they need to be treated as untrusuted or semi-trusted workloads within the silo cluster. Therefore, the silo cluster is configured according to security best practices, and additional controls are put in place to isolate and constrain the trainer workloads. The blueprint uses Anthos features to automate and optimise the configuration and security of the cluster.

The initial version of the blueprint creates infrastructure in Google Cloud. It can be extended to Anthos clusters running on premises or on other public clouds.

Out of scope

This blueprint is focussed on creating and configuring GKE clusters. The following items are out of scope for the blueprint:

Creation and orchestration of the federated learning workflows.
Management of the federated learning consortium.
Preparation of local training data.
Deployment and management of the federated learning apps.
Communication requirements between the cluster and the federation owner.

Getting started

To deploy this blueprint you need:

A Google Cloud project with billing enabled
Owner permissions on the project
It is expected that you deploy the blueprint using Cloud Shell.
You create the infastructure using Terraform. The blueprint uses a local backend. It is recommended to configure a remote backend for anything other than experimentation

Understanding the repository structure

This repository has the following key directories:

terraform: contains the Terraform code used to create the project-level infrastructure and resources, for example a GKE cluster, VPC network, firewall rules etc. It also installs Anthos components into the cluster
configsync: contains the cluster-level resources and configurations that are applied to your GKE cluster.
tenant-config-pkg: a kpt package that you can use as a template to configure new tenants in the GKE cluster.

Architecture

The blueprint uses a multi-tenant architecture. The federated learning workloads are treated as a tenant within the cluster. These tenant workloads are grouped in a dedicated namespace, and isolated on dedicated cluster nodes. This way, you can apply security controls and policies to the nodes and namespace that host the tenant workloads.

Infrastructure

The following diagram describes the infrastructure created by the blueprint

The infrastructure created by the blueprint includes:

A VPC network and subnet.
A private GKE cluster. The blueprint helps you create GKE clusters that implement recommended security settings, such as those described in the GKE hardening guide. For example, the blueprint helps you:
- Limit exposure of your cluster nodes and control plane to the internet by creating a private GKE cluster with authorised networks.
- Use shielded nodes that use a hardened node image with the containerd runtime.
- Harden isolation of tenant workloads using GKE Sandbox.
- Enable Dataplane V2 for optimised Kubernetes networking.
- Encrypt cluster secrets at the application layer.
Two GKE node-pools.
- You create a dedicated node pool to exclusively host tenant apps and resources. The nodes have taints to ensure that only tenant workloads are scheduled onto the tenant nodes
- Other cluster resources are hosted in the default node pool.
VPC Firewall rules
- Baseline rules that apply to all nodes in the cluster.
- Additional rules that apply only to the nodes in the tenant node-pool (targeted using the node Service Account below). These firewall rules limit egress from the tenant nodes.
Cloud NAT to allow egress to the internet
Cloud DNS rules configured to enable Private Google Access such that apps within the cluster can access Google APIs without traversing the internet
Service Accounts used by the cluster.
- A dedicated Service Account used by the nodes in the tenant node-pool
- A dedicated Service Account for use by tenant apps (via Workload Identity, discussed later)

Applications

The following diagram describes the apps and resources within the GKE cluster

The cluster includes:

Config Sync, which keeps cluster configuration in sync with config defined in a Git repository.
- The config defined by the blueprint includes namespaces, service accounts, network policies, Policy Controller policies and Istio resources that are applied to the cluster.
- See the configsync dir for the full set of resources applied to the cluster
Policy Controller enforces policies ('constraints') for your clusters. These policies act as 'guardrails' and prevent any changes to your cluster that violate security, operational, or compliance controls.
- Example policies enforced by the blueprint include:
  - Selected constraints similar to PodSecurityPolicy
  - Selected constraints from the template library, including:
    - Prevent creation of external services (Ingress, NodePort/LoadBalancer services)
    - Allow pods to pull container images only from a named set of repos
- See the resources in the configsync/policycontroller directory for details of the constraints applied by this blueprint.
Anthos Service Mesh(ASM) is powered by Istio and enables managed, observable, and secure communication across your services. The blueprint includes service mesh configuration that is applied to the cluster using Config Sync. The following points describe how this blueprint configures the service mesh.
- The root istio namespace (istio-system) is configured with
  - PeerAuthentication resource to allow only STRICT mTLS communications between services in the mesh
  - AuthorizationPolicies that:
    - by default deny all communication between services in the mesh,
    - allow communication to a set of known external hosts (such as example.com)
  - Egress Gateway that acts a forward-proxy at the edge of the mesh
  - VirtualService and DestinationRule resources that route traffic from sidecar proxies through the egress gateway to external destinations.
- The tenant namespace is configured for automatic sidecar proxy injection, see next section.
- Note that the mesh does not include an Ingress Gateway
- See the servicemesh dir for the cluster-level mesh config

The blueprint configures a dedicated namespace for tenant apps and resources:

The tenant namespace is part of the service mesh. Pods in the namespace receive sidecar proxy containers. The namespace-level mesh resources include:
- Sidecar resource that allows egress only to known hosts (outboundTrafficPolicy: REGISTRY_ONLY)
- AuthorizationPolicy that defines the allowed communication paths within the namespace. The blueprint only allows requests that originate from within the same namespace. This policy is added to the root policy in the istio-system namespace
The tenant namespace has network policies to limit traffic to and from pods in the namespace. For example, the network policy:
- By default, denies all ingress and egress traffic to/from the pods. This acts as baseline 'deny all' rule,
- Allows traffic between pods in the namespace
- Allows egress to required cluster resources like kube-dns, service mesh control plane and the GKE metadata server
- Allows egress to Google APIs (via Private Google Access)
The pods in the tenant namespace are hosted exclusively on nodes in the dedicated tenant node-pool.
- Any pod deployed to the tenant workspace automatically receives a toleration and nodeAffinity to ensure that it is scheudled only a tenant node
- The toleration and nodeAffinity are automatically applied using Policy Controller mutations
The apps in the tenant namespace use a dedicated Kubernetes service account that is linked to a Google Cloud service account using Workload Identity. This way you can grant appropriate IAM roles to interact with any required Google APIs.
The blueprint includes a sample RBAC ClusterRole that grants users permissions to interact with limited resource types. The tenant namespace includes a sample RoleBinding that grants the role to an example user.
- For example, different teams might be responsible for managing apps within each tenant namespace
- Users and teams managing tenant apps should not have permissions to change cluster configuration or modify service mesh resources

Deploy the blueprint

Open Cloud Shell
Fork or clone this repo
Change into the directory that contains the Terraform code cd terraform
Review the terraform.tfvars file and replace values appropriately
Set a Terraform environment variable for your project ID export TF_VAR_project_id=[YOUR_PROJECT_ID]
Initialise Terraform terraform init
Create the plan; review it so you know what's going on terraform plan -out terraform.out
Apply the plan to create the cluster. Note this may take ~15 minutes to complete terraform apply terraform.out

Test

See testing for some manual tests you can perform to verify setup

Add another tenant

Out-of-the-box the blueprint is configured with a single tenant called 'fltenant1'. Adding another tenant is a two-stage process:

Create the project-level infra and resources for the tenant (node pool, service accounts, firewall rules...). You do this by updating the Terraform config and re-applying.
Configure cluster-level resources for the tenant (namespace, network policies, service mesh policies...) You do this by instantiating and configuring a new version of the tenant kpt package, and then applying to the cluster.

See the relevant section in testing for instructions.

jtangney / gke-federated-learning Goto Github PK

gke-federated-learning's Introduction

Blueprint: Preparing a GKE cluster for apps distributed by a third party

Out of scope

Getting started

Understanding the repository structure

Architecture

Infrastructure

Applications

Deploy the blueprint

Test

Add another tenant

gke-federated-learning's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs