GithubHelp home page GithubHelp logo

tmobile / magtape Goto Github PK

View Code? Open in Web Editor NEW
144.0 13.0 27.0 1.95 MB

MagTape Policy-as-Code for Kubernetes

License: Apache License 2.0

Python 56.18% Shell 18.16% Open Policy Agent 19.12% Makefile 6.11% Dockerfile 0.43%
magtape kubernetes policy opa tmobile admission-controller webhook policy-as-code python

magtape's Introduction

Latest release License Gitter chat python-checks e2e-checks image-build

magtape-logo

MagTape

MagTape is a Policy-as-Code tool for Kubernetes that allows for evaluating Kubernetes resources against a set of defined policies to inform and enforce best practice configurations. MagTape includes variable policy enforcement, notifications, and targeted metrics.

MagTape builds on the Kubernetes Admission Webhook concept and uses Open Policy Agent (OPA) for its generic policy language and engine.

Our goal with MagTape is to show an example of wrapping additional business logic and features around OPA's core, not to be a competitor. While MagTape is not primarily meant to be a security tool, it can easily enforce security policy.

Overview

MagTape examines kubernetes objects against a set of defined policies (best practice configurations/security concepts) and can deny/alert on objects that fail policy checks. The webhook is written in Python using the Flask framework.

Prereqs

A modern version of Kubernetes with the admissionregistration.k8s.io API enabled. Verify that by the following command:

$ kubectl api-versions | grep admissionregistration.k8s.io

The result should be:

admissionregistration.k8s.io/v1

In addition, the MutatingAdmissionWebhook and ValidatingAdmissionWebhook admission controllers should be added and listed in the correct order in the admission-control flag of kube-apiserver.

NOTE: MagTape has been tested and is known to work for Kubernetes versions 1.13+ on various Distros/Cloud Providers (DOKS, GKE, EKS, AKS, PKS, and KinD) NOTE: MagTape v2.4.0+ no longer supports Kubernetes versions below v1.19.0. Please use MagTape v2.3.3 for earlier versions of Kubernetes

Permissions

MagTape requires cluster-admin permissions to deploy to Kubernetes since it requires access to create/read/update/delete cluster scoped resources (ValidatingWebhookConfigurations, Events, etc.)

MagTape's default RBAC permissions include get, list, and watch access to Secret resources across all namespaces in the cluster. This is to allow for lookup of user-defined Slack Incoming Webhook URL's. If this feature is not needed. the magtape-read ClusterRole can be adjusted to remove these permissions.

Quickstart

You can use the following command to install MagTape and the example policies from this repo with sane defaults. This won't have all features turned on as they require more configuration up front. Please see the Advanced Install section for more details.

NOTE: The quickstart installation is not meant for production use. Please read through the Advanced Install and Cautions sections, and as always, use your best judgement when configuring MagTape for production scenarios.

NOTE: The master branch of this repository is considered a working branch and may not always be in a functioning state. It's best to select a specific tag for a stable version of MagTape

$ kubectl apply -f https://raw.githubusercontent.com/tmobile/magtape/v2.4.0/deploy/install.yaml

This will do the following

  • Create the magtape-system namespace
  • Create cluster and namespace scoped roles/rolebindings
  • Deploy the MagTape workload and related configs
  • Deploy the example policies from this repo

Once this is complete you can do the following to test

Create and label a test namespace

$ kubectl create ns test1
$ kubectl label ns test1 k8s.t-mobile.com/magtape=enabled

Deploy some test workloads

# These examples assume you're in the root directory of this repo
# Example with no failures

$ kubectl apply -f ./testing/deployments/test-deploy01.yaml -n test1

# Example with deny
# You should get immediate feedback that this request was denied.

$ kubectl apply -f ./testing/deployments/test-deploy02.yaml -n test1

# Example with failures, but no deny
# While this request won't be denied, a K8s Event will be generated
# and can be viewed with "kubectl get events -n test1"

$ kubectl apply -f ./testing/deployments/test-deploy03.yaml -n test1

Beyond the Basics

Now that you've seen the basics of MagTape, try out some of the other features

Cleanup

Remove all MagTape deployed resources

# Assumes you're in the root directory of this repo
$ kubectl delete -f deploy/install.yaml
$ kubectl delete validatingwebhookconfiguration magtape-webhook

Policies

The below policy examples are available within this repo. The can be ignored or custom policies can be added. Policies use OPA's Rego language with a specific format to define policy metadata and the output message. This special formatting is required as it enables the additional functionality of MagTape.

  • Liveness Probe (Check ID: MT1001)
  • Readiness Probe (Check ID: MT1002)
  • Resource Limits (Check ID: MT1003)
  • Resource Requests (Check ID: MT1004)
  • Pod Disruption Budget (Check ID: MT1005)
  • Istio Port Name/Number Mismatch (Check ID: MT1006)
  • Singleton Pods (Check ID: MT1007)
  • Host Port (Check ID: MT1008)
  • emptyDir Volume (Check ID: MT1009)
  • Host Path (Check ID: MT1010)
  • Privileged Pod Security Context (Check ID: MT2001)
  • Node Port Range (Check ID: MT2002)

More detailed info about these policies can be found here.

The policy metadata is defined within each policy similar to this:

policy_metadata = {

    # Set MagTape Policy Info
    "name": "policy-resource-requests",
    "severity": "LOW",
    "errcode": "MT1004",
    "targets": {"Deployment", "StatefulSet", "DaemonSet", "Pod"},

}
  • name - Defines the name of the specific policy. This should be unique per policy.
  • severity - Defines the severity level of a specific policy. This correlates with the DENY_LEVEL to determine if a policy should result in a deny or not.
  • errcode - A unique code that can be used, typically in reference to an FAQ, to look up additional information about the policy, what produces a failure, and how to resolve failures.
  • targets - This controls which Kubernetes resources the policy targets. Each target should be the singular of the Kubernetes resource as found in the Kind field. Special care should be taken to make sure all target resources maintain similar JSON data paths within the policy logic, or that differences are handled appropriately.

Policies follow normal OPA operations for policy discovery. MagTape provides configuration to OPA to filter which configmaps it targets for discovery. If you're adding your own policies make sure to apply the following labels to the configmap:

app=opa
openpolicyagent.org/policy=rego

Example creating a policy configmap with appropriate labels from an existing Rego file

# Create a policy from a Rego file
$ kubectl create cm my-special-policy -n magtape-system --from-file=my-special-policy.rego --dry-run -o yaml | \
kubectl label --local app=opa openpolicyagent.org/policy=rego -f - --dry-run -o yaml > my-special-policy-cm.yaml

OPA will add/update the openpolicyagent.org/policy-status annotation on the policy configmaps to show they've been loaded successfully or if there are any syntax/validation issues.

Writing policies that Reference resources outside of the request object

As part of the integration MagTape has with OPA, the kube-mgmt service is also deployed within the MagTape pod. In short, kube-mgmt replicates resources from the Kubernetes cluster into OPA to allow for additional context with policies. kube-mgmt requires permissions to build the resource cache and those permissions should be updated accordingly when policies are developed that expand the scope of resources needed.

Please reference the kube-mgmt documentation on caching for additional information on how to configure kube-mgmt to watch new resource types and adjust the permissions in the magtape-read clusterrole accordingly.

Deny Level

Each policy is assigned a Severity level "LOW", "MED", or "HIGH". This is used to influence what policy checks result in an actual deny, or just become passive (alerting only)

The Deny Level is set within the deployment via an environment variable (MAGTAPE_DENY_VOLUME) and can be set to "OFF", "LOW", "MED", or "HIGH". The Deny Level has an inverse relationship to the Severity of the defined checks, which works as follows:

Deny Level Severities Blocked
OFF None
LOW HIGH
MED HIGH, MED
HIGH HIGH, MED, LOW

This configuration provides flexibility around controlling which checks should result in a "deny" and allows for a progressive approach as the platform and its users mature

Health Check

MagTape has a rudimentary healthcheck endpoint configured at /healthz. The endpoint displays a json output including the name of the pod running the webhook, the datetime of the request, and the overall health. This is nothing fancy. If the Flask app is running at all the health will report ok.

Image

MagTape uses a few images for operation. Please reference the image repos for more information on the image structure and contents

K8s Events

K8s Events can be generated for policy failures via the MAGTAPE_K8S_EVENTS_ENABLED environment variable.

Setting this variable to TRUE will cause a Kubernetes event to be created in the target namespace of the request object when a policy failure occurs. This will provide a more native method to passively inform users on policy failures (regardless of whether or not the request is denied).

Slack Alerts

Slack alerts can be enabled and controlled via environment variables (noted above):

  • MAGTAPE_SLACK_ENABLED
  • MAGTAPE_SLACK_PASSIVE
  • MAGTAPE_SLACK_WEBHOOK_URL_BASE
  • MAGTAPE_SLACK_WEBHOOK_URL_DEFAULT
  • MAGTAPE_SLACK_USER
  • MAGTAPE_SLACK_ICON

Override base domain for Slack Incoming Webhook URL

Some airgapped environments may need to use a forwarder/proxy service to assist in sending alerts to the Slack API. the MAGTAPE_SLACK_WEBHOOK_URL_BASE environment variable allows you to override the base domain for the Slack Incoming Webhook URL to target the forwarding/proxy service. This is very assumptive that the forwarding/proxy service will accept a Slack compliant payload and that the endpoint differs from the default Slack Incoming Webhook URL in domain only (ie. the protocol and trailing paths remain the same).

EXAMPLE:

MAGTAPE_SLACK_WEBHOOK_URL_DEFAULT="https://hooks.slack.com/services/XXXXXXXX/XXXXXXXXXXXX"
MAGTAPE_SLACK_WEBHOOK_URL_BASE="slack-proxy.example.com"

This configuration will override hooks.slack.com to be slack-proxy.example.com and the outcome will be:

https://slack-proxy.example.com/services/XXXXXXXX/XXXXXXXXXXXX

NOTE: The MAGTAPE_SLACK_WEBHOOK_URL_BASE environment variable is optional and if not specified the URL will remain unchanged from what is set in MAGTAPE_SLACK_WEBHOOK_URL_DEFAULT

Default Alert Target

When alerts are enabled they will be sent to the Slack Incoming Webhook URL defined in the MAGTAPE_SLACK_WEBHOOK_URL_DEFAULT environment variable. This is meant to be a channel controlled by the MagTape Webhook administrators.

User-defined Alert Target

When alerts are enabled they can be sent to a user-defined Slack Incoming Webhook URL in addition to the default mentioned above. This can be configured via a Kubernetes Secret resource in a target namespace. The secret should be named magtape-slack and the Slack Incoming Webhook URL should be set as the value (typical base64 encoding) for the webhook-url key. This will allow end-users to receive alerts in their desired Slack Channel for request objects targeting their own namespace.

EXAMPLE:

$ kubectl create secret generic magtape-slack -n my-cool-namespace --from-literal=webhook-url="https://hooks.slack.com/services/XXXXXXXX/XXXXXXXXXXXX"

Alert Format

Slack alert examples:

Slack Alert Deny Screenshot

Slack Alert Fail Screenshot

NOTE: For Slack Alerts to work, you will need to configure a Slack Incoming Webhook and set the environment variable for the webhook deployment as noted above.

Metrics

Prometheus formatted metrics are exposed on the /metrics endpoint. Metrics track counters for requests by:

  • CPU, Memory, and HTTP error rate
  • Number of requests passed, failed, and total
  • Breakdown by namespace
  • Breakdown by policy

Grafana dashboards showing Cluster, Namespace, and Policy scoped metrics are available in the metrics directory. An example Prometheus ServiceMonitor resource is located here.

These dashboards are simple, but serve a few purposes:

  • How busy the MagTape app itself is (ie. should the resources or replica count be increased/decreased)
  • What Namespaces seem to produce the most policy failures (Could indicate the team is struggling with certain concepts, there's something malicious going on, etc.)
  • What policies seem to be the most problematic (Maybe an opportunity to target education/training for specific topics based on the policy scope)

We've found that sometimes thinking about operations from a metrics perspective can lead you to develop a policy that is more about tracking how frequently some action occurs rather than explicitly if it should be allowed or denied. Your mileage may very!

Testing

  • Create namespace for testing and label it appropriately

    $ kubectl create ns test1
    $ kubectl label ns test1 k8s.t-mobile.com/magtape=enabled
  • Deploy test deployment to Kubernetes cluster

    $ kubectl apply -f test-deploy02.yaml -n test1

    NOTE: MagTape should deny this workload and should provide feedback similar to this:

    $ kubectl apply -f test-deploy02.yaml -n test1
    
    Error from server: error when creating "test-deploy02.yaml": admission webhook "magtape.webhook.k8s.t-mobile.com" denied the request: [FAIL] HIGH - Found privileged Security Context for container "test-deploy02" (MT2001), [FAIL] LOW - Liveness Probe missing for container "test-deploy02" (MT1001), [FAIL] LOW - Readiness Probe missing for container "test-deploy02" (MT1002), [FAIL] LOW - Resource limits missing (CPU/MEM) for container "test-deploy02" (MT1003), [FAIL] LOW - Resource requests missing (CPU/MEM) for container "test-deploy02" (MT1004)

Test Samples Available

Info on testing resources can be found in the testing directory

NOTE: These manifests are meant to test deploy-time validation, some pods related to these test manifests may fail to come up properly. A failing pod doesn't represent an issue with MagTape.

Cautions

Production Considerations

  • By Default the MagTape Validating Webhook Configuration is set to fail "closed". Meaning if the webhook is unreachable or doesn't return an expected response, requests to the Kubernetes API will be blocked. Please adjust the configuration if this is not something that fits your environment.
  • MagTape supports operation with multiple replicas that can increase availability and performance for critical clusters.

Break Glass Scenarios

MagTape can be enabled and disabled on a per namespace basis by utilizing the k8s.t-mobile.com/magtape label on namespace resources. In emergency situations the label can be removed from a namespace to disable policy assessment for workloads in that namespace.

If there are cluster-wide issues you can disable MagTape completely by removing the magtape-webhook Validating Webhook Configuration and deleting the MagTape deployment.

Troubleshooting

Certificate Trust

The ValidatingWebhookConfiguration needs to have a CA Bundle that includes the CA that signed the TLS cert used to secure the MagTape webhook. If this is not done the required trust between the K8s API and webhook will not exist and the webhook won't function correctly. More info is available here

Access MagTape API from local machine

$ kubectl get pods # to get the name of the running pod
$ kubectl port-forward <pod_name> -n <namespace> 5000:5000

Use Curl to perform HTTP POST to MagTape

$ curl -vX POST https://localhost:5000/ -d @test.json -H "Content-Type: application/json"

Follow logs of the webhook pod

$ kubectl get pods # to get the name of the running pod
$ kubectl logs <pod_name> -n <namespace> -f

magtape's People

Contributors

dependabot[bot] avatar freakin avatar ilrudie avatar jsteichen12 avatar kamleshjoshi8102 avatar phenixblue avatar pramod74 avatar xytian315 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

magtape's Issues

Fix DockerHub Secret References

What happened:

It looks like the secrets for referencing the DockerHUb username/password got changed in the workflow file, but not in the repo. I can't see why the name change was necessary, so we need to make the workflow match what's configured already.

What you expected to happen:

Image Builds for releases will succeed

Add support for server-side warnings for K8s v.19+

What would you like to be added:

Add functionality to take advantage of the server-side warnings enabled in Kubernetes v1.19.

More info:

Why is this needed:

This will allow surfacing policy failures bak to the client (kubectl/client-go), even in cases where the admission response is not a denial.

Add HPA for MagTape

What would you like to be added:

Add a Horizontal Pod Autoscaler resource to the mandate deployment artifacts.

Why is this needed:

This will be used to scale out replicas vs scaling up workers/threads per pod. This is related to the remediation of the issue noted in #48

Instrument Distributed Tracing for MagTape

What would you like to be added:

Instrument Distributed Tracing for the MagTape application. Ideally using OpenTelemetry packages.

Why is this needed:

To give more robust telemetry collection for MagTape. This is helpful for development of new features with performance in mind, tracking regressions between releases, for tracking the impact of additional policies over time, and general troubleshooting of performance related issues.

Reorganize Rego tests/mock data

What would you like to be added:

Would like to reorganize the Rego policy unit tests and mocked data. We should add some additional verbiage specific to policy contributions in CONTRIBUTING.md (expected file layout, test coverage, etc.)

Why is this needed:

  • Having the tests in their own packages seems against standards
  • Having the mocked data inline with the tests can be a bit busy/doesn't lend well to reuse.

Verify all K8s resource manifests have standard labels

What would you like to be added:

Need to make sure all kubernetes resource manifests in ./deploy/manifests have a standard set of labels:

  • app=magtape
  • resource=<resource_type> (ie. resource=deployment)

Why is this needed:

Standardization to easily identify all installed MagTape resources.

Extend NodePort policy functional testing

What would you like to be added:

Since #45 has now been merged we should be able to extend the functional testing for the NodePort policy.

Why is this needed:

We're currently not accurately testing the NodePort policies.

These tests require specific annotations on the target namespace for testing and will require a specifically formatted script that adheres to the pattern specified by the new functional testing framework.

Review CI for pinning utilities to specific versions

What would you like to be added:

During work on #77 I encountered an issue with a change to the version of the kubectl utility used in the ubuntu-latest GitHub Actions image. v1.19.0 seemed to produce errors for the compare-manifest CI job. I added a ci-bootstrap Make target to download a specific version of kubectl and replace the default in the container image.

Why is this needed:

We should review the work I did and assess the process for more longterm usage and to extend to any other utilities we want (ops, kustomize, kind, etc.). The Gatekeeper project had some examples of doing this that can be used for reference.

Add matrix to test multiple K8s versions

What would you like to be added:

Add a matrix for Kubernetes versions to the e2e CI check Action

Why is this needed:

We need to identify a target range of Kubernetes versions to test each release against/support. We currently only test against the latest (currently 1.18.2) in the version of KinD that's used in the e2e check Action.

Add Shellcheck CI checks

What would you like to be added:

Need to add Shellcheck CI Checks for all bash scripts in the repo.

Why is this needed:

To ensure adherence to a standard for BASH scripts for consistency and best practices. This should help to maintain trusted tooling within the repo.

Add multi-arch image builds

What would you like to be added:

We currently build container images for magtape and magtape-init for amd64 architecture only. We should start building for arm64 and ppc64le at a minimum. Probably good to check out a few other projects that are doing multi-arch builds and include any other architectures that seem relevant.

Why is this needed:

Wider support for hardware architectures that are gaining popularity within the Kubernetes community.

Support arm64 Architecture

What would you like to be added:

We need to have MagTape support deployment to arm64 based cluster environments.

We have multi-arch builds of the magtape-init and magtape container images, but we need supported images for opa and kube-mgmt as well.

Related to open-policy-agent/opa#2233 for arm64 support with OPA.

Why is this needed:

Further deployment flexibility

Migrate to Gunicorn WSGI Server

What would you like to be added:

Currently the native Flask HTTP server is used within the container image for MagTape. This is not the best for Production use and should be updated to Guniorn or some other production ready WSGI server.

Why is this needed:

Better performance and resiliency

Disable name suffix in configmapGenerator

What would you like to be added:

Need to disable the name suffix for the configmapGenerator in the base customization.yaml

Why is this needed:

This is needed to be consistent with the advanced install workflow.

Move end-user Slack Webhook URL to Secret

What would you like to be added:

Move end-user Slack Webhook URL to Secret

Why is this needed:

Currently an end-user can supply their own Slack Incoming Webhook URL as an annotation on their namespace to direct alerts at their own Slack channel for policy violations within their namespace. As the Slack Incoming Webhook URL is considered sensitive, this should be moved to a Secret resource.

The two ideas I have for this are:

  • Use a namespace label to specify a Secret resource to read the information from (ie. k8s.t-mobile.com/slack-webhook-secret: <my_custom_secret_name>)
    • The expected namespace label should be globally configurable via ENV var similar to MAGTAPE_SLACK_USER_LABEL
  • Use a consistent Secret resource name (ie. magtape-slack-secret)
    • The expected secret name should be globally configurable via ENV var similar to MAGTAPE_SLACK_USER_SECRET

Cleanup:

The end goal should involve the cleanup of the existing configs/tests. For example:

  • Remove the existing MAGTAPE_SLACK_ANNOTATION ENV var

Fix Bring Your Own Cert Docs

What would you like to be added:

The references to the MAGTAPE_TLS_SECRET environment variable should be removed and the documentation for the "Bring Your Own Cert" (BYOC) model needs to be corrected.

The BYOC model requires an annotation on the magtape-tls secret. Details are within the magtape-init code.

Why is this needed:

To correctly describe the BYOC scenario and configuration.

Add Rego test automation

What would you like to be added:

Need to add Rego unit tests to CI checks

Why is this needed:

Increase automate checking to build higher level of confidence in Policies and related changes.

Extend testing options for functional-test automation

What would you like to be added:

Currently there's not a good way to perform per test setup/breakdown for things outside of k8s artifacts.

Why is this needed:

Some tests may require setting up specific scenarios before/after a given functional test. Example:

NodePort test should add an annotation to the target namespace and then remove it when done.

Ideally this is done in a generic way where each resource type can have a setup/breakdown related hook.

allow specification of more diverse kubectl verbs in functional testing

What would you like to be added:
Investigate update to functional testing to allow verbs such as kubectl create instead of only supporting apply.

Why is this needed:
Adds capability to the CI to test a more diverse spectra of polices covering a wider variety of changes which might be requested against a cluster.

Add documentation for fix release procedure

What would you like to be added:

Need to enhance contributor docs to include steps for fix release procedures.

Why is this needed:

To provide a consistent maintainer experience when back porting bug/security fixes.

Add ability to disable policy per namespace

What would you like to be added:

It would be nice to have the ability to disable individual policies on a per-namespace basis.

Why is this needed:

This allows for flexibility to granularly disable policies without the need to completely remove a policy from a cluster, or having to lower the severity level of the policy at a global cluster level to meet the needs of a specific namespace.

Ideally I think something along the lines of a label on the namespace could work. Something like:

k8s.t-mobile.com/magtape-disable-<policy_name>: true

If the label is found for a specific policy we need to add logic to skip deny's, but still track failures for alerting/event creation. Because this is primarily valid in environments where end-users can't manipulate their assigned namespace resource, this should also be a global toggle with an ENV var similar to MAGTAPE_ENABLE_NS_TOGGLE

Add metric for Webhook Cert Expiration

What would you like to be added:

Would like to have a background process/sidecar to track a metric for Webhook cert expiration (ie. Num days left)

Why is this needed:

General observability concerns and tracking for lifecycle touch points (certificate rotation).

Linter check fails without any Python changes

What happened:

The lint job in the python-checks workflow is failing when no changes to Python files are made.

What you expected to happen:

listing should pass if no changes to Python files have been made

How to reproduce it (as minimally and precisely as possible):

open any PR, check will fail.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • Others:

Add logic to handle Github Action Workflow dependencies on releases

What would you like to be added:

Add logic to handle Github Action Workflow dependencies on releases

Why is this needed:

Currently the release flow executes e2e tests prior to the new container image build because the same flow is used for PR/push to master. Need to make sure image build happens before e2e tests for release prep (push to master). This may require pre-release image builds or refactoring the Github Action workflows overall.

Add support for post-assessment Webhook

What would you like to be added:

Add functionality to allow for calling a user defined endpoint for policy failures (possibly passes as well).

Not sure if the granularity should be a single global configuration for a MagTape installation, different endpoint per policy, etc.

should be bypassed if no config is provided

should have a timeout value and should not cause a failure in the policy assessment if the call to the endpoint fails

ideally this can happen asynchronous and be non-blocking to the end-user request

Why is this needed:

to allow for integration with existing systems for alerting/reporting

Need to update README for Testing

What would you like to be added:

The testing README located here needs to be updated for the most recent changes to the functional testing framework.

Reference #45 for more context

Why is this needed:

The existing info is slightly outdated even the changes to functional-tests.yaml

Pods crash when scheduled on nodes with >24 CPU's

What happened:

Installing and running MagTape on worker nodes with 24 or more CPU's generates a high number of threads with Gunicorn and there appears to be a memory leak of some sort.

What you expected to happen:

Pods to startup normally

How to reproduce it (as minimally and precisely as possible):

Run the simple install in a cluster with worker nodes that have 24 or more CPU's

Anything else we need to know?:

Experienced on worker nodes with 24 cores x 128GB RAM

Example output from MagTape container logs:

[2020-10-02 04:52:27 +0000] [107] [INFO] Booting worker with pid: 107
[2020-10-02 04:52:27 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:62)
[2020-10-02 04:52:27 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:63)
[2020-10-02 04:52:27 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:64)
[2020-10-02 04:52:27 +0000] [62] [INFO] Worker exiting (pid: 62)
[2020-10-02 04:52:27 +0000] [64] [INFO] Worker exiting (pid: 64)
[2020-10-02 04:52:27 +0000] [63] [INFO] Worker exiting (pid: 63)
[2020-10-02 04:52:29 +0000] [108] [INFO] Booting worker with pid: 108
[2020-10-02 04:52:29 +0000] [109] [INFO] Booting worker with pid: 109
[2020-10-02 04:52:30 +0000] [1] [INFO] Unhandled exception in main loop
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/gunicorn/arbiter.py", line 211, in run
    self.manage_workers()
  File "/usr/local/lib/python3.8/site-packages/gunicorn/arbiter.py", line 545, in manage_workers
    self.spawn_workers()
  File "/usr/local/lib/python3.8/site-packages/gunicorn/arbiter.py", line 616, in spawn_workers
    self.spawn_worker()
  File "/usr/local/lib/python3.8/site-packages/gunicorn/arbiter.py", line 567, in spawn_worker
    pid = os.fork()
OSError: [Errno 12] Out of memory

Environment:

  • Kubernetes version (use kubectl version): v1.15.5
  • Worker Node OS: Ubuntu 16.04
  • Cloud provider or hardware configuration:
  • Others:

Update CHANGELOG and release docs post-v2.3.0 release

What would you like to be added:

  • Add changes to CHANGELOG.md noted from v2.3.0 release
  • Add clarification to release docs based on feedback form the v2.3.0 release

Why is this needed:

We made some last minute updates to the changelog for the v2.3.0 release and need to make sure those changes make their way back into the actual CHANGELOG.md. file.

Also noted some gaps in the release documentation that need to be added.

Enhance docs around kube-mgmt cache customization

What would you like to be added:

Need to addd some verbiage to the docs that cover thee possibility of adjusting resources (CPU/MEM) as you add additional Kubernetes resources to be replicated by kube-mgmt.

Why is this needed:

I've noticed on some large clusters with many objects that the HPA kicks in automatically following an initial deployment of MagTape and remains at max replicas. This seems to be associated with kube-mgmt replication (ie. more resource types/number of a given type of resource on the cluster).

Possibly look into adjusting sync interval or adjusting resource location for the kube-mgmt container. At the very least we need to call it out in the dos.

Enable functional tests to have a descriptive name

What would you like to be added:
Allow a descriptive name to be associated with each functional test which can be printed during testing.

Why is this needed:
Enable more meaningful output during functional testing to make it easier to determine exactly what is being tested.

Documentation update:
Once this is implemented and descriptive names are set for each test the Test Samples Available table can be removed from readme.md

Review CI for Required Checks

What would you like to be added:

Need to review CI configuration with regards to required checks. I've seen some other projects that have minimal required checks (vs. almost all checks for MagTape being required).

Why is this needed:

If we can unmark certain checks as required, we can add path filters to minimize CI checks for small things like docs updates, etc. (ie. No need to run e2e checks if only docs are updated, or no need to run Rego checks if no Rego files are touched in a PR).

Need to add versioning for QuickStart install link in README

What would you like to be added:

We need to add a versioned reference for the install.yaml in linked in the main README.

Why is this needed:

Currently visitors to the repo will pull the latest install.yaml linked to the master branch, which could be under active development/not in a working state. Moving to a versioned reference provides a higher degree of stability to the casual visitor and allows us to maintain active development on the master branch.

This should be tied to the set-release-version make target for updating on new releases.

Bump KinD node Image versions used in CI

What would you like to be added:

Bump the KinD node image version to be on the latest dot releases for each currently supported minor version.

KinD node images are defined here in our e2e checks workflow:

kind-node-image: kindest/node:v1.19.1@sha256:98cf5288864662e37115e362b23e4369c8c4a408f99cbc06e58ac30ddc721600

Reference the KinD release pages for current node images.

Why is this needed:

There have been some security fixes and associated new Kubernetes upstream releases

Update PR Template

What would you like to be added:

  • Need to add /kind release to PR type section in PR template

Why is this needed:

Additional relevance for PR template that should make the contribex a bit better.

Add documentation for customizing webhook label

What would you like to be added:
Need to add documentation on how to customize the namespace label used with the webhook labelselector.

Why is this needed:
Additional flexibility in customization

Fix typos in Policies Doc

What would you like to be added:

Fix a few typos in the Policies doc.

  • NodePort policy: The nodePort annotation on the namespace should be "k8s.t-mobile.com/nodeportRange" Set the annotation to "na" if no nodePort range will to be set, that is seen as an exception value
  • emptyDir policy: Using emptyDir leads to consumption of ephemeral storage on the underlying nodes and can fill up easily affecting others on the platform.

There are probably others, so a good overall review would be nice.

Why is this needed:

Make our docs clear and understandable.

Investigate porting magtape-init to Go

What would you like to be added:

Investigate effort level/advantages for moving the magtape-init code to Golang.

Why is this needed:

This came up in conversation for a couple of reasons:

  • Migrating the core magtape code to Go in order to consume the OPA Go library and move away from the sidecar
  • Potential simplification of TLS bits in init process and better support for extended cert/key validation

Add more detail around contrib scenarios

What would you like to be added:

Need to add additional documentation around different contributor scenarios.

Examples:

  • Run linting/formatting if you edit Python files (#80)
  • Run linting/formatting if you edit Rego files (#60/#80)
  • Rebuild install manifest if you edit YAML manifests (#80)

Why is this needed:

Lower barrier to entry for new contributors and to help me not have to remember!

Add conditional CI Checks

What would you like to be added:

Need to add some conditional logic to certain CI checks to increase efficiency for some PR's.

Why is this needed:

Not all checks need to run on a given PR unless certain files change. Github Actions has filtering capabilities, but it breaks things if you enable that with a required check (https://github.community/t5/GitHub-Actions/Feature-request-conditional-required-checks/m-p/36938#M2735
) and then the check doesn’t trigger in a PR. Until there’s a solution from Github we may need to add the conditional logic into the CI checks themselves.

Use something like this in a helper function that lives somewhere in ./hack. This should be generic to work for any number of directly/file paths and for any CI check.

$ git --no-pager diff --name-only --ignore-blank-lines HEAD $REF -- app

Probably not an exhaustive list, but good to start collecting the specific paths we want to trigger on for each set of checks:

Python Checks

    - /app
    - /.github/workflows

e2e Checks

    - /app
    - /deploy/manifests
    - /policies
    - /hack
    - /.github/workflows

Manifests Checks

    - /deploy/manifests
    - /hack
    - /.github/workflows

Investigate porting magtape core code to Go

What would you like to be added:

Investigate migrating core MagTape code from Python to Golang

Need to know general idea of functionality with OPA Go library and assess UX in project lifecycle improvements as well as installation/testing of MagTape.

Why is this needed:

Potential usability simplification and performance increase by consuming the OPA Go library and moving away from the sidecar

Install times out waiting for CSR approval

What happened:
Ran kubectl apply -f https://raw.githubusercontent.com/tmobile/magtape/master/deploy/install.yaml
magtape init on Kubernetes 1.18 (KinD) timed out waiting for CSR to be approved

What you expected to happen:
MagTape to deploy on my test system

How to reproduce it (as minimally and precisely as possible):
Running a 1.18 version of Kubernetes apply the install.yaml

Anything else we need to know?:
INFO: Waiting for certificate approval
INFO: Timed out reading certificate request "magtape-svc.magtape-system.cert-request"
forbidden: user not permitted to approve requests with signerName "kubernetes.io/legacy-unknown"","reason":"Forbidden"

Environment:

  • Kubernetes version (use kubectl version):
    Client Version: v1.16.6-beta.0
    Server Version: v1.18.2
  • Cloud provider or hardware configuration:
  • Others:
    kind v0.8.1 go1.14.6 darwin/amd64

Add descriptions to functional tests

What would you like to be added:

Add descriptions for all existing functional tests to support the new functionality added in #86

May also be good to add some comments to the contrib docs around this pattern now that we have a more solid structure to the testing framework.

Why is this needed:

More descriptive output to track what each functional test is actually testing for in a simple human readable format.

Background scanning for policy violations

What would you like to be added:

a mechanism to scan and alert on kubernetes resources that are already deployed in the cluster (ie. past the initial admission control workflow).

  • Probably needs to run in a configurable interval
  • Could be background daemon or sidecar, or a completely separate pod.
  • Could be a good thing to look at doing in Golang
  • Maybe think through the possibility of an enforcement action in addition to alerts (ie. Scale to 0 pods on Deploymebt with privileged pod spec)
  • not sure if we'd want a separate severity/deny level for the background scanning vs. the admission response flow

Why is this needed:

This would cover brownfield environments or scenarios where new policies are added/policy severity changes and resources may be long-lived/deployed infrequently

Validate cert/key pairs for init workflow

What would you like to be added:

Add functionality to the magtape-init workflow to validate TLS Cert/Key relationship for both self-generated pairs and the BYOC mode.

Why is this needed:

To catch init errors sooner in the process to make the UX smoother and more robust.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.