GithubHelp home page GithubHelp logo

deepsquare-io / clusterfactory Goto Github PK

View Code? Open in Web Editor NEW
29.0 8.0 8.0 20.14 MB

Kubernetes-based infrastructure orchestration tool that automate the process of deploying, managing and monitoring compute-optimized clusters from bare metal servers to VMs and containers.

Home Page: https://docs.clusterfactory.io

License: Apache License 2.0

Shell 15.32% Mustache 3.52% JavaScript 2.26% CSS 8.18% TypeScript 0.09% HCL 43.95% Makefile 0.34% MDX 26.34%
clusters argocd helm k8s kubernetes

clusterfactory's People

Contributors

amnesium avatar darkness4 avatar deepsquare-bot[bot] avatar renovate[bot] avatar squarefactory-bot[bot] avatar stratumbespoke-mark avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

clusterfactory's Issues

Feature request: Version Tracking

Add a Github CI (with daily CRON) which check the version of the extensions/applications:

  • K0s (+k0sctl)
  • Bitnami MetalLB
  • Traefik
  • cert-manager
  • csi-driver-nfs
  • Kubevirt
  • [ ] Multus ? It's already always master.
  • Prometheus ?
  • [ ] Slurm ? It's updated through renovate.
  • [ ] xCAT ? It's updated through renovate.

Release: v0.6

The plan:

  • License and make public everything
  • Add documentation
  • Add CODE OF CONDUCT
  • Add Github Issues Template
  • Styling and layout the documentation
  • Fix discord link.
  • Maintains Github discussions.
  • Rename csquare to <country code>-<city>-<index>.deepsquare.run with the internal DNS
  • Move OnDemand url to .deepsquare.run the external DNS + SSL
  • [ ] Add YUM url to .deepsquare.run the externalDNS + SSL Won't do: We will think about a better YUM repository
  • Decouple our IaC from the examples:~
    • Use git push --mirror and create a branch "our cluster" or something.
  • Squash to avoid secrets leaking
  • [ ] Add Algolia to documentation
  • Add Google Analytics
  • Replace the diagrams

The CVMFS repositories cannot be renamed. (Unless we delete and recreate.)

Supports MetalLB 0.13.0

MetalLB 0.13.0 is now available, with its chart bitnami/metallb 4.0.0.

configInline is now deprecated and CRDs are now preferred.

This major release includes the changes and features available in MetalLB from version 0.13.0. Those changes include the deprecation of configmaps for configuring the service and using CRDs instead. If you are upgrading from a previous version, you can follow the official documentation on how to migrate the configuration from a configMap to CRDs.

This major change impact the getting started.

Use Network Policies to control the traffic of a namespace

Example:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: test-network-policy
  namespace: default
spec:
  podSelector:
    matchLabels:
      role: db
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - ipBlock:
            cidr: 172.17.0.0/16
            except:
              - 172.17.1.0/24
        - namespaceSelector:
            matchLabels:
              project: myproject
        - podSelector:
            matchLabels:
              role: frontend
      ports:
        - protocol: TCP
          port: 6379
  egress:
    - to:
        - ipBlock:
            cidr: 10.0.0.0/24
      ports:
        - protocol: TCP
          port: 5978

Documentation on the issue: https://hub.armosec.io/docs/c-0054

Documentation for implementation: https://kubernetes.io/docs/concepts/services-networking/network-policies/

Enable HTTP3 on Traefik

Enabling HTTP3 is easy with Traefik:

experimental:
  http3:
    enabled: true

ports:
  websecure:
    port: 443
    http3: true
    expose: true
    exposedPort: 443
    protocol: TCP
    tls:
      enabled: true

Remember to add port UDP redirection on the routers!

Remove cfctl.yaml `extensions`

Do not use k0s extensions because they have a lot of issue and its hard to debug. We had too many "state" issues and deploy issues.

Better redeploy everything manually or with ArgoCD.

MetalLB: move to core.
Traefik: move to ArgoCD.
cert-manager: move to core.
csi-driver-nfs: move to ArgoCD.

Replace all the helm applications and use the subchart pattern

Because we are declaring all the values inside the application.yaml, this is not GitOps friendly. We have to use the subchart pattern.

Example:

Create a chart repo with only:

app-subchart/
├── Chart.yaml
└── values.yaml

Chart.yaml

apiVersion: v2
name: wordpress
description: A Helm chart for Kubernetes
type: application
version: 0.1.0
appVersion: "9.0.3"

dependencies:
  - name: wordpress
    version: 9.0.3
    repository: https://charts.helm.sh/stable

Not only it solves every problem with GitOps, but it is the recommended solution: https://github.com/argoproj/argocd-example-apps/blob/master/helm-dependency

Reference: https://www.youtube.com/watch?v=VyuVFtp2-2M

This is a heavy breaking change as it changes all the helm applications.

Feature request: Implements App of Apps pattern for ArgoCD

We litteraly have arranged the whole repository to support the app of apps pattern.

Basically we need to:

  1. The AppProject must allow deployment to the argocd namespace (and maybe others).
  2. Group the Apps, Secrets, Volume... into a Kustomize project
  3. Create an ArgoCD Application which deploy that Kustomize project

For example, for monitoring:

argo/monitoring/
├─ base/
│  ├─ apps/
│  ├─ ingresses/
│  ├─ volumes/
│  ├─ secrets/
│  └─ kustomization.yaml
├─ app-of-apps.yml
├─ app-project.yml
└─ namespace.yml

With this pattern, even the secrets, volumes, ingresses... get tracked by ArgoCD and we wouldn't need to use kubectl apply -k anymore (besides for app-project.yml, app-of-apps.yml and namespace.yml).

We also need to rename app-project.yml to project.yml, similarly to the examples of ArgoCD.

Reference: https://argo-cd.readthedocs.io/en/stable/operator-manual/cluster-bootstrapping/

Research and implement a MVP for cluster autoscaling

Let's focus on Exoscale and a single availability group first.

There are multiple solutions, but this is my proposition. The idea is based on Slurm cloud-bursting. We need to solve two problems:

  1. How to spawn VM which will join a private network? The private repository with Ansible can help us.
  2. How to make the VM join the cluster? cfctl could be the answer.

Therefore, we should combine these two features. Because cfctl is similar to our Ansible.

To add more details, here is the join mechanism for k0s:

  1. On a controller node, call k0s token create --role=worker, this will create a join token. For extra security, the token must have an expiration time: k0s token create --role=worker --expiry=1h.
  2. On the new virtual machine, via SSH, install k0s and call sudo k0s install worker --token-file /path/to/token/file.
  3. Then start: sudo k0s start.

Ejecting a node is also easy: Cordon + Drain + Kubectl delete node. Then delete the VM.

This feature will certainly take time.

Clarification on job process location

When SLURM jobs are launched, where do they execute?

Are all users job processes that are running on a given node executing inside that nodes slurm-controller container from the statefulset?

i.e. https://github.com/SquareFactory/ClusterFactory/blob/main/helm/slurm-cluster/templates/slurm-controller/statefulset.yml#L110

Or is there an extra layer of indirection I'm not spotting where they get spawned independently sandboxed through k0s in their own per job + per node pod?

I suppose the former must be the case since you are using SLURM to manage cgroup for CPU and GPU affinity.

ClusterFactory looks like it checks a lot of boxes for me. Thank you for putting this together and documenting it so well.

Create a Helm repository

name: Helm CI
on:
  push:
    paths:
      - .github/workflows/helm.yaml
      - helm/**
    branches:
      - 'main'

jobs:
  publish:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
        with:
          fetch-depth: 0

      - name: Install Helm
        uses: azure/setup-helm@v3

      - name: Configure Git
        run: |
          git config user.name "$GITHUB_ACTOR"
          git config user.email "[email protected]"

      - name: Run chart-releaser
        uses: helm/[email protected]
        with:
          charts_dir: helm
        env:
          CR_TOKEN: '${{ secrets.GITHUB_TOKEN }}'
          CR_RELEASE_NAME_TEMPLATE: 'chart-{{ .Name }}-{{ .Version }}'

Dependency Dashboard

This issue lists Renovate updates and detected dependencies. Read the Dependency Dashboard docs to learn more.

Repository problems

These problems occurred while renovating this repository. View logs.

  • WARN: Package lookup failures

Warning

Renovate failed to look up the following dependencies: Failed to look up helm package rook-ceph-cluster, Failed to look up helm package rook-ceph.

Files affected: helm-subcharts/rook-ceph-cluster/Chart.yaml, helm-subcharts/rook-ceph/Chart.yaml


Open

These updates have all been created already. Click a checkbox below to force a retry/rebase of any.

Detected dependencies

github-actions
.github/workflows/check-version.yaml
  • tibdex/github-app-token v2
  • actions/checkout v4
  • peter-evans/create-pull-request v6
  • ubuntu 22.04
.github/workflows/deploy-docs.yaml
  • actions/checkout v4
  • pnpm/action-setup v2
  • actions/setup-node v4
  • peaceiris/actions-gh-pages v3
  • actions/checkout v4
.github/workflows/lint-pr.yaml
  • amannn/action-semantic-pull-request v5
.github/workflows/release.yaml
  • actions/create-release v1
.github/workflows/test-deploy-docs.yaml
  • actions/checkout v4
  • pnpm/action-setup v2
  • actions/setup-node v4
helm-values
helm/389ds/values.yaml
helm/csi-driver-cvmfs/values.yaml
  • cloudve/csi-cvmfsplugin v1.0.1
  • quay.io/k8scsi/csi-node-driver-registrar v2.1.0
  • quay.io/k8scsi/csi-attacher v3.1.0
  • quay.io/k8scsi/csi-provisioner v2.1.2
helm/cvmfs-server/values.yaml
  • ghcr.io/squarefactory/cvmfs-server 1.2.0
helm/cvmfs-service/values.yaml
helm/grendel/values.yaml
helm/ipmi-exporter/values.yaml
  • docker.io/prometheuscommunity/ipmi-exporter v1.8.0
  • jimmidyson/configmap-reload v0.9.0
helm/keycloak/values.yaml
helm/slurm-cluster/values.yaml
  • ghcr.io/deepsquare-io/slurm 23.02.6-2-controller-rocky9.2
  • ghcr.io/deepsquare-io/slurm 23.02.6-2-login-rocky9.2
  • ghcr.io/deepsquare-io/slurm 23.02.6-2-rest-rocky9.2
  • ghcr.io/deepsquare-io/slurm 23.02.6-2-prometheus-exporter-rocky9.2
  • ghcr.io/squarefactory/open-ondemand 2.0.28-slurm22.05-dex
  • ghcr.io/deepsquare-io/slurm 23.02.6-2-db-rocky9.2
helm/squid/values.yaml
  • docker.io/ubuntu/squid 5.2-22.04_beta
helmv3
helm-subcharts/harbor/Chart.yaml
  • harbor 1.14.2
helm-subcharts/kube-prometheus-stack/Chart.yaml
  • kube-prometheus-stack 52.1.0
helm-subcharts/mariadb/Chart.yaml
  • mariadb 14.1.4
helm-subcharts/rook-ceph-cluster/Chart.yaml
  • rook-ceph-cluster 1.11.8
helm-subcharts/rook-ceph/Chart.yaml
  • rook-ceph 1.11.8
helm-subcharts/supervisor/Chart.yaml
  • supervisor 0.17.0
npm
web/package.json
  • @algolia/client-search 4.23.3
  • @docusaurus/core 3.3.2
  • @docusaurus/module-type-aliases 3.3.2
  • @docusaurus/plugin-google-gtag 3.3.2
  • @docusaurus/preset-classic 3.3.2
  • @docusaurus/theme-classic 3.3.2
  • @docusaurus/tsconfig 3.3.2
  • @docusaurus/types 3.3.2
  • @mdx-js/react 3.0.1
  • @tsconfig/docusaurus 2.0.3
  • clsx 2.1.1
  • prism-react-renderer 2.3.1
  • react 18.3.1
  • react-dom 18.3.1
  • typescript 5.4.5
terraform
terraform/exoscale/main.tf
terraform/exoscale/modules/k0s_instances/versions.tf
  • exoscale ~> 0.58.0
  • hashicorp/terraform >= 1.3.0
terraform/exoscale/modules/router/versions.tf
  • cidr 0.1.0
  • exoscale ~> 0.58.0
  • hashicorp/terraform >= 1.3.0
terraform/exoscale/modules/storage/versions.tf
  • exoscale ~> 0.58.0
  • hashicorp/terraform >= 1.3.0
terraform/exoscale/versions.tf
  • cidr 0.1.0
  • exoscale ~> 0.58.0
  • hashicorp/terraform >= 1.3.0
terraform/ovh/main.tf
terraform/ovh/modules/k0s_instances/versions.tf
  • cidr 0.1.0
  • openstack ~> 1.54.0
  • hashicorp/terraform >= 1.3.0
terraform/ovh/modules/router/versions.tf
  • cidr 0.1.0
  • openstack ~> 1.54.0
  • ovh 0.43.1
  • hashicorp/terraform >= 1.3.0
terraform/ovh/modules/storage/versions.tf
  • cidr 0.1.0
  • openstack ~> 1.54.0
  • hashicorp/terraform >= 1.3.0
terraform/ovh/versions.tf
  • cidr 0.1.0
  • openstack ~> 1.54.0
  • hashicorp/terraform >= 1.3.0

  • Check this box to trigger a request for Renovate to run again on this repository

Documentation: about HA.

We have successfully setup HA. We can now write a guide about it!

  • About joining
  • About uninstalling if there are errors
    • etcdctl member remove
    • k0s reset
    • multus config delete
  • About ejecting a controller
    • kubectl cordon
    • kubectl drain
    • kubectl delete
    • etcdctl member remove
    • k0s reset

The CNI plugins aren't installed by default anymore.

We need to download the plugins before the installation of k0s:

curl -fsSL https://github.com/containernetworking/plugins/releases/download/v1.1.1/cni-plugins-linux-amd64-v1.1.1.tgz -o plugins.tgz
tar -xf plugins.tgz

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.