tinkerbell / roadmap Goto Github PK

Official Tinkerbell Roadmap

License: Apache License 2.0

roadmap's Introduction

Tinkerbell

License

Tinkerbell is licensed under the Apache License, Version 2.0. See LICENSE for the full license text. Some of the projects used by the Tinkerbell project may be governed by a different license, please refer to its specific license.

Tinkerbell is part of the CNCF Projects.

Community

The Tinkerbell community meets bi-weekly on Tuesday. The meeting details can be found here.

Community Resources:

What's Powering Tinkerbell?

The Tinkerbell stack consists of several microservices, and a gRPC API:

Tink

Tink is the short-hand name for the tink-server and tink-worker. tink-worker and tink-server communicate over gRPC, and are responsible for processing workflows. The CLI is the user-interactive piece for creating workflows and their building blocks, templates and hardware data.

Smee

Smee is Tinkerbell's DHCP server. It handles DHCP requests, hands out IPs, and serves up iPXE. It uses the Tinkerbell client to pull and push hardware data. It only responds to a predefined set of MAC addresses so it can be deployed in an existing network without interfering with existing DHCP infrastructure.

Hegel

Hegel is the metadata service used by Tinkerbell and OSIE. It collects data from both and transforms it into a JSON format to be consumed as metadata.

OSIE

OSIE is Tinkerbell's default an in-memory installation environment for bare metal. It installs operating systems and handles deprovisioning.

Hook

Hook is the newly introduced alternative to OSIE. It's the next iteration of the in-memory installation environment to handle operating system installation and deprovisioning.

PBnJ

PBnJ is an optional microservice that can communicate with baseboard management controllers (BMCs) to control power and boot settings.

Building

Use make help. The most interesting targets are make all (or just make) and make images. make all builds all the binaries for your host OS and CPU to enable running directly. make images will build all the binaries for Linux/x86_64 and build docker images with them.

Configuring OpenTelemetry

Rather than adding a bunch of command line options or a config file, OpenTelemetry is configured via environment variables. The most relevant ones are below, for others see https://github.com/equinix-labs/otel-init-go

Currently this is just for tracing, metrics needs to be discussed with the community.

Env Variable	Required	Default
`OTEL_EXPORTER_OTLP_ENDPOINT`	n	localhost
`OTEL_EXPORTER_OTLP_INSECURE`	n	false
`OTEL_LOG_LEVEL`	n	info

To work with a local opentelemetry-collector, try the following. For examples of how to set up the collector to relay to various services take a look at otel-cli

export OTEL_EXPORTER_OTLP_ENDPOINT=localhost:4317
export OTEL_EXPORTER_OTLP_INSECURE=true
./cmd/tink-server/tink-server <stuff>

Website

For complete documentation, please visit the Tinkerbell project hosted at tinkerbell.org.

roadmap's People

Contributors

Stargazers

Watchers

Forkers

chrisdoherty4 jacobweinstock moadqassem

roadmap's Issues

Integrate Rufio with the core Tinkerbell Stack

Disclaimer: this is more of a train of thought than a well thought out idea, but I'd like to discuss it.

Rufio is currently an 'optional' component. An orchestration component is responsible for arranging the Rufio Jobs and Tink Workflows (this is essentially what CAPT does for provisioning Kubernetes clusters).

Is there a use-case for tighter integration between Tink Core and Rufio by supplying fields on Workflows? Would it, for example, be useful to allow users to specify some sort of "boot strategy" on a Workflow that results in a netboot?

Assuming we worked out the details of a "boot strategy" the result would be a fully automated machine provisioning solution that can be executed with a single workflow instead of requiring an additional orchestration layer.

An example boot strategy could be "issue the series of commands to netboot every 60s until the first action transitions to 'running'".

Support BMC actions as part of a workflow

Summary

Currently, running Rufio actions against hardware is not builtin to a workflow. Users must create the jobs/tasks manually. Making this builtin to a workflow and optionally enabled would be very valuable.

The initial thought here is that this would be the responsibility of the tink controller. We could start off by just having a single option. Something that would get a machine into a networking booting state. The user experience would probably be just a boolean. netboot: true for example. This would be an opt-in feature. This would also probably only run as an initial step, not something you would be able to specify during the course of a running workflow.

Deprecate and Remove Postgres Backend

Tinkerbell has numerous pieces of functionality that are unsupported, poorly maintained and/or no longer used. This project seeks to clean up the Tinkerbell codebase by dubbing functionality deprecated and removing it.

Proposals
https://github.com/tinkerbell/proposals/tree/main/proposals/0029

Project
https://github.com/orgs/tinkerbell/projects/18

Hardware Monitoring and Alerting

Instrument monitoring and alerting of hardware managed by Tinkerbell.

Redfish may provide APIs to achieve the behavior.

Redfish being created as DMTF’s Redfish® is a standard designed to deliver simple and secure management for converged, hybrid IT and the Software Defined Data Center (SDDC). Both human readable and machine capable, Redfish leverages common Internet and web services standards to expose information directly to the modern tool chain.

https://www.dmtf.org/standards/redfish

Integrate Cluster API Provider Tinkerbell(CAPT) into Sandbox

Cluster API Provider Tinkerbell is not properly integrated into the Sandbox. This would be extremely valuable improving the development cycle of CAPT.

Tinkerbell CLI

Use-cases:

Easy installation of the stack.
Self testing a deployment.
Generating API objects.

Conditional Tinkerbell Template Actions

Overview

Tinkerbell defines a Template object that contains Actions. Actions represent an activity that contributes to the provisioning of a machine (for the primary Tinkerbell use-case). Actions are flexible as they are OCI images that can be developed and maintained by third parties. This flexibility contributes to the flexibility of Templates.

Template's themselves, however, aren't particularly flexible. For use-cases such as CAPI/CAPT where the same template is used to provision the same kind of node (such as control plane nodes) where its necessary to perform different dependent on the hardware it can be difficult to model using Template's.

Proposal

Provide control flow type capabilities in Template's that enables toggling of individual actions. This could work in a similar fashion to Github Action's if statement.

# Github Action Example
jobs:
  job_name:
     if: EXPRESSION

The semantics of if are to run the job if the EXPRESSION evaluates to true.

We could create something similar for Tinkerbell Template actions (note the historical concept of a 'task' has been removed for simplicity as it will be removed in future versions of Tinkerbell).

actions:
- name: "write-file"
  if: EXPRESSION

Rationale

Adding expression capabilities with if in Tinkerbell Templates adds complexity in the form of maintenance. Its non-trivial and the Go standard library doesn't offer expression evaluation. Leveraging third party libraries for evaluating expressions would be ideal.

The particular CAPI/CAPT example used is quite specific to CAPI/CAPT. Its possible that a CAPT solution could be created that decouples templates from CAPT TinkerbellMachineTemplate objects and alleviates that specific problem as there's nothing inherently preventing a user in Tinkerbell core from creating a different template for a specific kind of machine today.

Rearchitect `in_use` flag

The in_use flag found on the Hardware CRD in the Tink repository is hyperloaded dependent on the client reading/updating it. In Cluster API it indicates a machine has been provisioned while in the Tinkerbell stack it has loose semantics that generally prevent DHCP being served for that Hardware.

We want to rearchitect this flag, possibly remove it in favor of simpler solutions, to make understanding the system state easier.

Integrate with ISC Kea

AuthN/AuthZ

Tinkerbell doesn't have strong AuthN/AuthZ support. This has been raised in tinkerbell/tink#507 with some ideas on how we could address.

K8s Operator for Tinkerbell Stack Management

The Tinkerbell stack is a set of containers that could be managed by a Kubernetes operator. Initial stack deployment is reasonably trivial but it becomes more complex with stack upgrades. We have seen instances where users are instrumenting their own logic to perform Tinkerbell stack management.

Automated testing for Playground deployments

Currently we don't have any automated testing for the vagrant deployments of the Playground. We should write automated functional tests to validate the deployments as best we can.

Expected Behaviour

Current Behaviour

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

Operating System and version (e.g. Linux, Windows, MacOS):
How are you running Tinkerbell? Using Vagrant & VirtualBox, Vagrant & Libvirt, on Packet using Terraform, or give details:
Link to your project or a code example to reproduce issue:

Raspberry PI 4B Support

Secure Boot

Add support across the Tinkerbell stack for secure boot.

Resource Validation

Tinkerbells primary backend is Kubernetes. This means it acts as the data source for Hardware, Workflows and Templates. When these objects are submitted to the cluster they do not undergo any validation. This theme of work is to address general issues encountered by users when submitting data to Tinkerbell.

Project (Combined with CRD Refactor)
https://github.com/orgs/tinkerbell/projects/26

Related
tinkerbell/tink#532

Integrate with Netbox

Documentation Updates

Significant architectural changes have happened to Tinkerbell in the last 12 months; the documentation does not reflect these changes.

To minimize documentation effort we want the following:

High level documentation that helps depict the Tinkerbell system should reside on the docs website.
Known setups/working hardware/supported technologies should be documented on the website.
Contributing, project structure and lower level architectural detail should exist in project repositories. We expect this kind of documentation to remain volatile for some time and its easier to keep it in lock-step with the code when it lives close to the code.

Project
https://github.com/orgs/tinkerbell/projects/17

Migrate to a Single Tinkerbell Version

Tinkerbell is composed of several microservices. All Tinkerbell microservices are semver major version 0 and we have historically made breaking changes in minor version increments. The volatile nature of major version 0 has made it difficult for users to know which versions of our services are compatible with eachother.

This ticket is to track migration to a singular version that represents a set of known to work versions of the Tinkerbell microservices.

Auto enrollment of nodes

Overview

There have been various requests to auto enroll devices with some sort of MAC filtering. Auto enrollment could mean bringing a device online ready to process workflows, or it could mean defining a default workflow to be run on all devices that auto enroll.

It may be useful to think of running a default workflow as an independently configurable feature from auto enrolling a device. This would help define auto enrollment as simply bringing a Tink Worker online on said device and subsequently allow operators to manually define workflows as well as define an automated approach.

Archive tinkerbell/hub and move actions to dedicated actions repository

The tinkerbell/hub repository contains Tinkerbell supported actions that are commonly used in provisioning. The repository was originally written to publish actions to https://artifacthub.io. Other code in the repository is tool based and used for generating and publishing the actions.

https://artifacthub.io explicitly advertises itself as a place to "Find, install and publish
Kubernetes packages".

Publishing to Artifact Hub feels slightly inappropriate given actions are not Kubernetes packages; instead we can publish to qya.io or ghcr.io.
The remaining tooling in the repository is around action generation. Given we rarely generate actions and those produced by third parties seem unlikely to use this tool it doesn't seem worthwhile to maintain.
With the repository publishing to a new registry and most of the code removed (excluding actions themselves), we can archive the hub repository in favor of a more easily repository identified name such as actions.

Additional Hardware Support

Project
https://github.com/orgs/tinkerbell/projects/19/views/1

Support device restart or kexec as part of workflows

Summary

When provisioning devices users inevitably need to restart the device (or kexec). To date, users achieve a restart through an action which can lead to incorrect status reporting on Workflows.

When the restart action is run it instructs the kernel to perform a system restart. The restart process races against the action exiting and Tink Worker reporting the action was successful. In the case of kexec, we rarely - if ever - see the action transition to a success state. This generally leaves workflows to timeout which is misleading for users.

This issue is to track the introduction of restart/kexec to workflows as a built in feature removing the need to include an action. Other proposals are welcome.

Support RKE2 deployments

Overview

RKE2 is a Kubernetes distribution developed by Rancher that targets Governments.

Several community members have expressed a desire to use RKE2 with some of them running into trouble. This ticket is to experiment and provide a clear path forward for users wanting to leverage RKE2.

Web UI

Tinkerbell v1alpha2 API

The Custom Resource Definitions defined in the Tink repository are mapped from the old Postgres backend without thinking too much about the data and organization.

We've identified duplicate and hard to understand fields on CRDs. We would like to refactor the CRDs to better represent the data they contain.

Project
https://github.com/orgs/tinkerbell/projects/26

Support kubernetes secret for providing user-data to cloud-init

Context

When using Hegel, currently, if we want to provide user-data to cloud-init, we need to pass it via Hardware spec.
For example:

apiVersion: tinkerbell.org/v1alpha1
kind: Hardware
metadata:
  name: hw1
  namespace: tink-system
spec:
  userData: |
    #cloud-config
    ---
    user: <USERNAME>
    password: <PLAINTEXT_PASSWORD>
    chpasswd: {expire: False}
    ssh_pwauth: True

Hegel then serves the user-data on HEGEL_IP:HEGEL_PORT/2009-04-04/user-data and
meta-data on HEGEL_IP:HEGEL_PORT/2009-04-04/meta-data/

cloud-init can read these user-data and meta-data when datasource is configured correctly.

This behavior works ok as long as user-data does not contain any sensitive information. However, it could still cause formatting issues with user-data.

Proposal

If user-data contains sensitive data like passwords, license keys etc it might not be desirable to put these in Hardware spec in plaintext format which can be read by anyone with read access to Hardware CR.

To help with this, we could move the user-data to a kubernetes secret object and reference that object in Hardware spec.
This secret object reference can be used by Hegel to pull user-data.
New spec example:

apiVersion: tinkerbell.org/v1alpha1
kind: Hardware
metadata:
  name: hw1
  namespace: tink-system
spec:
  userDataRef:
     name: <SECRET_NAME>
     namespace: <SECRET_NAMESPACE>

This approach has a few benefits,

We can avoid sensitive user-data information in Hardware spec.
Access to this Secret can be restricted to only required users i.e. cluster-admin and hegel serviceaccount.
Secret stores data in base64 encoded form. This helps preserve formatting of user-data by creating a secret directly from the user-data file.

Add support for pulling in and using Secrets

We need to build the concept for pulling in and using secrets.

From conversation with Nathan:

“I think we’re going to handle it by having a privileged worker simply running totally separate from unprivileged.”

We will also need a “more formal architecture and option and implementation of how you do secrets management.”

Extra notes:

An out of the box "tink secrets" container
Integrating with k8s-secrets or Vault

Integrate with Secret Managers in Templates/Workflows

We need first class construct, design, etc for secret management in HookOS, Templates and Workflows.

Expand Supported DHCP Configurations

Project
https://github.com/orgs/tinkerbell/projects/21/views/1

tinkerbell / roadmap Goto Github PK

roadmap's Introduction

Tinkerbell

License

Community

What's Powering Tinkerbell?

Tink

Smee

Hegel

OSIE

Hook

PBnJ

Building

Configuring OpenTelemetry

Website

roadmap's People

Contributors

Stargazers

Watchers

Forkers

roadmap's Issues

Summary

Overview

Proposal

Rationale

Expected Behaviour

Current Behaviour

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

Overview

Summary

Overview

Context

Proposal

Recommend Projects

Recommend Topics

Recommend Org

Jobs