GithubHelp home page GithubHelp logo

eclipse-bluechi / bluechi Goto Github PK

View Code? Open in Web Editor NEW
117.0 117.0 31.0 5.36 MB

Eclipse BlueChi is a systemd service controller intended for multi-node environments with a predefined number of nodes and with a focus on highly regulated ecosystems such as those requiring functional safety.

Home Page: https://bluechi.readthedocs.io/en/latest/

License: GNU Lesser General Public License v2.1

Makefile 0.19% C 62.02% Meson 1.65% Shell 1.72% Roff 1.14% Python 33.29%
containers controller linux podman services systemd

bluechi's People

Contributors

alexlarsson avatar artiomdivak avatar darth-mera avatar dofmind avatar dougsland avatar dracher avatar engelmi avatar ericcurtin avatar eriksjolund avatar ewchong avatar iiqbal2000 avatar irishair avatar lsm5 avatar mkemel avatar mwperina avatar pbrilla-rh avatar psss avatar pypingou avatar raballew avatar rhatdan avatar sandrobonazzola avatar sdunnagan avatar yarboa avatar ygalblum avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bluechi's Issues

Configuration: Add configuration file location as CLI param to node

Please describe what you would like to see

Currently, the location of the configuration file is hard-coded (see here). The location of the config file for the node should be a CLI parameter like it currently is in the orchestrator.

Please describe the solution you'd like

The implementation in the orchestrator can be used as a reference:

out of scope: passing the hashmap values to the NodeParams or OrchParams

Move manager dbus API to system bus

Currently the dbus API that the manager exports is on the user/session bus, which is nice when testing as it means we can run things as the user. However, the proper place for this is the system bus, so when things are more stable, we should move it there.

e2e Testing infrastructure

Workflow for e2e Testing

As a developer I would like to see e2e testing in gating as part of MR

How to achieve?

Simulate different compute resources with containers, runing systemd in the container to enable dbus messaging among nodes

Test framework use use case

e2e tests should cover the minimum essentials operations
hirte, up with default configuration
hirte-agent is up and registered.

Port the linked list macros from the POC

Please describe what you would like to see

Port the various LIST_ macros from orch.h

Acceptance Criteria

Maybe throw these macros in a header in the new project, and add a unit test for them.

Move from POC: [Node] connect to systemd

Please describe what you would like to see

The node needs to be connected to systemd in order to be able to re-/start/stop and monitor units.

Please describe the solution you'd like

For reference, see the implementation from the POC.
Maybe adding a simple unit test (just a dedicated <xyz_test.c file containing a main()) can help to verify that the connection works.

Out of scope: adding listeners or services

Please describe what you would like to see

The node needs to be connected to systemd in order to be able to re-/start/stop and monitor units.

Please describe the solution you'd like

For reference, see the implementation from the POC.
Maybe adding a simple unit test (just a dedicated <xyz_test.c file containing a main()) can help to verify that the connection works.

Out of scope: adding listeners or services

image

Node: connect to orchestrator

Please describe what you would like to see

The node needs to use a DBUS and connect it to the orchestrator via TCP.

Please describe the solution you'd like

The node uses the systemd dbus library to create a new DBus that connects to the orchestrator via TCP. The required input, host and port, are either

  • given via cli option or
  • read from the configuration file (also given to the node via cli option, e.g. node -c /path/to/config)

Please describe your use case

We use TCP to remotely connect the nodes to the orchestrator on a different machine.

Move fmt and lint to meson

Please describe what you would like to see

Instead of using makefile targets for formatting and linting (make fmt and make lint), these tasks should be integrated into meson to have a single tool for running everything.

Please describe the solution you'd like

One way to achieve this could be to move the current makefile targets to bash scripts (e.g. hack/fmt.sh, hack/lint.sh, etc.) and create a meson target as described here. For example, having the target

run_target('fmt',  command : 'hack/fmt.sh')

would be called like this:

meson compile fmt

AC:

  • Makefile removed
  • formatting (+check) can be triggered via meson
  • linting (+auto fix) can be triggered via meson
  • CI steps are updated

Implement `hirtectl list-units`

hirtectl should have a command list-units that calls the ListUnit files either on the manager, or if you specify a node name, on the particularly named node.

Implement logging

Please describe what you would like to see

The orchestrator and node are outputting logs and the verbosity (debug, info, etc.) can be defined on start.

Please describe the solution you'd like

Probably a simple implementation similar to https://github.com/rxi/log.c/blob/master/src/log.h
Looking at systemds logging, it currently seems too complex and feature-rich for our simple use case.

The verbosity should be passed as an cli option, e.g. -v <level> (where level is a simple number).

As soon as there is an implementation, all fprintf etc. should be exchanged accordingly.

Please describe your use case

It would be great to have an easy way of logging to get an understanding of how the system while it is running.

Add xml file describing the hirte dbus APIs.

We should distribute a set of files similar to /usr/share/dbus-1/interfaces/org.freedesktop.systemd1.Manager.xml that contain a machine readable description of the hirte dbus APIs.

Change all headers to use "pragma once".

Please describe what you would like to see

In systemd, pragma once serves to prevent headers from being included multiple times. Our headers currently use inclusion guards, and we should change this to pragma once to be consistent with systemd.

Move from POC: [Orchestrator] create peer bus

Please describe what you would like to see

On a connection request from a node, the orchestrator creates a new peer dbus.

Please describe the solution you'd like

For reference, see the implementation from the POC.
Maybe use a factory pattern to create peer dbus (e.g. see here for factory pattern in C)

Please describe your use case

Creating a peer dbus that does not use the dbus daemon.

image

Configuration: Map the hashmap to NodeParams and OrchParams

Please describe what you would like to see

The hashmap resulting from reading the configuration file needs to be mapped to

Please describe the solution you'd like

A function (e.g. map_to_node_params) is added to binchihua/src/ini/mapper.h (file needs to be created if not exists). It maps a hashmap to the NodeParams struct.

Acceptance Criteria:

  • mapping function for hashmap -> NodeParams
  • mapping function for hashmap -> OrchParams
  • unit test for the mappings

Linting

Please describe what you would like to see

In order to statically analyze and enforce best practice rules, there should be linter in place - available to run locally and in the CI.

Please describe the solution you'd like

clang-tidy seems to be a good option for this. There should be

  • a first, initial rule set,
  • a Makefile target ,
  • and an integration into the CI

implemented.

Port strndupa_safe() from systemd

strndupa_safe is a safer version of strdup(). To replace the strdup() with strndupa_safe() it needed to be ported from
systemd library

Move from POC: [Node] function to start a unit

Please describe what you would like to see

The dbus connection from #20 is used to start units.

Please describe the solution you'd like

For reference, see the implementation from the POC.
The callback for the systemd start unit call can be empty or just print out a simple message (for now).
Maybe adding a simple unit test (just a dedicated <xyz_test.c file containing a main()) can help to verify that the systemd unit can be started.

Important Tech note: Instead of "isolate" use "replace". This is the default behavior of systemctl start (see here for reference) and should be used instead of isolate (as it causes all other units to be stopped).

image

Support authentication of nodes

We should be able to define in the manager configuration details about each node, and this should include authentication information for each of the nodes (like a client certification or similar) such that we can verify that the right node is connecting.

Add a signal handler for client application

Please describe what you would like to see

A signal handler gives the application an opportunity to do whatever clean-up is needed before the kernel removes the process.

CLI: Enhance CLI options handling: generate optstring automatically based on options[] array

Currently all CLI options are provided in an array of structs, e.g. in orch/opt.c:

const struct option options[] = {
                        {ARG_PORT, required_argument, 0, ARG_PORT_SHORT},
                        {ARG_CONFIG, required_argument, 0, ARG_CONFIG_SHORT},
                        {NULL, 0, 0, '\0'}};

When adding new CLI options, we add macro constants to common/opt.h and new members to the options[] array,
but also we have to manually edit the optstring, passed to getopt_long(). Currently there is a function, that returns this string hardcoded.

It is possible to generate the optstring from the options[] array, thus adding new CLI options will be easier.

Agent should try to reconnect when connection goes down

The agent should detect when the connection to the manager gets broken and do something about it. This could be either retry, with exponential backoff, or just exit with an error, or a combination of the two. Possibly this should be a configuration option.

Properly escape node names in dbus object paths

The object path for a node is /org/container/hirte/node/$NODENAME, which is currently computed in node_new(). This should actually escape the name to ensure that it is a proper dbus object path, as per the dbus spec:

 The following rules define a valid object path. Implementations must not send or accept messages with invalid object paths.

    The path may be of any length.

    The path must begin with an ASCII '/' (integer 47) character, and must consist of elements separated by slash characters.

    Each element must only contain the ASCII characters "[A-Z][a-z][0-9]_"

    No element may be the empty string.

    Multiple '/' characters cannot occur in sequence.

    A trailing '/' character is not allowed unless the path is the root path (a single '/' character). 

Systemd already does similar escaping for unit names, so we should just do whatever it does.

Orchestrator: use configuration file on start

Please describe what you would like to see

When starting the orchestrator, it should read and use the configuration file to bootstrap itself.

Please describe the solution you'd like

The configuration for the orchestrator could look like

[Configuration]
Port=<port>   # required
ExpectedNodes=<ip1>,<ip2>,...,<ip n>   # could be empty, then no connection request is accepted

The path to the file should be passed via cli option, e.g. orch -c /path/to/file.

Implement the Job-related manager operations

Some of the operations on a node are more long-running operations, like StartUnit, ReloadUnit, StopUnit, etc. These operations create a job object that track the state changes to the job, and there are two signals on the manager (JobNew and JobRemoved) which are emitted to track the completion of jobs. This API is essentially a direct copy of the systemd job api and needs to be implemented for these operations to work.

One issue here is that we don't want multiple outstanding (and potentially conflicting) jobs for the same unit, so there is a per-unit queue which stores queued jobs for a unit if there is a currently executing job already. This is also similar to what systemd does.

Move from POC: [Orchestrator] offer service for starting systemd units

Please describe what you would like to see

The orchestrator is connected to the user dbus and offers a service for starting a systemd unit on it.

Please describe the solution you'd like

For reference, see

  • connecting to the local user dbus: see here
  • adding a vtable (service) to the dbus: see here
  • defining a vtable: see here
    • without the signals, just the method
    • method can be empty or just print out an info

image

Testing

I'm opening this issue as a place to discuss developing a plan on how to test this project, as per our agreement in the chat.

Choose build system

Please describe what you would like to see

Plain Makefiles are ok in the beginning, but as the project grows we should switch to a proper build system.

Please describe the solution you'd like

A build system should be selected and integrated for local and CI usage. Possible ones are

  • autotools (older, standard and widely used) - PR already here
  • meson (newer and therefore less used, faster and simpler according to doc)
  • ...

Add default configuration location

The agent and the manager should read configuration files in a default location if not otherwise specified. This should be in /etc somewhere, like maybe /etc/hirte.conf and /etc/hirte-node.conf.

Apply code formatting rules

Please describe what you would like to see

Based on the discussions in #55:

  • Exchange all #ifndef xyz #define xyz by #pragma once
  • Re-order the #include in all source files to have the system headers first, followed by an empty line and then the local includes

Properly handle agent disconnect

When an agent disconnects we must look at all outstanding requests and jobs for it and terminate them, otherwise a client may wait forever for a repsonse.

Better error handling and logging framework

Our current handling of errors is rather naive. All we do is print stuff on stderr and return false. Once hirte is in production I think we want something more manageable. For example, we should probably have some minimal helpers for logging issues that have a log level which can be tweaked during testing via the configuration. And, it should probably go to the systemd journal at least when in production (although during testing, being able to get it to stderr is probably nice too).

We also shoud probably pick up the systemd error reporting mechanism of returning -errno style integers instead of just true/false for "success"/"error". That way callers can get at least some more nformation on exactly what failed. In many cases we already call into systemd APIs which can give this info. For example:

bool node_export(Node *node) {
        Manager *manager = node->manager;

        int r = sd_bus_add_object_vtable(
                        manager->user_dbus,
                        &node->export_slot,
                        node->object_path,
                        NODE_INTERFACE,
                        node_vtable,
                        node);
        if (r < 0) {
                fprintf(stderr, "Failed to add node vtable: %s\n", strerror(-r));
                return false;
        }

        return true;
}

This would probably be nicer if it was something like:

int node_export(Node *node) {
        Manager *manager = node->manager;

        int r = sd_bus_add_object_vtable(
                        manager->user_dbus,
                        &node->export_slot,
                        node->object_path,
                        NODE_INTERFACE,
                        node_vtable,
                        node);
        if (r < 0) {
                hirte_log(HIRTE_LOG_WARNING, "Failed to add node vtable: %s\n", strerror(-r));
        }

        return r;
}

Containerize the node and orchestrator

Please describe what you would like to see

Both, the node and orchestrator, are containerized.

Please describe the solution you'd like

A base image can be defined and used to containerize both applications. However, as the project progresses there will arise probably different requirements, e.g. to connect to systemd the node probably needs to mount the systemd private socket.

Please describe your use case

By containerizing the it is possible to start multiple nodes. It also makes manual and automatic testing easier.

Implement proxy services

A proxy service is a service that can run on a node and mirror a service on another node for the purposes of remote dependencies.

The way it works is this:

  • hirte ships with a proxy service template file [email protected]
  • Another service file depends on hirte-remote-service@nodename_servicename.service
  • [email protected] when started calls via dbus to the local agent and tells it to create a proxy object for the service.
  • The manager sees the newly created proxy object on the agent, ensures that the service is running on the target node and then tells the original agent that its running now (or that starting it failed)
  • The proxy service gets notified of this success or failure
  • In the manager, the proxy gets handled similar to a monitor on the target service, and when it changes state (i.e. when it stops) the original agent is told to stop the proxy
  • When the agent gets told the service is stopped, it stops the proxy service

The template service would look something like this:

[Unit]
Description=Hirte proxy service

[Service]
ExecStart=/usr/libexec/hirte-proxy-service "%i"
Type=oneshot
RemainAfterExit=yes
KillMode=mixed

This is a one-shot, remain-after-exit service. On startup it calls hirte-proxy-service with the nodename__servicename par of the name as argument. This blocks until the target service is know to be running (EXIT_SUCCESS) or it failed to start (EXIT_FAILURE). If it failed the proxy service will be considered failed and dependencies fail. If it returns success the proxy service is now running. If we do nothing, it will be considered running forever, however eventually the agent will get told the service died, and the agent will then tell systemd to stop the proxy service.

Here is some proof of concept code for how this could work:
https://gist.github.com/alexlarsson/bed968e0043f5ba3b22637be08bf19ac

To be Deleted

Please describe what you would like to see

Port from POC stubbed-out versions of Orchestrator, Node, Manager and Job. Implement just enough to capture the containment relationships between these objects. For example, Orchestrator contains a list of Node, and Manager contains a list of Job. Do this just for orch for now, and we'll come back to client and node.

[Orchestrator, Node, Client] Stop the event loop

Please describe what you would like to see

Currently, the orchestrator and node can only be stopped by killing the process. It would be great to extend the tooling - the client in this case - to be able to trigger a shutdown of the node/orchestrator, e.g. via cli:

client -o shutdown

Please describe the solution you'd like

This can be achieved by

  • node + orchestrator + client: connecting to the local user dbus
  • node + orchestrator: creating a shutdown service and providing it on the user dbus
    The service requires the event loop (sd_event) instance and calls sd_event_exit - this will result the blocked sd_event_loop to exit
  • client: adding a cli option to call the shutdown service on the local user dbus

Please describe your use case

This is potentially usefuinl for automation, e.g. tests.

e2e Tests as part of github workflow

Add tests to cover various hirte dbus scenarios

As part of review resolve issue #118 the following idea came up
#104 (comment)

Testing linrary and framework

Python package https://github.com/rhinstaller/dasbus
Could be integrated easily inside pytest fremwork

hirte scenarios

test_registered_node_proxy()
/org/containers/hirte/node/foo
test_offline_node_proxy()
/org/containers/hirte/node/bar
test_not_exist_node_proxy()
/org/containers/hirte/node/fye

Move from POC: [Client] Trigger start systemd unit

Please describe what you would like to see

The client connects to the user dbus and can trigger the service offered by the orchestrator for starting a systemd unit on all nodes (requires #22).

Please describe the solution you'd like

For a reference implementation see the client POC.

Please describe your use case

Having a dedicated client for triggering the services offered by the orchestrator. Starting with the "unit start" feature.

image

Node: use configuration file on start

Please describe what you would like to see

When starting a node, it should read and use the configuration file to bootstrap itself.

Please describe the solution you'd like

The configuration for the orchestrator could look like

[Configuration]
OrchestratorIP=<ip>       # required
OrchestratorPort=<port>   # required

The path to the file should be passed via cli option, e.g. node -c /path/to/file.

Implement properties and monitoring

Implement unit properties and monitoring.

In systemd each unit object has a set of properties (implementing using the standard dbus properties interface). We want to mirror at least some of these properties in hirte.

First this involves the GetUnitProperties() call on the Node, which can get the propertires of a particular unit of a particular node.

Secondly, you can call on the manager the CreateMonitor() call. This creates a new monitor object, which exists until it is manually closed, or the creating process disconnects from the bus (which can be detected using dbus events). This object has methods that allow you to add subscriptions to events. Like, you can say "tell me every time foo.service properties change on node bar. This will make hirte talk to the agent on node bar and subscribe (a standard dbus call) for all events, and forward all the matching ones to hirte, which then will emit a signal on the monitor object.

Implementation node: We don't need a monitor object in the agent, but instead the manager consolidates all the current subscriptions for a particular node and just sends changes to that single per-node subscription state to the agent.

CC: @sdunnagan

Build as a messaging library rather than 3 apps

Please describe what you would like to see

Blue Chihuahua maybe should be delivered as a messaging library, rather than as a set of applications. Experiment with changing the POC to be built as a library and tested with orch, client and node test programs. If the team agrees with this approach, change the new project to be built as a library.

Add handling for SIGTERM in orchestrator and node event loops

Please describe what you would like to see

SIGTERM gives the application an opportunity to do clean-up before it shuts down. This can be done by registering a signal handler for signals that can be ignored, such as SIGTERM and SIGINT, which I've done for client.

In the case of orchestrator and node, an event loop is run, and in that case the event loop handle SIGTERM. This GitHub issue is to write a function that sets up the event loops for handling SIGTERM.

Please describe the solution you'd like

A clear and concise description of what you want to happen.

Please describe your use case

A clear and concise description of how you would use this feature.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.