GithubHelp home page GithubHelp logo

azimuth's Introduction

Azimuth provides a self-service portal for managing cloud resources, with a focus on simplifying the use of cloud for high-performance computing (HPC) and artificial intelligence (AI) use cases.

Azimuth is currently capable of targeting OpenStack clouds, with the supported platforms ranging from single-machine workstations with web-based remote console and desktop access to entire Slurm clusters and platforms such as JupyterHub that run on Kubernetes clusters.

To run Azimuth on your own cloud, please see:

Developers looking to modify Azimuth, please see:

Users of Azimuth wanting to know more, please see:

Contents

What is Azimuth?

Azimuth was originally developed for the JASMIN Cloud as a simplified version of the OpenStack Horizon dashboard, with the aim of reducing complexity for less technical users.

It has since grown to offer additional functionality with a continued focus on simplicity for scientific use cases, including the ability to provision complex platforms via a user-friendly interface. The platforms range from single-machine workstations with web-based remote console and desktop access to entire Slurm clusters and platforms such as JupyterHub that run on Kubernetes clusters.

Services are exposed to users without consuming floating IPs or requiring SSH keys using the Zenith application proxy.

Here, you can see Stig Telfer (CTO) and Matt Pryor (Senior Tech Lead and Azimuth project lead) from StackHPC presenting Azimuth at the OpenInfra Summit in Berlin in 2022:

Azimuth - self service cloud platforms for reducing time to science

Key features of Azimuth include:

  • Supports multiple Keystone authentication methods simultaneously:
    • Username and password, e.g. for LDAP integration.
    • Keystone federation for integration with existing OpenID Connect or SAML 2.0 identity providers.
    • Application credentials to allowing the distribution of easily-revokable credentials, e.g. for training, or for integrating with clouds that use federation but implementing the required trust is not possible.
  • On-demand Platforms
    • Unified interface for managing Kubernetes and CaaS platforms.
    • Kubernetes-as-a-Service and Kubernetes-based platforms
      • Operators provide a curated set of templates defining available Kubernetes versions, networking configurations, custom addons etc.
      • Uses Cluster API to provision Kubernetes clusters.
      • Supports Kubernetes version upgrades with minimal downtime using rolling node replacement.
      • Supports auto-healing clusters that automatically identify and replace unhealthy nodes.
      • Supports multiple node groups, including auto-scaling node groups.
      • Supports clusters that can utilise GPUs and accelerated networking (e.g. SR-IOV).
      • Installs and configures addons for monitoring + logging, system dashboards and ingress.
      • Kubernetes-based platforms, such as JupyterHub, as first-class citizens in the platform catalog.
    • Cluster-as-a-Service (CaaS)
      • Operators provide a curated catalog of appliances.
      • Appliances are Ansible playbooks that provision and configure infrastructure.
        • Ansible calls out to Terraform to provision infrastructure.
      • Uses AWX, the open-source version of Ansible Tower, to manage Ansible playbook execution and Consul to store Terraform state.
  • Application proxy using Zenith:
    • Zenith uses SSH tunnels to expose services running behind NAT or a firewall to the internet using operator-controlled, random domains.
      • Exposed services do not need to be directly accessible to the internet.
      • Exposed services do not consume a floating IP.
    • Zenith supports an auth callout for proxied services, which Azimuth uses to secure proxied services.
    • Used by Azimuth to provide access to platforms, e.g.:
  • Simplified interface for managing basic OpenStack resources:
    • Automatic detection of networking, with auto-provisioning of networks and routers if required.
    • Create, update and delete machines with automatic network detection.
    • Create, delete and attach volumes.
    • Allocate, attach and detach floating IPs.
    • Configure instance-specific security group rules.

Try Azimuth

If you have access to a project on an OpenStack cloud, you can try Azimuth!

A demo instance of Azimuth can be deployed on an OpenStack cloud by following these simple instructions. All that is required is an account on an OpenStack cloud and a host that is capable of running Ansible. Admin privileges on the target cloud are not normally required.

Timeline

This section shows a timeline of the significant events in the development of Azimuth:

  • Autumn 2015: Development begins on the JASMIN Cloud Portal, targetting JASMIN's VMware cloud.
  • Spring 2016: JASMIN Cloud Portal goes into production.
  • Early 2017: JASMIN Cloud plans to move to OpenStack, cloud portal v2 development begins.
  • Summer 2017: JASMIN's OpenStack cloud goes into production, with the JASMIN Cloud Portal v2.
  • Spring 2019: Work begins on JASMIN Cluster-as-a-Service (CaaS) with StackHPC.
  • Summer 2019: JASMIN CaaS beta roll out.
  • Spring 2020: JASMIN CaaS in use by customers, e.g. the ESA Climate Change Initiative Knowledge Exchange project.
  • Summer 2020: Production rollout of JASMIN CaaS.
  • Spring 2021: StackHPC fork JASMIN Cloud Portal to develop it for IRIS.
  • Summer 2021: Zenith application proxy developed and used to provide web consoles in Azimuth.
  • November 2021: StackHPC fork detached and rebranded to Azimuth.
  • December 2021: StackHPC Slurm appliance integrated into CaaS.
  • January 2022: Native Kubernetes support added using Cluster API (previously supported by JASMIN as a CaaS appliance).
  • February 2022: Support for exposing services in Kubernetes using Zenith.
  • March 2022: Support for exposing services in CaaS appliances using Zenith.
  • June 2022: Unified platforms interface for Kubernetes and CaaS.
  • October 2022: Support for Kubernetes platforms in the unified platforms interface.

Architecture

Azimuth consists of a Python backend providing a REST API (different to the OpenStack API) and a Javascript frontend written in React.

At it's core, Azimuth is just an OpenStack client. When a user authenticates with Azimuth, it negotiates with Keystone (using either username/password or federation) to obtain a token which is stored in a cookie in the user's browser. Azimuth then uses this token to talk to the OpenStack API on behalf of the user when the user submits requests to the Azimuth API via the Azimuth UI.

Azimuth Core Architecture Diagram

When the Zenith application proxy and Cluster-as-a-Service (CaaS) subsystems are enabled, this picture becomes more complicated - see Azimuth Architecture for more details.

Deploying Azimuth

Although Azimuth itself is a simple Python + React application that is deployed onto a Kubernetes cluster using Helm, a fully functional Azimuth deployment is much more complex and has many dependencies:

  • A Kubernetes cluster.
  • Persistent storage for Kubernetes configured as a Storage Class.
  • The NGINX Ingress Controller for exposing services in Kubernetes.
  • AWX for executing Cluster-as-a-Service jobs.
  • Cluster API operators + custom Kubernetes operator for Kubernetes support.
  • Zenith application proxy with authentication callout wired into Azimuth.
  • Consul for Zenith service discovery and Terraform state for CaaS.

To manage this complexity, we use Ansible to deploy Azimuth and all of it's dependencies. See the Azimuth Deployment Documentation for more details.

Setting up a local development environment

See Setting up a local development environment.

azimuth's People

Contributors

mkjpryor avatar mkjpryor-stfc avatar dependabot[bot] avatar bmcollier avatar brtkwr avatar johngarbutt avatar sd109 avatar azimuth-ci-bot[bot] avatar abowery avatar bbezak avatar m-bull avatar motehue avatar azimuth-automation-bot[bot] avatar scrungus avatar

Stargazers

Dmitry Galkin avatar  avatar Almighty avatar Florian Jäger avatar  avatar  avatar Thanh Lee avatar  avatar Ken Crandall avatar  avatar Haruna Adoga avatar adam mccartney avatar Nikolaus Schlemm avatar Glyn Chudleigh avatar Shakhizat Nurgaliyev avatar Matheus Rossetti avatar Neil Martin avatar Mahendra Paipuri avatar Aaron W avatar  avatar Kevin Maris avatar russmz avatar Tomáš Kudlička avatar  avatar thetechoddbug (José María Gutiérrez) avatar Ahmed Jadani avatar Peter Rudenko avatar Erich B avatar Ricardo Silva avatar Aubort Jean-Baptiste avatar Mustafa Arif avatar Manuel Parra-Royón avatar Julian Pistorius avatar Simon Li avatar Aaron S. Jackson avatar Piotr Parczewski avatar

Watchers

James Cloos avatar Sebastian Luna-Valero avatar Stig Telfer avatar  avatar

azimuth's Issues

Contributors guide?

Hey folks. Are there any contribution guidelines and/or Azimuth development getting started docs?

Nunjucks templated values don't update in UI

Appliance properties displayed in the UI via nunjucks templates don't update when the appliance is updated. For example, we allow the user to update the number of IPUs their machine has available. If the user changes the number, and updates the appliance, the new IPU count isn't displayed in the UI (the IPU count is also populated into the instance metadata, and is correctly updated there).

There appears to be nothing the user can do to force an update: refreshing the details pane doesn't help; refreshing the main dashboard page doesn't help; logging out and back in doesn't help.

Feature: event handler for irreconcilable (k8s) clusters

Description

There are situations where a kubernetes cluster administrator may use the UI to
create / update a cluster that is beyond the scope of the underlying cloud
infrastructure. Such a request may be considered "irreconcilable". Like certain
types of romantic poetry, it is full of subtle longing for things that will
likely never come to pass.
One such situation is where the user submits a request to update the cluster,
and in doing so specifies a set of resources that exceed the given quota for the
tenancy where the cluster is deployed. After such a request has been issued, the
update button for that cluster becomes unusable and the only ways to get back to
a good state is to resolve the issue on the openstack side, delete the
cluster, or intervene using the azimuth cli.
Note that the k8s cluster under management and the applications it runs remain
functional during this blip.

Desired behaviour

The user has the option to roll back to a previously valid cluster configuration after a timeout.
For instance, after the timeout has been exceeded, the user would be presented with a modal
that gives a choice of either to continue to wait or to revert the cluster to the last known good state.

One possible Implementation suggestion

  • Add a representation for "previously good k8s cluster form state" to the store
  • Also add a state to represent "irreconcilable k8s cluster"
  • Modify KubernetesClusterModalForm and useKubernetesClusterFormState to
    use a "copy-on-write" style pattern to store the result of
    initialState(kubernetesCluster) before attempting the update.
  • In the case that we enter the "irreconcilable k8s cluster", allow the user to
    re-apply the stored "previously good k8s cluster form state"

Azimuth can present FIPs for selection which aren't actually available

Selected an IP on Arcus when creating a platform with a FIP, creation failedwith a popup:
{"parameter_values":{"cluster_floating_ip":"External IP is not available."}}

Horizon showed the FIP as Active but not mapped to a fixed IP. However the CLI showed it was mapped to a port on the ilab-60 network which was not owned by the project being used.

Advance create platform dialog as soon as platform has been selected

We've had a question from our Azimuth users whether, when an appliance has been selected in the "Create a new platform / Pick a platform type" dialog, can it automatically progress to the "Configure platform" page? This would be rather than the current workflow that requires clicking the "Next" button. Thanks

UI: server active tasks means table is too wide

when you reboot servers, the table width seems all wrong, and on chrome you don't get scroll bars or any obvious clues, just get buttons hiding on the right hand side.

Maybe we need to fix the width of something better here?

Create cluster failed in 500

FO] [2023-02-23 14:10:37,613] [azimuth.cluster_engine.drivers.awx.driver:108] [ThreadPoolExecutor-0_1] [[email protected]] [[email protected]] Found 2 inventories

  | Log labelsappazimuthcomponentapicontainerapifilename/var/log/pods/azimuth_azimuth-api-665cccdc4b-m2c7z_c6dc809a-f1b3-4400-9e0a-5a9c276cc828/api/0.loginstanceazimuthjobazimuth/azimuthnamespaceazimuthnode_nameazimuth-cl1-md-0-5e2cfe39-sbz78podazimuth-api-665cccdc4b-m2c7zstreamstdoutDetected fieldsTime1677161437613tsNs1677161437613671193 | Log labels |   | app | azimuth |   | component | api |   | container | api |   | filename | /var/log/pods/azimuth_azimuth-api-665cccdc4b-m2c7z_c6dc809a-f1b3-4400-9e0a-5a9c276cc828/api/0.log |   | instance | azimuth |   | job | azimuth/azimuth |   | namespace | azimuth |   | node_name | azimuth-cl1-md-0-5e2cfe39-sbz78 |   | pod | azimuth-api-665cccdc4b-m2c7z |   | stream | stdout | Detected fields |   | Time | 1677161437613 |   | tsNs | 1677161437613671193
Log labels
  | app | azimuth
  | component | api
  | container | api
  | filename | /var/log/pods/azimuth_azimuth-api-665cccdc4b-m2c7z_c6dc809a-f1b3-4400-9e0a-5a9c276cc828/api/0.log
  | instance | azimuth
  | job | azimuth/azimuth
  | namespace | azimuth
  | node_name | azimuth-cl1-md-0-5e2cfe39-sbz78
  | pod | azimuth-api-665cccdc4b-m2c7z
  | stream | stdout
Detected fields
  | Time | 1677161437613
  | tsNs | 1677161437613671193
  |   | 2023-02-23 14:10:37 | 172.21.36.128 - - [23/Feb/2023:14:10:37 +0000] "POST /api/tenancies/3a7dd6b6832a4dc2bf0d1cf3784f943b/clusters/ HTTP/1.1" 500 145 "https://portal.apps.gbnwp-cl1.ipu.graphcore.ai/tenancies/3a7dd6b6832a4dc2bf0d1cf3784f943b/platforms" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/109.0"
  |   | 2023-02-23 14:10:37 | KeyError: 'cluster_name'
  |   | 2023-02-23 14:10:37 | name = params.pop("cluster_name")
  |   | 2023-02-23 14:10:37 | File "/application/azimuth/cluster_engine/drivers/awx/driver.py", line 383, in _from_inventory
  |   | 2023-02-23 14:10:37 | _ = self._from_inventory(inventory, ctx)
  |   | 2023-02-23 14:10:37 | File "/application/azimuth/cluster_engine/drivers/awx/driver.py", line 550, in create_cluster
  |   | 2023-02-23 14:10:37 | return f(*args, **kwargs)
  |   | 2023-02-23 14:10:37 | File "/application/azimuth/cluster_engine/drivers/awx/driver.py", line 37, in wrapper
  |   | 2023-02-23 14:10:37 | cluster = self._driver.create_cluster(
  |   | 2023-02-23 14:10:37 | File "/application/azimuth/cluster_engine/engine.py", line 280, in create_cluster
  |   | 2023-02-23 14:10:37 | cluster = cluster_manager.create_cluster(
  |   | 2023-02-23 14:10:37 | File "/application/azimuth/views.py", line 971, in clusters
  |   | 2023-02-23 14:10:37 | return view(*args, **kwargs)
  |   | 2023-02-23 14:10:37 | File "/application/azimuth/views.py", line 61, in wrapper
  |   | 2023-02-23 14:10:37 | return view(*args, **kwargs)
  |   | 2023-02-23 14:10:37 | File "/application/azimuth/views.py", line 113, in wrapper
  |   | 2023-02-23 14:10:37 | return view(*args, **kwargs)
  |   | 2023-02-23 14:10:37 | File "/application/azimuth/views.py", line 183, in wrapper
  |   | 2023-02-23 14:10:37 | return view(*args, **kwargs)
  |   | 2023-02-23 14:10:37 | File "/application/azimuth/views.py", line 141, in wrapper
  |   | 2023-02-23 14:10:37 | return func(*args, **kwargs)
  |   | 2023-02-23 14:10:37 | File "/usr/local/lib/python3.9/site-packages/rest_framework/decorators.py", line 50, in handler
  |   | 2023-02-23 14:10:37 | response = handler(request, *args, **kwargs)
  |   | 2023-02-23 14:10:37 | File "/usr/local/lib/python3.9/site-packages/rest_framework/views.py", line 506, in dispatch
  |   | 2023-02-23 14:10:37 | raise exc
  |   | 2023-02-23 14:10:37 | File "/usr/local/lib/python3.9/site-packages/rest_framework/views.py", line 480, in raise_uncaught_exception
  |   | 2023-02-23 14:10:37 | self.raise_uncaught_exception(exc)
  |   | 2023-02-23 14:10:37 | File "/usr/local/lib/python3.9/site-packages/rest_framework/views.py", line 469, in handle_exception
  |   | 2023-02-23 14:10:37 | response = self.handle_exception(exc)
  |   | 2023-02-23 14:10:37 | File "/usr/local/lib/python3.9/site-packages/rest_framework/views.py", line 509, in dispatch
  |   | 2023-02-23 14:10:37 | return self.dispatch(request, *args, **kwargs)
  |   | 2023-02-23 14:10:37 | File "/usr/local/lib/python3.9/site-packages/django/views/generic/base.py", line 84, in view
  |   | 2023-02-23 14:10:37 | return view_func(*args, **kwargs)
  |   | 2023-02-23 14:10:37 | File "/usr/local/lib/python3.9/site-packages/django/views/decorators/csrf.py", line 54, in wrapped_view
  |   | 2023-02-23 14:10:37 | response = wrapped_callback(request, *callback_args, **callback_kwargs)
  |   | 2023-02-23 14:10:37 | File "/usr/local/lib/python3.9/site-packages/django/core/handlers/base.py", line 197, in _get_response
  |   | 2023-02-23 14:10:37 | response = get_response(request)
  |   | 2023-02-23 14:10:37 | File "/usr/local/lib/python3.9/site-packages/django/core/handlers/exception.py", line 55, in inner
  |   | 2023-02-23 14:10:37 | Traceback (most recent call last):
  |   | 2023-02-23 14:10:37 | [ERROR] [2023-02-23 14:10:37,106] [django.request:241] [ThreadPoolExecutor-0_0] Internal Server Error: /api/tenancies/3a7dd6b6832a4dc2bf0d1cf3784f943b/clusters/
  | Log labelsappazimuthcomponentapicontainerapifilename/var/log/pods/azimuth_azimuth-api-665cccdc4b-m2c7z_c6dc809a-f1b3-4400-9e0a-5a9c276cc828/api/0.loginstanceazimuthjobazimuth/azimuthnamespaceazimuthnode_nameazimuth-cl1-md-0-5e2cfe39-sbz78podazimuth-api-665cccdc4b-m2c7zstreamstderrDetected fieldsTime1677161437137tsNs1677161437137744717 | Log labels |   | app | azimuth |   | component | api |   | container | api |   | filename | /var/log/pods/azimuth_azimuth-api-665cccdc4b-m2c7z_c6dc809a-f1b3-4400-9e0a-5a9c276cc828/api/0.log |   | instance | azimuth |   | job | azimuth/azimuth |   | namespace | azimuth |   | node_name | azimuth-cl1-md-0-5e2cfe39-sbz78 |   | pod | azimuth-api-665cccdc4b-m2c7z |   | stream | stderr | Detected fields |   | Time | 1677161437137 |   | tsNs | 1677161437137744717
Log labels
  | app | azimuth
  | component | api
  | container | api
  | filename | /var/log/pods/azimuth_azimuth-api-665cccdc4b-m2c7z_c6dc809a-f1b3-4400-9e0a-5a9c276cc828/api/0.log
  | instance | azimuth
  | job | azimuth/azimuth
  | namespace | azimuth
  | node_name | azimuth-cl1-md-0-5e2cfe39-sbz78
  | pod | azimuth-api-665cccdc4b-m2c7z
  | stream | stderr
Detected fields
  | Time | 1677161437137
  | tsNs | 1677161437137744717
  |   | 2023-02-23 14:10:37 | 172.18.120.192 - - [23/Feb/2023:14:10:37 +0000] "GET /api/tenancies/4f19cf5df697497e9222505361cf75b8/quotas/ HTTP/1.1" 200 370 "https://portal.apps.gbnwp-cl1.ipu.graphcore.ai/tenancies/4f19cf5df697497e9222505361cf75b8/platforms" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.3 Safari/605.1.15"

Helm chart fails when adding awx with helm upgrade

If you don't read the bit about adding clusters until quite late, you end up trying to add awx as a helm updated. When you do this happens:
Error: UPGRADE FAILED: unable to recognize "": no matches for kind "AWX" in version "awx.ansible.com/v1beta1"

Servers with multiple IPs only show up with one internal IP

While we don't support creating servers with two networks attached, when you see servers created outside of azimuth, they only see to show as having one IP. It might be nice to show a list of IPs in that case? Total edge case though.

Enable per appliance filtering or other specification of applicable flavours

It would be most useful to have the capability of filtering the list of flavours, or otherwise specifying a list of applicable flavours, per appliance type. A trivial examples:

  • Present only "Intel" flavours in the dropdown for appliances dedicated to Intel workloads
  • Present only flavours with RAM > 128 GB for appliances requiring large allocations

Some errors make the screen go blank, need to be more defensive?

For example, if you use the CRD driver, you can delete a cluster type, but still have a cluster of that type, and the whole screen goes blank. There are some react error messages suggesting we are missing some error handlers to pop up a more friendly error message, so we fail more gracefully. These are mostly unexpected cases, like the choices param not being well formed in ui meta causing the screen to go blank.

Update slurm with changed number of compute nodes doesn't deploy additional node

  • Create a cluster with 1x compute node
  • Press Update and add another compute node.
  • No additional node was created. The stackhpc.terraform.infra : Provision infrastructure using Terraform step was an "OK" (not changed).
  • From K8s:
$ kubectl -n az-rcp-cloud-portal-demo describe clusters.caas.azimuth.stackhpc.com slurm-v1

Status:
  Applied Extra Vars:
  ...
    compute_count:                      1
    ...

so looks like this didn't get updated.

This did used to work.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.