clusterlabs / striker Goto Github PK

View Code? Open in Web Editor NEW

26.0 13.0 12.0 111.26 MB

The Anvil! Intelligent Availability™ Platform - Striker UI and ScanCore decision engine

Home Page: https://www.alteeve.com/w/Anvil!

Perl 96.24% HTML 1.69% CSS 0.07% JavaScript 0.19% PLpgSQL 1.55% Shell 0.22% C 0.03%

striker's Introduction

Anvil! m2 - Striker + ScanCore v2.0.1

Welcome to the v2.0.0 release of the Anvil! m2 Intelligent Availability™ platform!

What is an Anvil! Platform?

It is the first server platform designed with the singular focus of protecting your servers and keeping them running under even extreme fault conditions.
It is fully self-contained, making it ideal for totally offline operation.
It is a "self driving" server availability platform that can continuously monitors internal and external data sources, compiling, analyzing and autonomously deciding when and what action to take to protect your servers. It is ideally suited for extended remote deployments and "hands off" operation.
It is based on an extensively field tested, open architecture with full data, mechanical and electrical redundancy allowing any component to be failed, removed and replaced without the need for a maintenance window. The ANvil! platform has over five years of real-world deployment over dozens of sites and an historic uptime of over 99.9999%.
It is extremely easy to use, minimizing the opportunity for human error and making it as simple as possible for "remote hands" to affect repairs and replacements without any prior availability experience and minimal technical knowledge.

In short, it is a server platform that just won't die.

How do you build an Anvil!?

It's quite easy, but it does require a little more space than a README allows for.

How to Build an m2 Anvil!

The Anvil! was designed and extensively tested on Primergy servers, Brocade ICX switches and APC SmartUPS UPSes and Switched PDUs. That said, the Anvil! platform is designed to be hardware agnostic and should work just fine on Dell, Cisco USC, NEC, Lenovo x-series, and other tier-1 server vendors.

Alteeve, the company behind the Anvil! project, actively supports the open source community. We also offer commercial support contracts to assist with any stage of deployment, operation and custom development.

striker's People

Contributors

Stargazers

Watchers

Forkers

prothon markandrewj mallchin howdoicomputer cloudxtreme potistiri adamnugraha freeguy1 seneca-cdot zht750808 starrytony 1n1t6sh3ll

striker's Issues

Provide an option to remove dangling LVs

That is, LVs on the nodes that are not connected to any existing server.

Virsh wont connect cdrom/disk if VM has mixed case

The menu in which attached a DVD or CD ISO to a VM, if your VM has any mixed case it display as attached, but wont be.

Add a "notes" section to the servers so that users can make notes about their VMs.

Add sync'ing of striker cache and ssh_config

Title says all

Do not allow a node to be withdrawn when it is SyncSource

Likewise, if doing a cold-shutdown, make sure the SyncTarget node goes down first. Not doing this causes DRBD to complain and not exit when rgmanager is asked to stop.

Media Library isn't uploading files and passing a URL results in a 0-byte file

Something appears to have broken Media Library uploads...

driver name='qemu' added twice to server XML definition when sys::server::{bcn,sn,ifn}_nic_driver set in striker.conf

Example:

<interface type='bridge'>
  <mac address='52:54:00:8e:ea:75'/>
  <source bridge='ifn_bridge1'/>
  <target dev='vnet0'/>
  <model type='virtio'/>
  <alias name='net0'/>
  <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
  <driver name='qemu'/>
  <driver name='qemu'/>
</interface>

The vm.sh downloaded to nodes is a 404 message

Probably a simple issue with either the vm.sh written to the ISO or the kickstart call to grab it.

Verify that node 2's PDU fencing is off -> on

User reports that node 2's PDU fencing was left as 'reboot'.

Need to clean up the web pages that reference guacamole (in browser vnc) since guacd/tomcat are gone

The web UI has to be cleaned up to remove attempts at trying to give the user a way to use in browser vnc to a server that has just been configured.

Currently, the "Oh No!" icon appears. guacd/tomcat6 were removed because guac wouldn't support "spice".

Loading a backup appears to leave the ownership of cache files as root:root

hosts file is not leaving a space between host names in some cases

Hit IPMI specificlly on 1.1.6.

Alert email settings don't work with ScanCore yet

What it says on the tin.

Can't disable Install Target

It looks like the big shell call at /var/www/cgi-bin/lib/AN/Cluster.pm line 8613 isn't running.

Uploading a file from your local machine renders poorly in Media Library

Remember to fix before v1.1.5

Have striker check to see of 'anvil-safe-start' is running and, if so, print a banner asking the user to wait

Don't want the user to think there is a problem when, for example, anvil-safe-start is holding for a rapid resync to complete.

Make sure 'striker-configure-vmm' works properly

Currently, if the user is logged into gnome when it is run for the first time, the settings are wiped out when the user logs out. Find a way to get virt-manager to re-read the files (or find some other way to protect the changes).

Switch corosync/cman back to rrp=none

00:56 @fabbione digimer: the error means something in kernel is still holding the socket (DLM) and the socket has not been cleaned properly
00:57 @fabbione digimer: basically, the only way out is to reboot
00:57 @fabbione digimer: also, you shouldn't be using SCTP in the first place.
00:57 @fabbione DLM and SCTP don't work well together because of some missing RFC/drafts in SCTP implementations
01:00 < digimer> well fuck
01:00 < digimer> isn't sctp required for rrp?
01:00 @fabbione yes but we don't support DLM on rrp due to the kernel limitations of SCTP

;_;

Have a home button/link on all the web pages to take you to the first page

Some pages have a "back" arrow button and some others don't. It'll be good to have a "home" button/link on ALL the pages to start over the navigation.

Show dates and md5sums on files in the media library

This will be useful when you are playing with uploading multiple ISO files .. or the same file (i.e. the file names of the ISO are the same for each version) with different modifications .. for ex. to .ks or packages etc.

Currently unable to delete an Anvil! from Striker

The delete function only deletes the Anvil! from the local dashboard, and then it gets put back when striker-merge-dashboards is run. Have it delete from both dashboards at the same time.

Unsure how to handle the case where one dashboard is offline at the time... For now, we'll probably ignore that use case. In Striker v3, all this will move into ScanCore's DB which can handle cases like this.

Make guacamole restart smarted

Occassionally, the 'restart' fails to start. When this happens, sleep and try again. Repeat up to X times before giving up.

If there is no network connection when apcupsd starts, the deamon needs to be restarted

If an-cm doesn't get anything useful out of apcaccess, try restarting the daemon once every X seconds or Y checks.

Example:

[root@pp-c01n02 ~]# apcaccess
APC : 001,018,0471
DATE : 2012-08-23 17:24:26 -0400
HOSTNAME : pp-c01n02.plasticplus.ca
VERSION : 3.14.10 (13 September 2011) redhat
UPSNAME : pp-u01
CABLE : Ethernet Link
DRIVER : SNMP UPS Driver
UPSMODE : Stand Alone
STARTTIME: 2012-08-23 12:01:08 -0400
STATUS :
MBATTCHG : 0 Percent
MINTIMEL : 0 Minutes
MAXTIME : 0 Seconds
NUMXFERS : 0
TONBATT : 0 seconds
CUMONBATT: 0 seconds
XOFFBATT : N/A
STATFLAG : 0x07000000 Status Flag
END APC : 2012-08-23 17:24:29 -0400

[root@pp-c01n02 ~]# /etc/init.d/apcupsd restart
Shutting down UPS monitoring (apcupsd.ups0.conf): [ OK ]
Shutting down UPS monitoring (apcupsd.ups1.conf): [ OK ]
Starting UPS monitoring (apcupsd.ups0.conf): [ OK ]
Starting UPS monitoring (apcupsd.ups1.conf): [ OK ]

[root@pp-c01n02 ~]# apcaccess
APC : 001,050,1252
DATE : 2012-08-23 17:25:03 -0400
HOSTNAME : pp-c01n02.plasticplus.ca
VERSION : 3.14.10 (13 September 2011) redhat
UPSNAME : APCUPS
CABLE : Ethernet Link
DRIVER : SNMP UPS Driver
UPSMODE : Stand Alone
STARTTIME: 2012-08-23 17:25:03 -0400
MODEL : Smart-UPS 1500
STATUS : ONLINE
LINEV : 118.0 Volts
LOADPCT : 20.0 Percent Load Capacity
BCHARGE : 100.0 Percent
TIMELEFT : 66.0 Minutes
MBATTCHG : 0 Percent
MINTIMEL : 0 Minutes
MAXTIME : 0 Seconds
MAXLINEV : 119.0 Volts
MINLINEV : 118.0 Volts
OUTPUTV : 118.0 Volts
SENSE : High
DWAKE : 000 Seconds
DSHUTD : 000 Seconds
DLOWBATT : 02 Minutes
LOTRANS : 106.0 Volts
HITRANS : 127.0 Volts
RETPCT : 34779152.0 Percent
ITEMP : 25.0 C Internal
ALARMDEL : 30 seconds
BATTV : 27.0 Volts
LINEFREQ : 60.0 Hz
LASTXFER : Line voltage notch or spike
NUMXFERS : 0
TONBATT : 0 seconds
CUMONBATT: 0 seconds
XOFFBATT : N/A
SELFTEST : OK
STESTI : OFF
STATFLAG : 0x07000008 Status Flag
MANDATE : 06/08/2012
SERIALNO : xxxxxxxxxxxxx
BATTDATE : 06/08/2012
NOMOUTV : 120 Volts
NOMBATTV : 34779152.0 Volts
HUMIDITY : 2971549696.0 Percent
AMBTEMP : 2971549696.0 C
EXTBATTS : 34779152
BADBATTS : -1323417600
FIRMWARE : UPS 08.3 / MCU 14.0
END APC : 2012-08-23 17:25:06 -0400

Deleting a server needs to remove the server from the 'server' DB table

Get live migration working with 4kn drives

See: https://bugzilla.redhat.com/show_bug.cgi?id=1285921

Test fix in oalbrigt/resource-agents@16fb53b.

SELinux appears to be left 'enforcing' in Striker install

If so, fix.

Scancore DB didn't initialize in the latest build

Regression introduced in the last day... Find and fix.

Editing an existing Anvil! causes the ',' between the node names to be lost

This causes Striker to think the node name for node1 is 'node1 node2' instead of splitting, breaking things pretty badly.

Changing the IP of a node causes the IP alone to be set in /etc/hosts

No name is set. Check for IPs with no hostnames and skip.

Build an Anvil! install ISO generator

Take a stock RHEL or CentOS 6 ISO and use it as input to generate an Anvil! install ISO to avoid running afoul of trademarks.

DLM hanging but corosync working causes total cluster hang

An example would be a storage failure blocking DLM lock-space but leaving corosync operational. Possibly able to reproduce by intentionally disconnecting storage from a node.

ScanCore could spawn a child process that tries to call 'clustat' and if it doesn't respond within a timeout, calls 'echo c > /proc/sysrq-trigger'. Note that this would have to happen inside ScanCore proper as agents may not be loadable.

Alternatively, ScanCore could try to ssh into the peer after the timeout and if it can't login (as may be the case with failed storage) fence the peer.

striker-push-ssh isn't invoked on peer striker

Title says all

Add a boot ordering function for servers when 'anvil-safe-start' is used

Many clients have servers that need to be booted in a certain order and with set delays.

Support in striker.conf only, no support in Striker v2.0's WebUI.

Striker.conf:

# A user may wish for servers to boot in a certain order and with possible
# delays between machine boots. This section allows for this configuration.
# 
# Server boot ordering is configured by defining each server with a unique
# variable integer. This integer has no bearing on the boot order, it simply
# provides a way to distinguish between entries.
# 
# The values are in the form: "(boot order):(delay):(server name)".
# 
# Any undefined server will be booted as if it were set to '1:0:x' and, thus,
# will boot immediately.
# 
# One or more servers can have the same 'boot order' value. All servers in the
# lower boot order must been before the next boot order is evaluated. Once a
# server is ready for boot, the 'delay' will be checked. If a delay is set, 
# it will sleep for the defined number of seconds before actually starting.
# 
# So in the example set below;
#1. vm01-foo and vm02-bar will be booted without delay
#2. vm04-bang will boot once set #1 servers have booted without delay, but 
#    vm03-baz will wait 30 seconds before booting.
#3. vm05-boop will boot once set #2 servers have booted.
# 
# If a given server fails to boot, the servers is subsequent sets will not 
# boot.
#server::boot_order::vm01-foo           =   1:0
#server::boot_order::vm02-bar           =   1:0
#server::boot_order::vm03-baz           =   2:30
#server::boot_order::vm04-bang          =   2:0
#server::boot_order::vm05-boop          =   3:60

Investigate adding SpiceHTML5 for in-browser remote access to servers

See: http://www.ovirt.org/Features/SpiceHTML5

Add load shedding to ScanCore/anvil-safe-stop

Possible reboot but when running Install Manifest

Possible reproducer;

Run the install with IPMI + PDU fencing on a pair of nodes so that install fails when it hits the IPMI LAN.
Delete cluster.conf on both nodes, fix install manifest on striker.
Reboot nodes from their command line (to get the new IPs up)
Reload the manifest from the start, using the updated IPs.
Reboot of node 1 occurred, but Striker failed to reconnect.
Restart the manifest run using the 'Reboot' button after the error message.
Note that both nodes appear to not need a reboot, but the DRBD will fail to attach r0 on node 2 (likely because it didn't get rebooted so the partition table wasn't updated)

Verify that changing VM hardware specs doesn't damage the XML definition

Reported happening by a user. Add a check to verify the new XML has a closing element and roll-back if not.

Significantly reduce logging volume

Adapt the Cluster.pm -> record() function to take log levels. Set the default log level to 1 in striker.conf.

The color scheme on the web UI could use a revamp

Currently various colors such as "purple", "blue" and "green" are used in the various web pages. It's not always clear where the links are (for ex: turn the LED on is a link that performs an action) ....

It looks like the "purple" colored words/phrases are links. May be making them "bold" to stand out might be enough.

Add usage info the ScanCore

Pushing a config to an Anvil! node doesn't display cleanly.

A quick SSH based check/call should fix this.

Node Manifest installation stalls on creating partition, attempts to create 0GB drive

The Stage 2 Node install from a single striker stalls if a VM partition is already created.

Drive layout prior to running the manifest:

Number Start End Size Type File system Flags
1 1049kB 538MB 537MB primary ext4 boot
2 538MB 43.5GB 42.9GB primary ext4
3 43.5GB 47.8GB 4295MB primary linux-swap(v1)
4 47.8GB 1497GB 1449GB extended lba
5 47.8GB 1497GB 1449GB logical

striker.log output:

13:09:03 AN::InstallManifest.pm 7680; node: [10.20.10.201], disk: [sda], return: [Model: FTS PRAID EP420i (scsi)]
13:09:03 AN::InstallManifest.pm 7680; node: [10.20.10.201], disk: [sda], return: [Disk /dev/sda: 1394GiB]
13:09:03 AN::InstallManifest.pm 7680; node: [10.20.10.201], disk: [sda], return: [Sector size (logical/physical): 512B/512B]
13:09:03 AN::InstallManifest.pm 7680; node: [10.20.10.201], disk: [sda], return: [Partition Table: msdos]
13:09:03 AN::InstallManifest.pm 7680; node: [10.20.10.201], disk: [sda], return: []
13:09:03 AN::InstallManifest.pm 7680; node: [10.20.10.201], disk: [sda], return: [Number Start End Size Type File system Flags]
13:09:03 AN::InstallManifest.pm 7680; node: [10.20.10.201], disk: [sda], return: [0.00GiB 0.00GiB 0.00GiB Free Space]
13:09:03 AN::InstallManifest.pm 7686; start: [0.00], end: [0.00], size: [0.00]
13:09:03 AN::InstallManifest.pm 7680; node: [10.20.10.201], disk: [sda], return: [1 0.00GiB 0.50GiB 0.50GiB primary ext4 boot]
13:09:03 AN::InstallManifest.pm 7680; node: [10.20.10.201], disk: [sda], return: [2 0.50GiB 40.5GiB 40.0GiB primary ext4]
13:09:03 AN::InstallManifest.pm 7680; node: [10.20.10.201], disk: [sda], return: [3 40.5GiB 44.5GiB 4.00GiB primary linux-swap(v1)]
13:09:03 AN::InstallManifest.pm 7680; node: [10.20.10.201], disk: [sda], return: [4 44.5GiB 1394GiB 1350GiB extended lba]
13:09:03 AN::InstallManifest.pm 7680; node: [10.20.10.201], disk: [sda], return: [5 44.5GiB 1394GiB 1350GiB logical]
13:09:03 AN::InstallManifest.pm 7680; node: [10.20.10.201], disk: [sda], return: []
13:09:03 AN::InstallManifest.pm 7691; node: [10.20.10.201], disk: [sda], start: [0.00], end: [0.00].
13:09:03 AN::InstallManifest.pm 7712; node: [10.20.10.201], disk: [sda], type: [extended], partition_size: [all]
13:09:03 AN::InstallManifest.pm 7739; snode: [10.20.10.201], disk: [sda], type: [extended], start: [0.00 GiB], end: [0.00 GiB]
13:09:03 AN::InstallManifest.pm 7746; shell_call: [parted -a opt /dev/sda mkpart extended 0.00GiB 100%]

Steps to recreate:

Initialize Anvil from dashboard, already have a partition for VMs created.

Removing the partitions causes the installer to create them from scratch, and circumvents the issue.

Generating an ISO based on CentOS fails to copy EFI over iso-read

Rework the hardware status (disks, raid controller etc.) display in the web UI

Currently in the web UI, each disk status is shown as a block with complete details. It would be better to just give the drive state and may be temperature of all the drives in a table and have links that take you to the display giving you all the pertinent details such as error counts etc.

Running ScanCore on a node for the first time, when UPSes are not accessible, generates a warning that comms was lost to the UPSes

Should simply ignore them...

Warning:
  Connection to the UPS: [an-ups01.alteeve.ca] has been lost!
Warning: The UPS power state and hold up time can no longer be monitored. 
         Graceful shutdown in the case of a power loss may not function until 
         communication is restored!

Warning:
  Connection to the UPS: [an-ups02.alteeve.ca] has been lost!
Warning: The UPS power state and hold up time can no longer be monitored. 
         Graceful shutdown in the case of a power loss may not function until 
         communication is restored!

clusterlabs / striker Goto Github PK

striker's Introduction

Anvil! m2 - Striker + ScanCore v2.0.1

striker's People

Contributors

Stargazers

Watchers

Forkers

striker's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs