GithubHelp home page GithubHelp logo

clusterlabs / striker Goto Github PK

View Code? Open in Web Editor NEW
26.0 26.0 12.0 111.26 MB

The Anvil! Intelligent Availability™ Platform - Striker UI and ScanCore decision engine

Home Page: https://www.alteeve.com/w/Anvil!

Perl 96.24% HTML 1.69% CSS 0.07% JavaScript 0.19% PLpgSQL 1.55% Shell 0.22% C 0.03%

striker's People

Contributors

digimer avatar howdoicomputer avatar legrady avatar nummysquee avatar prothon avatar ylei-tsubame avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

striker's Issues

Start Remote Session

There should be a button on the Cluster page that will enable remote support. This should allow alteeve, only when you click it, to be able to remote into your dashboard and by proxy assist you with your clusters.

Currently unable to delete an Anvil! from Striker

The delete function only deletes the Anvil! from the local dashboard, and then it gets put back when striker-merge-dashboards is run. Have it delete from both dashboards at the same time.

Unsure how to handle the case where one dashboard is offline at the time... For now, we'll probably ignore that use case. In Striker v3, all this will move into ScanCore's DB which can handle cases like this.

Make sure 'striker-configure-vmm' works properly

Currently, if the user is logged into gnome when it is run for the first time, the settings are wiped out when the user logs out. Find a way to get virt-manager to re-read the files (or find some other way to protect the changes).

The color scheme on the web UI could use a revamp

Currently various colors such as "purple", "blue" and "green" are used in the various web pages. It's not always clear where the links are (for ex: turn the LED on is a link that performs an action) ....

It looks like the "purple" colored words/phrases are links. May be making them "bold" to stand out might be enough.

Running ScanCore on a node for the first time, when UPSes are not accessible, generates a warning that comms was lost to the UPSes

Should simply ignore them...

Warning:
  Connection to the UPS: [an-ups01.alteeve.ca] has been lost!
Warning: The UPS power state and hold up time can no longer be monitored. 
         Graceful shutdown in the case of a power loss may not function until 
         communication is restored!

Warning:
  Connection to the UPS: [an-ups02.alteeve.ca] has been lost!
Warning: The UPS power state and hold up time can no longer be monitored. 
         Graceful shutdown in the case of a power loss may not function until 
         communication is restored!

Switch corosync/cman back to rrp=none

00:56 @fabbione digimer: the error means something in kernel is still holding the socket (DLM) and the socket has not been cleaned properly
00:57 @fabbione digimer: basically, the only way out is to reboot
00:57 @fabbione digimer: also, you shouldn't be using SCTP in the first place.
00:57 @fabbione DLM and SCTP don't work well together because of some missing RFC/drafts in SCTP implementations
01:00 < digimer> well fuck
01:00 < digimer> isn't sctp required for rrp?
01:00 @fabbione yes but we don't support DLM on rrp due to the kernel limitations of SCTP

;_;

DLM hanging but corosync working causes total cluster hang

An example would be a storage failure blocking DLM lock-space but leaving corosync operational. Possibly able to reproduce by intentionally disconnecting storage from a node.

ScanCore could spawn a child process that tries to call 'clustat' and if it doesn't respond within a timeout, calls 'echo c > /proc/sysrq-trigger'. Note that this would have to happen inside ScanCore proper as agents may not be loadable.

Alternatively, ScanCore could try to ssh into the peer after the timeout and if it can't login (as may be the case with failed storage) fence the peer.

Show dates and md5sums on files in the media library

This will be useful when you are playing with uploading multiple ISO files .. or the same file (i.e. the file names of the ISO are the same for each version) with different modifications .. for ex. to .ks or packages etc.

Make guacamole restart smarted

Occassionally, the 'restart' fails to start. When this happens, sleep and try again. Repeat up to X times before giving up.

Add a boot ordering function for servers when 'anvil-safe-start' is used

Many clients have servers that need to be booted in a certain order and with set delays.

Support in striker.conf only, no support in Striker v2.0's WebUI.

Striker.conf:

# A user may wish for servers to boot in a certain order and with possible
# delays between machine boots. This section allows for this configuration.
# 
# Server boot ordering is configured by defining each server with a unique
# variable integer. This integer has no bearing on the boot order, it simply
# provides a way to distinguish between entries.
# 
# The values are in the form: "(boot order):(delay):(server name)".
# 
# Any undefined server will be booted as if it were set to '1:0:x' and, thus,
# will boot immediately.
# 
# One or more servers can have the same 'boot order' value. All servers in the
# lower boot order must been before the next boot order is evaluated. Once a
# server is ready for boot, the 'delay' will be checked. If a delay is set, 
# it will sleep for the defined number of seconds before actually starting.
# 
# So in the example set below;
#1. vm01-foo and vm02-bar will be booted without delay
#2. vm04-bang will boot once set #1 servers have booted without delay, but 
#    vm03-baz will wait 30 seconds before booting.
#3. vm05-boop will boot once set #2 servers have booted.
# 
# If a given server fails to boot, the servers is subsequent sets will not 
# boot.
#server::boot_order::vm01-foo           =   1:0
#server::boot_order::vm02-bar           =   1:0
#server::boot_order::vm03-baz           =   2:30
#server::boot_order::vm04-bang          =   2:0
#server::boot_order::vm05-boop          =   3:60

Node Manifest installation stalls on creating partition, attempts to create 0GB drive

The Stage 2 Node install from a single striker stalls if a VM partition is already created.

Drive layout prior to running the manifest:

Number Start End Size Type File system Flags
1 1049kB 538MB 537MB primary ext4 boot
2 538MB 43.5GB 42.9GB primary ext4
3 43.5GB 47.8GB 4295MB primary linux-swap(v1)
4 47.8GB 1497GB 1449GB extended lba
5 47.8GB 1497GB 1449GB logical

striker.log output:

13:09:03 AN::InstallManifest.pm 7680; node: [10.20.10.201], disk: [sda], return: [Model: FTS PRAID EP420i (scsi)]
13:09:03 AN::InstallManifest.pm 7680; node: [10.20.10.201], disk: [sda], return: [Disk /dev/sda: 1394GiB]
13:09:03 AN::InstallManifest.pm 7680; node: [10.20.10.201], disk: [sda], return: [Sector size (logical/physical): 512B/512B]
13:09:03 AN::InstallManifest.pm 7680; node: [10.20.10.201], disk: [sda], return: [Partition Table: msdos]
13:09:03 AN::InstallManifest.pm 7680; node: [10.20.10.201], disk: [sda], return: []
13:09:03 AN::InstallManifest.pm 7680; node: [10.20.10.201], disk: [sda], return: [Number Start End Size Type File system Flags]
13:09:03 AN::InstallManifest.pm 7680; node: [10.20.10.201], disk: [sda], return: [0.00GiB 0.00GiB 0.00GiB Free Space]
13:09:03 AN::InstallManifest.pm 7686; start: [0.00], end: [0.00], size: [0.00]
13:09:03 AN::InstallManifest.pm 7680; node: [10.20.10.201], disk: [sda], return: [1 0.00GiB 0.50GiB 0.50GiB primary ext4 boot]
13:09:03 AN::InstallManifest.pm 7680; node: [10.20.10.201], disk: [sda], return: [2 0.50GiB 40.5GiB 40.0GiB primary ext4]
13:09:03 AN::InstallManifest.pm 7680; node: [10.20.10.201], disk: [sda], return: [3 40.5GiB 44.5GiB 4.00GiB primary linux-swap(v1)]
13:09:03 AN::InstallManifest.pm 7680; node: [10.20.10.201], disk: [sda], return: [4 44.5GiB 1394GiB 1350GiB extended lba]
13:09:03 AN::InstallManifest.pm 7680; node: [10.20.10.201], disk: [sda], return: [5 44.5GiB 1394GiB 1350GiB logical]
13:09:03 AN::InstallManifest.pm 7680; node: [10.20.10.201], disk: [sda], return: []
13:09:03 AN::InstallManifest.pm 7691; node: [10.20.10.201], disk: [sda], start: [0.00], end: [0.00].
13:09:03 AN::InstallManifest.pm 7712; node: [10.20.10.201], disk: [sda], type: [extended], partition_size: [all]
13:09:03 AN::InstallManifest.pm 7739; snode: [10.20.10.201], disk: [sda], type: [extended], start: [0.00 GiB], end: [0.00 GiB]
13:09:03 AN::InstallManifest.pm 7746; shell_call: [parted -a opt /dev/sda mkpart extended 0.00GiB 100%]

Steps to recreate:

Initialize Anvil from dashboard, already have a partition for VMs created.

Removing the partitions causes the installer to create them from scratch, and circumvents the issue.

If there is no network connection when apcupsd starts, the deamon needs to be restarted

If an-cm doesn't get anything useful out of apcaccess, try restarting the daemon once every X seconds or Y checks.

Example:

[root@pp-c01n02 ~]# apcaccess
APC : 001,018,0471
DATE : 2012-08-23 17:24:26 -0400
HOSTNAME : pp-c01n02.plasticplus.ca
VERSION : 3.14.10 (13 September 2011) redhat
UPSNAME : pp-u01
CABLE : Ethernet Link
DRIVER : SNMP UPS Driver
UPSMODE : Stand Alone
STARTTIME: 2012-08-23 12:01:08 -0400
STATUS :
MBATTCHG : 0 Percent
MINTIMEL : 0 Minutes
MAXTIME : 0 Seconds
NUMXFERS : 0
TONBATT : 0 seconds
CUMONBATT: 0 seconds
XOFFBATT : N/A
STATFLAG : 0x07000000 Status Flag
END APC : 2012-08-23 17:24:29 -0400

[root@pp-c01n02 ~]# /etc/init.d/apcupsd restart
Shutting down UPS monitoring (apcupsd.ups0.conf): [ OK ]
Shutting down UPS monitoring (apcupsd.ups1.conf): [ OK ]
Starting UPS monitoring (apcupsd.ups0.conf): [ OK ]
Starting UPS monitoring (apcupsd.ups1.conf): [ OK ]

[root@pp-c01n02 ~]# apcaccess
APC : 001,050,1252
DATE : 2012-08-23 17:25:03 -0400
HOSTNAME : pp-c01n02.plasticplus.ca
VERSION : 3.14.10 (13 September 2011) redhat
UPSNAME : APCUPS
CABLE : Ethernet Link
DRIVER : SNMP UPS Driver
UPSMODE : Stand Alone
STARTTIME: 2012-08-23 17:25:03 -0400
MODEL : Smart-UPS 1500
STATUS : ONLINE
LINEV : 118.0 Volts
LOADPCT : 20.0 Percent Load Capacity
BCHARGE : 100.0 Percent
TIMELEFT : 66.0 Minutes
MBATTCHG : 0 Percent
MINTIMEL : 0 Minutes
MAXTIME : 0 Seconds
MAXLINEV : 119.0 Volts
MINLINEV : 118.0 Volts
OUTPUTV : 118.0 Volts
SENSE : High
DWAKE : 000 Seconds
DSHUTD : 000 Seconds
DLOWBATT : 02 Minutes
LOTRANS : 106.0 Volts
HITRANS : 127.0 Volts
RETPCT : 34779152.0 Percent
ITEMP : 25.0 C Internal
ALARMDEL : 30 seconds
BATTV : 27.0 Volts
LINEFREQ : 60.0 Hz
LASTXFER : Line voltage notch or spike
NUMXFERS : 0
TONBATT : 0 seconds
CUMONBATT: 0 seconds
XOFFBATT : N/A
SELFTEST : OK
STESTI : OFF
STATFLAG : 0x07000008 Status Flag
MANDATE : 06/08/2012
SERIALNO : xxxxxxxxxxxxx
BATTDATE : 06/08/2012
NOMOUTV : 120 Volts
NOMBATTV : 34779152.0 Volts
HUMIDITY : 2971549696.0 Percent
AMBTEMP : 2971549696.0 C
EXTBATTS : 34779152
BADBATTS : -1323417600
FIRMWARE : UPS 08.3 / MCU 14.0
END APC : 2012-08-23 17:25:06 -0400

Possible reboot but when running Install Manifest

Possible reproducer;

  1. Run the install with IPMI + PDU fencing on a pair of nodes so that install fails when it hits the IPMI LAN.
  2. Delete cluster.conf on both nodes, fix install manifest on striker.
  3. Reboot nodes from their command line (to get the new IPs up)
  4. Reload the manifest from the start, using the updated IPs.
  5. Reboot of node 1 occurred, but Striker failed to reconnect.
  6. Restart the manifest run using the 'Reboot' button after the error message.
  7. Note that both nodes appear to not need a reboot, but the DRBD will fail to attach r0 on node 2 (likely because it didn't get rebooted so the partition table wasn't updated)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.