clusterlabs / striker Goto Github PK
View Code? Open in Web Editor NEWThe Anvil! Intelligent Availability™ Platform - Striker UI and ScanCore decision engine
Home Page: https://www.alteeve.com/w/Anvil!
The Anvil! Intelligent Availability™ Platform - Striker UI and ScanCore decision engine
Home Page: https://www.alteeve.com/w/Anvil!
Something appears to have broken Media Library uploads...
There should be a button on the Cluster page that will enable remote support. This should allow alteeve, only when you click it, to be able to remote into your dashboard and by proxy assist you with your clusters.
If so, fix.
For example, if 'clustat' fails to return because DLM is hung, the user will not be able to load the Anvil!'s control panel. Provide a mechanism to allow the user to fence both nodes to restart the entire cluster stack.
The delete function only deletes the Anvil! from the local dashboard, and then it gets put back when striker-merge-dashboards is run. Have it delete from both dashboards at the same time.
Unsure how to handle the case where one dashboard is offline at the time... For now, we'll probably ignore that use case. In Striker v3, all this will move into ScanCore's DB which can handle cases like this.
Remember to fix before v1.1.5
Some pages have a "back" arrow button and some others don't. It'll be good to have a "home" button/link on ALL the pages to start over the navigation.
Don't want the user to think there is a problem when, for example, anvil-safe-start is holding for a rapid resync to complete.
This causes Striker to think the node name for node1 is 'node1 node2' instead of splitting, breaking things pretty badly.
That is, LVs on the nodes that are not connected to any existing server.
The web UI has to be cleaned up to remove attempts at trying to give the user a way to use in browser vnc to a server that has just been configured.
Currently, the "Oh No!" icon appears. guacd/tomcat6 were removed because guac wouldn't support "spice".
Currently, if the user is logged into gnome when it is run for the first time, the settings are wiped out when the user logs out. Find a way to get virt-manager to re-read the files (or find some other way to protect the changes).
Remember to fix before 1.1.5
It looks like the big shell call at /var/www/cgi-bin/lib/AN/Cluster.pm line 8613 isn't running.
No name is set. Check for IPs with no hostnames and skip.
Take a stock RHEL or CentOS 6 ISO and use it as input to generate an Anvil! install ISO to avoid running afoul of trademarks.
Title says all
Currently various colors such as "purple", "blue" and "green" are used in the various web pages. It's not always clear where the links are (for ex: turn the LED on is a link that performs an action) ....
It looks like the "purple" colored words/phrases are links. May be making them "bold" to stand out might be enough.
Reported happening by a user. Add a check to verify the new XML has a closing element and roll-back if not.
Should simply ignore them...
Warning:
Connection to the UPS: [an-ups01.alteeve.ca] has been lost!
Warning: The UPS power state and hold up time can no longer be monitored.
Graceful shutdown in the case of a power loss may not function until
communication is restored!
Warning:
Connection to the UPS: [an-ups02.alteeve.ca] has been lost!
Warning: The UPS power state and hold up time can no longer be monitored.
Graceful shutdown in the case of a power loss may not function until
communication is restored!
Likewise, if doing a cold-shutdown, make sure the SyncTarget node goes down first. Not doing this causes DRBD to complain and not exit when rgmanager is asked to stop.
Title says all
What it says on the tin.
A quick SSH based check/call should fix this.
Probably a simple issue with either the vm.sh written to the ISO or the kickstart call to grab it.
00:56 @fabbione digimer: the error means something in kernel is still holding the socket (DLM) and the socket has not been cleaned properly
00:57 @fabbione digimer: basically, the only way out is to reboot
00:57 @fabbione digimer: also, you shouldn't be using SCTP in the first place.
00:57 @fabbione DLM and SCTP don't work well together because of some missing RFC/drafts in SCTP implementations
01:00 < digimer> well fuck
01:00 < digimer> isn't sctp required for rrp?
01:00 @fabbione yes but we don't support DLM on rrp due to the kernel limitations of SCTP
;_;
An example would be a storage failure blocking DLM lock-space but leaving corosync operational. Possibly able to reproduce by intentionally disconnecting storage from a node.
ScanCore could spawn a child process that tries to call 'clustat' and if it doesn't respond within a timeout, calls 'echo c > /proc/sysrq-trigger'. Note that this would have to happen inside ScanCore proper as agents may not be loadable.
Alternatively, ScanCore could try to ssh into the peer after the timeout and if it can't login (as may be the case with failed storage) fence the peer.
What it says on the tin. Easy to test... Add and then remove a VM from cluster.conf. Doesn't have to actually exist.
This will be useful when you are playing with uploading multiple ISO files .. or the same file (i.e. the file names of the ISO are the same for each version) with different modifications .. for ex. to .ks or packages etc.
Occassionally, the 'restart' fails to start. When this happens, sleep and try again. Repeat up to X times before giving up.
If someone tries to run ./striker-installer directly, without using the install ISO, it will fail because the AN!Repo isn't installed.
Many clients have servers that need to be booted in a certain order and with set delays.
Support in striker.conf only, no support in Striker v2.0's WebUI.
Striker.conf:
# A user may wish for servers to boot in a certain order and with possible
# delays between machine boots. This section allows for this configuration.
#
# Server boot ordering is configured by defining each server with a unique
# variable integer. This integer has no bearing on the boot order, it simply
# provides a way to distinguish between entries.
#
# The values are in the form: "(boot order):(delay):(server name)".
#
# Any undefined server will be booted as if it were set to '1:0:x' and, thus,
# will boot immediately.
#
# One or more servers can have the same 'boot order' value. All servers in the
# lower boot order must been before the next boot order is evaluated. Once a
# server is ready for boot, the 'delay' will be checked. If a delay is set,
# it will sleep for the defined number of seconds before actually starting.
#
# So in the example set below;
#1. vm01-foo and vm02-bar will be booted without delay
#2. vm04-bang will boot once set #1 servers have booted without delay, but
# vm03-baz will wait 30 seconds before booting.
#3. vm05-boop will boot once set #2 servers have booted.
#
# If a given server fails to boot, the servers is subsequent sets will not
# boot.
#server::boot_order::vm01-foo = 1:0
#server::boot_order::vm02-bar = 1:0
#server::boot_order::vm03-baz = 2:30
#server::boot_order::vm04-bang = 2:0
#server::boot_order::vm05-boop = 3:60
Currently in the web UI, each disk status is shown as a block with complete details. It would be better to just give the drive state and may be temperature of all the drives in a table and have links that take you to the display giving you all the pertinent details such as error counts etc.
Hit IPMI specificlly on 1.1.6.
The Stage 2 Node install from a single striker stalls if a VM partition is already created.
Drive layout prior to running the manifest:
Number Start End Size Type File system Flags
1 1049kB 538MB 537MB primary ext4 boot
2 538MB 43.5GB 42.9GB primary ext4
3 43.5GB 47.8GB 4295MB primary linux-swap(v1)
4 47.8GB 1497GB 1449GB extended lba
5 47.8GB 1497GB 1449GB logical
striker.log output:
13:09:03 AN::InstallManifest.pm 7680; node: [10.20.10.201], disk: [sda], return: [Model: FTS PRAID EP420i (scsi)]
13:09:03 AN::InstallManifest.pm 7680; node: [10.20.10.201], disk: [sda], return: [Disk /dev/sda: 1394GiB]
13:09:03 AN::InstallManifest.pm 7680; node: [10.20.10.201], disk: [sda], return: [Sector size (logical/physical): 512B/512B]
13:09:03 AN::InstallManifest.pm 7680; node: [10.20.10.201], disk: [sda], return: [Partition Table: msdos]
13:09:03 AN::InstallManifest.pm 7680; node: [10.20.10.201], disk: [sda], return: []
13:09:03 AN::InstallManifest.pm 7680; node: [10.20.10.201], disk: [sda], return: [Number Start End Size Type File system Flags]
13:09:03 AN::InstallManifest.pm 7680; node: [10.20.10.201], disk: [sda], return: [0.00GiB 0.00GiB 0.00GiB Free Space]
13:09:03 AN::InstallManifest.pm 7686; start: [0.00], end: [0.00], size: [0.00]
13:09:03 AN::InstallManifest.pm 7680; node: [10.20.10.201], disk: [sda], return: [1 0.00GiB 0.50GiB 0.50GiB primary ext4 boot]
13:09:03 AN::InstallManifest.pm 7680; node: [10.20.10.201], disk: [sda], return: [2 0.50GiB 40.5GiB 40.0GiB primary ext4]
13:09:03 AN::InstallManifest.pm 7680; node: [10.20.10.201], disk: [sda], return: [3 40.5GiB 44.5GiB 4.00GiB primary linux-swap(v1)]
13:09:03 AN::InstallManifest.pm 7680; node: [10.20.10.201], disk: [sda], return: [4 44.5GiB 1394GiB 1350GiB extended lba]
13:09:03 AN::InstallManifest.pm 7680; node: [10.20.10.201], disk: [sda], return: [5 44.5GiB 1394GiB 1350GiB logical]
13:09:03 AN::InstallManifest.pm 7680; node: [10.20.10.201], disk: [sda], return: []
13:09:03 AN::InstallManifest.pm 7691; node: [10.20.10.201], disk: [sda], start: [0.00], end: [0.00].
13:09:03 AN::InstallManifest.pm 7712; node: [10.20.10.201], disk: [sda], type: [extended], partition_size: [all]
13:09:03 AN::InstallManifest.pm 7739; snode: [10.20.10.201], disk: [sda], type: [extended], start: [0.00 GiB], end: [0.00 GiB]
13:09:03 AN::InstallManifest.pm 7746; shell_call: [parted -a opt /dev/sda mkpart extended 0.00GiB 100%]
Steps to recreate:
Initialize Anvil from dashboard, already have a partition for VMs created.
Removing the partitions causes the installer to create them from scratch, and circumvents the issue.
Regression introduced in the last day... Find and fix.
Adapt the Cluster.pm -> record() function to take log levels. Set the default log level to 1 in striker.conf.
If an-cm doesn't get anything useful out of apcaccess, try restarting the daemon once every X seconds or Y checks.
Example:
[root@pp-c01n02 ~]# apcaccess
APC : 001,018,0471
DATE : 2012-08-23 17:24:26 -0400
HOSTNAME : pp-c01n02.plasticplus.ca
VERSION : 3.14.10 (13 September 2011) redhat
UPSNAME : pp-u01
CABLE : Ethernet Link
DRIVER : SNMP UPS Driver
UPSMODE : Stand Alone
STARTTIME: 2012-08-23 12:01:08 -0400
STATUS :
MBATTCHG : 0 Percent
MINTIMEL : 0 Minutes
MAXTIME : 0 Seconds
NUMXFERS : 0
TONBATT : 0 seconds
CUMONBATT: 0 seconds
XOFFBATT : N/A
STATFLAG : 0x07000000 Status Flag
END APC : 2012-08-23 17:24:29 -0400
[root@pp-c01n02 ~]# /etc/init.d/apcupsd restart
Shutting down UPS monitoring (apcupsd.ups0.conf): [ OK ]
Shutting down UPS monitoring (apcupsd.ups1.conf): [ OK ]
Starting UPS monitoring (apcupsd.ups0.conf): [ OK ]
Starting UPS monitoring (apcupsd.ups1.conf): [ OK ]
[root@pp-c01n02 ~]# apcaccess
APC : 001,050,1252
DATE : 2012-08-23 17:25:03 -0400
HOSTNAME : pp-c01n02.plasticplus.ca
VERSION : 3.14.10 (13 September 2011) redhat
UPSNAME : APCUPS
CABLE : Ethernet Link
DRIVER : SNMP UPS Driver
UPSMODE : Stand Alone
STARTTIME: 2012-08-23 17:25:03 -0400
MODEL : Smart-UPS 1500
STATUS : ONLINE
LINEV : 118.0 Volts
LOADPCT : 20.0 Percent Load Capacity
BCHARGE : 100.0 Percent
TIMELEFT : 66.0 Minutes
MBATTCHG : 0 Percent
MINTIMEL : 0 Minutes
MAXTIME : 0 Seconds
MAXLINEV : 119.0 Volts
MINLINEV : 118.0 Volts
OUTPUTV : 118.0 Volts
SENSE : High
DWAKE : 000 Seconds
DSHUTD : 000 Seconds
DLOWBATT : 02 Minutes
LOTRANS : 106.0 Volts
HITRANS : 127.0 Volts
RETPCT : 34779152.0 Percent
ITEMP : 25.0 C Internal
ALARMDEL : 30 seconds
BATTV : 27.0 Volts
LINEFREQ : 60.0 Hz
LASTXFER : Line voltage notch or spike
NUMXFERS : 0
TONBATT : 0 seconds
CUMONBATT: 0 seconds
XOFFBATT : N/A
SELFTEST : OK
STESTI : OFF
STATFLAG : 0x07000008 Status Flag
MANDATE : 06/08/2012
SERIALNO : xxxxxxxxxxxxx
BATTDATE : 06/08/2012
NOMOUTV : 120 Volts
NOMBATTV : 34779152.0 Volts
HUMIDITY : 2971549696.0 Percent
AMBTEMP : 2971549696.0 C
EXTBATTS : 34779152
BADBATTS : -1323417600
FIRMWARE : UPS 08.3 / MCU 14.0
END APC : 2012-08-23 17:25:06 -0400
Possible reproducer;
Example:
<interface type='bridge'>
<mac address='52:54:00:8e:ea:75'/>
<source bridge='ifn_bridge1'/>
<target dev='vnet0'/>
<model type='virtio'/>
<alias name='net0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
<driver name='qemu'/>
<driver name='qemu'/>
</interface>
The menu in which attached a DVD or CD ISO to a VM, if your VM has any mixed case it display as attached, but wont be.
User reports that node 2's PDU fencing was left as 'reboot'.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.