I have some megaraid controllers which are returning the following:
megaraid_healthy 0 <== there's a problem
megaraid_failed 0
megaraid_degraded 0
megaraid_battery_backup_healthy 1
This is odd: the controller says it needs attention, but it's not obvious why.
On closer inspection: storcli.py returns battery_backup_healthy 1
if the BBU state is 0 or 32. I'm getting 32, and the battery is also "Degraded":
# /opt/MegaRAID/storcli/storcli64 /cALL show all J | less
...
"Status" : {
==> "Controller Status" : "Needs Attention",
"Memory Correctable Errors" : 0,
"Memory Uncorrectable Errors" : 0,
"ECC Bucket Count" : 0,
"Any Offline VD Cache Preserved" : "No",
==> "BBU Status" : 32,
"PD Firmware Download in progress" : "No",
"Support PD Firmware Download" : "No",
"Lock Key Assigned" : "No",
"Failed to get lock key on bootup" : "No",
"Lock key has not been backed up" : "No",
"Bios was not detected during boot" : "No",
"Controller must be rebooted to complete security operation" : "No",
"A rollback operation is in progress" : "No",
"At least one PFK exists in NVRAM" : "No",
"SSC Policy is WB" : "No",
"Controller has booted into safe mode" : "No",
"Controller shutdown required" : "No"
},
...
"BBU_Info" : [
{
"Model" : "iBBU",
==> "State" : "Dgd (Needs Attention)",
"RetentionTime" : "48 hours +",
"Temp" : "29C",
"Mode" : "-",
"MfgDate" : "2014/02/10",
"Next Learn" : "2019/06/27 01:33:42"
}
]
My best guess is that the controller "Needs Attention" because of the battery status, but I can't find documentation for what status=32 means. Can you point to some info which says that 32 is healthy?
For comparison, here's what MegaCLI says on the same controller:
# /opt/MegaRAID/MegaCli/MegaCli64 -AdpBbuCmd -GetBbuStatus -aALL
BBU status for Adapter: 0
BatteryType: iBBU
Voltage: 4014 mV
Current: 0 mA
Temperature: 29 C
Battery State: Degraded(Need Attention)
A manual learn is required.
BBU Firmware Status:
Charging Status : None
Voltage : OK
Temperature : OK
Learn Cycle Requested : Yes
Learn Cycle Active : No
Learn Cycle Status : OK
Learn Cycle Timeout : No
I2c Errors Detected : No
Battery Pack Missing : No
Battery Replacement required : No
Remaining Capacity Low : No
Periodic Learn Required : No
Transparent Learn : No
No space to cache offload : No
Pack is about to fail & should be replaced : No
Cache Offload premium feature required : No
Module microcode update required : No
GasGuageStatus:
Fully Discharged : No
Fully Charged : No
Discharging : Yes
Initialized : Yes
Remaining Time Alarm : No
Discharge Terminated : No
Over Temperature : No
Charging Terminated : No
Over Charged : No
Relative State of Charge: 75 %
Charger System State: 49169
Charger System Ctrl: 0
Charging current: 512 mA
Absolute state of charge: 77 %
Max Error: 9 %
Exit Code: 0x00
Perhaps 32 means "manual learn is required"? But in that case, I'd say it's not "healthy", in the sense that some attention is required.
On another controller, which is healthy, the BBU state is 0. This one has CacheVault_Info rather than BBU_Info:
"Cachevault_Info" : [
{
"Model" : "CVPM02",
"State" : "Optimal",
"Temp" : "30C",
"Mode" : "-",
"MfgDate" : "2014/05/30"
}
]
(Aside 1: storcli.py provides a metric megaraid_cv_temperature
for the temperature from Cachevault_Info, but not the temperature from BBU_Info)
On a different controller, which doesn't have a BBU at all, I get megaraid_battery_backup_healthy 0
. In other words: it's flagging as a battery "bad" even though the controller is healthy and there's no action required. The JSON contains:
(Aside 2: I would be inclined in this state to drop the megaraid_battery_backup_healthy metric entirely. Otherwise we get a false alarm about a bad battery, especially since there's no other metric saying whether the BBU is present or not. On the other hand, I can suppress this alarm if megaraid_healthy is 1, which is is)
So in summary:
- Can anyone confirm what BBU status 32 means?
- Is it correct for storcli.py to report the battery as "healthy" in this condition, even though the overall controller health is "needs attention"?
- Should we return BBU_Info temperature as a different metric, e.g.
megaraid_bbu_temperature
?
- Should we suppress the megaraid_battery_backup_healthy metric if the BBU is not present (status="NA")? Or have a different metric for BBU present/absent?