GithubHelp home page GithubHelp logo

bjoerrrn / storj-system-health.sh Goto Github PK

View Code? Open in Web Editor NEW
8.0 1.0 0.0 1.14 MB

storagenode (storj) tool to inform operators almost immediately about fatal, audit or general errors via discord push message and email alerts.

Home Page: https://discordapp.com/users/371404709262786561

License: GNU General Public License v3.0

Shell 100.00%
docker storj storj-node healthcheck error-reporting discord-bot mail-alerts linux-shell linux

storj-system-health.sh's Introduction

Stand With Ukraine

Stand With Ukraine

storj-system-health.sh

stars last_commit

this linux/macos shell script checks, if a storj node (from the storj project) runs into errors and alerts the operator by discord push messages as well as emails. requires at least one storj node running with docker on linux.

features

  • multinode support 🌍
  • optionally discord (as quick notifications) and/or mail (with error details) alerts πŸ“₯ πŸ””
  • alerts, in case: ⚠️
    • audit, suspension and/or online scores are below a threshold (storj node disqualification risk)
    • audit timeouts are recognized (pending audits; discqualification risk)
    • audit time lags: download started vs. download finished is larger than 3 mins (storj node disqualification risk)
    • a threshold of repair gets/puts and downloads/uploads are reached (storj node disqualification risk)
    • there was no get/put at all in the last hour (storj node disqualification risk)
    • any other fatal error occurs, incl. issues with docker stability
    • storj node version is outdated
    • the node is offline (docker container not started)
  • reports: πŸ“°
    • disk usage
    • success rates audits, downloads, uploads, repair up-/downloads
    • estimated payouts for today and current month
    • todays upload and download statistics
  • optimized for crontab and command line usage πŸ’»
  • supports redirected logs to a file
  • only requires curl, jq, bc and (optionally) swaks to run πŸ”₯

optimzed / tested for

  • debian bullseye 🐧
  • macos monterey 🍎 (jq + swaks installed with brew)

dependencies

  • storj node node up and running, within a
  • docker container
  • curl (http requests)
  • jq 1.6 ⚠️ (JSON parsing)
  • bc (arbitrary precision calculator)
  • swaks (mail sending, smtp)
  • discord.sh (discord pushes)

setting up storj system health

  1. optional: setup a webhook in the desired discord text channel
  2. optional: grab your smtp email authentication data
  3. download (or clone) a copy of discord.sh *
  4. download (or clone) a copy of storj-system-health.sh and storj-system-health.credo **
  5. optional: setup discord and mail variables in storj-system-health.credo
  6. Go nuts πŸš€

* wget https://raw.githubusercontent.com/ChaoticWeg/discord.sh/master/discord.sh
** wget https://raw.githubusercontent.com/dusselmann/storj-system-health.sh/main/storj-system-health.sh && wget https://raw.githubusercontent.com/dusselmann/storj-system-health.sh/main/storj-system-health.credo

setting up variables in *.credo

you will need to modify these variables in *.credo for your specific node and smtp mail server configuration. the *.credo file must not include comments and blank lines, the following description is just for your explanation:

## discord settings
DISCORDON=true.         # enables (true) or disables (false) discord pushes
DISCORDURL=https://discord.com/api/webhooks/...
                        # your discord webhook url

## mail settings
MAILON=true             # enables (true) or disables (false) email messages
MAILFROM=""             # your "from:" mail address
MAILTO=""               # your "to:" mail address
MAILSERVER=""           # your smtp server address
MAILUSER=""             # your user name from smtp server
MAILPASS=""             # your password from smtp server

## alerting settings
SATPINGFREQ=3600        # in case satellite scores are below threshold, 
                        # value in seconds, when next alert will be sent earliest
                        
## storj node docker names and urls
NODES=storagenode       # storage node names, multiple: separated with comma, 
                        # e.g. storagenode,storagenode-a,storagenode-b
NODEURLS=localhost:14002
                        # storage node dashboard urls, multiple: separated with comma, 
                        # e.g. localhost:14002,192.168.171.5:14002

## node data mount points
MOUNTPOINTS=/mnt/node   # your storage node mount point, multiple: separated with comma
                        # e.g. /mnt/node,/mnt/node-a,/mnt/node-b
                        # enter 'source' from the docker run command here

## specify redirected logs per node
NODELOGPATHS=/          # put your relative path + log file name here,
                        # in case you've redirected your docker logs with
                        # e.g. config.yaml: 'log.output: "/app/config/node.log"'
                        #  /                       -> for non-redirected logs
                        #  /node.log               -> for single node redirect
                        #  /,/                     -> for 2 node with non-redirected logs
                        #  /node1.log,/node2.log   -> for 2 nodes with redirects
                        #  /node.log,/             -> only 1st is redirected
                        #  /mnt/hdd1/node.log      -> full path possible, too

## log selection specifica - in alignment with cronjob settings
LOGMIN=60               # latest log horizon to have a detailled view on, in minutes
                        # -> change this, if your cronjob runs more often than 60m
LOGMAX=720              # larger log horizon for overall statistics, in minutes

make sure, your script is executable by running the following command. add 'sudo' at the beginning, if admin privileges are required.

chmod u+x storj-system-health.sh  # or:
sudo chmod u+x storj-system-health.sh

chmod u+x discord.sh  # or:
sudo chmod u+x discord.sh

usage

you can run the script in debug mode to force a push message to your discord channel (if enabled) although no error was found - or without the debug flag to run it in silent mode via crontab (see automation chapter).

./storj-system-health.sh -d   # for a regular discord push message or:
./storj-system-health.sh      # for silent mode

optionally you can pass another path to *.credo, in case it has another name or source:

./storj-system-health.sh -c /home/pi/anothername.credo

in order to use the estimated payout information, which looks like so:

message:  [sn1] : hdd 38.62% > OK 0.25$ / 11.77$

... you should set your crontab to be run around 23:55 UTC. You need to adjust the timing, if you have a couple of nodes and/or huge log files to be analysed: the script needs to be finished before the next full hour, ideally latest 23:59:59 UTC.

it also supports a help command for further details:

./storj-system-health.sh -h

automation with crontab

to let the health check run automatically, here’s a crontab example for linux, which runs the script each hour.

15,35,55  * * * *   pi      /home/pi/storj-system-health.sh -d  > /dev/null

for macos please be aware of the following specifics:

  • use crontab -e and crontab -l, although it is depricated (for now it works)
  • you do not have to use the user name, it's to be executed with the current user
  • use full paths to your script and credo file
  • find out your standard path with echo Β§PATH and set it in crontab
SHELL=/bin/sh
PATH="/opt/homebrew/opt/sqlite/bin:/opt/homebrew/bin:/opt/homebrew/sbin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin"
# UNIX:
30 *    * * *   pi      cd /home/pi/scripts/ && ./storj-system-health.sh
59 1    * * *   pi      cd /home/pi/scripts/ && ./storj-system-health.sh -Ed
# MACOS
# 30    * * * *  /Users/me/storj-system-health.sh >> /Users/me/Desktop/checks.txt 2>&1
# 59    1 * * *  /Users/me/storj-system-health.sh -Ed -c /Users/me/my.credo >> /Users/me/Desktop/checks.txt 2>&1

example screenshots

an "ok" message

ok message

a message saying, that there are fatal errors

fatal error message

another message saying, that there are general errors

general error message

satellite score issues

satellite issues

success rates per node

success rates

explanation:

(repair) downloads / (repair) uploads:
c = cancelled rate
f = failed rate
s = success rate

audits : 
r = recoverable audit rate
c = critical audit fail rate
s = audit success rate

contributing

issues and pull requests are welcome. for major changes, please open an issue first to discuss what you would like to change.

if you want to contact me directly, feel free to do so via discord: https://discordapp.com/users/371404709262786561

license

GPL-3.0

storj-system-health.sh's People

Contributors

bjoerrrn avatar f-systemes avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

storj-system-health.sh's Issues

add an analysis and warning mechanism of too huge audit request time lags

Is your feature request related to a problem? Please describe.
a time lag of audit requests is not logged into the storage node log file.

issues on that will result more or less quick into disqualification without any warning to the storage node operator (SNO).

Describe the solution you'd like
analyse the log extract selected by minlog and check for long time lags between audit requests

example select statement:

cat /mnt/WD1003/logs/sn1.log | grep 1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE | grep -E "GET_AUDIT" | jq -R '. | split("\t") | (.[4] | fromjson) as $body | {SatelliteID: $body."Satellite ID", ($body."Piece ID"): {(.[0]): .[3]}}' | jq -s 'reduce .[] as $item ({}; . * $item)'

example result from the command:

{
  "SatelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE",
  "NG6KAUMU7TP22DNGROKBU2MRRNV675QYEOJC3X2BXH4OCML6BPNQ": {
    "2022-06-28T21:24:39.646Z": "download started",
    "2022-06-28T21:24:40.002Z": "downloaded"
  },
  "IADTQX62PCZQEJRRYPCKNWX3QSPG7A3U53IBWPQRSX6ZMH6I45UQ": {
    "2022-06-28T21:32:40.597Z": "download started",
    "2022-06-28T21:32:40.893Z": "downloaded",
    "2022-07-09T20:00:10.698Z": "download started",
    "2022-07-09T20:00:10.995Z": "downloaded"
  },
  "MZEPH4JSGSAJZ72QQV4YOYYVGLER7KOQPBUB2VEANL4MPNSZDBTA": {
    "2022-06-28T21:58:56.184Z": "download started",
    "2022-06-28T21:58:56.454Z": "downloaded"
  },
  "GFATHGO2WFBZNAOQJKXYNHTFKH2T5T4OXK3BEL7U62FNK5ZRR6OQ": {
    "2022-06-28T22:08:49.765Z": "download started",
    "2022-06-28T22:08:50.089Z": "downloaded"
  },
...

Additional context

Alert on a) low thresholds for success rates + b) storage node status

  • Low audit success rate (<95%)
  • Low repair success rate (<95%). Risk of getting disqualified.
  • Low customer download success rate (<90%). No disqualification risk.
  • Low upload success rate (<90%). No disqualification risk.
  • No upload or download activity for quite some time.
  • Storagenode not running
  • Storagenode unable to checkin including pingback error.
  • ...

may this can be a help: https://forum.storj.io/t/guide-to-debug-my-storage-node-uplink-s3-gateway-satellite/1372

New version notification

Add delay of 10 days, before showing a new storj version.

Then, display message just once a day.

URGENT: Prevent mail alerts from spamming the mail account

There needs to be a logic to not or much less send the same error log extract again and again.

Usually, older error messages should disappear after 24h, as the log selection is limited. Anyway, one e-mail for each new error found should be fine.

Pending audits: run script again automatically

In case of pending audits: run script again automatically in order to verify that audits are working well.

Send a warning message only in the case, the "verification run" of the script still warns about pending audits (not anymore immediately).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.