holgerhees / smartserver Goto Github PK
View Code? Open in Web Editor NEWSmartHome Server deployment setup
Home Page: http://www.intranet-of-things.com/smarthome/infrastructure/server/setup/
License: GNU General Public License v3.0
SmartHome Server deployment setup
Home Page: http://www.intranet-of-things.com/smarthome/infrastructure/server/setup/
License: GNU General Public License v3.0
I'm trying to solve the alerts that I have. With this issue I would like to let you now the special situation - maybe you consider and make the code more robust.
The alert is System service service state with prometheus expression system_service_state{job="system_service",hostname=""} == 0
which gets triggered because in my systems librenms has no VLANs (Don't know if it's wrong configuration or my switch - TP-Link - does not provide those information via SNMP).
The call returns 404:
# curl -H 'X-Auth-Token: XXXXXX' http://librenms:8000/api/v0/resources/vlans
{
"status": "error",
"message": "VLANs do not exist"
}
If I modify the code to print the stacktrace I see:
Traceback (most recent call last):
File "/opt/system_service/lib/scanner/handler/librenms.py", line 45, in _run
self._processLibreNMS()
File "/opt/system_service/lib/scanner/handler/librenms.py", line 80, in _processLibreNMS
self._processVLANs()
File "/opt/system_service/lib/scanner/handler/librenms.py", line 160, in _processVLANs
_vlan_json = self._get("resources/vlans")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/system_service/lib/scanner/handler/librenms.py", line 319, in _get
raise NetworkException("Got wrong response status code: {}".format(r.status_code), self.config.startup_error_timeout if not self._isInitialized() else self.config.remote_suspend_timeout)
lib.scanner.handler.librenms.NetworkException: Got wrong response status code: 404
For now I commented the call to _processVLANs and _processFDP and the alert is gone.
I don't know if this situation may appear to others (and code need to be changed to consider this) or I need to configure librenms /switch to report VLANs.
Hello,
By default on OpenSuse 15,5 the ansible version is 2.11.12 and since 2.8 there is a change that should be considered.
I receive
The requested handler 'refresh prometheus' was not found in either the main handlers list nor in the listening handlers list
and I suppose that all import_tasks like import_tasks: roles/prometheus/shared/add_config.yml
should be replaced by include_tasks: like include_tasks roles/prometheus/shared/add_config.yml
.
In documentation you mentioned ansible 2.10.7 - I guess is affected by this change.
In my system I have a NVMe drive:
# smartctl --scan
/dev/sda -d scsi # /dev/sda, SCSI device
/dev/sdb -d scsi # /dev/sdb, SCSI device
/dev/nvme0 -d nvme # /dev/nvme0, NVMe device
The command smartctl --scan | grep -oP "^[A-z/]+"
returns (please see the missing 0 from /dev/nvme0):
/dev/sda
/dev/sdb
/dev/nvme
If I modify the comand like smartctl --scan | grep -oP "^[A-z0-9/]+"
and change the hardware_smartd\templates\smartd.conf
to guess the device type (eg. using -d auto
) then the issue is fixed in my system.
These changes are in 1beaf45 - maybe you want to incorporate these changes.
Prometheus has some internal metrics like ALERTS_FOR_STATE that contains the timestamp when an alert was triggered. This metric contains all alert labels and one additional 'alertname'.
I think the expression for this alert is not restrictive enough because after a alert is triggered it will not be removed any more (not even after the condition that triggered the alert is false).
I propose the expression to be changed to (exclude timeseries ALERTS_FOR_STATE - eg. the ones without label alertname):
{alertname="",chart=~"smartd_log.*",family="temperature",job="netdata"} > 50
I'm not sure if this should be applied to other alerts...
Telegraf broadens the number of entry points for Influxdb and makes collecting metrics much easier.
I love this project, just what I was looking for. A couple of suggestions:
It would be great to have some kind of tree/graph showing the various dependencies of the various subsystems. Then users could known which parts could be safely dropped or modified without side effects.
An overview of how Vagrant and Ansible interact to build out the system would help us understand how it all comes together.
Provide some simple FAQ/tutorial sections, something like:
Recent browser updates are not allowing self-signed certs to be accepted and there doesn't seem to be any good instructions on how to overcome this. Two possible approaches:
or
I really hope this can be fixed, I'd like to start developing with Smartserver.
Next alarm in my systems - maybe discovered another issue!?
In my system the alarm Weather consumer (station) is triggered because nobody publish data into the mqtt topic +/weather/station/# The openmeteo provider publish to +/weather/provider/ so the has_any_updates is always False
I think the StationConsumer should be instantiated only for cloud_mosquitto - but maybe I misunderstood your documentation
I was using parts of the repo to help build my own system. But now I'd like to also use the staging/production mix and migrate my ansible roles into this.
It's not clear in https://github.com/HolgerHees/smartserver/wiki/Setup:-Create#create-your-own-setup how to:
vagrant --config=demo --os=ubuntu up --provision
works and provisions to the VM running locally.sudo ansible-playbook -i config/demo/server.ini server.yml
will only run if I add the production IP to config/demo/server.ini.Point 5: copying files is unclear - copy config/demo/
to the production server?
Perhaps an example where we:
server.ini
,ansible-playbook -i config/demo/server.ini --extra-vars "target=production" server.yml
At least this would be my thinking after playing around with your setup very briefly. I might be missing something so take all of this with a grain of salt.
Regarding
On my system (recently updated) the output of command
/usr/bin/zypper ps -s
is:
No processes using deleted files found.
No core libraries or services have been updated since the last system boot.
Reboot is probably not necessary.
The following exception is thrown:
Traceback (most recent call last):
File "/opt/update_daemon/system_update_check", line 31, in <module>
repo = plugin.Repository()
File "/opt/update_daemon/plugins/os/opensuse.py", line 19, in __init__
self.needs_restart = m.group(0) is not None
AttributeError: 'NoneType' object has no attribute 'group'
For the moment I fixed this by this additional check:
self.needs_restart = False
if m is not None:
self.needs_restart = m.group(0) is not None
but I'm not sure the logic is correct (needs_restart = False) as long the output says "Reboot is probably not necessary."
My ISP offer a parental control service (that I have it enabled). When this feature is enabled, I guess all DNS requests to other servers (like Google DNS 8.8.8.8) than local one are drop.
With parental control enabled on ISP side:
jupiter:~/git/smartserver # nslookup www.google.com 8.8.8.8
;; connection timed out; no servers could be reached
Without parental control enabled:
jupiter:~ # nslookup www.google.com 8.8.8.8
Server: 8.8.8.8
Address: 8.8.8.8#53
Non-authoritative answer:
Name: www.google.com
Address: 142.250.180.228
Name: www.google.com
Address: 2a00:1450:400d:80c::2004
Sometimes the smartserver ansible scripts fail because of that. For example, now, when you updated the alpine version the creation of dnsmasq container failed. The names are not resolved because the 8.8.8.8 (that is used by docker) on 53 port is not accessible.
#5 [2/2] RUN apk --no-cache add dnsmasq tzdata\n
#5 0.081 fetch https://dl-cdn.alpinelinux.org/alpine/v3.20/main/x86_64/APKINDEX.tar.gz\n
#5 5.085 WARNING: fetching https://dl-cdn.alpinelinux.org/alpine/v3.20/main: temporary error (try again later)\n
#5 5.085 fetch https://dl-cdn.alpinelinux.org/alpine/v3.20/community/x86_64/APKINDEX.tar.gz\n
#5 10.09 WARNING: fetching https://dl-cdn.alpinelinux.org/alpine/v3.20/community: temporary error (try again later)\n
#5 10.09 ERROR: unable to select packages:\n
#5 10.09 dnsmasq (no such package):\n
#5 10.09 required by: world[dnsmasq]\n
#5 10.09 tzdata (no such package):\n
#5 10.09 required by: world[tzdata]\n
After I disabled the parental control on ISP then the ansible scripts works as expected.
It's nothing critical, but I just let you know - maybe there is an easy fix.
For the moment, my workarounds are:
The query
results = self.influxdb.query('SELECT "group","value" FROM "netflow_size" WHERE time >= now() - 358m AND "group"::tag != \'normal\'')
in my system returns an empty array. This leads to an exception (IndexError: list index out of range) here
Adding a check if the array is empty solve the issue. Proposed changes in this commit.
Thank you!
Not sure if it's a defect or wrong configuration in my side, anyway I will let you know.
I'm getting the following error:
[ERROR] - [m_service.lib.influxdb:66] - Traceback (most recent call last):
File "/etc/system_service/lib/influxdb.py", line 62, in run
messurements += callback()
^^^^^^^^^^
File "/etc/system_service/lib/trafficwatcher/trafficblocker/trafficblocker.py", line 264, in getMessurements
messurements.append("trafficblocker,extern_ip={},blocking_state={},blocking_reason={},blocking_list={},blocking_count={} value=\"{}\"".format(ip, data["state"], data["reason"], data["blocklist"], data["count"], data["last"]))
~~~~^^^^^^^^^^^^^
KeyError: 'blocklist'
If I delete the /dataDisk/var/lib/system_service/trafficblocker.json file (version 3) the error is gone, but the file is empty for the moment. Previously, the file contains entries like (all missing blocklist):
"146.190.40.112": {
"count": 1,
"created": 1692759673.4107,
"details": "scanning",
"last": 1692759617.3242,
"reason": "apache",
"state": "blocked",
"type": "unknown",
"updated": 1692759673.4107
},
Hello Holger,
I've tried to use your recent changes to gallery and I think I found a edge case that may lead to image_proxy.php to never ends.
In my environment (the cameras are not configured to accept Basic auth or the format returned is not recognized by Imagick) seems that curl_exec of the camera snapshot url may throw an exception, and then in subsequent calls the apcu_delete( $url . ":fetch" ); never called and the loop never exit. The workaround is to restart the php container to clean the cache.
PHP logs looks like (with some of your debugging statements uncommented):
NOTICE: PHP message: fetch http://hostname/ipcamera/cam01/ipcamera.jpg
NOTICE: PHP message: 0.0033631324768066
NOTICE: PHP message: fetched size 282
NOTICE: PHP message: calculated size 282
NOTICE: PHP message: PHP Fatal error: Uncaught ImagickException: no decode delegate for this image format `' @ error/blob.c/BlobToImage/363 in /dataDisk/htdocs/gallery/image_proxy.php:148
Stack trace:
#0 /dataDisk/htdocs/gallery/image_proxy.php(148): Imagick->readImageBlob()
#1 /dataDisk/htdocs/gallery/image_proxy.php(101): scaleImage()
#2 /dataDisk/htdocs/gallery/image_proxy.php(65): fetchUrl()
#3 /dataDisk/htdocs/gallery/image_proxy.php(14): getData()
#4 {main}
thrown in /dataDisk/htdocs/gallery/image_proxy.php on line 148
Maybe you consider fixing this by handling such exception and cleanup the cache.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.