GithubHelp home page GithubHelp logo

docker-flow-monitor's Introduction

Docker Flow Monitor

GitHub release license Docker Pulls Go Report Card

The goal of the Docker Flow Monitor project is to provide an easy way to reconfigure Prometheus every time a new service is deployed, or when a service is updated. It does not try to "reinvent the wheel", but to leverage the existing leaders and combine them through an easy to use integration. It uses Prometheus as a metrics storage and query engine and adds custom logic that allows on-demand reconfiguration.

Please visit the tutorial for a brief introduction or Configuring Docker Flow Monitor and Usage sections for more details.

Please join the #df-monitor Slack channel in DevOps20 if you have any questions, suggestions, or problems.

Buy Me a Coffee at ko-fi.com

docker-flow-monitor's People

Contributors

andrh avatar atljlawrie avatar dorsany avatar itestaverde avatar patrickleet avatar puffin avatar redtex avatar thomasjpfan avatar vfarcic avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

docker-flow-monitor's Issues

servicePath is not well manage

Thanks for this stack.

I tried to configure demo micro service with following configuration in my docker stack:

deploy:
  labels:
    - com.df.notify=true
    - com.df.distribute=true
    - com.df.servicePath=/demo
    - com.df.port=8080

In my swarm-listener logs, I got:
monitor_swarm-listener.1.yykqznc875td@manager | 2017/10/30 12:00:18 Sending service created notification to http://monitor:8080/v1/docker-flow-monitor/reconfigure?distribute=true&port=8080&replicas=1&serviceName=demo_main&servicePath=%2Fdemo

But in my monitor_monitor logs, I got:
monitor_monitor.1.vcn8myrn9igg@manager | 2017/10/30 12:00:18 Processing /v1/docker-flow-monitor/reconfigure?distribute=true&port=8080&replicas=1&serviceName=demo_main&servicePath=**%!F(MISSING)**demo
monitor_monitor.1.vcn8myrn9igg@manager | GLOBAL_SCRAPE_INTERVAL=10s

I have this strange MISSING string. Any Idea why ?

Thanks
Phil

scaling to 0 removes alert rules

I am trying to get the benefits of serverless, with microservices by autoscaling based on queue length. When there are no messages in the queue, the service can be scaled down to 0.

I'm using @thomasjpfan's https://github.com/thomasjpfan/docker-scaler, though Jenkins job would be the same, and a rabbitmq prometheus exporter. This allows me to scale without instrumenting the service specifically.

~ $ docker service logs voelhgexsur4
[email protected]    | 2018/02/20 07:21:32 Scale service down: rethink-denormalizer_service
[email protected]    | 2018/02/20 07:21:32 scale-service success: Scaling rethink-denormalizer_service from 1 to 0 replicas (min: 0)
[email protected]    | 2018/02/20 07:21:32 Alertmanager received message: Scaling rethink-denormalizer_service from 1 to 0 replicas (min: 0)
[email protected]    | 2018/02/20 07:31:32 Scale service down: rethink-denormalizer_service
[email protected]    | 2018/02/20 07:31:32 scale-service success: Scaling rethink-denormalizer_service from 1 to 0 replicas (min: 0)
[email protected]    | 2018/02/20 07:31:32 Alertmanager received message: Scaling rethink-denormalizer_service from 1 to 0 replicas (min: 0)

It works great.

Problem is, this bring the service down in the eyes of DFM, and correctly, for DFP.

In DFM this is an issue because now, because the alerts needed to scale back up when there are messages in the queue again have been removed as part of the service removed event.

Idea for scaling a service without Jenkins

While reviewing the source code for https://github.com/alexellis/faas, I saw they created a custom webhook: https://github.com/alexellis/faas/blob/master/gateway/handlers/alerthandler.go to scale a docker service. They have an alertmanager configured to signal the webhook when the service needs to be scaled up: https://github.com/alexellis/faas/blob/master/prometheus/alertmanager.yml.

Incorporating the custom webhook as a microservice to scale services would eliminate the need to use Jenkins to perform scaling.

marshal errors in logs, unable to load config file

steps to reproduce:

  1. check out the code https://github.com/vfarcic/docker-flow-monitor.
  2. uncomment below section in stacks/docker-flow-monitor-proxy.yml

#- com.df.serviceName=monitor
#- com.df.scrapeType=static_configs
#- com.df.scrapePort=9090

  1. run the ./scripts/dm-swarm-04.sh
  2. determine monitor container id and get the log.
  3. try to access dfm using open "http://$(docker-machine ip swarm-1)/monitor/config"

exected:
there should not be any log errors

actual:
marshal errors in logs, trying to process "/etc/prometheus/prometheus.yml". this stops integration between dfp and dfm.

can not access the endpoint, "Docker Flow Proxy: 503 Service Unavailable"

logs:

2017/11/08 01:47:17 /bin/sh -c prometheus -web.external-url="http://192.168.99.100/monitor" -web.route-prefix="/monitor" -config.file="/etc/prometheus/prometheus.yml" -storage.local.path="/prometheus" -web.console.libraries="/usr/share/prometheus/console_libraries" -web.console.templates="/usr/share/prometheus/consoles"
time="2017-11-08T01:47:17Z" level=info msg="Starting prometheus (version=1.8.1, branch=HEAD, revision=3a7c51ab70fc7615cd318204d3aa7c078b7c5b20)" source="main.go:87"
time="2017-11-08T01:47:17Z" level=info msg="Build context (go=go1.9.1, user=root@ab78fb101775, date=20171023-15:50:57)" source="main.go:88"
time="2017-11-08T01:47:17Z" level=info msg="Host details (Linux 4.4.93-boot2docker #1 SMP Wed Oct 18 17:00:16 UTC 2017 x86_64 923dc40e55ff (none))" source="main.go:89"
time="2017-11-08T01:47:17Z" level=info msg="Loading configuration file /etc/prometheus/prometheus.yml" source="main.go:254"
time="2017-11-08T01:47:17Z" level=error msg="Error loading config: couldn't load configuration (-config.file=/etc/prometheus/prometheus.yml): yaml: unmarshal errors:
line 10: cannot unmarshal !!seq into string" source="main.go:160"
2017/11/08 01:47:21 Processing /v1/docker-flow-monitor/reconfigure?distribute=true&port=9090&replicas=1&scrapePort=9090&scrapeType=static_configs&serviceDomain=192.168.99.100&serviceName=monitor&servicePath=%!F(MISSING)monitor

No Autoscaling when Memory threshold is reached

Hi,
I am following the tutorial about Auto-scaling:
https://monitor.dockerflow.com/auto-scaling/

In order to make it easier to test the autoscaling when memory is NOT sufficient:

I reconfigured the service "go-demo_main" as follows:

  • decreased the limits to 25% which is the easiest way forward,
  • Changed the threshold for scaled it in down to 1 and scale up to 5
  • Update the alertFor to 1minute of waiting in order not to wait long.
    image

This is visible also in monitor/alerts:
image

I am observing the Docker stats
image

It reads MEM% is bigger than 30% for the tasks of the service go-demo_main, the scenario that is supposed to trigger a SCALE-OUT

image

However, after 1 minutes, no scale-out took place!
image

The number of tasks per service is still the same before and after the PERIOD:
image

Below are some details:$ docker service inspect go-demo_main --pretty | grep "com.df"

com.df.alertFor.1=5m <----
com.df.alertIf.1=@service_mem_limit:0.1 <----
...
com.df.scaleMin=1 <----
com.df.scrapePort=8080com.df.servicePath=/demo

LOGS Are clean. No exception is seen:
Just reading:


GET request to /demo/hello


For the Resources:
Resources:
Reservations:
Memory: 4MiB
Limits:
Memory: 20MiB

@service_mem_limit_nobuff not expanded properly in prometheus alert.rules

Hey guys, upon discovering that one of our services was alerting on high memory usage due to filling up it's page cache, I tried to change the memory alert definition from com.df.alertIf=@service_mem_limit to com.df.alertIf=@service_mem_limit_nobuff. I changed this label in my stack file and redeployed the stack, but I am still seeing the old alert definition in prometheus.

Upon further investigation, it appears that the alert that was sent to prometheus was malformed yaml, and thus it fell back on the previous alert definitions. Here is what the invalid definitions look like, with a valid one included for reference. Looks like the shortcut was not expanded properly.

- alert: monitor_monitor_memlimit
    expr: container_memory_usage_bytes{container_label_com_docker_swarm_service_name="monitor_monitor"}/container_spec_memory_limit_bytes{container_label_com_docker_swarm_service_name="monitor_monitor"} > 0.9
    for: 30s
    labels:
      receiver: system
      service: monitor_monitor
    annotations:
      summary: "Memory of the service monitor_monitor is over 0.9"
  - alert: nexus_haproxy_memlimit
    expr: @service_mem_limit_nobuff:0.8
    for: 30s
  - alert: nexus_nexus_memlimit
    expr: @service_mem_limit_nobuff:0.8
    for: 30s

Here is what the logs for docker flow monitor look like for this.

[email protected] | level=info ts=2018-09-17T17:00:18.441343037Z caller=head.go:357 component=tsdb msg="WAL truncation completed" duration=92.096158ms [email protected] | 2018/09/17 18:55:04 Processing /v1/docker-flow-monitor/reconfigure?alertFor=30s&alertIf=%!s(MISSING)ervice_mem_limit_nobuff%!A(MISSING)0.8&alertName=memlimit&distribute=true&port=8081&redirectWhenHttpProto=true&replicas=1&serviceDomain=nexus.staging-gridpl.us&serviceName=nexus_nexus [email protected] | 2018/09/17 18:55:04 Adding alert memlimit for the service nexus_nexus [email protected] | %!(EXTRA prometheus.Alert={map[] 30s @service_mem_limit_nobuff:0.8 map[] memlimit false nexus_nexus_memlimit nexus_nexus 1}) [email protected] | 2018/09/17 18:55:04 Writing to alert.rules [email protected] | 2018/09/17 18:55:04 Writing to prometheus.yml [email protected] | 2018/09/17 18:55:04 Reloading Prometheus [email protected] | 2018/09/17 18:55:04 pkill -HUP prometheus [email protected] | level=info ts=2018-09-17T18:55:04.165360131Z caller=main.go:588 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml [email protected] | 2018/09/17 18:55:04 Prometheus was reloaded [email protected] | level=error ts=2018-09-17T18:55:04.180208814Z caller=manager.go:479 component="rule manager" msg="loading groups failed" err="yaml: line 108: found character that cannot start any token" [email protected] | level=error ts=2018-09-17T18:55:04.180249509Z caller=main.go:607 msg="Failed to apply configuration" err="error loading rules, previous rule set restored" [email protected] | level=error ts=2018-09-17T18:55:04.18027167Z caller=main.go:451 msg="Error reloading config" err="one or more errors occurred while applying the new configuration (--config.file=/etc/prometheus/prometheus.yml)" [email protected] | 2018/09/17 18:55:11 Processing /v1/docker-flow-monitor/reconfigure?alertFor=30s&alertIf=%!s(MISSING)ervice_mem_limit_nobuff%!A(MISSING)0.8&alertName=memlimit&distribute=true&port=9000&replicas=2&serviceDomain=docker.staging-gridpl.us&serviceName=nexus_haproxy [email protected] | 2018/09/17 18:55:11 Adding alert memlimit for the service nexus_haproxy [email protected] | %!(EXTRA prometheus.Alert={map[] 30s @service_mem_limit_nobuff:0.8 map[] memlimit false nexus_haproxy_memlimit nexus_haproxy 2}) [email protected] | 2018/09/17 18:55:11 Writing to alert.rules [email protected] | 2018/09/17 18:55:11 Writing to prometheus.yml [email protected] | 2018/09/17 18:55:11 Reloading Prometheus [email protected] | 2018/09/17 18:55:11 pkill -HUP prometheus [email protected] | level=info ts=2018-09-17T18:55:11.249477453Z caller=main.go:588 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml [email protected] | 2018/09/17 18:55:11 Prometheus was reloaded [email protected] | level=error ts=2018-09-17T18:55:11.264679519Z caller=manager.go:479 component="rule manager" msg="loading groups failed" err="yaml: line 100: found character that cannot start any token" [email protected] | level=error ts=2018-09-17T18:55:11.264715853Z caller=main.go:607 msg="Failed to apply configuration" err="error loading rules, previous rule set restored" [email protected] | level=error ts=2018-09-17T18:55:11.26473839Z caller=main.go:451 msg="Error reloading config" err="one or more errors occurred while applying the new configuration (--config.file=/etc/prometheus/prometheus.yml)"

Use double underscore inv ENV config override

There is ambiguity in current ENV naming rules such as: GLOBAL_SCRAPE_INTERVAL=10s
Can be treated as:

global:
  scrape_interval: 10s

as well as:

global:
  scrape:
    interval: 10s

To avoid ambiguity and complexity of parsing next nesting level should be indicated with double underscore where single underscore should be treated literally.
With such rule there is no ambiguity in GLOBAL__SCRAPE_INTERVAL=10s or GLOBAL__EXTERNAL_LABELS__LABEL_TYPE=production.

Jenkins doesn't go on internet

Hi,
Jenkins jobs fails since they can't resolve any host, while the VirtualBox is correctly connected.
Could you add some configuration in the tutorial or describe here how to have the dockerized Jenkins already configured to go on internet?

I have already tried this, but it didn't work (I was already logged in)

docker@swadocker@swarm-1:~$ cat /etc/resolv.conf                                                                                                                 
nameserver 10.0.2.3

docker@swarm-1:~$ sudo sysctl -w net.ipv4.ip_forward=1
net.ipv4.ip_forward = 1
                                                                                                                                 
docker@swarm-1:~$ docker run busybox nslookup google.com 
Server:		10.0.2.3
Address:	10.0.2.3:53

Non-authoritative answer:
Name:	google.com
Address: 216.58.205.174

*** Can't find google.com: No answer

$ docker-machine ls
NAME      ACTIVE   DRIVER       STATE     URL                         SWARM   DOCKER     ERRORS
swarm-1   *        virtualbox   Running   tcp://192.168.99.102:2376           v19.03.1   
swarm-2   -        virtualbox   Running   tcp://192.168.99.103:2376           v19.03.1  


# From the virtual machine:
docker@swarm-1:~$ ping github.com                                                                                                                      
PING github.com (140.82.118.4): 56 data bytes
64 bytes from 140.82.118.4: seq=0 ttl=63 time=44.182 ms
64 bytes from 140.82.118.4: seq=1 ttl=63 time=43.573 ms
^C
--- github.com ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 43.573/43.877/44.182 ms


docker@swarm-1:~$ ifconfig
docker0   Link encap:Ethernet  HWaddr 02:42:0F:84:20:53  
          inet addr:172.17.0.1  Bcast:172.17.255.255  Mask:255.255.0.0
          UP BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

docker_gwbridge Link encap:Ethernet  HWaddr 02:42:19:07:CB:74  
          inet addr:172.18.0.1  Bcast:172.18.255.255  Mask:255.255.0.0
          inet6 addr: fe80::42:19ff:fe07:cb74/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:5943 errors:0 dropped:0 overruns:0 frame:0
          TX packets:9786 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:2499747 (2.3 MiB)  TX bytes:3000996 (2.8 MiB)

eth0      Link encap:Ethernet  HWaddr 08:00:27:99:2E:5E  
          inet addr:10.0.2.15  Bcast:10.0.2.255  Mask:255.255.255.0
          inet6 addr: fe80::a00:27ff:fe99:2e5e/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:622257 errors:0 dropped:0 overruns:0 frame:0
          TX packets:186322 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:634018464 (604.6 MiB)  TX bytes:11836469 (11.2 MiB)

eth1      Link encap:Ethernet  HWaddr 08:00:27:A1:B9:57  
          inet addr:192.168.99.102  Bcast:192.168.99.255  Mask:255.255.255.0
          inet6 addr: fe80::a00:27ff:fea1:b957/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:912367 errors:0 dropped:0 overruns:0 frame:0
          TX packets:833600 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:350365461 (334.1 MiB)  TX bytes:313476366 (298.9 MiB)

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

veth24750a5 Link encap:Ethernet  HWaddr CA:F2:29:12:DA:2B  
          inet6 addr: fe80::c8f2:29ff:fe12:da2b/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:5914 errors:0 dropped:0 overruns:0 frame:0
          TX packets:9776 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:2580960 (2.4 MiB)  TX bytes:3000421 (2.8 MiB)

veth39634b3 Link encap:Ethernet  HWaddr 32:73:8C:15:9E:2D  
          inet6 addr: fe80::3073:8cff:fe15:9e2d/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:44 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:0 (0.0 B)  TX bytes:3008 (2.9 KiB)

veth98e43dd Link encap:Ethernet  HWaddr 0A:4F:F3:F5:85:40  
          inet6 addr: fe80::84f:f3ff:fef5:8540/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:27 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:0 (0.0 B)  TX bytes:1874 (1.8 KiB)

vetha8e3b37 Link encap:Ethernet  HWaddr BE:46:29:7C:F3:08  
          inet6 addr: fe80::bc46:29ff:fe7c:f308/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:32 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:0 (0.0 B)  TX bytes:2224 (2.1 KiB)

vethbe4c7e3 Link encap:Ethernet  HWaddr 4A:4F:CE:78:FD:08  
          inet6 addr: fe80::484f:ceff:fe78:fd08/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:41 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:0 (0.0 B)  TX bytes:2826 (2.7 KiB)

vethd6473cb Link encap:Ethernet  HWaddr D2:8E:83:A6:B2:DA  
          inet6 addr: fe80::d08e:83ff:fea6:b2da/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:33 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:0 (0.0 B)  TX bytes:2294 (2.2 KiB)

vethd8a0383 Link encap:Ethernet  HWaddr 8A:12:3A:D0:71:7A  
          inet6 addr: fe80::8812:3aff:fed0:717a/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:42 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:0 (0.0 B)  TX bytes:2896 (2.8 KiB)

vethd8a649c Link encap:Ethernet  HWaddr 0A:0A:0E:DA:A4:C1  
          inet6 addr: fe80::80a:eff:feda:a4c1/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:29 errors:0 dropped:0 overruns:0 frame:0
          TX packets:73 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:1989 (1.9 KiB)  TX bytes:4989 (4.8 KiB)

Screenshot from 2019-08-31 12-23-46

Alert definition not removed from prometheus

I believe I'm seeing an issue where after removing the com.df.alert labels from a service and redeploying it, the alert was not removed correctly from Prometheus.

I actually made the same change in a staging environment and a production environment, and in staging the alert was removed, but in production the alert remains defined in prometheus. I suspect this may be due to the alert being in a firing state in production while I performed the update of the service.

Interestingly, I can see that the monitor swarm listener does not contain the alert definition in it's output at /v1/docker-flow-swarm-listener/get-services

Below is the service definition. The commented lines were in the original service definition, and I redeployed with them commented out, intending to remove the alert definition.

logspout:
image: gliderlabs/logspout:v3.2.4
networks:
- logging
environment:
- SYSLOG_FORMAT=rfc3164
volumes:
- /etc/hostname:/etc/host_hostname:ro
- /var/run/docker.sock:/var/run/docker.sock
command: syslog://logstash:51415
deploy:
mode: global
labels:
- com.df.notify=true
- com.df.distribute=true
# - com.df.alertName=memlimit
# - com.df.alertIf=@service_mem_limit:0.85
# - com.df.alertFor=30s
resources:
reservations:
memory: 30M
limits:
memory: 75M

Wondering if this is known behavior, and how I should remove the alert rules from docker flow monitor in a case like this?

Happy to provide any follow up info needed.

Passing engine or node labels to Prometheus

Service labels are useful but we have to use labels at the engine or node level to add metadata about environment and infrastructure, e.g. cloud provider, region, availability zone, instance, etc., which is pretty useful information in monitoring data.

Is there a mechanism or technique by which we can pass engine or node labels to Prometheus with DFM?

Unable to activate alerts + must manually restart monitor to register new alerts

My main problem is no matter how restrictive I set my mem limit I cannot get the alert to indicate active on the /alerts page in Prometheus. In the example below you will see I have set my service's mem_limit to 10% where at rest, the service in question uses at least 60% of it's available memory limit, and to be triggered with no timespan. Yet no long how I wait for the alert says (0 active)

screen shot 2018-03-01 at 10 12 45 pm

      resources:
        limits:
          memory: 1000M
      labels:
        - com.df.notify=true
        - com.df.alertName=memlimit
        - com.df.alertIf=@service_mem_limit:0.1

This is how the alert translates into Prometheus

alert: monitoring_elasticsearch_memlimit 
expr: 
container_memory_usage_bytes{container_label_com_docker_swarm_service_name="monitoring_elasticsearch"}   / 
container_spec_memory_limit_bytes{container_label_com_docker_swarm_service_name="monitoring_elasticsearch"}   > 0.1 
labels:   receiver: system   service: monitoring_elasticsearch annotations:   
summary: Memory of the service monitoring_elasticsearch is over 0.1

When I plug the expr into the Prometheus Expression receiver I get no-data. Not even container_memory_usage_bytes{container_label_com_docker_swarm_service_name="monitoring_elasticsearch"} seems to produce a result.

Here are the relevant docker-compose instructions

  swarm-listener:
    image: vfarcic/docker-flow-swarm-listener
    networks:
      - proxy
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    environment:
      - DF_NOTIFY_CREATE_SERVICE_URL=http://proxy:8080/v1/docker-flow-proxy/reconfigure
      - DF_NOTIFY_REMOVE_SERVICE_URL=http://proxy:8080/v1/docker-flow-proxy/remove
    deploy:
      placement:
        constraints: [node.role == manager]
  monitor:
    image: vfarcic/docker-flow-monitor:${TAG:-latest}
    environment:
      - LISTENER_ADDRESS=swarm-listener
      - GLOBAL_SCRAPE_INTERVAL=10s
    networks:
      - proxy
    deploy:
      placement:
        constraints:
          - node.role == manager
    ports:
      - 9090:9090

It may be worth noting that I have not incorporated the alert-manager as I didn't want to set it up and figured I could test my alert settings before moving on to that step. Am I wrong in assuming I can continue with docker-flow-monitor without alert-manager.

It's also worth noting that I am using proxy as the shared network between docker-flow-monitor, docker-flow-swarm-listener because I am also using docker-flow-proxy in this stack.

It may also be worth noting that I must manually restart the docker-flow-monitor service for new alerts to register in the prometheus web console after spinning up other services that are not docker-flow-monitor I am not sure if that is intended behavior and perhaps this is a sign of something else wrong.

Nothing in the monitor logs seem to indicate anything is amiss either

proxy_monitor.1.b0gt7nw8bbt2@fences-dev-master1    | 2018/03/02 02:50:46 Requesting services from Docker Flow Swarm Listener
proxy_monitor.1.b0gt7nw8bbt2@fences-dev-master1    | 2018/03/02 02:50:46 Processing: [{"alertFor":"30s","alertIf":"@service_mem_limit:0.8","alertName":"memlimit","distribute":"true","replicas":"1","serviceName":"monitoring_kibana"},{"alertIf":"@service_mem_limit:0.1","alertName":"memlimit","distribute":"true","replicas":"1","serviceName":"monitoring_elasticsearch"},{"distribute":"true","port":"80","replicas":"1","serviceName":"proxy_letsencrypt-companion","servicePath":"/.well-known/acme-challenge"}]
proxy_monitor.1.b0gt7nw8bbt2@fences-dev-master1    | 2018/03/02 02:50:46 Writing to alert.rules
proxy_monitor.1.b0gt7nw8bbt2@fences-dev-master1    | 2018/03/02 02:50:46 Writing to prometheus.yml
proxy_monitor.1.b0gt7nw8bbt2@fences-dev-master1    | 2018/03/02 02:50:46 Starting Docker Flow Monitor
proxy_monitor.1.b0gt7nw8bbt2@fences-dev-master1    | 2018/03/02 02:50:46 Starting Prometheus
proxy_monitor.1.b0gt7nw8bbt2@fences-dev-master1    | 2018/03/02 02:50:46 /bin/sh -c prometheus --config.file="/etc/prometheus/prometheus.yml" --storage.tsdb.path="/prometheus" --web.console.libraries="/usr/share/prometheus/console_libraries" --web.console.templates="/usr/share/prometheus/consoles"
proxy_monitor.1.b0gt7nw8bbt2@fences-dev-master1    | level=info ts=2018-03-02T02:50:46.425281311Z caller=main.go:225 msg="Starting Prometheus" version="(version=2.1.0, branch=HEAD, revision=85f23d82a045d103ea7f3c89a91fba4a93e6367a)"
proxy_monitor.1.b0gt7nw8bbt2@fences-dev-master1    | level=info ts=2018-03-02T02:50:46.425401927Z caller=main.go:226 build_context="(go=go1.9.2, user=root@6e784304d3ff, date=20180119-12:01:23)"
proxy_monitor.1.b0gt7nw8bbt2@fences-dev-master1    | level=info ts=2018-03-02T02:50:46.42546681Z caller=main.go:227 host_details="(Linux 4.4.0-1047-aws #56-Ubuntu SMP Sat Jan 6 19:39:06 UTC 2018 x86_64 ba5f63bfc96a (none))"
proxy_monitor.1.b0gt7nw8bbt2@fences-dev-master1    | level=info ts=2018-03-02T02:50:46.425555206Z caller=main.go:228 fd_limits="(soft=1048576, hard=1048576)"
proxy_monitor.1.b0gt7nw8bbt2@fences-dev-master1    | level=info ts=2018-03-02T02:50:46.428645759Z caller=main.go:499 msg="Starting TSDB ..."
proxy_monitor.1.b0gt7nw8bbt2@fences-dev-master1    | level=info ts=2018-03-02T02:50:46.438652055Z caller=web.go:383 component=web msg="Start listening for connections" address=0.0.0.0:9090
proxy_monitor.1.b0gt7nw8bbt2@fences-dev-master1    | level=info ts=2018-03-02T02:50:46.443432951Z caller=main.go:509 msg="TSDB started"
proxy_monitor.1.b0gt7nw8bbt2@fences-dev-master1    | level=info ts=2018-03-02T02:50:46.443526522Z caller=main.go:585 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml
proxy_monitor.1.b0gt7nw8bbt2@fences-dev-master1    | level=info ts=2018-03-02T02:50:46.444105035Z caller=main.go:486 msg="Server is ready to receive web requests."
proxy_monitor.1.b0gt7nw8bbt2@fences-dev-master1    | level=info ts=2018-03-02T02:50:46.444482222Z caller=manager.go:59 component="scrape manager" msg="Starting scrape manager..."

I am fully at a loss on how to debug this further. Perhaps I have made some mistake along the way or misunderstand what I should be expecting.

DFM config changes from file_sd_configs to dns_sd_configs

Using the flexible labeling feature of DFM, we are experiencing an issue where DFM initially creates a Prometheus job using the file_sd_configs options for services, but then changes to using the dns_sd_configs options for services. This results in all flexible labels being dropped from the scraped data -- we are left with only instance and job target labels.

It seems that this change occurs when DFM is updated with docker service update -- we are adding a new network to DFM with the --network-add flag. I feel I should also point out that our DFM is connected to at least 8 different overlay networks (we isolate our services as much as possible), so that multiple networks may not be a highly-utilized or tested scenario at present but I think is a feasible one from a design standpoint.

Image prom/alertmanager using new rule format

Thanks for sharing, vfarcic!

Checked running docker-flow-monitor-slack-9093.yml that the latest image prom/alertmanager has changed the rule format according to this post from using single dash (-config.file) to double dash (--config.file)

Source prometheus.yml

Hi
Where can I find your prometheus.yml or how can I add rule_files to your prometheus??
THX

Problem with "Additional scrapes"

there is a problem to start a container when i try to load a secret with the configuration inside i am using this context

  - job_name: 'mongo-instance'
    scrape_interval: 5s
    static_configs:
      - targets: ['xxx.xxx.xx.xxx:9100']

in a file named "scrape_mongodb"

Support for rolling service updates

Question: does DFM support updates to services with rolling policies (e.g. --update-delay) applied? We're seeing some unexpected results in our environment but want to see if this is by-design before filing a new feature or bug request.

Service labels don't work without node info

Is it correct that Server.go at lines 307 & 669 ignores additional service labels if NodeInfo is null?

I have service external-service-green that I want to monitor. I added additional label "appName", DFSL returns this label to DFM, but DFM ignores it.

2019/02/09 18:59:26 Processing: [
{"distribute":"true","port":"9090","replicas":"1","serviceDomain":"192.168.12.130","serviceName":"monitor_monitor","servicePath":"/monitor"},
{"distribute":"true","port":"3000","replicas":"1","reqPathSearch":"/grafana","serviceName":"monitor_grafana","servicePath":"/grafana/,/grafana/public,/grafana/api"},
{"appName":"external-service","distribute":"true","metricsPath":"/external-service/metrics","replicas":"1","scrapePort":"8080","serviceName":"external-service-green"}
]

I don't see label appName for target external-service-green:

global:
  scrape_interval: 10s
  scrape_timeout: 10s
  evaluation_interval: 1m
scrape_configs:
- job_name: external-service-green
  scrape_interval: 10s
  scrape_timeout: 10s
  metrics_path: /external-service/metrics
  scheme: http
  dns_sd_configs:
  - names:
    - tasks.external-service-green
    refresh_interval: 30s
    type: A
    port: 8080

typo in 'replicas_more_than' shortcut

"@replicas_more_than":
  expanded: count(container_memory_usage_bytes{container_label_com_docker_swarm_service_name="{{ .Alert.ServiceName }}"}) > {{ .Alert.Replicas }}
  annotations:
    summary: The number of running replicas of the service {{ .Alert.ServiceName }} is more than {{ .Alert.Replicas }}
  labels:
    receiver: system
    service: "{{ .Alert.ServiceName }}"
    scale: up
    type: node

Shouldn't the scale value should be down rather than up for this shortcut? Seems odd for the same scale value to be in place for both replicas_less_than and replicas_more_than shortcuts.

com.df.metricsPath does not work

I added - com.df.metricsPath=/actuator/prometheus to my service labels but its still shows up as metrics_path: /metrics in Prometheus config.

Allow sub-value options in GLOBAL configurations

Currently (v0.32), df-monitor allow you to define environment variables that create the prometheus.yml. The variables should be names like GLOBAL_SCRAPE_INTERVAL=10s and they are transformed to:

global:
  scapre_interval: 10s

The issue is about sub values inside global, for example:

global:
  labels:
    cluster: swarm

This is not possible to reproduce because GLOBAL_LABELS_CLUSTER=swam creates:

global:
  labels_cluster: swarm

instead of the expected.

A possible solution is to use - for sub-values. Something like: GLOBAL-LABELS-CLUSTER=swarm.

Thanks!

How do flexible labels work?

@thomasjpfan What do com.df.env=prod and com.df.metricType=system do? Are they just informational labels and can be replaced by whatever is needed, or are they required? If so, what does each do? Also, where can one retreive this info in Prometheus queries and alerts, is there an additional metric, is it added to all metrics sent from the exporter?

Also, I'm trying to decide if this can replace basi/node-exporter image that sends the host name in a host metric.. this allows me to combine other metrics to it and get the node's host name instead of node exporter's ip, can flexible labels replace this function (and maybe give this possibility even in cadvisor metrics)?

samples DevOps2.2 do not work

I started the samples in chapter.
Deploying And Configuring Prometheus
I first tried with my headless ubuntu server. Virtual machines up an running.
The first sample :
docker stack deploy -c stacks/prometheus.yml monitor
I was not able to connect to http://$(docker-machine ip swarm-1):9090/config.
Thought is was my proxy config to the headless server, did al sort of ssh port forwarding, etc, always getting connection refused.
so I moved to my iMAC. Did all the needed setup regarding docker and docker machine. Getting the same result. After googling I managed to change the yml file to:

version: "3.2"

services:

  prometheus:
    image: prom/prometheus:${TAG:-latest}
    ports:
      - target: 9090
        published: 9090
        protocol: tcp
        mode: host

then was able to connect to config.
the next example does not work eather in chapter
Integrating Docker Flow Monitor With Docker Flow Proxy
proxy is not able to connect to swarm listerner:

2018/12/27 18:23:41 Error: Fetching config from swarm listener failed: Get http://swarm-listener:8080/v1/docker-flow-swarm-listener/get-services: dial tcp 10.0.12.91:8080: connect: connection refused. Will retry in 5 seconds.
2018/12/27 18:23:46 Error: Fetching config from swarm listener failed: Get http://swarm-listener:8080/v1/docker-flow-swarm-listener/get-services: dial tcp 10.0.12.91:8080: connect: connection refused. Will retry in 5 seconds.

what is wrong here, my setup problem, or something changed with version?

my setup:

BigiMac:docker-flow-monitor slavisalukic$ docker version
Client:
 Version:           18.06.1-ce
 API version:       1.38
 Go version:        go1.10.3
 Git commit:        e68fc7a
 Built:             Tue Aug 21 17:21:31 2018
 OS/Arch:           darwin/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.0
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.4
  Git commit:       4d60db4
  Built:            Wed Nov  7 00:52:55 2018
  OS/Arch:          linux/amd64
  Experimental:     false

BigiMac:docker-flow-monitor slavisalukic$ docker-machine version
docker-machine version 0.15.0, build b48dc28d

Monitor Nodes outside the swarm cluster

Hi,

Actually I have 3 swarm clusters, all separated in different VPC on AWS. One of the swarm is for the tools (Prometheus, Grafana, Jenkins, and so on). The 2 others are for production and staging.

Today, it's setup like this: on AWS I first created private hosted zones for my production and staging stack. On those hosted zones I created SRV entries for my exporters (node-exporters, cadvisor, etc..). Then I created scrape config that point to those SRV entries to dynamically get the list of internal node IPs so prometheus can scrape metrics from them.

This is not ideal because the list of IPs (on the SRV entries) are not build up dynamically as I add/remove nodes from my clusters. To be dynamic it would require to create a service that listen on docker events to know when nodes are added or removed to update those SRV entries. Moreover, if dockerflow/docker-flow-monitor restart, it does not get alerted from those 2 stacks because it looks only on the current swarm cluster it's running on.

To have my alerts from production and staging I have to restart dockerflow/docker-flow-swarm-listener on those 2 stacks so it send the alerts config to prometheus.

Ideally dockerflow/docker-flow-monitor should accept a list of LISTENER_ADDRESS to call dockerflow/docker-flow-swarm-listener on multiple hosts. Same for the DF_GET_NODES_URL.

What do you think? Does it makes sense for you? That would be very helpful!

no targets and jobs under scrape_configs appear

Hi, I have some issues to configure the Docker flow monitor in my swarm cluster. From some reason I don't get any targets in the prometheus.

i'm not sure why the scrape_configs is not exist and why I don't have any jobs or targets under it.

here is some of my swarm docker compose which include the proxy,monitor and exporters:

version: '3.3'
volumes:
  prometheus_data:
  grafana_data:
  swarm-endpoints:
  txt_file_exporter_data:
networks:
  monitor:
    external: true
  proxy:
    external: true
  prod:
    external: true
configs:
  alert_manager_config:
    file: ./monitor/alertmanager/config.yml
  blackbox_exporter_config:
    file: ./monitor/blackbox-exporter/blackbox.yml
  grafana_ini_config:
    file: ./monitor/grafana/grafana.ini
  grafana_dashboard_allhosts_config:
    file: ./monitor/grafana/dashboards/monitor_all_hosts_rev1.json
  grafana_dashboard_application_config:
    file: ./monitor/grafana/dashboards/application_monitoring_rev1.json
  grafana_dashboard_system_config:
    file: ./monitor/grafana/dashboards/system_docker_monitoring_rev2.json
  grafana_provisioning_dashboard_config:
    file: ./monitor/grafana/provisioning/dashboards/provisioning_config_file.yml
  grafana_provisioning_datasources_config:
    file: ./monitor/grafana/provisioning/datasources/datasource.yml
secrets:
  prometheus_scraps_config:
    file: ./monitor/prometheus/scrape_swarm_prometheus.yml
services:
  proxy:
    image: dockerflow/docker-flow-proxy:18.07.18-74
    ports:
      - "80:80"
      - "443:443"
      #- "3001:3001"
    networks:
      proxy:
        aliases:
          - proxy
    environment:
      - LISTENER_ADDRESS=swarm-listener
      - MODE=swarm
      - DEBUG=true
    deploy:
#      replicas: 1
      mode: global
      placement:
        constraints: [node.role == manager]
      restart_policy:
        delay: 5s
    logging:
      options:
        max-size: 1g
  swarm-listener:
    image: dockerflow/docker-flow-swarm-listener:18.07.03-28
    privileged: true
    networks:
      - proxy
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    environment:
      - DF_NOTIFY_CREATE_SERVICE_URL=http://proxy:8080/v1/docker-flow-proxy/reconfigure
      - DF_NOTIFY_REMOVE_SERVICE_URL=http://proxy:8080/v1/docker-flow-proxy/remove
      - DF_NOTIFY_CREATE_SERVICE_URL=http://monitor:8080/v1/docker-flow-monitor/reconfigure
      - DF_NOTIFY_REMOVE_SERVICE_URL=http://monitor:8080/v1/docker-flow-monitor/remove
      - DF_NOTIFY_CREATE_NODE_URL=http://monitor:8080/v1/docker-flow-monitor/node/reconfigure
      - DF_NOTIFY_REMOVE_NODE_URL=http://monitor:8080/v1/docker-flow-monitor/node/remove
      - DF_INCLUDE_NODE_IP_INFO=true
    deploy:
      replicas: 1
      placement:
        constraints: [node.role == manager]
      restart_policy:
        delay: 5s
    logging:
      options:
        max-size: 1g
  monitor: #This is also include prometheus
   image: dockerflow/docker-flow-monitor
   environment:
     - LISTENER_ADDRESS=swarm-listener
     - DF_GET_NODES_URL=http://swarm-listener:8080/v1/docker-flow-swarm-listener/get-nodes
     - GLOBAL_SCRAPE_INTERVAL=10s
     #- ARG_WEB_ROUTE-PREFIX=/monitor
     - ARG_ALERTMANAGER_URL=http://alert-manager:9093
     - ARG_CONFIG_FILE=/etc/prometheus/prometheus.yml
     - ARG_STORAGE_TSDB_PATH=/prometheus
     - ARG_STORAGE_TSDB_RETENTION=10d
     - ARG_WEB_ENABLE-LIFECYCLE=
     - ARG_WEB_ENABLE-ADMIN-API=
     - GLOBAL__SCRAPE_INTERVAL=60s
     - GLOBAL__evaluation_interval=60s
     - GLOBAL__scrape_timeout=60s
     - DF_SCRAPE_TARGET_LABELS=metricType,url_healthcheck
     #- DF_NODE_TARGET_LABELS=aws_region,role
   secrets:
     - source: prometheus_scraps_config
       target: /run/secrets/scrape_swarm_prometheus.yml
       #uid: "0"
       mode: 444
   networks:
     - monitor
     - proxy
   ports:
     - 9090:9090
   deploy:
     replicas: 1
     placement:
       constraints: [node.role == manager]
     restart_policy:
       delay: 5s
   logging:
     options:
       max-size: 1g
   labels:
     com.df.notify: 'true'
  alert-manager:
    image: prom/alertmanager:v0.15.2
    configs:
      - source: alert_manager_config
        target: /etc/alertmanager/config.yml
        mode: 444
    command:
      - '--config.file=/etc/alertmanager/config.yml'
      - '--storage.path=/alertmanager'
    #ports:
    #  - 9093:9093
    networks:
      - monitor
    environment:
      - ADMIN_USER=${ADMIN_USER:-admin}
      - ADMIN_PASSWORD=${ADMIN_PASSWORD:-admin}
    logging:
      options:
        max-size: 1g
    deploy:
      replicas: 1
      placement:
        constraints: [node.role == manager]
    labels:
      com.df.notify: 'true'
  grafana:
    image: grafana/grafana:5.2.2
    volumes:
      - grafana_data:/var/lib/grafana:rw
    configs:
      - source: grafana_ini_config
        target: /etc/grafana/grafana.ini
        mode: 444
      - source: grafana_dashboard_allhosts_config
        target: /etc/grafana/dashboards/monitor_all_hosts_rev1.json
        mode: 444
      - source: grafana_dashboard_application_config
        target: /etc/grafana/dashboards/application_monitoring_rev1.json
        mode: 444
      - source: grafana_dashboard_system_config
        target: /etc/grafana/dashboards/system_docker_monitoring_rev2.json
        mode: 444
      - source: grafana_provisioning_dashboard_config
        target: /etc/grafana/provisioning/dashboards/provisioning_config_file.yml
        mode: 444
      - source: grafana_provisioning_datasources_config
        target: /etc/grafana/provisioning/datasources/datasource.yml
        mode: 444
    environment:
      - GF_SECURITY_ADMIN_USER=${ADMIN_USER:-admin}
      - GF_SECURITY_ADMIN_PASSWORD=${ADMIN_PASSWORD:-admin}
      - GF_USERS_ALLOW_SIGN_UP=false
    ports:
      - 3001:3001
    networks:
      - monitor
      - proxy
    deploy:
      replicas: 1
      placement:
        constraints: [node.role == manager]
    logging:
      options:
        max-size: 1g
    labels:
      com.df.notify: 'true'
      com.df.servicePath: "/monitor"
      com.df.reqPathSearchReplace: "/monitor,"
      com.df.port: 3001
  blackbox:
    image: prom/blackbox-exporter:v0.12.0
    #ports:
    #  - "9115:9115"
    networks:
      - monitor
      - prod
    configs:
      - source: blackbox_exporter_config
        target: /config/blackbox.yml
        mode: 444
    command:
      - '--config.file=/config/blackbox.yml'
      - '--log.level=debug'
    deploy:
      replicas: 1
      placement:
        constraints: [node.role == manager]
      resources:
        limits:
          cpus: '0.1'
          memory: '1gb'
    logging:
      options:
        max-size: 1g
    labels:
      com.df.notify: 'true'
      com.df.scrapePort: 9115
      com.df.scrapeNetwork: monitor
      com.df.metricType: url_healthcheck
  nodeexporter:
    image: prom/node-exporter:v0.16.0
    user: root
    privileged: true
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
      - /etc/hostname:/etc/host_hostname
      - txt_file_exporter_data:/etc/node-exporter:ro
    environment:
      - HOST_HOSTNAME=/etc/host_hostname
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc|docker|tmpfs)($$|/)'
      - '--collector.filesystem.ignored-fs-types=^/(aufs|cgroup|devpts|mqueue|nsfs|sysfs|proc|tmpfs|loop|shm|none|overlay)($$|/)'
      - '--collector.textfile.directory=/etc/node-exporter'
    restart: always
    ports:
      - 9100:9100
    networks:
      - monitor
    deploy:
      mode: global
      restart_policy:
        delay: 5s
      resources:
        limits:
          cpus: '0.1'
          memory: '1gb'
    logging:
      options:
        max-size: 1g
    labels:
      com.df.notify: 'true'
      com.df.scrapeNetwork: monitor
      com.df.scrapePort: 9100
      com.df.metricType: system
  cadvisor:
    image: google/cadvisor:v0.30.2
    privileged: true
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - /:/rootfs:ro
      - /var/run:/var/run:rw
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
      - /dev/disk/:/dev/disk:ro
    restart: always
    ports:
      - 9101:9101
    networks:
      - monitor
    command:
      - '--port=9101'
    deploy:
      mode: global
      restart_policy:
        delay: 5s
      resources:
        limits:
          cpus: '0.1'
          memory: '1gb'
    logging:
      options:
        max-size: 1g
    labels:
      com.df.notify: 'true'
      com.df.scrapeNetwork: monitor
      com.df.scrapePort: 9101
      com.df.metricType: system

example of one app:

  nginx:
    image: nginx
    networks:
      proxy:
      site01:
        aliases:
         - nginx-site01.domain.local
    volumes:
      - /storage:/opt/nginx/html:ro
      - /etc/localtime:/etc/localtime:ro
      - /etc/timezone:/etc/timezone:ro
    deploy:
      mode: replicated
      replicas: 1
      endpoint_mode: dnsrr
      placement:
        constraints:
          - node.labels.site==site01
      restart_policy:
        delay: 5s
      labels:
        - com.df.notify='true'
        - com.df.healthurl=nginx-site01.domain.local
        - com.df.scrapeNetwork=monitor
        - com.df.metricType=url_healthcheck
        - com.df.alertName=mem_limit
        - com.df.alertIf=@service_mem_limit:0.8
        - com.df.alertFor=5s
        - com.df.scaleMin=2
        - com.df.scaleMax=4
        - com.df.port=443
        - com.df.srcPort=443
        - com.df.reqMode=sni
        - com.df.pathType-"req.ssl_sni -i -m reg"
        - com.df.servicePath="^(nginx-site01\\.)"

here is the scrape file under /etc/prometheus/prometheus.yml:

global:
  scrape_interval: 1m
  scrape_timeout: 1m
  evaluation_interval: 1m
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - alert-manager:9093
    scheme: http
    timeout: 10s
rule_files:
- /etc/prometheus/alert.rules

some logs from monitor:

docker logs monitor_monitor.1.l8ydgmdj2iyawtaa89ee35x2l
2018/08/27 11:16:51 Requesting services from Docker Flow Swarm Listener
2018/08/27 11:16:56 Processing: [{"distribute":"true","port":"8080","replicas":"1","reqPathSearchReplace":"\"/site01,\"","serviceName":"site01_app01","servicePath":"\"/site01\""},{"distribute":"true","port":"8080","replicas":"1","reqPathSearchReplace":"\"/site02,\"","serviceName":"site02_app01","servicePath":"\"/site02\""},{"distribute":"true","pathType":"req.ssl_sni -i -m reg","port":"9443","replicas":"1","reqMode":"sni","serviceName":"site01_apigateway","servicePath":"^(apigateway-site01\\.)","srcPort":"9443"},{"distribute":"true","pathType":"req.ssl_sni -i -m reg","port":"9443","replicas":"1","reqMode":"sni","serviceName":"site02_apigateway","servicePath":"^(apigateway-site02\\.)","srcPort":"9443"},{"alertFor":"5s","alertIf":"@service_mem_limit:0.8","alertName":"mem_limit","distribute":"true","healthurl":"nginx-site01.domain.local","metricType":"url_healthcheck","pathType-\"req.ssl_sni -i -m reg\"":"","port":"443","replicas":"1","reqMode":"sni","scaleMax":"4","scaleMin":"2","scrapeNetwork":"monitor","serviceName":"site01_nginx","servicePath":"\"^(nginx-site01\\\\.)\"","srcPort":"443"},{"alertFor":"5s","alertIf":"@service_mem_limit:0.8","alertName":"mem_limit","distribute":"true","healthurl":"nginx-site02.domain.local","metricType":"url_healthcheck","pathType-\"req.ssl_sni -i -m reg\"":"","port":"443","replicas":"1","reqMode":"sni","scaleMax":"4","scaleMin":"2","scrapeNetwork":"monitor","serviceName":"site02_nginx","servicePath":"\"^(nginx-site02\\\\.)\"","srcPort":"443"}]
2018/08/27 11:16:56 Requesting nodes from Docker Flow Swarm Listener
2018/08/27 11:16:56 Processing: [{"address":"10.132.0.10","availability":"active","hostname":"swarm-worker-3","id":"0abczxkaqmgvscwm7r0xafut2","role":"worker","state":"ready","versionIndex":"477591"},{"address":"10.132.0.8","availability":"active","hostname":"swarm-worker-1","id":"4a61xthy8rnd8tu08e48pabx6","role":"worker","state":"ready","versionIndex":"477591"},{"address":"0.0.0.0","availability":"active","hostname":"swarm-manager-2","id":"hn0r5pmdr2gruu6haneqo1b72","role":"manager","state":"ready","versionIndex":"477591"},{"address":"10.132.0.5","availability":"active","hostname":"swarm-manager-1","id":"qg5z99jfzvw1lbcfe7vbqfyom","role":"manager","state":"ready","versionIndex":"477591"},{"address":"10.132.0.7","availability":"active","hostname":"swarm-worker-2","id":"vnjn8uj8mptck9ik291muyqzf","role":"worker","state":"ready","versionIndex":"477591"}]
2018/08/27 11:16:56 Writing to alert.rules
2018/08/27 11:16:56 Writing to prometheus.yml
2018/08/27 11:16:56 Starting Prometheus
2018/08/27 11:16:56 /bin/sh -c prometheus --config.file="/etc/prometheus/prometheus.yml" --storage.tsdb.path="/prometheus" --storage.tsdb.retention="10d" --web.enable-lifecycle --web.console.libraries="/usr/share/prometheus/console_libraries" --web.console.templates="/usr/share/prometheus/consoles"
2018/08/27 11:16:56 Starting Docker Flow Monitor
level=info ts=2018-08-27T11:16:56.48316236Z caller=main.go:222 msg="Starting Prometheus" version="(version=2.3.2, branch=HEAD, revision=71af5e29e815795e9dd14742ee7725682fa14b7b)"
level=info ts=2018-08-27T11:16:56.483243102Z caller=main.go:223 build_context="(go=go1.10.3, user=root@5258e0bd9cc1, date=20180712-14:02:52)"
level=info ts=2018-08-27T11:16:56.483262247Z caller=main.go:224 host_details="(Linux 4.15.0-1018-gcp #19-Ubuntu SMP Thu Aug 16 13:38:55 UTC 2018 x86_64 95303c893d07 (none))"
level=info ts=2018-08-27T11:16:56.483277801Z caller=main.go:225 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2018-08-27T11:16:56.484625376Z caller=web.go:415 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2018-08-27T11:16:56.484599804Z caller=main.go:533 msg="Starting TSDB ..."
level=info ts=2018-08-27T11:16:56.490405807Z caller=main.go:543 msg="TSDB started"
level=info ts=2018-08-27T11:16:56.490472545Z caller=main.go:603 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml
level=info ts=2018-08-27T11:16:56.492077907Z caller=main.go:629 msg="Completed loading of configuration file" filename=/etc/prometheus/prometheus.yml
level=info ts=2018-08-27T11:16:56.492137538Z caller=main.go:502 msg="Server is ready to receive web requests."

logs from listener:

docker service logs management_df-swarm-listener
management_df-swarm-listener.1.f0v58zj92u1l@swarm-manager-1    | 2018/08/27 11:06:28 Starting Docker Flow: Swarm Listener
management_df-swarm-listener.1.f0v58zj92u1l@swarm-manager-1    | 2018/08/27 11:06:28 Using Docker Client API version: 1.37
management_df-swarm-listener.1.f0v58zj92u1l@swarm-manager-1    | 2018/08/27 11:06:28 Sending notifications for running services and nodes
management_df-swarm-listener.1.f0v58zj92u1l@swarm-manager-1    | 2018/08/27 11:06:28 Listening to Docker Service Events
management_df-swarm-listener.1.f0v58zj92u1l@swarm-manager-1    | 2018/08/27 11:06:28 Listening to Docker Node Events
management_df-swarm-listener.1.f0v58zj92u1l@swarm-manager-1    | 2018/08/27 11:06:28 Sending node created notification to http://monitor:8080/v1/docker-flow-monitor/node/reconfigure?address=0.0.0.0&availability=active&hostname=swarm-manager-2&id=hn0r5pmdr2gruu6haneqo1b72&role=manager&state=ready&versionIndex=477482
management_df-swarm-listener.1.f0v58zj92u1l@swarm-manager-1    | 2018/08/27 11:06:28 Sending node created notification to http://monitor:8080/v1/docker-flow-monitor/node/reconfigure?address=10.132.0.10&availability=active&hostname=swarm-worker-3&id=0abczxkaqmgvscwm7r0xafut2&role=worker&state=ready&versionIndex=477482
management_df-swarm-listener.1.f0v58zj92u1l@swarm-manager-1    | 2018/08/27 11:06:28 Sending node created notification to http://monitor:8080/v1/docker-flow-monitor/node/reconfigure?address=10.132.0.7&availability=active&hostname=swarm-worker-2&id=vnjn8uj8mptck9ik291muyqzf&role=worker&state=ready&versionIndex=477482
management_df-swarm-listener.1.f0v58zj92u1l@swarm-manager-1    | 2018/08/27 11:06:28 Sending node created notification to http://monitor:8080/v1/docker-flow-monitor/node/reconfigure?address=10.132.0.5&availability=active&hostname=swarm-manager-1&id=qg5z99jfzvw1lbcfe7vbqfyom&role=manager&state=ready&versionIndex=477482
management_df-swarm-listener.1.f0v58zj92u1l@swarm-manager-1    | 2018/08/27 11:06:28 Sending node created notification to http://monitor:8080/v1/docker-flow-monitor/node/reconfigure?address=10.132.0.8&availability=active&hostname=swarm-worker-1&id=4a61xthy8rnd8tu08e48pabx6&role=worker&state=ready&versionIndex=477482
management_df-swarm-listener.1.f0v58zj92u1l@swarm-manager-1    | 2018/08/27 11:06:33 Sending service created notification to http://monitor:8080/v1/docker-flow-monitor/reconfigure?distribute=true&port=8080&replicas=1&reqPathSearchReplace=%22%2Fsite02%2C%22&serviceName=site02_app01&servicePath=%22%2Fsite02%22
management_df-swarm-listener.1.f0v58zj92u1l@swarm-manager-1    | 2018/08/27 11:06:33 Sending service created notification to http://monitor:8080/v1/docker-flow-monitor/reconfigure?distribute=true&pathType=req.ssl_sni+-i+-m+reg&port=9443&replicas=1&reqMode=sni&serviceName=site02_apigateway&servicePath=%5E%28apigateway-site02%5C.%29&srcPort=9443
management_df-swarm-listener.1.f0v58zj92u1l@swarm-manager-1    | 2018/08/27 11:06:33 Sending service created notification to http://monitor:8080/v1/docker-flow-monitor/reconfigure?alertFor=5s&alertIf=%40service_mem_limit%3A0.8&alertName=mem_limit&distribute=true&healthurl=nginx-site02.domain.local&metricType=url_healthcheck&pathType-%22req.ssl_sni+-i+-m+reg%22=&port=443&replicas=1&reqMode=sni&scaleMax=4&scaleMin=2&scrapeNetwork=monitor&serviceName=site02_nginx&servicePath=%22%5E%28nginx-site02%5C%5C.%29%22&srcPort=443
management_df-swarm-listener.1.f0v58zj92u1l@swarm-manager-1    | 2018/08/27 11:06:33 Sending service created notification to http://monitor:8080/v1/docker-flow-monitor/reconfigure?distribute=true&port=8080&replicas=1&reqPathSearchReplace=%22%2Fsite01%2C%22&serviceName=site01_app01&servicePath=%22%2Fsite01%22
management_df-swarm-listener.1.f0v58zj92u1l@swarm-manager-1    | 2018/08/27 11:06:33 Sending service created notification to http://monitor:8080/v1/docker-flow-monitor/reconfigure?alertFor=5s&alertIf=%40service_mem_limit%3A0.8&alertName=mem_limit&distribute=true&healthurl=nginx-site01.domain.local&metricType=url_healthcheck&pathType-%22req.ssl_sni+-i+-m+reg%22=&port=443&replicas=1&reqMode=sni&scaleMax=4&scaleMin=2&scrapeNetwork=monitor&serviceName=site01_nginx&servicePath=%22%5E%28nginx-site01%5C%5C.%29%22&srcPort=443
management_df-swarm-listener.1.f0v58zj92u1l@swarm-manager-1    | 2018/08/27 11:06:33 Sending service created notification to http://monitor:8080/v1/docker-flow-monitor/reconfigure?distribute=true&pathType=req.ssl_sni+-i+-m+reg&port=9443&replicas=1&reqMode=sni&serviceName=site01_apigateway&servicePath=%5E%28apigateway-site01%5C.%29&srcPort=9443

any idea?

Thanks!

[FEATURE REQUEST] Use File based service discovery instead of DNS based for flexibility

Current Situation

Currently DFSL + DFM use the DNS based service discovery. AFAIK DFSL just checks changes in services in the docker API and sends each change to DFM.

This approach is simple and effective but comes with some limitations:

  • We don't have any way to add the node information where each task is running, as Docker is not providing this info ATM.
  • We cannot add arbitrary labels to each service/task for more granular info
  • Each time an exporter is restarted the id, that is based in the task IP changes, then we lose continuity in the metrics.

As you may see relabelling does not offer any value useful for identifying where the request is coming from:

image

Proposal

Prometheus allows to use other discovery systems, while Swarm is not integrated (I've not seen any improvement in this area in a year) we could make usage of Docker API calls + File based discovery service for getting more info for improving the collected metrics.

Example

~ $ curl -s -g -H "Content-Type: application/json" --unix-socket /var/run/docker.sock "http:/v1.24/tasks?filters={\"desired-state\":[\"running\"],\"service\":[\"monitoring_cadvisor\"]}" | jq '[.[] | {NodeID: .NodeID, Labels: .Spec.Conta
inerSpec.Labels, Addresses: .NetworksAttachments[0].Addresses[0] }]'
[
  {
    "NodeID": "zixucyzgugv4fmcf0funaqele",
    "Labels": {
      "com.docker.stack.namespace": "monitoring",
      "com.df.environment": "prod",
      "com.df.xxxx.scrapeTasks": "true"
    },
    "Addresses": "10.0.3.13/24"
  },
  {
    "NodeID": "qz9u95jos3fibvewxxbj9y5i8",
    "Labels": {
      "com.docker.stack.namespace": "monitoring",
      "com.df.environment": "prod",
      "com.df.xxxx.scrapeTasks": "true"
    },
    "Addresses": "10.0.3.12/24"
  },
  {
    "NodeID": "l0n731v8859ejdnnubqvjyuwd",
    "Labels": {
      "com.docker.stack.namespace": "monitoring",
      "com.df.environment": "prod",
      "com.df.xxxx.scrapeTasks": "true"
    },
    "Addresses": "10.0.3.11/24"
  }
]
$ curl -s --unix-socket /var/run/docker.sock http:/v1.31/nodes/zixucyzgugv4fmcf0funaqele | jq '.Description.Hostname'
"node-1"

These queries finally allow to generate an structure like this for Prometheus:

[
  {
    "targets": [ "10.0.3.13:8080" ],
    "labels": {
      "env": "prod",
      "job": "cadvisor",
      "node": "node-1"
    }
  },
  {
    "targets": [ "10.0.3.12:8080" ],
    "labels": {
      "env": "prod",
      "job": "cadvisor",
      "node": "node-2"
    }
  },
  {
    "targets": [ "10.0.3.11:8080" ],
    "labels": {
      "env": "prod",
      "job": "cadvisor",
      "node": "node-3"
    }
  },
  {
    "targets": [ "10.0.0.8:9100" ],
    "labels": {
      "env": "prod",
      "job": "node-exporter",
      "node": "node-1"
    }
  },
  {
    "targets": [ "10.0.0.9:9100" ],
    "labels": {
      "env": "prod",
      "job": "node-exporter",
      "node": "node-2"
    }
  }
]

Which could be stored in a dynamic file that should change every time something in the exporters change. This file can be used with a reference in the prometheus.yml file:

- job_name: 'dummy'  # This is a default value, it is mandatory.
  file_sd_configs:
    - files:
      - /etc/prometheus/prometheus.sd.json

And doing some relabelling to use the node entry as target in Prometheus.

Ability to define ScrapeInterval and ScrapeTimeout with labels

Wasn't sure if this needed to go here or with the swarm-listener.

With the size of our cluster, using a global config of 30sec scrapes with 10 second timeouts, cadvisor can't seem to keep up with Prometheus. Even if we turn off all disable-able metrics via '-disable_metrics=tcp,udp,disk,network' the /metrics endpoint is about 2 MB. If I change my global to 60 / 30 things seem ok but it would be nice to do it more granularly.

What is your opinion or supporting something like
com.df.scrapeInterval
com.df.scrapeTimeout

additional configurations

i want to use this 2 configurations in your image

remote_write:
- url: "http://prometheus_postgresql_adapter:9201/write"
remote_read:
- url: "http://prometheus_postgresql_adapter:9201/read"
how can i do it by environment flags in a compose file

Alternative to AlertIf @service_mem_limit shortcut

The @service_mem_limit shortcut uses as base an container_memory_usage_bytes metric, which includes application memory and linux page cache. Such metric good for those kind of applications, which has low disk activity, such as web services and so on... But in case of databases or file servers and other applications with moderate to high disk activity, this shortcut will always tend to 1, because of nature of Linux disk page cache, which try to use all available memory. So, my suggestion is to make new shortcut, which will substract value of container_memory_cache from container_memory_usage_bytes for it's base.
Example: If serviceName is set to my-service, @service_mem_limit_nobuff:0.8 would be expanded to (container_memory_usage_bytes{container_label_com_docker_swarm_service_name="my-service"}-container_memory_cache{container_label_com_docker_swarm_service_name="my-service"})/container_spec_memory_limit_bytes{container_label_com_docker_swarm_service_name="my-service"} > 0.8.

Additional manual scrape configs

I have setup a swarm cluster with docker flow monitor for monitoring. All is working fine so far. No I have the use case where I have to define some scrape configs manually.

Why do I need to write this scrape config manually? I am using blackbox_exporter for probing some external service (outside of my control) and for that I need a bunch of relabel statements.

As far as I can see, this is not supported. Since I can nighter set relabelling thought docker labels nor I can define some custom scrape configs. How should I approach this use case?

Thanks in advance for help.

Specifying command-line argument with ARG_ fails for values containing "="

Hi there,

Took a look at docker-flow-monitor recently and ran into the following problem:
If a value contains an "=" the command-line argument is truncated.
Example:
ARG_LOG_FORMAT="logger:stdout?json=true"
results in:
-log.format="logger:stdout?json

Clearly, the issue originates in prometheus/util.go:getArgFromEnv(): Environment variables in the form NAME=VALUE are split at "=" and only the first two parts are considered.

Thanks in advance for a fix!

Include `job` filter for node resource checks

If we have multiple node exporter services limited say to only workers or managers, the @node_[...] checks should be limited to that node exporter service's job like this:

(sum(node_memory_MemTotal{job="monitor-exporters_node-exporter-sm"}) BY (instance) - sum(node_memory_MemFree{job="monitor-exporters_node-exporter-sm"} + node_memory_Buffers{job="monitor-exporters_node-exporter-sm"} + node_memory_Cached{job="monitor-exporters_node-exporter-sm"}) BY (instance)) / sum(node_memory_MemTotal{job="monitor-exporters_node-exporter-sm"}) BY (instance)

feature request: support multiple alert manager services via ARG_ALERTMANAGER_URL env var

According to official Alertmanager documentation,

It's important not to load balance traffic between Prometheus and its Alertmanagers, but instead, point Prometheus to a list of all Alertmanagers.

DFM, at present, appears to only support a single URL for Alertmanager via the ARG_ALERTMANAGER_URL environment variable.

This is a feature request to support multiple Alertmanager URLs. There are several instances of DF services that support similar mechanisms, either with indexes or comma-separation. Would it be possible to implement a similar mechanism to enable support of this feature in DFM?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.