GithubHelp home page GithubHelp logo

metrumresearchgroup / gridengine_prometheus Goto Github PK

View Code? Open in Web Editor NEW
2.0 5.0 6.0 124 KB

Prometheus exporter for the Sun Grid Engine

License: MIT License

Go 77.92% Dockerfile 4.01% Shell 18.07%
prometheus-exporter qstat gridengine

gridengine_prometheus's Introduction

Quality Gate Status

Coverage Status

Prometheus Exporter for Sun Grid Engine

This is a Prometheus exporter for the Sun Grid Engine meant to be run on your master nodes. It utilizes Qstat on the command line and uses the gogridengine library to serialize its XML output into native objects and then format for prometheus consumption. As long as the path for the executing user contains qstat, everything should work as the command execution inherits everything from the user.

Environment Variables

TEST: true for test mode which will not attempt to reach out to the command line but will rather generate data. LISTEN_PORT : Defines what port the application should listen on

Running

There is one optional flag available to the binary, which is --pidfile. This should indicate where the pidfile for the application should be placed, and primarily services to facilitate service managers such as uptstart or systemd.

Default

./gridengine_prometheus This will run the application on the default port (9081)

./gridengine_prometheus --pidfile /tmp/pid.pid

Will run the application on the default port and write it's PID into a file located at /tmp/pid.pid

Opinions

This exporter has various opinions about how data is reported, primarily based on the XML structures from Qstat:

  • All metrics have a "hostname" hey
    • This hostname is derivative of the Qlist Name (split by @ symbol)
    • This facilitates PromQL statements for locating / isolating queries by host
  • Hostname is the primary identifier for host level metrics. This includes:
    • Load Averages
    • Resource Values:
      • mem_free
      • swap_used
      • cpu ...
  • Job Details are recorded with the following labels:
    • hostname
    • Job Jumber
    • Job Name
    • Owner
  • Values that fit into this category are:
    • State (Running = 1 , Not = 0)
    • Priority
    • Slots

With these labels, it should be easy to create variable driven dashboards to allow scientists to drill down to their specific jobs across any or all hosts at a time.

Testing

For the sake of testing, there's an environment variable called TEST that if set to "true", will cause the collector to bypass trying to run the command line output and generate XML (2 instances) with some static and some invalid information. The invalid information is for unit testing purposes, but also to ensure that base values are still reported by the collector. This is exceptionally beneficial if you're looking to write custom grafana dashboards, as you can setup prometheus, the collector, and grafana in a local compose instance to basically consume generated data.

Grafana

If you want to work with grafana or try the existing dashboards, the docker-compose file in this directory will setup :

  • Prometheus
  • Grafana
  • Exporter

The exporter will be running in test mode and will generate a mix of static / non-static content. Prometheus is auto configured to scrape the exporter by name, and grafana is set with the magical username / pass of "admin / admin" although it'll make you change it on first setup.

gridengine_prometheus's People

Contributors

dpastoor avatar mehtaabb avatar shairozan avatar xfgavin avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

gridengine_prometheus's Issues

unexpected newline

I built a docker imaging using trunk version and got this error when executing the compiled binary.
/go/src/metrumresearchgroup/gridengine_prometheus # /go/bin/gridengine_exporter
/go/bin/gridengine_exporter: line 2: syntax error: unexpected newline

Attempting to parse Task identifier failed

When running prometheus from hosts_and_queues 6aad879 (but master probably also), I see in logs:

ERRO[10304] Attempting to parse Task identifier failed: strconv.ParseInt: parsing "59,60": invalid syntax
ERRO[10304] Unable to marshal the XML cleanly into an object  error="strconv.ParseInt: parsing \"59,60\": invalid syntax"

Do you need the xml or this is enough?

Recording State

After using it and creating some dashboard I have found several bugs/issues/features. We can create new issues or we can discuss it here.

job_state_value
This metrics hides real state of the job/task. Why not to use something like enum:

job_state{job_id="123", state="r", ...}      1
job_state{job_id="123", state="qw", ...}     0

There is additional problem, and that is when you count running jobs

count(job_state_value{} == 1)

and that is staleness issue which causes, that number of running jobs is reported higher than it should be.

Count "dt" and "auo" state as running

We are fairly passive in the state codes that are returned with qstat. So if qstat returns the state as "running" we report it as such.

For states like dt and auo, we want to update the job list before presentation to upstream services to indicate that they are errored state jobs

gridengine_exporter fails on SGE_ARCH

No matter how I set SGE_ARCH, gridengine_exporter fails with message--

gridengine_exporter --sge_arch "lx-amd64" --sge_root "/mnt/software/sge"
Error: failed to validate SGE configuration: the SGE architecture has not been provided

Running in docker

Hi there,

I met another issue using docker.
My sge_root is at /var/lib/gridengine and my sge_cell is fmri.cn
So I wrote my docker-compose.yml as below:

version: '2'
services:
  sge-exporter:
    image: gridengine-exporter
    ports:
      - "9081:9081"
    command: "--pidfile /tmp/sge-exporter.pid --sge_root /mnt --sge_cell fmri.cn"
    environment:
      TEST: "true"
    volumes:
      - /var/lib/gridengine:/mnt:ro

When I ran and requested metrics, I got error:
sge-exporter_1 | time="2020-05-08T08:17:08Z" level=info msg="Getting ready to start exporter on port 9081"
sge-exporter_1 | time="2020-05-08T08:17:16Z" level=error msg="Couldn't locate binaryexec: "qstat": executable file not found in $PATH"
sge-exporter_1 | time="2020-05-08T08:17:16Z" level=error msg="There was an error processing the XML output" error="Couldn't locate the binary"

I then grabbed the compiled binary and tested it manually and it worked.

I wonder if there is something wrong with my yml?
Thank you.

Pending tasks only being reported once

If there is a pending job, there is reported only 1 task

job_state_value{hostname="hostname",job_number="62752",name="test.sh",owner="user",task_id="0"} 0

although there are multiple tasks queued. I can provide xml example if needed.

Create separate labels for host and queue

ATM, queue is part of hostname and label_replace must be used. It would be cool to have 2 separate labels: queue, hostname.

Previously we were just splitting on the hostname, and are currently only taking the hostname. Would be good for the sake of carving up metrics to provide both as labels.

Error when extracting resources

I guess that this problem is in underlying library gogridengine (or my setup), but I will raise an issue here since the problem comes up when running exporter.

Commit: ed14702

When running exporter and scraping it, I see following errors:

ERRO[0002] There was an error extracting Free Memory from the resource list  error="Could not located the requested key"
ERRO[0002] There was an error extracting Used Memory from the resource list  error="Could not located the requested key"
ERRO[0002] There was an error extracting CPU Utilization from the resource list  error="Could not located the requested key"
ERRO[0002] There was an error extracting Free Memory from the resource list  error="Could not located the requested key"
ERRO[0002] There was an error extracting Used Memory from the resource list  error="Could not located the requested key"
ERRO[0002] There was an error extracting CPU Utilization from the resource list  error="Could not located the requested key"

I guess that the xml cannot be parsed. From code I see, that the xml is gained using qstat -u * -F -xml.

Here is my snippet of running this command. Note that most of the xml is omitted.

<?xml version='1.0'?>
<job_info  xmlns:xsd="http://gridengine.sunsource.net/source/browse/*checkout*/gridengine/source/dist/util/resources/schemas/qstat/qstat.xsd?revision=1.11">
  <queue_info>
    <Queue-List>
      ...
      <resource name="mem_free" type="hl">0.0G</resource>
      ...
    </Queue-List>
...

Is somehow my settings wrong? Thanks for any help.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.