GithubHelp home page GithubHelp logo

aws-samples / aws-batch-operational-dashboards Goto Github PK

View Code? Open in Web Editor NEW
11.0 8.0 1.0 2.51 MB

Home Page: https://aws.amazon.com/batch/

License: MIT No Attribution

Python 100.00%
aws-batch grafana hpc job-metrics amazon-cloudwatch performance-monitoring

aws-batch-operational-dashboards's People

Contributors

amazon-auto avatar mhuguesaws avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

mhuguesaws

aws-batch-operational-dashboards's Issues

Deployment failure for an account without an organization

During deployment "sam deploy --stack-name ${BATCH_DASHBOARD_NAME} ...", one could experience the following error if the account is not a member of an organization:
Resource handler returned message:
"Your account is not a member of an
organization. (Service:
AWSSingleSignOn; Status Code: 400;
Error Code: AccessDeniedException;
Request ID: 7fdf6cad-1b9b-4d28-
b959-26262ec7a900; Proxy: null)
(Service: Grafana, Status Code:
403, Request ID: 66819d95-0456-
4853-ae0b-ba5a8e02191e)"
(RequestToken: 869a9138-6c5f-f50a-
1c72-3bb9c43c4b00,
HandlerErrorCode: AccessDenied)

EBS Read IOPS reported incorrectly

There is a couple of places where the metric 'Ops' should be used.
E.g., the template file batch-grafana-dashboard-template.json has the following section, where the EBSReadOps_Average should be used.

  "title": "EBS Read IOPS",
  "transformations": [
    {
      "id": "calculateField",
      "options": {
        "alias": "EBS Read IOPS",
        "binary": {
          "left": "EBSReadBytes_Average",
          "operator": "/",
          "reducer": "sum",
          "right": "60"
        },
        "mode": "binary",
        "reduce": {
          "include": [
            "EBSReadBytes_Average"
          ],
          "reducer": "sum"
        },
        "replaceFields": true
      }
    }
  ],
  "type": "timeseries"
}

],

Unable to retrieve tablefrom AWS Glue!

the S3 bucket is empty and when I took a look at the Lambda, I saw this error, handled as warning!
the main reason that I ended up to debug is error I get in Grafana

Any suggestion what could be the issue?

image

2024-05-02 12:19:00 3a0adf03-417f-4df3-991b-35cd2fae2363 WARN DynamoDBMetadataHandler:250 - doGetTable: Unable to retrieve table batch-op-dashboard-batchjobdata-41yi200clapo from AWSGlue in database/schema default. Falling back to schema inference. If inferred schema is incorrect, create a matching table in Glue to define schema (see README) com.amazonaws.services.glue.model.EntityNotFoundException: Entity Not Found (Service: AWSGlue; Status Code: 400; Error Code: EntityNotFoundException; Request ID: 03a04a20-5584-43a8-bf70-fcff1ce3cf52; Proxy: null)

I will test also the feature branch!

Fix vcpus and memory switched value

In the case the job definition is registered with memory first and vcpus second, the values are interchanged and lead to incorrect display in the dashboard.

Fix grafana update permissions

 aws grafana update-permissions --workspace-id ${GRAFANA_ID} \
>     --update-instruction-batch \
>     action=ADD,role=ADMIN,users=[{$ADMIN_GROUP,type=SSO_GROUP}]

Parameter validation failed:
Invalid type for parameter updateInstructionBatch[0].users[0], value: 14d844e8-3081-7007-3d2b-1cec5daf3063, type: <class 'str'>, valid types: <class 'dict'>
Invalid type for parameter updateInstructionBatch[1].users[0], value: type=SSO_GROUP, type: <class 'str'>, valid types: <class 'dict'>

Add display jobs in RUNNING status

Currently, the dashboard display job with a stoppedAt time leading to only show jobs FAILED or SUCCEEDED.
It will be nice to see RUNNING jobs as well.

Display Instance Id in EC2 metrics

EC2 metrics display the name of the metric. I will be more useful to have the instance id.
Solution: specify as lias {{InstanceId}} in Grafana

Cost Dashboard Instructions

The instructions for the Cost Dashboard is slightly off as there are no data sources in the json file that references Cloudwatch.

This is what I see:
Screenshot 2023-11-08 at 12 17 38 PM

Set container insights retention

By default, container insights data retention is 1 day.
This feature will setup container insights retention based on a variable.

Add GPU metric support

Currently the solution obtain CPU,memory and EBS metric through CloudWatch.
It will be great to have individual GPU metric such as compute and memory utilization.

No Athena type available in Managed Graphana "Add Data Source" dialog

Hello,
After deploying the solution today, trying to follow up the guide and at the step requiring adding an Athena datasource I'm unable to do so:
the Athena doesn't appear as datasource type available to add.
The previous datasource (Cloudwatch) was added successfully but not this one.

Are there prerequisites missing for the Athena to become available as datasource?

Fetch error: 404 Not Found Instantiating; when trying to add Athena data source

Hello,
When I try to add an Amazon Athena Data Source, I get this error below
image
I do not get prompted to add any information related to the Athena data source, as is shown in your tutorial. The moment I click on athena in the add data source page, it shows me "Data Source added" and I immediately get the error in the screenshot above.
I've tried the Athena plugin version 2.13.5 and 2.14.0. I am running Grafana 9.4.
Is Athena part of the Enterprise Plugins and is that why I'm getting this error?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.