greenscheduler / cats Goto Github PK

CATS: the Climate-Aware Task Scheduler :cat2: :tiger2: :leopard:

Home Page: https://greenscheduler.github.io/cats/

License: MIT License

Python 100.00%

climate computing scheduling carbon carbon-footprint electricity electricity-consumption energy energy-consumption job-scheduler

cats's People

Contributors

Stargazers

Watchers

Forkers

abhidg sadielbartholomew colinsauze tlestang ljcolling andreww nicolaspayette

cats's Issues

Test failures

Currently we have three failing tests on main. Two of them are easy to fix (see #82).

The test that still fails comes from line 218 of __init__.py where we expect four values to be returned by get_runtime_config (in configure.py) but that function returns six values. I think we just need to plumb in jobinfo and PU to __init__ but I'm not totally sure which way around the fix should go. Looks like some merge conflict resolution gone wrong to me. I think this is the cause of the error reported in #81.

While we're at it, I think the type hints in get_runtime_config are out of sync with the values that actually get returned (four rather than six types listed). It's not the only issue that mypy finds:

 cats/check_clean_arguments.py:31: error: Need type annotation for "info" (hint: "info: Dict[<type>, <type>] = ...")  [var-annotated]
cats/check_clean_arguments.py:31: error: Argument 1 to "dict" has incompatible type "list[tuple[str | Any, ...]]"; expected "Iterable[tuple[Never, Never]]"  [arg-type]
cats/check_clean_arguments.py:56: error: Expected keyword arguments, {...}, or dict(...) in TypedDict constructor  [misc]
cats/configure.py:50: error: Module has no attribute "eror"; maybe "error"?  [attr-defined]
cats/configure.py:67: error: Incompatible return value type (got "tuple[Mapping[str, Any], APIInterface, str, int, list[tuple[int, float]] | None, Any]", expected "tuple[dict[Any, Any], APIInterface, str, int]")  [return-value]
cats/configure.py:87: error: Missing return statement  [return]
cats/__init__.py:199: error: Incompatible types in assignment (expression has type "bytes", variable has type "CATSOutput")  [assignment]
cats/__init__.py:209: error: Missing return statement  [return]
Found 8 errors in 3 files (checked 10 source files)

Worth adding a mypy test to our CI? Worth setting github up to preclude merging / pushing onto main when the tests don't pass?

Deploy the documentation site via a GitHub Action

A follow-on from #47, add a GitHub Action workflow to generate and host via GitHub Pages the built Sphinx documentation pages from the source .rst content.

As part of this Issue, the following should be completed:

Add a GitHub Action workflow to deploy the built docs site to GitHub Pages.
Set the workflow job to run to generate those pages (checking they all render and link as they should).
Configure GitHub Pages via 'Settings' on the repo to host the generated pages as a site.
Update 'LINK-PENDING' in the README in the first sentence under the 'Documentation' heading to point to the link of the landing/index page of the new hosted docs.
Update the repository webpage link (currently not specified) to be the link to the landing/index page of the new hosted docs.

If anyone wants to take this on, please feel free and if so, 'assign' yourself here so we know you are working on it. I can do it, but it might be some weeks before I get round to it, so until the point then when I assign myself, I won't be working on it. (And of course I'm happy to provide any guidance relating to the content and infrastructure I added in #47 towards getting this done.)

Essential info.

The documentation source including confirguration and makefile (etc.) is all contained under the docs/ repository. The README document in that directory explains the build process via the core command make html, which is what we need to automate via the Actions workflow.

Needs a manual/info page

It would aid greatly if (e.g.) "cats -h" gave a full list of each parameter and its meaning.
(e.g. what is -c COMMAND for? how would I find that out from the command line?)

NB I cannot see in https://greenscheduler.github.io/cats/quickstart.html#basic-usage where "-c COMMAND" is discussed

Although there is useful info in https://github.com/GreenScheduler/cats/blob/main/cats/__init__.py I do not get this displayed:

mkb@deb12:~/src/mmu-linux/mmu-bin$ ~/.local/bin/cats
usage: cats [-h] -d DURATION [-s {at}] [-a API] [-c COMMAND] [--dateformat DATEFORMAT] [-l LOCATION] [--config CONFIG] [--profile PROFILE] [--format {json}] [-f] [--cpu CPU] [--gpu GPU]
[--memory MEMORY]
cats: error: the following arguments are required: -d/--duration

Wrap schedulers, starting with at(1)

In the current design we use cats to generate some output (on standard out) that can be used as an argument to at to set the runtime. All other output goes to standard error. This makes handling out output a bit complicated and cats does not really look like a stand alone scheduler.

One option would be to rebuild the command line interface to cats such that it looks like a scheduler itself and under the hood calls at having done the calculation for the start time. This would probably use processing from the standard library. It could also open the door to letting us ship more than one command line programme.

Anyhow, this came up in #52 and it seems like it needs thinking about.

Increase test coverage

Get coverage as close to 100% before v1 release

Documentation

We need some documentation for the project. Both a framework to create / host the documentation (e.g. github pages or read the docs via sphinx) and a first pass at some content. This may involve trimming down the readme file.

Ability to constrain the suggested start time to a specific time window within which the job should start

Suggested in #29

Documented command to apply with `at` scheduler broken

HI, a quick one to note that the README command quoted to utilise the output of cats with the at scheduling command, namely:

command | at `python -m cats -d <job_duration> --loc

is not working. Clearly there is a missing or spurious backtick in there, but eve when one is added in a logical place, namely "<trivial job command to test> | at `python -m cats -d <job_duration> --loc <postcode>`" (or I even try it without, just to test to be sure, logically that wouldn't work to my knowledge though), the command won't work, e.g:

$ mkdir mydir | at `python -m cats -d 120 --loc "RG6 6ES"`
{'timestamp': datetime.datetime(2023, 5, 13, 12, 30), 'carbon_intensity': 102.0, 'est_total_carbon': 102.0}
syntax error. Last token seen: B
Garbled time

I am not sure of the context, but imagine this command might have worked before the latest changes to include the estimated carbon intensity, because obviously with the output now being a Python-like dict it is not a valid at timestamp input (unless processed with further commands e.g. by pipe to extract it), whereas if it was just the timestamp before that could have worked assuming the extra backtick.

Solution for now

Until the CLI input and output format is tidied and we can provide a means to grab the timestamp only, to pass to at etc., we could either:

apply some contrived awk pipe command or similar to parse the timestamp out, to get a working command for hooking up to at; or
simply remove the quoted command for now in the README and me say soon it will work such that there is an easy way to hook up with at and other schedulers, etc.

Configure pre-commit config

Suggested tool black, isort in #29, could also use ruff.

Capture our next steps

How do we take this forward?

Multinational support: research electricity grid APIs for other countries

From a Twitter thread relating to cats, some folk were enquiring as to whether it works or could work for locations outside of the UK (see original quotes given below for the context, if useful). I agree that it would be nice to provide support for countries not in Britain, assuming of course we can find and use APIs for other national electricity system grids comparable to the National Grid ESO API we have made use of so far.

The first step would be to research whether there are other such APIs we could make use of. Then we can get a feel for how multi-national in scope cats could be. Alternatively, we could decide to limit our scope solely to the UK, to avoid the complications. What do people think? It would be especially useful to hear from those with more knowledge of electricity systems than I (I have very little!).

Either way, we should clarify the location scope of cats in the documentation. Since our README doesn't mention explicitly that it only works for places in GB, we should add some brief text to clarify that straight away (to be updated if we eventually wider our location scope in line with this Issue). (I'll do that shortly in a commit.)

Twitter thread background

(See also the link above for original source.)

Does it work outside the U.S. ? I mean because of the grid data it depends on ?

It uses the UK @NationalGridESO API, but presumably other countries have equivalent ones?

Sorry for assuming U.S. was the default 😉. Probably, there's something equivalent elsewhere, too. Would be cool to add some resources to the http://README.md

Yeah, people from other countries should definitely raise issues with info for their grid.

Check system timezone

The carbon intensity data we are given is in UTC. But system time could be in a different timezone. We need to translate this when producing the output from cats, currently we don't. We should also write unit tests to check this and to check what happens when the clocks change as this is a common way systems like this break!

Code of conduct discussion

At our next catch up we should have a discussion about a contributor code of conduct. One approach would be to adopt the Contributor Covenant: https://www.contributor-covenant.org/version/2/1/code_of_conduct/code_of_conduct.md

We would need to include a reporting method (probably two people in case one of us manages to mess up).

Multinational support: research electricity grid APIs for other countries

Twitter thread background

(See also the link above for original source.)

Does it work outside the U.S. ? I mean because of the grid data it depends on ?

It uses the UK @NationalGridESO API, but presumably other countries have equivalent ones?

Sorry for assuming U.S. was the default 😉. Probably, there's something equivalent elsewhere, too. Would be cool to add some resources to the http://README.md

Yeah, people from other countries should definitely raise issues with info for their grid.

docs: remove `--jobinfo` and document config.yml

Add sample config.yml to the main branch
Remove --jobinfo from documentation

Implement Slurm Plugin

Implement the best strategy identified in #40

SLURM plugin

At some point it would be nice to use carbon intensity to help schedule tasks on HPC clusters. In principle the 'backend' of cats could help with this and the obvious approach is to somehow plug into SLURM. For example, on an under used cluster, it may be best to run user jobs only during low carbon intensity times and let the queue build up when carbon intensity is high. We would presumably need to build a SLURM plugin (https://slurm.schedmd.com/plugins.html) and work with a team managing a cluster. This issue is to keep track of ideas around this.

Asynchronous call to the carbon intensity API

To avoid calling the API more than once every 30min (as data doesn't change in between).
Options include:

Writing out a time-stamped csv, and if the csv is already present, bypass API call
More elegant cache methods

Display error if duration is greater than 48 hours

If api is carbonintensity.org.uk then we should not try to optimise start time for a task with duration > 48 hours

Make conflig.yml optional

We probably need to make this optional (assuming we don't need it in all cases) or create in on install (is that something we can do)?

I'll have a bash at step 1.

Allow customisation of date output through `--dateformat`

As suggested in #63, allow customisation of date format. Introduce a --dateformat option that takes strftime(3) syntax and outputs the date.

This is intended for customization for users into their existing workflows. We expect most usage through the supported --scheduler options, that will auto set appropriate formatting options.

Calculate corresponding carbon footprints

carbon intensity calculation

calculate the average carbon intensity over the duration of the job. For now this can be with the forecast data when the job starts, but we could later switch it to looking up the real data at the end of the job.

Versioning strategy, towards a v1.0 tool

Here and within the next meeting we should discuss versioning and how we want to approach it, in particular:

at what stage we can consider we have a version '1.0' package of cats, including alpha and/or beta candidates towards that;
what system is best to use for the versioning going forward, e.g. semantic, date-based, etc.

So it would be a good idea to try to think a bit about this in advance.

unknown fail

falls over on first try:
INSTALL via
pip install git+https://github.com/GreenScheduler/catsthub.com/GreenScheduler/cats

but...
$ ~/.local/bin/cats -d 10
WARNING:root:config file not found
WARNING:root:Unspecified carbon intensity forecast service, using carbonintensity.org.uk
WARNING:root:location not provided. Estimating location from IP address: M3.
Traceback (most recent call last):
File "/home/staff/banem/.local/bin/cats", line 8, in
sys.exit(main())
File "/home/staff/banem/.local/lib/python3.9/site-packages/cats/init.py", line 218, in main
config, CI_API_interface, location, duration = get_runtime_config(args)
ValueError: too many values to unpack (expected 4)

wich doesn't mean much to me! There are WARNINGs (not ERRORs) but then some unclear "ValueError" failure...

Implement caching of carbon intensity forecast

Currently a new request to carbonintensity.org.uk is made each time cats is run. In cats/__init__.py:

def findtime(postcode, duration):
  tuples = get_tuple(postcode) # API request
  result = writecsv(tuples, duration) # write intensity data to disk
				      # as csv timeseries
  # ...

Although the carbon intensity data obtained from the API is written on disk, this not taken advantage of. Instead, if the relevant carbon intensity data is already on disk, we'd like to reuse this data instead of making a new request each time.

The local carbon intensity forecast data is reusable if the last forecast datetime is beyond the expected finish datetime of the application, i.e. forecast_end > now() + runtime.

A possible approach is to reshuffle the responsabilites of both top-level functions api_query.get_tuple and parsedata.writecsv.

First, get_tuple could be responsible for ensuring that the right data is present on disk, and download it if not.
Then writecsv only cares about computing the best job start time, assuming correct intensity data is available. For instance,

# cats/__init__.py
def findtime(postcode, duration):
  tuples = get_tuple(postcode)
  result = writecsv(tuples, duration)

then becomes

# cats/__init__.py
def findtime(postcode, duration):
  # Check if cached carbon intensity data goes beyond
  # now() + duration, download new forecast if not
  # formerly `get_tuple()`
  ensure_cached_intensity_data(postcode, duration)
  # Then -- assuming data is available on disk -- compute
  # the best time to start the job.
  # formerly `writecsv()`
  result = get_best_start_time(duration)

This approach has the benefit is maitaining a good separation between talking to the API – and caching intensity data – and the calculation of the start time. We currently do almost have this, expect that the function returning the start time is also responsible for writing the intensity data on disk.

Another possible approach is to push the API query and data caching down to the current writecsv function:

def writecsv(data_path: str, duration=None) -> dict[str, int]:
    try:
	return cat_converter(data_path, method, duration)
    except MissingItensityDataError:
	cache_latest_intensity_forecast(postcode)
	return cat_converter(data_path, method, duration)

Run a test deployment

Deploy CATSv2 on a real cluster(s) that was/were offered to us at the scoping workshop.

Multinational support: research electricity grid APIs for other countries

Twitter thread background

(See also the link above for original source.)

Does it work outside the U.S. ? I mean because of the grid data it depends on ?

It uses the UK @NationalGridESO API, but presumably other countries have equivalent ones?

Sorry for assuming U.S. was the default 😉. Probably, there's something equivalent elsewhere, too. Would be cool to add some resources to the http://README.md

Yeah, people from other countries should definitely raise issues with info for their grid.

Make available carbon intensity for jobs now and delayed

In order to provide carbon savings estimates the GreenAlgorithmsCalculator needs to now about

The average carbon intensity if the job is started right now
The average carbon intensity if the job is delayed (to start at the time computed by cats).

(2) is currently returned by parsedata.writecsv

If I'm not mistaken (1) is currently not computed to we'd have to do it additionally - but all the ingredients are there.

I think this is the last remaining piece to allow cats to display carbon emissions savings? (see #20 )

CATS packaging

Once cats is in PyPi, we could look into packaging for distributions such as Fedora, Debian and other channels such as Homebrew and conda-forge. Not a priority until after 1.0. Suggested in #29

Support JSON output through `--format=json`

As suggested in #63 discussion, introduce a --format option that will support machine readable output in a specified JSON schema.

[minor] Make the tool work for carbon intensity intervals of arbitrary sizes

(just saving it as a small job for later if anyone has time)

For now, the tool assumes the carbon intensity forecasts are regular intervals throughout (e.g. every 30min). It would be good to either keep that assumption but check it by testing thee forecasts, or make the integration method work for any kind of intervals (which would be cleaner).
(discussed here briefly)

user-facing wrapper around `at` unix command

We want to have a cats executable at the minimum taking

the program to execute
expected duration of the program
[optional] loacation as UK postcode, if not infer location using global ip adress

cats myprog --dur 00:08 --loc E14

JOSS publication

Submit CATS to the Journal of Open Source Software

Draft readme and associated documents

We need the repository to act as an advertisement for the overall problem as well as the 'normal' stuff (contribution guide, code of conduct, getting started documentation reference documentation etc.) so this needs particular thought

Get package in PyPi

This should make it easy to install

TestPyPI: https://test.pypi.org/project/climate-aware-task-scheduler/0.1.0/

List of authors
Homepage if we want something other than GitHub
Tags
optional: CI/CD for auto-uploading new versions

Grab data from API

We need to pull the next 48 hours of data for the current location.

How to best estimate an average carbon intensity over the duration of a job

We have been discussing with @tlestang about what the data collected by the carbon intensity.org.uk represents, and therefore how best to calculate the average CI over a long period of time.

The API sends data for 30min periods, each value having a from and a to parameter (e.g. 50 gCO2e from 7:00am to 7:30am).

There are at least two ways these values can have been obtained (figure below).

Assuming the blue line is the real (continue) CI forecast:

The values provided can be timepoints on that curve (red dots) that are then provided for the next 30min (in the example above, the value for 9:00-9:30 would be 40)
Or they could have already been averaged over the 30min (green lines). In the example above, the value for 9:00-9:30 would be 45.
The consequence is that in case (1), it is best to approximate the blues curve by using some form of integration between the red dots (trapezoidal was used so far), while in (2), it is best to sum the respective averages (easier).

Probably best to have both options implemented for when/if we add now APIs, but also good to have a good understanding of what's best.

I'm emailing the people in charge to ask about that, but good to hear everyone's thoughts about that! (as it's quite an important part of the tool!)

request: output CI of suggested time period/s

Useful to have CATS say when CI is lowest (over 48h forecast) but it would also be useful to have CATS state what that value is. Then, for example, I can take my known energy to solution and directly calculated the CO2eq (rather than have to set up a config file and use an average power which is very approximate). Ta, M

Wider dissemination of CATS

Write a blogpost for the SSI
advertise CATS on appropriate mailing lists and Slacks, e.g. HPC-SIG or Slurm mailing list.
Email scoping workshop attendees

Setup CI testing

Once we have a test or two committed we should turn on testing. We may want to be clever about how we do this to minimise the carbon use of our tests and use this as an example for the turing way book chapter (see GreenScheduler/env-impact-of-open-research-chapter#1).

Handle absence of parameter files gracefully or provide sensible defaults

Currently, cats works fine if run from the git checkout as it finds config.yml and fixed_parameters.yml. When run out of tree, for example by using a pipx install, cats fails as it does not find parameter files.

Steps to reproduce:

$ cd <directory where cats is cloned>
$ pipx install .
$ cd <any other directory>
$ cats -d 5 --loc OX1 --jobinfo=cpus=2,gpus=0,memory=8,partition=CPU_partition
WARNING:root:config file not found
Traceback (most recent call last):
  File "/Users/abhidg/.local/bin/cats", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/Users/abhidg/.local/pipx/venvs/climate-aware-task-scheduler/lib/python3.12/site-packages/cats/__init__.py", line 285, in main
    args.jobinfo, expected_partition_names=config["partitions"].keys()
                                           ~~~~~~^^^^^^^^^^^^^^
KeyError: 'partitions'

Expected output:

cats should show a user-friendly error in this case and suggest a config file.

Other points:

If the fixed_parameters.yaml file is fixed and non-configurable by the user, then it makes sense to inline it in the carbonFootprint module. Data files can be installed through pyproject.toml, but this may not be needed for this use case.

Docs: all auto-generated API reference entries missing

See #51 (comment).

Clarify units and accepted format(s) of `<job_duration>` option value

I can't find information for this in the README. The code seems to imply the accepted format is an integer representing seconds, but possibly some datetime formats are accepted too (I don't have much time to investigate). Please can someone specify this in more detail in the README.

An example command introduced with a sentence explaining what it does might be really instructive to illustrate the use of both present command-line options (e.g. postcode as a string).

Refactor the code for more modularity + self-explanatory function names

Looking at the code, I think it would be worth refactoring it to clean up a bit from the hackaday rush and keep each module separately: in particular it would make it easier to address communications between functions, as raised by #25, and facilitate future expansion to new countries by replacing the API part for #22. It would also facilitate asynchronous tasks when some modules need to be bypassed.

An example of what can be confusing at the moment: tracking back where the optimal start time is being computed. In __init__.py, the optimal time to run the job is provided by writecsv from the parsedata.py file (not immediately clear from the name). writecsv itself calls cat_converter from timeseries_conversion which in turns calls the function get_lowest_carbon_intensity. The last function makes sense, but the steps in between would be really difficult to guess!

My suggestion is that each component is a separate class in a separate file, for now these are:

UK API call to obtain carbon intensity time series
Find the optimal running window to minimise carbon intensities
Calculate carbon footprint

And each component is called directly from the main function, something like:

def main(arguments=None):
    parser = parse_arguments()
    args = parser.parse_args(arguments)

    # ... some stuff about config file

    CI = APIcarbonIntensity_UK(args).getCI()

    optimal_time = runtimeOptimiser(args).findRuntime(CI)

    print(f"Best job start time: {optimal_time['start_time']}")

    if carbonFootprint:
        estim = greenAlgorithmsCalculator(...).get_footprint()
        print(f"carbon footprint: {estim}")

I'm keen to start working on a branch in this direction, but would like to hear people's thoughts on that as I've probable forgotten some aspects! 😃

Investigate potential test systems for running a small Slurm cluster

Find a suitable system for testing Slurm on. This could be:

@sadielbartholomew's Raspberry pi Cluster
A cloud based cluster that we can temporarily create.
A physical server that we've been offered use of.

Write function to take timeseries of carbon intensity, runtime and output optimal time

Given a timeseries in the form of a CSV file with columns timestamp, carbon_intensity and a runtime, write a function that returns a timestamp that will minimize total carbon intensity.

What (should) happen at 48 hrs?

I've just done a demo of this and it turned out that the lowest carbon intensity predicted for Oxford was in 48 hrs time (the last half hour returned by the API). Currently we choose to schedule the start of the task then. Is this what we want to happen?

Probably worth thinking about some of this kind of edge case, and cooking up some example csv files so we can test them. But deciding what to do isn't obvious to me.

Remove unused carbon footprint files

Files like fixed_parameters.yaml are not being used anymore and should be removed.

Providing users with the option to pass their own API wrappers for carbon intensity

This is to continue the discussion @tlestang started with PR #43

From my comments there:

I see two different use cases here:

We or other contributors will want to add other CI APIs (e.g. for other countries), and we ideally want to make them part of CATS so that these new APIs are available to the whole community. In this case, it would be good to have all the URL/parsing codes in the same place and api_inferface is a good place for that (it also makes it easier to add things by copy-pasting). And in terms of how much hassle it is to add it, it's equivalent now and with the new code (api_interface needs to be modified either way, and current code requires messing with init.py as well), but the existing code doesn't allow user to easily pick an API, this is what the new argument --api-carbonintensity introduces.

Second use case is if users want to pass their own API wrapper directly to CATS without having to modify the code. And in this case I agree, it would be good to make it possible in an easier way. But how would that work in practice? It would be good to have an idea of how the user would do it if we want to implement it.

This issue is to discuss whether we want to implement (2) and how it would work in practice from the user's point of view.

greenscheduler / cats Goto Github PK

cats's People

Contributors

Stargazers

Watchers

Forkers

cats's Issues

Essential info.

Solution for now

Twitter thread background

Twitter thread background

Twitter thread background

Recommend Projects

Recommend Topics

Recommend Org

Jobs