gleanerio / scheduler Goto Github PK

Scheduling approaches related to gleaner tooling

License: Apache License 2.0

Python 95.69% Shell 0.16% Jupyter Notebook 4.07% Makefile 0.06% Dockerfile 0.01%

scheduler's Introduction

Gleaner.io website

About

This is just the repo for the web site at gleaner.io. It is shameless copy of the Tailwind Toolbox landing page template.

It's deployed out to a Google object store served via GROW.

scheduler's People

Contributors

Watchers

Forkers

ksonda jmckenna earthcube

scheduler's Issues

Build on commit

Need to automate the build of the dagster code when a configuration file updates.

Need to document and automate the whole build process when a config file changes. It would be good to do this all the way to docker containers.

Need to diagram this flow out better in the documents.

Containers... use try.. finally to remove containers

If there is a error in a run, a container will be left behind.
Probably need to method that wraps a call to a container (in case we use another one).
to be sure it's created, and removed if there is an error.

Testing/Refactoring

For testing, it looks like the @graph needs to be moved from the file with all the ops.

someting like this works with @graph removed.

from implnet_ops_geocodes_demo_datasets import geocodes_demo_datasets_gleaner
def test_geocodes_demo_datasets_gleaner():
    res = geocodes_demo_datasets_gleaner()
    assert res.success
    assert res.output_for_node("find_highest_protein_cereal") == "Special K"

Maybe we can us the same set of @ops with parameters/context passed.

logs capture runstats, and repository logs, and directory?

There are now separate runstats and repo_{name}_{loaded|issue}.log files.

should these be uploaded
also, should we capture to directories?

Option 1: just put each source in it's source all in one

source/

Option 2:

run (gleaner/nabu) or run/source
runstats (place for just the stats, and also the ec utilities stats)
loaded/source (place for the repo_{loaded|issue} files

Can the latest runstat be an artifact... that way we would not need to dig too far after a run

Error in prune but not prefix

The following is being seen in the prune call but not in prefix for nabu:

{
  "file": "/home/fils/src/Projects/gleaner.io/nabu/internal/objects/pipeload.go:56",
  "func": "github.com/gleanerio/nabu/internal/objects.PipeLoad",
  "level": "error",
  "msg": "JSONLDToNQ err: %sunexpected end of JSON input",
  "time": "2023-02-21T00:26:26Z"
}
{
  "file": "/home/fils/src/Projects/gleaner.io/nabu/internal/objects/gets3bytes.go:21",
  "func": "github.com/gleanerio/nabu/internal/objects.GetS3Bytes",
  "level": "info",
  "msg": "Issue with reading an object:  gleaner.oih/summoned/africaioc/ffb59b01cf1d2de175c66576d2b69c7940dda8a5.jsonld",
  "time": "2023-02-21T00:26:26Z"
}
{
  "file": "/home/fils/src/Projects/gleaner.io/nabu/internal/objects/pipeload.go:41",
  "func": "github.com/gleanerio/nabu/internal/objects.PipeLoad",
  "level": "error",
  "msg": "gets3Bytes %v\\nThe specified key does not exist.",
  "time": "2023-02-21T00:26:26Z"
}
{
  "file": "/home/fils/src/Projects/gleaner.io/nabu/internal/graph/jsonldToNQ.go:17",
  "func": "github.com/gleanerio/nabu/internal/graph.JSONLDToNQ",
  "level": "info",
  "msg": "Error when transforming JSON-LD document to interface: unexpected end of JSON input",
  "time": "2023-02-21T00:26:26Z"
}

Remove container on confilct 409

Info: https://portainer.geocodes-aws-dev.earthcube.org/api/endpoints/2/docker/containers/create?name=gleaner01_opentopography

Then failure, because one exists

urllib.error.HTTPError: HTTP Error 409: Conflict
  
    returned_value = gleanerio(("gleaner"), "opentopography")
  File "/usr/src/app/./ops/implnet_ops_opentopography.py", line 177, in gleanerio

docker container ls -a

CONTAINER ID   IMAGE                                      COMMAND                  CREATED          STATUS                      PORTS                                                                      NAMES

1.e9c742fnsem50khimxyxni0as
4a3e056a973f   fils/gleaner:v3.0.11-development-df        "/gleaner/gleaner --…"   5 hours ago      Exited (1) 5 hours ago                                                                                 gleaner01_opentopography
f22d332a1701   fils/gleaner:v3.0.11-development-df        "/gleaner/gleaner --…"   10 hours ago     Exited (1) 10 hours ago                                                                                gleaner01_geocodes_demo_datasets

earthcube@ip-172-31-2-108:~$ docker container rm f22d332a1701
f22d332a1701
earthcube@ip-172-31-2-108:~$ docker container rm 4a3e056a973f

Headless

Wonder if for a headless, we can't spin up a container dedicated to a that process

Use Docker Contexts to set docker endpoints in scripts

Need some logic/documentation/ideas

While we can run locally, the configs need to getup to the PORTAINER_URL

So if a PORTAINER_URL is not equal to a Endpoints.docker.Host

toss cookies ;)

It looks like we will need to use docker contexts to set the endpoint for the docker scripts.

(venv) valentin@MacBook-Pro deployment % docker context show                 
desktop-linux
(venv) valentin@MacBook-Pro deployment % docker context inspect desktop-linux
[
    {
        "Name": "desktop-linux",
        "Metadata": {},
        "Endpoints": {
            "docker": {
                "Host": "unix:///Users/valentin/.docker/run/docker.sock",
                "SkipTLSVerify": false
            }
        },
        "TLSMaterial": {},
        "Storage": {
            "MetadataPath": "/Users/valentin/.docker/contexts/meta/fe9c6bd7a66301f49ca9b6a70b217107cd1284598bfc254700c989b916da791e",
            "TLSPath": "/Users/valentin/.docker/contexts/tls/fe9c6bd7a66301f49ca9b6a70b217107cd1284598bfc254700c989b916da791e"
        }
    }
]

Alternatives, Dagu et al

It would be good to let people know this is not the only way to o this.

AirFlow, Prefect and new ones like Dagu (https://github.com/yohamta/dagu) are all valid alternatives. Indeed, a rather large amount of this could be done via cron too.

Multiple Code Locations example

From: dagster-io/dagster#13768

https://github.com/dagster-io/cloud-examples/tree/main/multi-location-project

Implment run_status_sensor for Cancel

Need to be able to cancel a job, and not have the container hang around.

dagster-io/dagster#7464

Jobs cleanup

List of sources as an asset

Could a list of sources be the first asset?
Then could sitemaps and information from sitemaps be the next asset to drive the system?

Time schedule distribution code not in the IoW generator?

The Iow generator is not doing the time distribution.

Need to take the generator code an make one version of it with arguments for running. Encode these into the Makefile then too and re-name the makefile entries for back command line completion.

Gleaner/Nabu Env

The GLEANER_ variables are not getting utiized in a run.

Nabu prov load and release graph incorporated into the templates

At present I need to review the "ops templates" and review the loading of triples. I think currently just the summoned triples are loaded.

Need to review and, if needed, add the prov triples as a new step in the loading done by Nabu.

gleaner/nabu container needs to connect to network

Added headless to dagster stack.

If gleaner container is connected manually to the network then we can connect using

http://headless:9222

Some other changes are on the dev_eco branch: https://github.com/earthcube/scheduler/tree/dev_eco

Pre-commit hooks

to remove the passwords from a gleaner/nabu file

https://github.com/gitleaks/gitleaks/blob/master/scripts/pre-commit.py

https://pre-commit.com/

http://eloquentcode.com/prevent-committing-secrets-with-a-pre-commit-hook

[sitemapCheck] handle sitemap living on GitHub

sitemapCheck currently throws an error trying to access a sitemap that is stored on GitHub
- example problem url: https://raw.githubusercontent.com/BeBOP-OBON/odis-interface/main/sitemap.xml
likely reason is that validation fails as the header content-type of that page is text/plain, whereas application/xml is expected
exact error message is ERROR reading sitemap XML

(related to BeBOP-OBON/odis-interface#1 )

Fix the generation of the Dagster names for sources in IoW sitemap

Given:

https://geoconnex.us/sitemap.xml

and

https://geoconnex.us/sitemap/usgs/monitoring-location/nwisgw/nwisgw__20.xml

use nwisgw__20 for the name

Duplication

So the development process for this across the three implementer was hideous...

generator script should be one file (and really should be Go or Python and use real template files)
The repository directory and file are not used, remove
At present the template files are all the same, don't need them three times

missing_report --cfgfile ./gleaner_oceans
generategraphstats
summarize_identifier_metadata

They should be installed when you install earthcube utilities ;)

gleanerio / scheduler Goto Github PK

scheduler's Introduction

Gleaner.io website

About

scheduler's People

Contributors

Watchers

Forkers

scheduler's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs