buda-base / ao-workflows Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 1.06 MB

Use DAG platform to define and orchestrate workflows

Python 88.82% Shell 10.71% Perl 0.47%

ao-workflows's Introduction

Vagrant scripts for BUDA platform instanciation

The base platform is built using Vagrant and VirtualBox:

Install Vagrant and VirtualBox.
Download or git clone this repository.
cd into the unzipped directory or git clone
install VirtualBox guest additions with vagrant plugin install vagrant-vbguest
run vagrant up to summon a local instance

Or for an AWS EC2 instance:

install the vbguest plugin: vagrant plugin install vagrant-vbguest
and run the command: vagrant up or rename Vagrantfile.aws to Vagrantfile and run vagrant up --provider=aws

This will grind awhile installing all the dependencies of the BUDA platform.

Once the initial install has completed the command: vagrant ssh will connect to the instance where development, customization of the environment and so on can be performed as for any headless server.

Similarly, the jena-fuseki server will be listening on:

http://localhost:13180/fuseki

Lds-pdi application is accessible at :

http://localhost:13280/

(see https://github.com/buda-base/lds-pdi/blob/master/README.md for details about using this rest services)

The command: vagrant halt will shut the instance down. After halting (or suspending the instance) a further: vagrant up will simply boot the instance without further downloads, and vagrant destroy will completely remove the instance.

If running an AWS instance, after provisioning access the instance via ssh -p 15345 and delete Port 22 from /etc/ssh/sshd_config and sudo systemctl restart sshd. This will further secure the instance from attacks on port 22.

ao-workflows's People

Contributors

Watchers

ao-workflows's Issues

Remove syncd works from airflow

After a work has been sync'd, its debagged contents persist in airflow://home/airflow/bdrc/data which is mapped on a real fs to AIRFLOW_DIR/AO-staging/Incoming Rather than manually remove them, a successful sync should remove the source dir. The sync debagged task has access to this.

Crawl the IIIFPRES repo for disordered json

...Instead of regenerating the jsons in option 3, perhaps we can have some code that checks the order of the existing dimensions.json files and when not in order, compares it with the list on exide?

Originally posted by @eroux in https://github.com/buda-base/archive-ops/issues/608#issuecomment-1036378712

Migrate workflow docker to Linux

First under dev id, then service

Bind mount logs to `/mnt/processing/logs`

Currently, ao-workflows uses logging that writes to the container. Create a bind mount that puts logs on the usual place /mnt/processing

Remediate IA

Fix the issues that the buda-base/ia-metadata-report identifies.

identify the works which failed to pass derive process. We can't update those until they derive, so their misadjustment might not apply.
Identify the works that could be remediated by a rederive. These are all the misplaced works which do not have derive failures.
Build a task that rederives the open works. (see note below)
Repetitively apply the ia-metadata-report/fillcache and create-report shells until only works which have failed derive appear.

derives can take quite a long time, so this work will have to be repetitively returned to.

In an ideal world, the code should only submit a few works, and only submit more when the number of active works falls below a threshold. Initially, this will be manual.

Create new records in glacier sync tracking db

glacier_staging_to_sync.py and extras.py have routines that perform CRUD operations against the drs.glacier_staging_progress table. Those routines should be able to create records if they don't exist.

Reconcile docker file system with dip-pump

The airflow under docker writes to a bind mounted volume that results in a path that the dip-pump cannot see.

The dip_log event is generated from a shell, running in the docker image, that has a file path that is different from the native host. In the docker compose file, the volumes are mounted:

      - ${ARCH_ROOT:-/mnt}/Archive0:/home/airflow/extern/Archive0
      - ${ARCH_ROOT:-/mnt}/Archive1:/home/airflow/extern/Archive1
      - ${ARCH_ROOT:-/mnt}/Archive2:/home/airflow/extern/Archive2
      - ${ARCH_ROOT:-/mnt}/Archive3:/home/airflow/extern/Archive3

left of the colon is the host (local) path, right of the colon is the local mount point.

this results in a dip_log dip_dest_path of /home/airflow/extern/Archive0 for the resulting record. this means that
all dip_log work will not be able to locate that path (unless shims are made on sattva (ln -s /mnt/Archive0 /home/airflow/extern/Archive0

or the path is duplicated precisely on the docker image. This is the first path to explore:

      - ${ARCH_ROOT:-/mnt}/Archive0:/mnt/Archive0
      - ${ARCH_ROOT:-/mnt}/Archive1:/mnt/Archive1
      - ${ARCH_ROOT:-/mnt}/Archive2:/mnt/Archive2
      - ${ARCH_ROOT:-/mnt}/Archive3:/mnt/Archive3

bdrc-docker.sh can do this, in the same way it creates other resources. ONLY if that fails, use a shim on the client hosts (which I really don't want to have to support on two machines.!)

Fix is simpler:

      - ${ARCH_ROOT:-/mnt}:/mnt

And reflect changes in DAG, Dockerfile-bdrc, and bdrc-docker-compose.yml

Download runs out of space

Processing this message:

[
  {
    "eventVersion": "2.1",
    "eventSource": "aws:s3",
    "awsRegion": "ap-northeast-2",
    "eventTime": "2024-04-06T00:11:23.730Z",
    "eventName": "ObjectRestore:Completed",
    "userIdentity": {
      "principalId": "AmazonCustomer:A1JPP2WW1ZYN4F"
    },
    "requestParameters": {
      "sourceIPAddress": "s3.amazonaws.com"
    },
    "responseElements": {
      "x-amz-request-id": "439897F6741FD9BA",
      "x-amz-id-2": "MF0oW9le+g8K5/R/uUks1QuFbZxNuSmZDWQ5utu8ZTcHEKSGFHzdFBEtebICzrPtG3YL1YVmffxhRw4nDPTZ1w=="
    },
    "s3": {
      "s3SchemaVersion": "1.0",
      "configurationId": "BagCreatedNotification",
      "bucket": {
        "name": "glacier.staging.nlm.bdrc.org",
        "ownerIdentity": {
          "principalId": "A1JPP2WW1ZYN4F"
        },
        "arn": "arn:aws:s3:::glacier.staging.nlm.bdrc.org"
      },
      "object": {
        "key": "Archive0/00/W1NLM4700/W1NLM4700.bag.zip",
        "size": 17017201852,
        "eTag": "41654cbd2a8f2d3c0abc83444fde825b-2029",
        "sequencer": "00638792A45B638391"
      }
    },
    "glacierEventData": {
      "restoreEventData": {
        "lifecycleRestorationExpiryTime": "2024-04-12T00:00:00.000Z",
        "lifecycleRestoreStorageClass": "DEEP_ARCHIVE"
      }
    }
  }
]

size is "size": 17,017,201,852" 17GB

[2024-04-05, 20:30:09 EDT] {taskinstance.py:2513} INFO - Exporting env vars: AIRFLOW_CTX_DAG_OWNER='***' AIRFLOW_CTX_DAG_ID='sqs_scheduled_dag' AIRFLOW_CTX_TASK_ID='download_from_messages' AIRFLOW_CTX_EXECUTION_DATE='2024-04-06T00:20:00+00:00' AIRFLOW_CTX_TRY_NUMBER='1' AIRFLOW_CTX_DAG_RUN_ID='scheduled__2024-04-06T00:20:00+00:00'
[2024-04-05, 20:30:09 EDT] {logging_mixin.py:188} INFO - using secrets
[2024-04-05, 20:30:09 EDT] {logging_mixin.py:188} INFO - section='ap_northeast'   ['default', 'ap_northeast']
[2024-04-05, 20:34:22 EDT] {taskinstance.py:2731} ERROR - Task failed with exception
Traceback (most recent call last):
...
                   ^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/s3transfer/download.py", line 643, in _main
    fileobj.write(data)
  File "/home/airflow/.local/lib/python3.11/site-packages/s3transfer/utils.py", line 379, in write
    self._fileobj.write(data)
OSError: [Errno 28] No space left on device
[2024-04-05, 20:34:22 EDT] {taskinstance.py:1149} INFO - Marking task as FAILED. dag_id=sqs_scheduled_dag, task_id=download_from_messages, execution_date=20240406T002000, start_date=20240406T003009, end_date=20240406T003422
[2024-04-05, 20:34:22 EDT] {standard_task_runner.py:107} ERROR - Failed to execute job 259 for task download_from_messages ([Errno 28] No space left on device; 13305)
[2024-04-05, 20:34:22 EDT] {local_task_job_runner.py:234} INFO - Task exited with return code 1

Two possible approaches:

bind mount the output. This exposes the writing area to host systems. If we put this area on /mnt/AO-staging-Incoming we have an internal log of downloaded bag.zips that we can delete from outside the container.
Use a shared volume, and have the docker procedure erase the bag.zip when it is complete.

Unify Dockerfile and docker-compose

Running Airflow in Docker provides a cookbook approach to Airflow. This issue modifies the current stack to use best practices from this guide.

The existing approach was taken from a text book. This issue calls for changing to use official docker sources and approved methods.

Clean up contributory material.

Dockerfile wants everything to be with respect to a given work directory. Dockerfile is in the main root of the airflow-docker directory, so a lot of stuff winds up in it.

Push all the local sources into staging, change the airflow-docker/Dockerfile to reflect this,

Add 'airflow-docker/staging/*` to .gitignore.

First FPL to bring in to archive

@eroux needs these works to be brought in to send to SCAM.

W1FPL2080 to W1FPL3800

Implement manual trigger for mock workload on Linux

Create AWS settings repo

The ao-workflows repo began the process of documenting AWS settings for some SQS queues and bucket events as code. This section is in the public repository ao-workflows (needed to be public for its documentation).

AWS Settings shouldn't be in a public repo, create a private one for the settings, and move them from ao-workflows

Handle multiple regions

There's one DAG to initiate the workflow, but it looks in an SQS queue in the us-east-1 region, which is the default. The actual FPL and NLM archives are in different regions:

source	zone
NLM	ap-northeast-1
FPL	ap-southeast-1

Don't succeed when aws download fails

A download task had this error:

[2024-03-14T15:41:21.312+0000] {logging_mixin.py:188} INFO - using secrets
[2024-03-14T15:41:21.313+0000] {logging_mixin.py:188} INFO - section='default'   ['default']
[2024-03-14T15:41:21.314+0000] {logging_mixin.py:188} INFO - KeyError: 'region_name'

yet the task succeeded. It should have failed.

Missing SQS notifications

I initiated glacier restore on a number of glacier.staging.nlm.bdrc.org works. (W1NLM4700-5000, 5100 - 5900)
All the ones that existed successfully restored. Most of them sent SQS messages that the sqs_scheduled_dag picked up and syncd
A random subset (4500,4600, 5200, 5600, 5700) were successfully restored, but there was no message sent, so the dag didn't pick it up.

find out why, and how to recover, develop another input path. (There's a dataset facility that could be a data bridge between dags)

buda-base / ao-workflows Goto Github PK

ao-workflows's Introduction

Vagrant scripts for BUDA platform instanciation

ao-workflows's People

Contributors

Watchers

ao-workflows's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs