GithubHelp home page GithubHelp logo

ao-workflows's Introduction

Vagrant scripts for BUDA platform instanciation

The base platform is built using Vagrant and VirtualBox:

  1. Install Vagrant and VirtualBox.
  2. Download or git clone this repository.
  3. cd into the unzipped directory or git clone
  4. install VirtualBox guest additions with vagrant plugin install vagrant-vbguest
  5. run vagrant up to summon a local instance

Or for an AWS EC2 instance:

  1. install the vbguest plugin: vagrant plugin install vagrant-vbguest
  2. and run the command: vagrant up or rename Vagrantfile.aws to Vagrantfile and run vagrant up --provider=aws

This will grind awhile installing all the dependencies of the BUDA platform.

Once the initial install has completed the command: vagrant ssh will connect to the instance where development, customization of the environment and so on can be performed as for any headless server.

Similarly, the jena-fuseki server will be listening on:

http://localhost:13180/fuseki

Lds-pdi application is accessible at :

http://localhost:13280/

(see https://github.com/buda-base/lds-pdi/blob/master/README.md for details about using this rest services)

The command: vagrant halt will shut the instance down. After halting (or suspending the instance) a further: vagrant up will simply boot the instance without further downloads, and vagrant destroy will completely remove the instance.

If running an AWS instance, after provisioning access the instance via ssh -p 15345 and delete Port 22 from /etc/ssh/sshd_config and sudo systemctl restart sshd. This will further secure the instance from attacks on port 22.

ao-workflows's People

Contributors

jimk-bdrc avatar

Watchers

 avatar  avatar  avatar  avatar

ao-workflows's Issues

Remove syncd works from airflow

After a work has been sync'd, its debagged contents persist in airflow://home/airflow/bdrc/data which is mapped on a real fs to AIRFLOW_DIR/AO-staging/Incoming Rather than manually remove them, a successful sync should remove the source dir. The sync debagged task has access to this.

Remediate IA

Fix the issues that the buda-base/ia-metadata-report identifies.

  • identify the works which failed to pass derive process. We can't update those until they derive, so their misadjustment might not apply.
  • Identify the works that could be remediated by a rederive. These are all the misplaced works which do not have derive failures.
  • Build a task that rederives the open works. (see note below)
    Repetitively apply the ia-metadata-report/fillcache and create-report shells until only works which have failed derive appear.

derives can take quite a long time, so this work will have to be repetitively returned to.

In an ideal world, the code should only submit a few works, and only submit more when the number of active works falls below a threshold. Initially, this will be manual.

Create new records in glacier sync tracking db

glacier_staging_to_sync.py and extras.py have routines that perform CRUD operations against the drs.glacier_staging_progress table. Those routines should be able to create records if they don't exist.

Reconcile docker file system with dip-pump

The airflow under docker writes to a bind mounted volume that results in a path that the dip-pump cannot see.

The dip_log event is generated from a shell, running in the docker image, that has a file path that is different from the native host. In the docker compose file, the volumes are mounted:

      - ${ARCH_ROOT:-/mnt}/Archive0:/home/airflow/extern/Archive0
      - ${ARCH_ROOT:-/mnt}/Archive1:/home/airflow/extern/Archive1
      - ${ARCH_ROOT:-/mnt}/Archive2:/home/airflow/extern/Archive2
      - ${ARCH_ROOT:-/mnt}/Archive3:/home/airflow/extern/Archive3

left of the colon is the host (local) path, right of the colon is the local mount point.

this results in a dip_log dip_dest_path of /home/airflow/extern/Archive0 for the resulting record. this means that
all dip_log work will not be able to locate that path (unless shims are made on sattva (ln -s /mnt/Archive0 /home/airflow/extern/Archive0

or the path is duplicated precisely on the docker image. This is the first path to explore:

      - ${ARCH_ROOT:-/mnt}/Archive0:/mnt/Archive0
      - ${ARCH_ROOT:-/mnt}/Archive1:/mnt/Archive1
      - ${ARCH_ROOT:-/mnt}/Archive2:/mnt/Archive2
      - ${ARCH_ROOT:-/mnt}/Archive3:/mnt/Archive3

bdrc-docker.sh can do this, in the same way it creates other resources. ONLY if that fails, use a shim on the client hosts (which I really don't want to have to support on two machines.!)

Fix is simpler:

      - ${ARCH_ROOT:-/mnt}:/mnt

And reflect changes in DAG, Dockerfile-bdrc, and bdrc-docker-compose.yml

Download runs out of space

Processing this message:

[
  {
    "eventVersion": "2.1",
    "eventSource": "aws:s3",
    "awsRegion": "ap-northeast-2",
    "eventTime": "2024-04-06T00:11:23.730Z",
    "eventName": "ObjectRestore:Completed",
    "userIdentity": {
      "principalId": "AmazonCustomer:A1JPP2WW1ZYN4F"
    },
    "requestParameters": {
      "sourceIPAddress": "s3.amazonaws.com"
    },
    "responseElements": {
      "x-amz-request-id": "439897F6741FD9BA",
      "x-amz-id-2": "MF0oW9le+g8K5/R/uUks1QuFbZxNuSmZDWQ5utu8ZTcHEKSGFHzdFBEtebICzrPtG3YL1YVmffxhRw4nDPTZ1w=="
    },
    "s3": {
      "s3SchemaVersion": "1.0",
      "configurationId": "BagCreatedNotification",
      "bucket": {
        "name": "glacier.staging.nlm.bdrc.org",
        "ownerIdentity": {
          "principalId": "A1JPP2WW1ZYN4F"
        },
        "arn": "arn:aws:s3:::glacier.staging.nlm.bdrc.org"
      },
      "object": {
        "key": "Archive0/00/W1NLM4700/W1NLM4700.bag.zip",
        "size": 17017201852,
        "eTag": "41654cbd2a8f2d3c0abc83444fde825b-2029",
        "sequencer": "00638792A45B638391"
      }
    },
    "glacierEventData": {
      "restoreEventData": {
        "lifecycleRestorationExpiryTime": "2024-04-12T00:00:00.000Z",
        "lifecycleRestoreStorageClass": "DEEP_ARCHIVE"
      }
    }
  }
]

size is "size": 17,017,201,852" 17GB

[2024-04-05, 20:30:09 EDT] {taskinstance.py:2513} INFO - Exporting env vars: AIRFLOW_CTX_DAG_OWNER='***' AIRFLOW_CTX_DAG_ID='sqs_scheduled_dag' AIRFLOW_CTX_TASK_ID='download_from_messages' AIRFLOW_CTX_EXECUTION_DATE='2024-04-06T00:20:00+00:00' AIRFLOW_CTX_TRY_NUMBER='1' AIRFLOW_CTX_DAG_RUN_ID='scheduled__2024-04-06T00:20:00+00:00'
[2024-04-05, 20:30:09 EDT] {logging_mixin.py:188} INFO - using secrets
[2024-04-05, 20:30:09 EDT] {logging_mixin.py:188} INFO - section='ap_northeast'   ['default', 'ap_northeast']
[2024-04-05, 20:34:22 EDT] {taskinstance.py:2731} ERROR - Task failed with exception
Traceback (most recent call last):
...
                   ^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/s3transfer/download.py", line 643, in _main
    fileobj.write(data)
  File "/home/airflow/.local/lib/python3.11/site-packages/s3transfer/utils.py", line 379, in write
    self._fileobj.write(data)
OSError: [Errno 28] No space left on device
[2024-04-05, 20:34:22 EDT] {taskinstance.py:1149} INFO - Marking task as FAILED. dag_id=sqs_scheduled_dag, task_id=download_from_messages, execution_date=20240406T002000, start_date=20240406T003009, end_date=20240406T003422
[2024-04-05, 20:34:22 EDT] {standard_task_runner.py:107} ERROR - Failed to execute job 259 for task download_from_messages ([Errno 28] No space left on device; 13305)
[2024-04-05, 20:34:22 EDT] {local_task_job_runner.py:234} INFO - Task exited with return code 1

Two possible approaches:

  1. bind mount the output. This exposes the writing area to host systems. If we put this area on /mnt/AO-staging-Incoming we have an internal log of downloaded bag.zips that we can delete from outside the container.
  2. Use a shared volume, and have the docker procedure erase the bag.zip when it is complete.

Unify Dockerfile and docker-compose

Running Airflow in Docker provides a cookbook approach to Airflow. This issue modifies the current stack to use best practices from this guide.

The existing approach was taken from a text book. This issue calls for changing to use official docker sources and approved methods.

Clean up contributory material.

Dockerfile wants everything to be with respect to a given work directory. Dockerfile is in the main root of the airflow-docker directory, so a lot of stuff winds up in it.

Push all the local sources into staging, change the airflow-docker/Dockerfile to reflect this,

Add 'airflow-docker/staging/*` to .gitignore.

Create AWS settings repo

The ao-workflows repo began the process of documenting AWS settings for some SQS queues and bucket events as code. This section is in the public repository ao-workflows (needed to be public for its documentation).

AWS Settings shouldn't be in a public repo, create a private one for the settings, and move them from ao-workflows

Handle multiple regions

There's one DAG to initiate the workflow, but it looks in an SQS queue in the us-east-1 region, which is the default. The actual FPL and NLM archives are in different regions:

source zone
NLM ap-northeast-1
FPL ap-southeast-1

Don't succeed when aws download fails

A download task had this error:

[2024-03-14T15:41:21.312+0000] {logging_mixin.py:188} INFO - using secrets
[2024-03-14T15:41:21.313+0000] {logging_mixin.py:188} INFO - section='default'   ['default']
[2024-03-14T15:41:21.314+0000] {logging_mixin.py:188} INFO - KeyError: 'region_name'

yet the task succeeded. It should have failed.

Missing SQS notifications

I initiated glacier restore on a number of glacier.staging.nlm.bdrc.org works. (W1NLM4700-5000, 5100 - 5900)
All the ones that existed successfully restored. Most of them sent SQS messages that the sqs_scheduled_dag picked up and syncd
A random subset (4500,4600, 5200, 5600, 5700) were successfully restored, but there was no message sent, so the dag didn't pick it up.

find out why, and how to recover, develop another input path. (There's a dataset facility that could be a data bridge between dags)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.