GithubHelp home page GithubHelp logo

samvera-labs / hyrax-batch_ingest Goto Github PK

View Code? Open in Web Editor NEW
2.0 64.0 0.0 291 KB

Batch Ingest Plugin for Hyrax

License: Apache License 2.0

Ruby 87.33% JavaScript 0.60% CSS 0.56% HTML 11.17% SCSS 0.34%

hyrax-batch_ingest's Introduction

hyrax-batch_ingest

Batch Ingest Plugin for Hyrax

Installing

  1. Add the Hyrax Batch Ingest gem to your Gemfile

    # in your Gemfile
    gem 'hyrax-batch_ingest'
  2. Run bundle install

    # from your application's root directory...
    bundle install
    
  3. Run the Hyrax Batch Ingest installer

    # from your application's root directory...
    bin/rails generate hyrax:batch_ingest:install
    

    The installer does a few things:

    • Adds batch ingest routes:
      • /batches will list all batches.
      • /batches/[batch_id] will show details for a batch, including a list of all batch items.
      • /batches/[batch_id]/items/[batch_item_id] will show details for single batch item within a batch.
    • Adds database migrations.
    • Includes Hyrax::BatchIngest::Ability in your applicaton's Ability class at app/models/ability.rb.
  4. Run database migrations

    # from your application's root directory...
    bundle exec rails db:migrate
    

Configuration

By default, Hyrax Batch Ingest will try to load configuration from config/batch_ingest.yml.

You can tell Hyrax Batch Ingest to load a different configuration file at runtime like this:

# Inline syntax
Hyrax::BatchIngest.config.load_config('path/to/your_batch_ingest_config.yml')

# Block syntax, useful if you have additional configuration to set at runtime.
Hyrax::BatchIngest.configure do |config|
  config.load_config('path/to/your_batch_ingest_config.yml')
  # additional config...
end

Ingest Types

Each Bach Ingest has a specific type. The ingest type determines how the batch should be read into the system, and how it should be mapped to the persistence layer (i.e. Fedora)

hyrax-batch_ingest's People

Contributors

afred avatar bkeese avatar cjcolvar avatar jasoncorum avatar mcwhitaker avatar phuongdh avatar yingfeng-iu avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hyrax-batch_ingest's Issues

Dashboard

Determine needs and create mockups of a dashboard to report details of batches and filter to find specific batches.

Reporting

  • Information on validation errors (header not present, metadata field not found etc.)

Questions

  • Do uploaded files for batch need to be stored? (As in, storing the manifest file itself in the database indefinitely for future retrieval).

Interface

  • Batches are displayed in a table with each row corresponding to an individual batch
  • Each batch corresponds to an individual manifest file
  • The Job ID field is hyperlinked to a detail page for that batch
  • The detail page lists each child object created

depends on #14.

UI Upload large batch from remote location

Description:

When a user has a batch larger than 750MB, they are instructed to upload the batch to a remote location (S3, Box, Dropbox) and enter the remote location in the upload interface.

Done Looks Like:

  • User sees instructions on uploading large batch using remote location
  • Batch file is retrieved from the remote location that the user specifies.

Ability to cancel batch

Description:

As an ingester, I can cancel a batch I have initiated, so that I can save time if I realize a batch is failing hard

Does the cancel action stop where the batch is at, and items that have successfully completed are retained? Or should everything in the batch object, including items that were successful, be rolled back/deleted?

Done Looks Like:

  • Cancel batch button on the batch details page
  • Cancel batch button on the batch's line of the batch dashboard
  • Cancel batch buttons causes
    - request for confirmation from user
    - remaining items in the batch to become "cancelled"
    - currently running item continues to run
    - before starting to run another batch item, check status to see if it has been cancelled or is still enqueued; only run it if it is enqueued.
    - set the status for those batch item cancelled
    - sets the batch status to cancelled

Batch Item Payload meeting make a call!

Description

We have a final meeting on Batch payload, make a call, and write up issues. We spent a lot of time discussing what would go into the payload, then we discussed whether we need it.

Done looks like

  • Decide if payload is an optional cache (e.g. if source is volatile)
  • Decide if payload is really needed
  • If devs can't come to agreement, POs will ask for prototypes OR pluses and minuses and make call

Stubbing out Interfaces

Description

Groundwork for implementing the diagram developed: https://iu.app.box.com/file/320290778323

Done Looks Like

  • Stub out interface
  • Stub out tests for interface

TODO:

  • tests for routing
    • /batches (batch list)
    • /batches/[batch_id] (batch detail)
    • batches/[batch_id]/[batch_item_id] (batch item details)

NU Batch Code Investigation and Move to hyrax-batch_ingest

Description

As we start to work on the batch ingestion plugin, investigate the work from NU that we can 'borrow'. Hopefully we will be able to take the work/concepts and make it more generic in nature.

Done looks like

  • investigate DONUT work
  • Document what we can borrow
  • create issues as necessary.

References

Core NU batch:

https://github.com/nulib/donut/tree/master/app/batch

There are also views and models within the application that should be investigated.

AMS - Submit Set of Zipped XML Files

Description

As an AMS user, I want to submit a batch containing a zipped file of xml files of pbcoreDescriptionDocuments and see it as a batch in the dashboard

Done Looks Like

  • After batch is uploaded, all batch items are represented in the dashboard with the status "enqueued"

Email Notification When Batch Job Begins

Description

As a submitter, I want to receive an email when a batch job has been read or errored before reader completes (see architecture diagram in #48).

Concerns

  • What are all of the messages that we might want to send and in what situation does each get sent.
    - succeed
    - batch isn't initiated
    - batch errors during reading

Done Looks Like

  • When a batch job begins, the submitter receives an email from the system
  • The email contains text indicating the batch job has begun processing or the batch failed (to iterate on error messages if required).
  • The email contains a link to the dashboard detail view for the job

child of #54

Determine which fields are needed for 'batches' and 'batch_items' tables.

Background: Batch ingests are tracked in a relational database with two tables: batches, and batch_items, where a Batch has many BatchItems.

We need to figure out what the fields for those tables are.

Things to consider:

  • A BatchItem may represent more than one record in the persistence layer and/or more than one derivative binary file. So having a field for tracking the results of each batch item is necessary.

See Donut as a reference: https://github.com/nulib/donut/blob/master/db/schema.rb#L18-L36

Done when:

  1. Fields for batch and batch_item tables has been determined.
  2. New ticket is created for actually creating tables in the db migration code.

depends on #12.

Create Rails Engine

  • Generate a Rails engine into existing repo
  • Remove default dummy test app
  • Install RSpec
  • Install engine_cart
  • Setup testing with engine_cart

Done looks like:

  • Passing test suite that uses Engine Cart
  • Tickets have been created for installing Hyrax into test app. (See #20)

Adding a name field to Batch

Description

In Avalon 6 batch spreadsheet, we have a name to refer to the manifest represented by the spreadsheet (it's the first column on the first row, while the second column of the first row is the submitter email address). This name field could be a useful reference to users when we display batch info on the dashboard (instead of or in addition to showing batch id). For AMP side, this could come from the UI input by users.

Configure which reader to use per SIP type

As a user submitting a SIP of a specific shape, the system knows what reader to use based on configuration.

Done Looks Like

  • Default SIP types can be configured in the application
  • SIPs can be specified by uploaders

Reporting on batch status

Determine information and metrics to report on a batch object, successful or errored. Possible information: current state, errors, # of objects created, total disk space used, ids of created objects, processing time, etc. Possibly similar information for each batch item.

Details

  • Each row of the dashboard corresponds to a single top level item (asset, AV work)
  • Each job gets a unique ID
  • Submitter Name
  • Original manifest filename
  • Date submitted
  • Status
    • Completed
    • Error
    • Pending
    • In Progress

depends on #14.

AMS xml mapper for Assets

Description

Reads batch item data from the database and persist the data correctly to Fedora.

Done Looks Like

  • Data in Fedora correctly
  • Model validation happens and errors are thrown accordingly

Allow optional cache for payload

Description

Payload (Source Data and Source Location) will have an optional cache. It will be configured via a configuration file. The API will allow it as a flag.

Done looks like

  • The source data field is populated
  • a batch can be executed with cached data

Determine Batch Format

Description

Fields in CSV format will correspond to values in a work type, allowing for dynamic adding of new fields as work type changes or other changes occur in the application. Note currently in NU batch, if a field exists in the model it should be accessible via batch.

Requirements

  • Fields are available for an existing item ID which would add the contents of a row to a pre-existing item in the system
  • Metadata fields are available for any work type

Done looks like

  • rules set up for how batch columns work and are named
  • repeatable field rules are documented.

Notes

Below are NU's current batch ingest instructions.

UI Upload batch size limit

Description

As an ingester, I can upload batches up to 750MB through the UI. If my file exceeds that size limit, I get a message telling me how to alternatively upload my batch.

A waiting time of up to 30 seconds or so is fine.

Done Looks Like:

  • User uploads a batch smaller than 750MB and it gets accepted.
  • User uploads a batch larger than 750MB and they get a message that gives them an alternative way to upload.
  • 750 MB is the default, and this value is able to be changed within the batch ingest yaml config file

Decide what BatchItem#payload needs to store

Description:
As a developer I want a decision on a what the BatchItem#payload needs to store

Considerations:

  • What does the data need to be able to handle (how complicated might the data in here get?)
  • Does it need to handle nested objects?

Done Looks Like

  • A meeting scheduled to discuss and make some decisions regarding the payload
  • Any decisions made are stored in a shared location (e.g. wiki, github)

Note

Answers to this informed by ongoing development

Batch runner master issue

Description:

The batch runner gets initiated when the batch gets started and orchestrates a batch through it's lifecycle.

Done looks like

  • Fires off jobs

  • Persists batches and batch items

  • Sending the notifications

Submit batch by uploading through UI

Description:

As an AMS submitter, I can submit a batch from my local file system through the UI

Considerations:

  • The submitter will be a logged in user, so that will give the system the information about who's submitting, rather than it being in a SIP or Avalon-style manifest file.
  • The user sets additional parameters through drop down menu
  • The user uploads the file from local system

Done Looks Like:

  • Button on the batch list (dashboard) to submit a new batch
  • A modal appears upon button select, which present users with:
    • User can select what type of SIP they are submitting (drop-down menu)
    • User can browse their file system and select a file to upload
    • User can set the Admin Set that the batch will be uploaded into (some users may only have access to one, but some will have access to many)
    • User can add other shared values for all records being ingested, as configured per application, per SIP (see comments for examples).
    • File type validation is performed against the file—if it does not match with the SIP type, the modal remains open and displays a message to the user that informs them of the correct file type to use
  • Configurable option to have the button appear or not (specified in application configuration file)

High level design

Description

We need a diagram that details the flow of the batch interface in a technical way, that includes decisions made.

Tasks

  • Dev meeting scheduled for high level design.
  • Create preliminary diagram to bring to meeting.
  • Dev meeting held for high level design.

Done Looks Like

  • A diagram of the system components and how data moves through them.
  • Diagram is placed in a shared location.
  • The team signs off on the diagram
  • Tickets for defining the APIs for each component (input and output).

Error Reporting / Error Handling

Description

Error reporting needs to be defined.

Questions

  • What is the level of granularity?

    • Errors are reported per batch item. Each element or attribute that is invalid is reported. It would be better if the value which was not expected was reported in addition to the raw line number in the file.
  • How are errors listed when files are being processed?

    • An item is completely read through for each of its values and all errors present are compiled. If a single object had multiple errors, for example, they would all be represented in the error message available to the user; processing would not stop on the first error and skip to the next batch item. This will allow users to identify each issue and correct them in full before resubmitting.
  • How to Treat Child Objects

    • If a child object contains errors and cannot be created, the parent item fails in full as well.

Dashboard - View List of Batch Objects

Description

As a user, I want to view a list of each batch job that I have submitted.

Done Looks Like

  • A page is available to users that displays each batch job they have submitted
  • This list is a table which includes the fields:
    • Status
    • Job ID
    • Adminset within which items will be/are created
    • Collection
    • Submitter
    • Filename
    • Submission Date
    • The Job ID field

Batch Model

Batch

Represents the batch object. Each batch contains one or more batch items.

Attributes

  • Identifier
  • Status
  • Error Message
  • Submitter email address
  • Original filename of submission file
  • AdminSet Name or Identifier
  • Collection Name/Identifier
  • Created_at timestamp
  • Updated_at timestamp
  • List of Batch Items (has_many relationship—or or more Batch Items in a Batch)

Batch Item

  • Identifier
  • Payload
  • ID of Batch (i.e. the parent of this Batch Item)
  • ID within batch (Row Number/XML Filename?)
  • AdminSet Name/Identifier
  • Collection Name/Identifier
  • Status
  • Error Message
  • Created_at timestamp
  • Updated_at timestamp

Batch Status

Below are the list of valid status states

  • Started (the batch or item is picked up/registered)
  • Initiated (the batch is processing)
  • Skipped (only occurs on rerun)
  • Complete (Contains Error(s))
  • Complete (Success)

Batch Item Status

Below are the list of valid status states

  • Started (the batch or item is picked up/registered)
  • Initiated (the batch is processing)
  • Skipped (only occurs on rerun)
  • Complete (Error)
  • Complete (Success)

Red/Green/Blue/Default (Mustard)

Batch Ingestion Phase 1

Description

At the end of .4, Avalon will have a minimal batch ingestion tool based on NU's hyrax-batch work. This tool will be functional as a first iteration, but features such as updates via csv, complex error checking, and advanced reporting will come later.

The batch tool will accept a CSV file with a meta-header row containing a submitter's email address, a header row which corresponds to the fields in Hyrax, and a row(s) for attached assets referencing files on a dropbox. A user will drop the assets and associated batch manifest into the dropbox. A cronjob (or AWS Lambda trigger) will kick off the batch ingestion process. The batch ingestion process will verify that the csv is properly formed or error. After initialization, it will iterate through each data row and attempt to ingest. If bad data exists in a row, the row will error. This will allow a user to update the row in the csv and re-ingest. Duplicates based on ??? will be skipped initially.

Done looks like

  • NU batch tool is investigated, key functionality is lifted as necessary
  • CSV format is determined. Ideally the header row only requires that a field exist in Hyrax
  • Reporting dashboard is created that includes (status of job, failure with error, success, and link to work)
  • Batch job verifies each row is valid if a row is bad, the row is skipped and the dashboard shows a skipped row
  • Batch job ingests metadata into appropriate fields and kicks off derivative generation
  • Dashboard shows status batch job and can be drilled into for insight into each row

Avalon Mapper

Description

Reads batch item data from the database and persist the data correctly to Fedora.

Done Looks Like

  • Data in Fedora correctly
  • Model validation happens and errors are thrown accordingly

BatchItem#payload Decision

Description

As a developer, I want a decision on a what the BatchItem#payload needs to store

Questions

  • What does the data need to do?
  • Are objects nested or flat?
  • What's in the payload?

Done Looks Like

  • Questions above are answered and decisions recorded in Box or a Github issue.

Email Notifications During Batch Processing

Description

As a user who is the submitter of a batch job, I want to receive emails when my submitted batch job has begun processing and when it has completed.

Done Looks Like

  • An email is sent to the submitter when a batch job has started
  • An email is sent to the submitter when a batch job has completed

depends on #14.
broken into #22 and #23

As a User I Can View Certain Batches

Description

Permissions need to be established.
Managers can view all batches of Admin Sets they manage.
Depositors can view only batches they created.

Done Looks Like

Manager of an Admin Set can see a batch for the Admin Set that someone else created.
Depositor of an Admin Set can see a batch they created, but cannot see a batch that someone else created.

Batch runner fires off Jobs

Description:

For each batch item within a batch a background job is enqueued. Data is read in and validated. Records are stored and payload is optionally saved. The background job is kicked off.

Done looks like:

  • Data is read in and validated.
  • Records are stored and payload is optionally saved.
  • The background jobs is kicked off (enqueued)

child of #54

Dashboard - View Batch Object Detail

Description

As a user, I want to view details about a particular batch job. I will go to a page that lists all batches. When clicking on a batch, I will get a detail page about the batch object.

Done Looks Like

  • A detail page is available for each batch, in the same tabular format as the master batch job list. Column headers:

    • Status
    • Submission ID (of the batch item) - this will be row # in CSV submission or xml filename in a zipped xml submission.
    • Created Item (hyperlinked ID of the created item in Hyrax—value within the dashboard can be the object identifier rather than specifically the Fedora identifier)
  • This detail page gives a row for each batch item within the batch job (the top level asset)

  • Users can only view a batch detail page if they are the submitter or an admin

  • When the input is a CSV, Submission ID reflects the row number of the item within the CSV file, beginning at an index of 1

  • When a user clicks an Error value in the Status column, it displays the error returned by the application on why that item didn't complete. (see #13)

Question

  • What is the batch identifier used for the batch? (UUID like below and in Donut, Database ID?)

Implement the Avalon Reader

Description

Implement the interface for the Avalon. As a result of this we will have the common interface defined.

Submit batch from watched directory

Description

As a system, I want to pick up a batch for processing from a directory I am watching.

Done Looks Like

  • Write a rake tasks that looks in some specified place on disk, and start a new job if valid manifest files are found

  • A file deposited in a specific location will be picked up and processed by the application

  • This location is a URI—Avalon manifest files could be on a local filesystem, in S3, etc.

  • All subdirectories of the watched directory are searched for valid submission files

  • If a valid submission file is found, it is picked up and batch processing begins

Talk with Rob at Notch8 about their batch work

Description

We've done a lot of work on batch let's make sure the other folks know what we're up to.

Done looks like

  • A meeting is scheduled with Avalon/WGBH team members and Rob/Notch8
  • Chattin' 'bout batch

Determine initial relational db schema

Background: we know that we're going to have a DB with representations of the batch having many batch_items.

Done when:

  • settle on terminology, and initial set of fields for each table.

Create Mailer

Description

Getting everything ready to go for any sent mail.

Done Looks Like

Avalon - Submit Avalon-style CSV

Description

As an Avalon user, I want to submit a batch by dropping into a dropbox a CSV batch file and associated assets to the system

Done Looks Like

  • Batch files are picked up by the system and batch jobs created within the application
  • A batch manifest should be validated 
  • Each item in a batch should be ingested into the system or error with appropriate notifications (error reporting user story)

Monitoring batch process progression

Description:

On an item detail page for a batch, I will be able to see the status of items within a batch. At this phase we expect the user to do a page refresh to get updates. (extra points for push notifications):

Done looks like

For each batch item in a batch indication / status for:

  • initiated
  • working 
  • completed (with link to object created)
  • error (with error trapped)

Email Notification When Batch Job Completes

Description

As a user, I want to receive an email when a batch job that I have submitted is complete.

Done Looks Like

  • When a batch job completes, the submitter receives an email from the system
  • The email contains text about the batch job: whether it completed with no errors or completed but contained errors
  • The email contains a link to the dashboard detail view for the job

child of #54

Only administrators Are Able to View All Batch Jobs

Description

As a batch administrator, I want to view every batch system-wide from my batch job dashboard.

Done Looks Like

  • When an admin-level user navigates to the batch dashboard, they see every job created in the system
  • Other than viewing all batch jobs, the interface is the same as for non-admin users (i.e. different views for batch job list, batch job detail, and batch item detail)

Verification of Batch

As a system, I want to validate submitted manifest files against an agreed upon format and schema.

AMS - Submit CSV

Description

As an AMS user, I want to submit a batch containing a csv

Done Looks Like

Persist batches and batch items

Description

When a batch is completed items will be saved. This depends on the output from the reader.

Done looks like

  • Records are saved in the database that represent the batch

child of #54
depends on #49

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.