sul-dlss / lyber-core Goto Github PK

View Code? Open in Web Editor NEW

1.0 13.0 1.0 10.33 MB

SULAIR Robot Framework and Infrastructure Tools

License: MIT License

Ruby 100.00%

infrastructure gem

lyber-core's Introduction

lyber_core

Robot Creation

Create a class that subclasses LyberCore::Robot

In the initializer, call super with the workflow name, step name
Your class #perform_work method will perform the actual work; druid is available as an instance variable.

module Robots
  module DorRepo
    module Accession

      class Shelve < LyberCore::Robot

        def initialize
          super('accessionWF', 'shelve')
        end

        def perform_work
          cocina_object.shelve
        end

      end

    end
  end
end

By default, the druid will be set to the completed state, but you can optionally have it set to skipped by creating a ReturnState object as shown below. You can also return custom notes in this way

module Robots
  module DorRepo
    module Accession

      class Shelve < LyberCore::Robot

        def initialize
          super('accessionWF', 'shelve')
        end

        def perform
          if some_logic_here_to_determine_if_shelving_occurs
            cocina_object.shelve
            return LyberCore::ReturnState.new(status: 'completed') # set the final state to completed
#           return LyberCore::ReturnState.new(status: 'completed', note: 'some custom note to pass back to workflow') # set the final state to completed with a custom note

          else
            # just return skipped if we did nothing
            return LyberCore::ReturnState.new(status: 'skipped') # set the final state to skipped
#           return LyberCore::ReturnState.new(status: 'skipped', note: 'some custom note to pass back to workflow') # set the final state to skipped with a custom note
          end
        end

      end

    end
  end
end

Robot Environment Setup

Create a config/boot.rb containing:

require 'rubygems'
require 'bundler/setup'
Bundler.require(:default)

LyberCore::Boot.up(__dir__)

# Any additional robot-specific configuratio.

The configuration must include:

redis_url: ~

workflow:
  url: http://workflow.example.com/workflow
  logfile: 'log/workflow_service.log'
  shift_age: 'weekly'
  timeout: 60

And optionally:

# For Dor Services Client
dor_services:
  url:  'https://dor-services-test.stanford.test'
  token: secret-token

# For Cocina::Models::Mapping::Purl
purl_url: 'https://purl-example.stanford.edu'

# For DruidTools::Druid
stacks:
  local_workspace_root: ~

The following environment variables can optionally be set:

ROBOT_ENVIRONMENT
ROBOT_LOG_LEVEL

Robot Testing

Include the following in rspec/spec_helper.rb:

ENV['ROBOT_ENVIRONMENT'] = 'test'
require File.expand_path("#{__dir__}/../config/boot")

include LyberCore::Rspec

Robots can be invoked with:

test_perform(robot, druid)

to avoid the workflow updates in perform().

lyber-core's People

Contributors

Watchers

Forkers

davidmcclure

lyber-core's Issues

Update to dor-services 5.x

This also entails updating all consumers of lyber-core to use the 5.x-based release (if they actually hit the backend in any way).

item_queued? is returning wrong result

When there is more than one version.

e.g.:

curl https://sul-lyberservices-test.stanford.edu/workflow/dor/objects/druid:hx908xy6904/workflows/assemblyWF
<workflow repository="dor" objectId="druid:hx908xy6904" id="assemblyWF">
  <process version="1" priority="0" note="" lifecycle="pipelined" laneId="default" elapsed="" attempts="0" datetime="2019-01-28T20:40:18+00:00" status="completed" name="start-assembly"/>
  <process version="1" priority="0" note="" lifecycle="" laneId="default" elapsed="" attempts="0" datetime="2019-01-28T20:40:18+00:00" status="skipped" name="jp2-create"/>
  <process version="1" priority="0" note="sul-robots1-test.stanford.edu" lifecycle="" laneId="default" elapsed="0.25" attempts="0" datetime="2019-01-28T20:40:18+00:00" status="completed" name="checksum-compute"/>
  <process version="1" priority="0" note="sul-robots1-test.stanford.edu" lifecycle="" laneId="default" elapsed="0.306" attempts="0" datetime="2019-01-28T20:40:18+00:00" status="completed" name="exif-collect"/>
  <process version="1" priority="0" note="sul-robots2-test.stanford.edu" lifecycle="" laneId="default" elapsed="0.736" attempts="0" datetime="2019-01-28T20:40:18+00:00" status="completed" name="accessioning-initiate"/>
  <process version="2" priority="0" note="" lifecycle="" laneId="default" elapsed="" attempts="0" datetime="2019-01-29T22:51:09+00:00" status="completed" name="start-assembly"/>
  <process version="2" priority="0" note="contentMetadata.xml exists" lifecycle="" laneId="default" elapsed="0.278" attempts="0" datetime="2019-01-29T22:51:09+00:00" status="skipped" name="content-metadata-create"/>
  <process version="2" priority="0" note="" lifecycle="" laneId="default" elapsed="0.0" attempts="0" datetime="2019-01-29T22:51:09+00:00" status="queued" name="jp2-create"/>
  <process version="2" priority="0" note="" lifecycle="" laneId="default" elapsed="0.0" attempts="0" datetime="2019-01-29T22:51:09+00:00" status="queued" name="checksum-compute"/>
  <process version="2" priority="0" note="" lifecycle="" laneId="default" elapsed="0.0" attempts="0" datetime="2019-01-29T22:51:09+00:00" status="queued" name="exif-collect"/>
  <process version="2" priority="0" note="" lifecycle="" laneId="default" elapsed="0.0" attempts="0" datetime="2019-01-29T22:51:09+00:00" status="queued" name="accessioning-initiate"/>
</workflow>

Caused by: sul-dlss/dor-workflow-client#56

fast robots cause malformed log output for time

For example: Finished druid:wn335hh4709 in 7.802e-05s

Update lyber-core to return workflow context

Lyber-core gem has a Workflow class which wraps operations involving workflow. We should add the new ability to set and retrieve workflow context values here.

support robots being able to set a skipped status

We want to provide support for a robot to set the skipped status. Currently, after the .perform method is done, lyber-core always sets the status to either completed or error.

the proposal is to make the default completed if used as is to keep things backwards compatible for all existing robots, but allow the robot to return an optional value to indicate the desired end state

To implement, we could allow the .perform method to return a Results class/struct that has the status (skipped or completed) that the framework should use on successful completion.

See https://github.com/sul-dlss/lyber-core/blob/master/lib/lyber_core/robot.rb#L77

sometimes robot finished but status left as 'queued'

LyberCore::Robot module includes `initialize` method, not modular

A module should not include an initialize method. The point of a module is modularity, i.e. methods that could be included in different classes and different types of classes. But if a module provides initialize then it is specifying fundamental class-defining behavior AND cannot be used with other such modules without order dependency and a foreknowledge of interdependency. That isn't modular.

LyberCore::Robot module includes an initialize method here:
https://github.com/sul-dlss/lyber-core/blob/master/lib/lyber_core/robot.rb#L51

Seems like it would be less of an anti-pattern to just be a base class.

logger does not output name of robot

When you run a robot manually, or configure them to output to stdout, the log doesn't report which robot is running, so you get something like:

 INFO [2014-12-03 15:41:29] (748)  :: bh152hk2665 processing
 INFO [2014-12-03 15:41:30] (748)  :: bh152hk2665 completed in 0.1384s

See https://github.com/sul-dlss/lyber-core/blob/master/lib/lyber_core/robot.rb#L74

But I'd like to know the name of the robot, like this:

 INFO [2014-12-03 15:41:29] (748)  :: robot-name :: bh152hk2665 processing
 INFO [2014-12-03 15:41:30] (748)  :: robot-name :: bh152hk2665 completed in 0.1384s

Update workflow client to use non-deprecated error update endpoint

See sul-dlss/dor-workflow-client#126

some exceptions not caught

Sometimes robot exceptions are not caught correctly by lybercore::work and they are sent to Resque to be put in the /failed queue....

Worker
sul-robots1-prod.stanford.edu:2219 on DOR_ACCESSIONWF_PUBLISH_DEFAULT at 6 minutes ago Retry or Remove
Class
 Robots::DorRepo::Accession::Publish
Arguments
--- druid:qf593jg6933
...
Exception
Dor::Describable::CrosswalkError
Error
Unknown descMetadata namespace: nil
/home/lyberadmin/common-accessioning/shared/bundle/ruby/1.9.1/gems/dor-services-4.13.0/lib/dor/models/describable.rb:41:in `generate_dublin_core'
/home/lyberadmin/common-accessioning/shared/bundle/ruby/1.9.1/gems/dor-services-4.13.0/lib/dor/models/publishable.rb:54:in `publish_metadata'
/home/lyberadmin/common-accessioning/releases/20140910190822/robots/accession/publish.rb:17:in `perform'
/home/lyberadmin/common-accessioning/shared/bundle/ruby/1.9.1/gems/lyber-core-3.2.4/lib/lyber_core/robot.rb:67:in `block in work'
/usr/local/rvm/rubies/ruby-1.9.3-p484/lib/ruby/1.9.1/benchmark.rb:295:in `realtime'
/home/lyberadmin/common-accessioning/shared/bundle/ruby/1.9.1/gems/lyber-core-3.2.4/lib/lyber_core/robot.rb:66:in `work'
/home/lyberadmin/common-accessioning/shared/bundle/ruby/1.9.1/gems/lyber-core-3.2.4/lib/lyber_core/robot.rb:20:in `perform'

when sidekiq process times out, lyber-core does not catch the error and report it to workflow service

If a Sidekiq process for managing workers is shutdown gracefully, as happens when the timeout window is reached for hotswapping old processes for new ones after robot deployment, any work in progress when the job is killed will error in such a way that lyber-core doesn't trap it and report it to workflow service, leaving the workflow step in the started state, instead of putting it in a failed state. The job will actually hit the retry queue, but then workflow service will say that the job isn't queued when a robot picks up the job from Sidekiq, and then the job won't run. Here's an example in common-accessioning from 4:09 pm this afternoon, when the 4 day timeout started by this week's dependency update deployment was reached: https://app.honeybadger.io/projects/52894/faults/95009475/01H5BAMR6YHW10C8777YYY6TEV

Item druid:sz929gx7593 is not queued for checksum-compute (assemblyWF), but has status of 'started'. Will skip processing

I think there are likely many other ways that we can encounter that error message, and this is one of the newer ones (since we've only implemented Sidekiq hotswap in the last few months).

See also this Slack thread where we were discussing the aforementioned checksum-compute job, and figured out why it disappeared without us noticing an error at first: https://stanfordlib.slack.com/archives/C09M7P91R/p1689373245001619

cc @andrewjbtw