cdrh / datura Goto Github PK

Datura is a ruby gem that manages data (TEI-XML, CSVs, VRA-XML, etc) and populates Solr / Elasticsearch instances. Datura also generates HTML for the formats to allow serving the contents via web

Ruby 59.54% XSLT 40.46%

tei-xml cdrh saxon elasticsearch solr has-issues

datura's Introduction

Datura

Welcome to the documentation for Datura, a gem dedicated to transforming and posting data sources from CDRH projects. This gem is intended to be used with a collection containing TEI, VRA, CSVs, and more.

Looking for information about how to post documents? Check out the documentation for posting.

Install / Set Up Data Repo

Check that Ruby is installed, preferably 2.7.x or up. If you are using RVM, see the RVM section below.

If your project already has a Gemfile, add the gem "datura" line. If not, create a new directory and add a file named Gemfile (no extension).

source "https://rubygems.org"

# fill in the latest available release for the tag
gem "datura", git: "https://github.com/CDRH/datura.git", tag: "v0.0.0"

If this is the first datura repository on your machine, install saxon as a system wide executable. Saxon setup documentation.

Then, in the directory with the Gemfile, run the following:

gem install bundler
bundle install

bundle exec setup

The last step should add files and some basic directories. Have a look at the setup instructions to learn how to add your files and start working with the data!

RVM

RVM, or the Ruby Version Manager, is a handy way to manage multiple Ruby and Rails versions. Install RVM using instructions on the site, then add the following to your new Rails application, making sure to change the values for the Ruby version and app name:

echo 'ruby-x.x.x' > (app_name)/.ruby-version
echo '(app name)' > (app_name)/.ruby-gemset
cd (app name)
bundle install

CDRH Convention is to use the Datura version as the base for the gemset name, i.e.:

v0.1.5 = datura_015 in .ruby-gemset

Local Development

Add this to your collection's Gemfile:

source 'https://rubygems.org'

gemspec path: '/path/to/local/datura/repo'

Then in your repo you can run:

bundle install
# create the gem package if the above doesn't work
gem install --local path/to/local/datura/pkg/datura-0.x.x.gem

You will need to recreate your gem package for some changes you make in Datura. From the DATURA directory, NOT your data repo directory, run:

bundle exec rake install

Note: You may also need to delete your scripts/.xslt-datura folder if you are making changes to the default Datura scripts.

First Steps

Test it out by running the about command to view all your options (bundle exec may not be necessary, but it is recommended):

bundle exec about

To set up a brand new collection run:

bundle exec setup

Refer to the documentation to learn more about how to configure your collection and about each of the scripts.

Tests

bundle install
rake test

datura's People

Contributors

Stargazers

Watchers

Forkers

tiki007 gshuohu

datura's Issues

remove newlines from keyword fields

I cannot think of any situations where keywords should have return characters in them, since we are using them as exact matches / faceting.

Related to issues in API: CDRH/api#82
CDRH/api#81

We have two options I see:
We could make this change across all keyword fields automatically and then if some project does need a newline in a keyword field, we can cross that bridge later with an override, etc. Alternatively, we could add a line to each keyword field that smushes the contents, but then any overrides of those methods would need to imitate the behavior, as well.

Create sample collection spanning projects

We need a collection which can be used to verify that no unexpected changes are being made to collections. We should not use the example collection for this, as its gitignore is set up to fit brand new collections.

Concept:
Take two or three files per project (preferably sampling tricky spots and file types) and put them in the amalgamate collection, then generate all their PRODUCTION materials: html, solr xml, and ES json. Commit to the data repo to track.

When altering files in the data scripts / config, a programmer (or a task / hook) will run the test suite and regenerate all the sample files to check for changes. If those changes make sense, then the programmer is responsible for making sure that all affected collections are updated on their production site (so as to mitigate any surprises when updates are run several months later after everybody has forgotten about these changes). If the changes don't make sense, the programmer is alerted to a problem with their changes, or the need to customize the collection's overrides.

Move threading number to config

50 is too many for some of the less powerful servers

Add ead as a file type

Create tests that can be run for specific project

Example:
use oscys xslt script to transform TEI and check fields in / out

review file_type and file_x files for URL paths

Switch from string interpolation to using .join method for things like "#{out_es}/#{self.filename(false)}.xml"

Changes we want to make to script

1: Make it so you can have a replacement scrip for different types of files i.e. both tei and vra

2: Make it so you can have test index and production index on the same solr server

Check existing schema before posting to ES

Probably in datamanager, before pushing to index check if there is a schema for given _type and shut everything down if there's not.

Then, slightly weirder because it will happen every time (?), check the fields about to be sent to ES for any fields with no type and no _t setting and end the script, then tell the user that they either need to add that field to the schema, or they need to append a type to the field.

Think about how that last one will be implemented that won't require lots of calls or json checks, etc.

Add back in "shortname" or some other variable

Here's the problem I am running into:

I need to change the XSL to build the media server URL's. Right now the media_base looks like this:

media_base: https://cdrhmedia.unl.edu

So in order to build the complete URL, I need to add

images/cather-complete-letters

but I don't know that I want to pull cather-complete-letters fromt he folder name because the actual repo name is "complete letters" and the folder name could be different on different servers.

As Jess and I talked about it, I realized that since the folder name will never be committed to the repo it's probably not the best basis for building path URL's. Jess proposed we add a new variable, "folder_name" which will be passed through for things that require the folder path, and then add "shortname" back in. (ugh, sorry)

I'm open to other names besides "shortname" though - namespace? something else?

People showing up twice

http://rosie.unl.edu/earlywashingtondc/people/per.000002

set up dev to post to production server

Attempting to post solr to production from dev server gets the following error:

Error posting to Solr for oscys.case.0179.006.xml: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>400 Bad Request</title>
</head><body>
<h1>Bad Request</h1>
<p>Your browser sent a request that this server could not understand.<br />
Reason: You're speaking plain HTTP to an SSL-enabled server port.<br />
 Instead use the HTTPS scheme to access this URL, please.<br />
</p>
</body></html>

Likely, we need to look into how we are sending the POST via the ruby library. Tagging @techgique in case you think that error looks like it's a server config issue, but I suspect it's the data repo's solr posting:

https://github.com/CDRH/data/blob/master/scripts/ruby/lib/solr_poster.rb#L55

Clean up documentation

We've got documentation all over:

in google docs
in wikis
on the U/commcdrh drive

At first, someone should just go through everything and identify what needs to be preserved, and then we can work on getting it all in one place/updating.

solr proximity / phrase searching

The cody archive api schema behaves differently than the original. If you search "pine ridge" the old site will do highlighting (and relevance) for both words whereas the new core will emphasize the first word only. On the api site you can put the search in quotation marks to zero in, but the phrase searching should be behaving differently than this.

This may be a problem for other api sites, also. Austen was not stemming "pie" a few weeks ago, as Andy was good enough to point out to us.

@karindalziel

nokogiri upgrade

nokogiri critical severity vulnerability, update to 1.8.1

CSV Sample es output is writing to .xml instead of .json

image_id -> preview_image_?

Single valued field with full path to the image that should be used for search results, document preview, etc.

Previously was simply wfc.img.1838.jpg (filename) which was then constructed in cocoon / rails

Rethink error handling in scripts

We probably should run a variety of tests, with a variety of broken things (html xsl, solr xsl, bad solr request, malformed solr xml, malformed tei) and think about the most useful thing to print to the screen in each situation.

Not a super high priority, but will get more important the more projects are in the new system.

Test vs live index?

I noticed that in the config file there is only a place for one solr index. Should there be two for testing vs production? Or perhaps an override to temporarily put something into another index.

Cody page image within a table

The pages images within a table need some styling. See Life and Adventures of Buffalo Bill wfc.bks00006.
screen shot 2016-05-02 at 10 14 22 am

HTML TEI Transform Issues for Karin (Or Brian?)

clean up xsl even further to utilize class creator template and generic TEI names
maybe fold rules like this b4f3f4b into the default name creation?
fold in enhanced rules from cather for generic class creation
comment heavily
create default CSS file, perhaps as part of a center wide API demo repo
- as an example, the span created above for centering would always have to be block level for centering to work, so that should be included in default TEI CSS

Change tei to solr general script to use slug params

https://github.com/CDRH/data/blob/master/scripts/xslt/cdrh_to_solr/solr_transform_tei.xsl#L19

Rather than finding the slug from the project path, we could use the slug parameter that is already being passed into the xslt:
https://github.com/CDRH/data/blob/master/config/config.example.yml#L37

I'm having trouble getting neihardt's uris to come out correctly in solr and I discovered this, though I'm not sure that it is currently contributing to my neihardt problem.

errors in tei_to_es

Errors are failing silently in this class

Common.convert_tags bug

convert_tags is behaving differently in different places. I discovered this when trying to select the personography texts for the example collection. I believe that it may be because sometimes an entire Nokogiri::Document is passed in and other times it's a Nokogiri::Element.

Decide if only one of these should be allowed in
Write more unit tests to make sure behavior is as expected
Make sure example personography texts working

Update logging calls to use block parameters

See http://edgeguides.rubyonrails.org/debugging_rails_applications.html#impact-of-logs-on-performance

Should be a quick change

explore using a mixin module for XmlToEs instead of class

I think it would make sense, but I've not used mixins before. Look into the viability of doing it this way, instead of having multiple inherited classes to get to something like VraToPersonography.

Version data with collections

Come up with a workflow / system to make sure that big changes to the data repo don't require changing ALL the collections immediately.

Brainstorming:

alter data repo to be an includable gem or similar per collection
use multiple data directory locations (this is complicated because of html / xml file locations)
hook to verify pulls to data repo on different servers (for example, whitman server whose data repo may not be updated as frequently)

Make URL's configurable?

Right now we hard code the paths to the html, images, etc, following the paths we designed

https://docs.google.com/document/d/1WGy2DutUZ-a0ygIfd80jq8eADIbGD5k7ppzSVqGJ7i8/edit#

https://github.com/CDRH/data/blob/master/scripts/ruby/lib/to_es/tei_to_es/fields.rb#L213

But if this is generalized, these paths should probably be configurable by environment in a private.yml file or elsewhere

Add VRA to the list of things which can be transformed

I added a new project to data, whitman-multimedia

Error running script, could use help troubleshooting

Tried to run: ./scripts/ruby/post_to_solr.rb whitman-multimedia -f vra -s -e test

Got the following error:

Should be using an alternate script /var/www/html/data/scripts/ruby/lib/helpers.rb:30:in get_xslt_path': undefined method Exception' for main:Object (NoMethodError) from ./scripts/ruby/post_to_solr.rb:39:in

'``

cdrh_to_html not selecting attributes correctly?

When I was redoing some of the script for neihardt specific changes, I found that the notes template was not selecting attributes correctly in places like this: https://github.com/CDRH/data/blob/master/scripts/xslt/cdrh_tei_to_html/tei.p5.xsl#L763

<xsl:template match="note">
    <xsl:choose>
      <xsl:when test="@place='foot'">...</xsl:when>
      <xsl:when test="@type='editorial'"/>
      <xsl:otherwise>
        <!-- everything is ending up in the otherwise -->
      </xsl:otherwise>
    </xsl:choose>
  </xsl:template>

I ended up having to put "current()" in the xpath:

<xsl:when test="current()/@type='bibliography'"/>

We should investigate a bit further to make sure that everything is working as expected in the main script.

Decide what media URL's should look like

right now we have a media_base URL, which typically then built a url like cdrhmedia.unl.edu/projectname/image/size/123.jpg

With the image server, we'll have URL's more like:

https://cdrhmedia.unl.edu/iiif/2/cather%2Fcat.let0019.001.jpg/full/!1000,1000/0/default.jpg

So we need to pass in the project name as well.

It could be that we just use the shortname to build that part of the URL, but I wanted to talk it through first. (Right now, the cather shortname is cather-complete-letters but I have named
the IIIF folder "cather" - we'll need to sandardize)
We could also have a different variable for "iiif folder name" or something.

Here's an example of where I build the URL: https://github.com/Willa-Cather-Archive/complete-letters/blob/iiif/scripts/tei.p5.xsl#L849

collection vs _type

The API appears to be using collection instead of _type, and though that's not ideal at least for the meantime we should probably make sure that _type and collection always match.

It could be added during the es_post step of file_type, but then using the -o won't reflect all the contents that are actually being sent to ES. I think instead we might want to change how the various file types are using the transform and output step so that output happens more distinctly.

Cody search -- previous and next

@karindalziel when I search for Red Shirt on the Cody site, the previous and next links don't work. It does work to type in the page number and click GO.

es branch: clear solr seems not to work

When I tried to clear a solr index on the _es branch I got an error:

No hurry to fix, I can still clear using the old data repo

Feature request: Top level validation script

It would be really nice if we could run a script to validate TEI and HTML of a project in the data repo.

something like

ruby scripts/ruby/validate.rb family_letters -x tei

My thinking it would simply print a list of files that are not valid. Figuring out WHY they aren't valid is better handled in oxygen.

Potential to automatically namespace files with shortname or similar

Due to concerns about Whitman's non namespaced IDs potentially overwriting documents from other projects (https://github.com/whitmanarchive/whitman-issues/issues/27), @karindalziel has suggested that the data repository automatically add a project name (whitman, codyarchive, etc) to the front of IDs. The API would need to be aware of this and add them to requests / strip them from results so that orchid sites maintained the given IDs. Filenames shared across projects should not conflict because of how the data repositories and mediaserver have been set up, so the main concern would be two projects with the same filename pushing to the same elasticsearch index....which would favor whichever was most recently updated, and functionally erase the previous.

This is not a "definitely implement" type of feature yet, but something that bears more thought and discussion.

Ideas for multiple projects

This applies specifically to Whitman right now, but I could see using it for others.

Make the ability to have sub folders within projects, wo instead of whitman-correspondence you could have a whitman folder with correspondence
since it's assumed those within a "project" will be writing to the same index, let the user set the index(es) for an entire project.
change clear script to refer to only the sub project - the best way to do this would probably be by using the slug property, but perhaps not. Maybe we need to add another field for subprojects.

create "new" repo that can be copied or generated

I think copying would be the easiest option

would have all the necessary empty directories (maybe omit solr since a dev will be setting those up?)
.gitignore ready to go

helpers.rb

Previously there had been helper methods loose in that file

Then I stuck ruby equivalents to the old "common" xsl code in Common in the helpers.rb file

Need to come up with better convention, whether it's two separate files, or one file with two modules (ew), or everything in single module

EAD file Docket number

Kaci,Erin was proofing the EAD file and we were wondering if I had put the Docket numbers in the correct place. Will you look at this and let us know what you think?

Check if ID will be overwritten by different project

I need to do a little testing to 100% confirm this is the case, but since we are using filenames as the IDs for documents in elasticsearch, I worry that if two projects have the same ID for a file (particularly those which are not namespaced by project (https://github.com/whitmanarchive/whitman-issues/issues/27) will overwrite each other silently.

@karindalziel suggests some kind of pre-indexing check, that would compare all the IDs that will be pushed against results already in the API index, and warn the user if there is already an ID by that name belonging to a different collection name.

Any other ideas?

Person's chart is incorrect

If you look at this person, you see 4 relationships:

http://rosie.unl.edu/earlywashingtondc/people/per.000010

But on their visualization, there are only two. http://rosie.unl.edu/earlywashingtondc/people/network/per.000010

Also on the person page, it links owner of shorter, Sally and Defendant against Shorter Ann, and on the visualization page, it's listing owner of Shorter, Sally and childOf shorter, Ann

get_list params question

I'd like clarification on how params work for the get_text function described in tei_to_es.md:

get_text and get_list accept a few parameters.

xpath(s) : a string or array of strings
keep_tags : defaults to false, pass in true if you want to convert italics, bold, underline to HTML

get_text only:

delimiter : the separator between multiple items for get_text

get_text "/TEI/person", false, ","
#=> "Jadzia Dax, Geordi LaForge"

get_text @xpaths["people"], false, " &"
#=> "Jadzia Dax & Geordi LaForge"

The question is: do you have to pass in these params in a particular order, and do all always need to be present? So is there a way to pass through a param for delimiter and use the default for keep_tags?

copy metadata (title) to text field

We have some items that don't have any "text" to add, but they do have metadata. It's probably good practice to at least add the title in as text to search. Possibly other metadata fields, needs thought.

Ambiguous rule match

Got the following error while trying to run OSCYS on a branch. Since that branch didn't look super involved, I have a bad feeling that we won't be able to run OSCYS at the moment. This error may affect more repositories, also:

There was an error transforming oscys.case.0036.007.xml: Recoverable error 
  XTRE0540: Ambiguous rule match for
  /TEI/text[1]/body[1]/div1[1]/div2[1]/table[4]/row[36]/cell[1]/hi[1]
Matches both "hi[@rend='subscript']" on line 586 of
  file:/var/local/www/data/scripts/xslt/cdrh_to_html/lib/html_formatting.xsl
and "hi[@rend]" on line 308 of
  file:/var/local/www/data/scripts/xslt/cdrh_to_html/lib/html_formatting.xsl
Recoverable error 
  XTRE0540: Ambiguous rule match for
  /TEI/text[1]/body[1]/div1[1]/div2[1]/table[4]/row[37]/cell[1]/hi[1]
Matches both "hi[@rend='subscript']" on line 586 of
  file:/var/local/www/data/scripts/xslt/cdrh_to_html/lib/html_formatting.xsl
and "hi[@rend]" on line 308 of
  file:/var/local/www/data/scripts/xslt/cdrh_to_html/lib/html_formatting.xsl
Recoverable error 
  XTRE0540: Ambiguous rule match for
  /TEI/text[1]/body[1]/div1[1]/div2[1]/table[4]/row[38]/cell[1]/hi[1]
Matches both "hi[@rend='subscript']" on line 586 of
  file:/var/local/www/data/scripts/xslt/cdrh_to_html/lib/html_formatting.xsl
and "hi[@rend]" on line 308 of
  file:/var/local/www/data/scripts/xslt/cdrh_to_html/lib/html_formatting.xsl
Recoverable error 
  XTRE0540: Ambiguous rule match for
  /TEI/text[1]/body[1]/div1[1]/div2[1]/table[4]/row[39]/cell[1]/hi[1]
Matches both "hi[@rend='subscript']" on line 586 of
  file:/var/local/www/data/scripts/xslt/cdrh_to_html/lib/html_formatting.xsl
and "hi[@rend]" on line 308 of
  file:/var/local/www/data/scripts/xslt/cdrh_to_html/lib/html_formatting.xsl
Recoverable error 
  XTRE0540: Ambiguous rule match for
  /TEI/text[1]/body[1]/div1[1]/div2[1]/table[7]/row[11]/cell[1]/hi[1]
Matches both "hi[@rend='subscript']" on line 586 of
  file:/var/local/www/data/scripts/xslt/cdrh_to_html/lib/html_formatting.xsl
and "hi[@rend]" on line 308 of
  file:/var/local/www/data/scripts/xslt/cdrh_to_html/lib/html_formatting.xsl

Failed files are duplicated in logs

Sometimes failed files are duplicated in the logs:

I, [2015-07-28T08:05:45.245746 #62575] INFO -- : ===========================================
I, [2015-07-28T08:05:45.246080 #62575] INFO -- : Starting script at 2015-07-28 08:05:45 -0500
I, [2015-07-28T08:05:45.246147 #62575] INFO -- : Script running with following options: {:environment=>"test", :format=>nil, :solr_or_html=>nil, :commit=>true, :regex=>nil, :transform_only=>false, :update_time=>nil, :verbose=>true, :project=>"oscys"}
I, [2015-07-28T08:05:45.246192 #62575] INFO -- : Solr URL: http://rosie.unl.edu:8080/solr/api_oscys_test_alpha/update
E, [2015-07-28T08:14:27.004662 #62575] ERROR -- : Failed to transform following files for oscys: /var/www/html/data/projects/oscys/tei/oscys.case.0142.011.xml
/var/www/html/data/projects/oscys/tei/oscys.case.0142.009.xml
/var/www/html/data/projects/oscys/tei/oscys.case.0142.010.xml
/var/www/html/data/projects/oscys/tei/oscys.case.0142.011.xml
/var/www/html/data/projects/oscys/tei/oscys.case.0142.009.xml
/var/www/html/data/projects/oscys/tei/oscys.case.0142.010.xml
#Net::HTTPBadRequest:0x00000001dbd430
/var/www/html/data/projects/oscys/tei/oscys.case.0306.002.xml
/var/www/html/data/projects/oscys/tei/oscys.case.0292.001.xml
/var/www/html/data/projects/oscys/tei/oscys.case.0306.002.xml
/var/www/html/data/projects/oscys/tei/oscys.case.0292.001.xml
I, [2015-07-28T08:14:27.004791 #62575] INFO -- : Posted all specified files for oscys successfully
I, [2015-07-28T08:14:27.841795 #62575] INFO -- : Script finished running at 2015-07-28 08:14:27 -0500
I, [2015-07-28T08:14:27.841911 #62575] INFO -- : Script completed in 00 hrs 08 mins 42 secs
I, [2015-07-28T08:32:46.361613 #6617] INFO -- : ===========================================

Cody reindex

Even after I reindex, looks like the image is still not showing up for wfc.bks00010 -- "Buffalo Bill" from Prairie to Palace. Should be the same generic book image as for wfc.bks00018.

Move generated paths to config file

Right now, you set your media path in config and the scripts add our own convoluted paths to it. I'd like to move this path setting into the configs.

This will be especially helpful for running local environments, but it would also be quite useful for anyone who wants to use the repo but not our paths. We could also override paths for a particular project, which can be useful if images are stored elsewhere (luna, for example)

same as #103

(If this is easy, I'd prefer to do it in an upcoming sprint because it would really help my development process - if not, we should save for future grant/work)

image_thumb_loc, image_full_loc

In future, discuss adding list or lists of all the images affiliated with a given document. This way, users of the API will have a convenient way to access all of the images at once.

A complication of this is whether to include all images including pb / figures and not just document page scans, etc.

One file always fails, can't tell what it is in logs

When I run OSCYS after correcting all files, one file always fails, but I can't tell what it is. Here's the log:

I, [2015-07-28T08:32:46.361718 #6617] INFO -- : ===========================================
I, [2015-07-28T08:32:46.361778 #6617] INFO -- : Starting script at 2015-07-28 08:32:46 -0500
I, [2015-07-28T08:32:46.361844 #6617] INFO -- : Script running with following options: {:environment=>"test", :format=>nil, :solr_or_html=>nil, :commit=>true, :regex=>nil, :transform_only=>false, :update_time=>nil, :verbose=>true, :project=>"oscys"}
I, [2015-07-28T08:32:46.361884 #6617] INFO -- : Solr URL: http://rosie.unl.edu:8080/solr/api_oscys_test_alpha/update
E, [2015-07-28T08:41:29.375329 #6617] ERROR -- : Failed to transform following files for oscys: #Net::HTTPBadRequest:0x00000001d0eb88
I, [2015-07-28T08:41:29.375479 #6617] INFO -- : Posted all specified files for oscys successfully
I, [2015-07-28T08:41:30.107208 #6617] INFO -- : Script finished running at 2015-07-28 08:41:30 -0500
I, [2015-07-28T08:41:30.107283 #6617] INFO -- : Script completed in 00 hrs 08 mins 43 secs
I, [2015-07-28T08:44:35.756657 #33089] INFO -- : ===========================================