buda-base / buda-iiif-server Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 1.0 19.69 MB

the buda Image server based on hymir iiif-server

License: MIT License

Java 99.85% Smarty 0.15%

buda-iiif-server's Introduction

Vagrant scripts for BUDA platform instanciation

The base platform is built using Vagrant and VirtualBox:

Install Vagrant and VirtualBox.
Download or git clone this repository.
cd into the unzipped directory or git clone
install VirtualBox guest additions with vagrant plugin install vagrant-vbguest
run vagrant up to summon a local instance

Or for an AWS EC2 instance:

install the vbguest plugin: vagrant plugin install vagrant-vbguest
and run the command: vagrant up or rename Vagrantfile.aws to Vagrantfile and run vagrant up --provider=aws

This will grind awhile installing all the dependencies of the BUDA platform.

Once the initial install has completed the command: vagrant ssh will connect to the instance where development, customization of the environment and so on can be performed as for any headless server.

Similarly, the jena-fuseki server will be listening on:

http://localhost:13180/fuseki

Lds-pdi application is accessible at :

http://localhost:13280/

(see https://github.com/buda-base/lds-pdi/blob/master/README.md for details about using this rest services)

The command: vagrant halt will shut the instance down. After halting (or suspending the instance) a further: vagrant up will simply boot the instance without further downloads, and vagrant destroy will completely remove the instance.

If running an AWS instance, after provisioning access the instance via ssh -p 15345 and delete Port 22 from /etc/ssh/sshd_config and sudo systemctl restart sshd. This will further secure the instance from attacks on port 22.

buda-iiif-server's People

Contributors

Watchers

Forkers

datazuul

buda-iiif-server's Issues

encoding grayscale tiffs in B&W

In some cases tiffs are grayscale but they should be B&W, I think the server could automatically do the translation. Example:

http://iiif.bdrc.io/bdr:V00JW501203_I1CZ2552::I1CZ25520075.tif/info.json

"global level of red" different in resized image

in some of the first volumes of bdr:W4CZ5369,
full size image and resized image seem to have differents "red level" of something
resulting in unsatisfying OpenSeaDragon viewer behavior when zooming in/out due to this "red level" changing
below is a mix between both images, left is fullsize and right 1250px width:

500 error on restricted image

see http://iiif.bdrc.io/bdr:V1KG24968_I1KG24970::I1KG249700003.jpg/info.json

FairUse restriction must be implemented for archives (pdf & zip)

The ArchiveController doesn't handle FairUse accessType. Resources having this access type should only display the first and last 20 pages (or images for zip)

pdf images cropped

see pdf from http://library.bdrc.io/show/bdr:V22084_I0978

seems to be the same in any volume of http://library.bdrc.io/show/bdr:W22084 or http://library.bdrc.io/show/bdr:W12827

500 errors

I just got another 500 error, on https://iiif-dev.bdrc.io/bdr:I4CZ75259::I4CZ752590001.tif/info.json , see:

this should be very high priority

reducing size of output jpg

Context: on S3, the tif corresponding to this image is below 30KB, but the output jpg on the iiif server is 461KB.

On the current website (using JAI), the corresponding image (here) is a png of about 30KB.

The png version is only 53KB, much more reasonable but still significantly more than the current website.

It seems hymir just uses the basic javax.imageio functions (see here) as provided by twelvemonkeys. The parameters that we can use in JPEGImageWriteParam look very limited. There doesn't seem to be a much better option in Java though.

This is an important issue for various reasons:

bigger files take longer to load
they cost us more to transfer to the user (we're paying the bandwidth)
a factor 10 in size is just completely unreasonable and probably indicates some deep problems

Here are a few ideas to start dealing with the issue:

first, let's bring @TBRC-JimK in: Jim, you'll develop some expertise in image treatment in Java for the asset manager, maybe we should share our doc, techniques, libraries, code, etc.?
a first easy action would be to tweak the Java jpg encoding quality values with the method suggested here, 90% is probably sufficient
we should log the decoders/encoders used by javax.imageio to make sure the correct ones are used (and probably also make sure we understand what the correct ones are)
we can also make some experiments to understand why a png produced with JAI is half the size as a png produced with imageio and report bugs or tweak configuration if needed
(I'm not sure my diagnosis is right here) then we should understand why the output jpg is full color while the original tif is black and white. It will require some diving into the Java APIs and internal image representation in Java
then we should understand if in these cases we can indicate to the iiif viewer to prefer png to jpg. This will require also some diving, this time in the iiif APIs (that's probably a job for me)

Canonical Redirection option not properly set

There must be a Hymir version issue as this option should be available as of version 3.5.2

logback configuration

There should be a system to configure logback so that when executing the jar we can override properties for the path to the log files, the log level, etc.

serving static images

In order to serve the error images, we should be able to serve an image in

src/main/resources/static/abc.def

when requesting the id

http://iiif.bdrc.io/static::abc.def/

caching images

I'm a bit surprised as I can't seem to find the code that handles the caching of S3 images... are the images cached at all in the normal context (not in the context of building an archive)? If not, doing it should make an important difference, especially since the same image is requested several times by the viewer in different sizes (thumbnail, big thumbnail and full)

public cache claim for open access images

For all the images that don't require authentication to be accessed (including the 20 first and final pages of a FairUse work), the cache control mechanism should advertise the image to be public (see rfc7234). I think images restricted in China could have that too...

Fair use

iiif server pdfs of Fair_use works always return 41 pages regardless the use profile. In other words, an admin user has no way to access the entire work.

not having the tbrc intro pages in the pdf export

it's very easy to get the number of TbrcIntroPages in the data (it's in volumeInfo). This gives number X (often 0 or 2), and we should skip the first X pages in the PDF export.

remove the /image/v2 prefix

I have a perspective in mind, I'm not sure if it's reasonable or doable (or in contradiction with the iif spec), but basically it's the following: having permanent identifiers for images. This implies that an image is not dependent of the iiif version, and we need to have a iiif-server-independant way to refer to them, and thus we need to completely control the URL of the images in the image server. Hence the request. It's quite obvious for annotations: when someone annotates an image provided in iiif v2, we don't want their annotation not to work anymore when we'll move to iiifv3.

404 error on NLM works

the following URL gives a 404:

http://iiif.bdrc.io/bdr:V1NLM7_I1NLM7_001::I1NLM7_0010003.jpg/full/max/0/default.jpg

but the s3 key exists though:

s3://archive.tbrc.org/Works/ba/W1NLM7/images/W1NLM7-I1NLM7_001/I1NLM7_0010003.jpg

new identifier format?

In the same vein as #6, but in a less important way, maybe it would be nice to change the identifier format a little bit, so that it can be prefixed. For instance

<http://iiif.bdrc.io/image/v2/bdr:V00KG0545_I1KG20698::I1KG206980007.tif>

could be

@prefix bdi: http://iiif.bdrc.io/image/v2/

bdi:bdr:V00KG0545_I1KG20698::I1KG206980007.tif

but although having :s in the local names seems to be allowed by the spec, I'm afraid this could confuse both poorly written libraries and humans, so maybe we should replace : by something else in the identifiers... maybe ,?

issue with resized images

The following URL gives an erroneous output:

https://iiif.bdrc.io/bdr:V22334_I3867::38670133.tif/full/,2000/0/default.png

while this one is ok:

https://iiif.bdrc.io/bdr:V22334_I3867::38670133.tif/full/max/0/default.png

add metadata in the PDF export

Some metadata should be added to the exported PDF, such as:

the prefLabel of the work in Unicode
a reference to BDRC with the URI
a license indication (should be available in the admin data of the item)
a URI indication

the reference to iText should be removed

aws credentials configuration

re: buda-base/buda-base#7 we should be able to have a way to configure the aws credentials in a file in /etc/buda/iiifserv/(it could in a properties file)

implement country restriction

We have basically two options:

use Auth0 (see here)
implement it on the iiif server itself (not reading from the auth token), which is most costly but more secure (through this method for instance)

handle webP format

For bitonal images, it seems webp could be a better option than png (faster encoding and similar size), but it doens't seem to be handled by our server:

http://iiif.bdrc.io/bdr:V1PD127393_I1PD127464::I1PD1274640004.jpg/full/max/0/default.webp

returns a 415

error on fetching resized image

http://iiif.bdrc.io/bdr:V22084_I0886::08860003.tif/full/,600/0/default.png returns error 400
http://iiif.bdrc.io/bdr:V22084_I0886::08860003.tif/full/full/0/default.png is ok
(see http://library.bdrc.io/show/bdr:W22084)

perf counters

The experience feedback I have for the iiif system is that it's very slow, and I tend to agree with that. There are some aspects that are not really due to the server, but I think some definitely are. It would be good to log some timings in the debug log level in order to have information about bottlenecks. Basically for each big operation there should be a log with some size and timing information. A non-exhaustive list of operations:

getting results from ldspdi
downloading an image from S3
checking if an image is in the cache
getting from the cache
writing in the cache
converting an image to the desired format
cutting or resizing the image to fit the request (might not be distinguishable from the previous one)

white image bug when treating some images

Some grayscale images are buggy when a modified image is treated. For instance

http://iiif.bdrc.io/bdr:V1PD127393_I1PD127464::I1PD1274640004.jpg/full/max/0/default.jpg

is fine while

http://iiif.bdrc.io/bdr:V1PD127393_I1PD127464::I1PD1274640004.jpg/full/max/90/default.jpg

is almost entirely white. Note that in the first case the original .jpg is accessed so it doesn't go through the image saving pipeline with turbojpeg. So there may be another bug with the way turbojpeg is handled. We should verify that and if the bug is indeed in the turbojpeg plugin, report a bug here

Expired download links

There should be a way to manage pdf/zip links expiration (typical use case: a user starts a pdf generation then goes out and comes back a few hours later)

wrong content-type header for info.json

The content-type header returned when accessing

http://iiif.bdrc.io/bdr:V23703_I1421::14210082.tif/info.json

is html, that should be application/ld+json (see spec)

exif metadata of images

It would be helpful if hymir was adding some exif metadata to the served images (when the format allows). Something rather simple like the source of download would be good enough. Does hymir allow that? If not we should open an issue about it.

use IIIF auth API

Once the auth is done, it should be indicated in the images which users can't access. It seems it's mostly a matter of adding the auth service to the info.json, see this example

error 500 fetching first image of volume

in some Volumes of http://library.bdrc.io/show/bdr:I1PD96684
eg http://library.bdrc.io/show/bdr:V1PD96684_I1PD106654, http://library.bdrc.io/show/bdr:V1PD96684_I1PD106655,
http://library.bdrc.io/show/bdr:V1PD96684_I1PD106657
but not http://library.bdrc.io/show/bdr:V1PD96684_I1PD106656,
manifest is fetched successfully but there is a 500 error fetching first image: http://iiif.bdrc.io/image/v2/bdr:V1PD96684_I1PD106657::I1PD1066530003.jpg/full/full/0/default.jpg

monitoring service password

We need to setup a password outside of the application.yml config file

serving rdf for images?

In the perspective of having permanent identifiers for images, it seems reasonable that, if images are entities, an http request should provide some RDF in some serialization... This RDF would be different from the info.json (which is of type ImageService3 for iiif v3.0), but more similar to the information used in the presentation api, which in JSON-LD is (in the example):

{
    "id": "https://example.org/iiif/book1/res/page1.jpg",
    "type": "Image",
    "label": {"en": ["Page 1"], "es": ["Página 1"]},
    "format": "image/jpeg",
    "service": [
      {
        "id": "https://example.org/images/book1-page1",
        "type": "ImageService3",
        "profile": "level2"
      }
    ],
    "height": 2000,
    "width": 1500
  }

BDRCImageSErvice

Write a custom implementation of enrichInfo method (info.json building)

cleanup of geolocation

I've changed a few things in Geolocation in presentation (see this commit), I think a similar set of changes could be applied here, no emergency though

README update

The readme should be updated with the new instructions to run the server locally

use ebooks when present and relevant

In Some cases (full volume downloads), it way be relevant to use the ebook if it's present on S3. Not all are generated but when they are it seems they have the following URL convention:

s3://archive.tbrc.org/Works/{md5}/{workid}/eBooks/{workid}-{volumenumber}.pdf

with the volume number padded on 3 digits. Example:

s3://archive.tbrc.org/Works/60/W22084/eBooks/W22084-001.pdf

it's not really any kind of priority, but I thought I'd report this to provide some awareness on the existence of these. The main difference with normal PDF is that they contain bookmarks, table of contents, a cover and a copyright notice. The ones on S3 are quite old so it may be a bad idea to use them, maybe the new ones would make more sense... (but they're not on S3 yet?).

pdf file naming problem (double extension)

When on the download page (ex: here) and click on the link, the file that is saved by my browsers (both Firefox and Chrome) has a double .pdf extension (it's named bdr_V1PD96945_I1PD96947FAIR_USE_1-662.pdf.pdf), there is certainly a way to fix that... also, I'm not sure it has much impact on the experience of most users, but the <a> could have a type="application/pdf" attribute

copyright + restricted in China fails

in this case:

http://iiif.bdrc.io/bdr:V1KG1418_I1KG1521::I1KG15210004.tif/full/,600/0/default.png

we have

   adm:access    bdr:AccessRestrictedInChina ;
   adm:license   bdr:LicenseCopyrighted ;

which should make the first 20 / last 20 images work (including the one linked above). But we have a 401 instead.

error 500 on zip/pdf download page

clicking on generated download links leads to error 500:
http://iiif.bdrc.io/download/pdf/wi:bdr:W22704::bdr:I22704
http://iiif.bdrc.io/download/zip/wi:bdr:W22704::bdr:I22704

for example:
http://iiif.bdrc.io/download/zip/v:bdr:V22704_I3252::1-562
http://iiif.bdrc.io/download/pdf/v:bdr:V22704_I3265::1-864

pdf generation failures

cannot access to the volume listing page for bdr:W22084 (link)
can access to the volume listing page for bdr:W12827 (link) but then not to individual volume pdf download page eg Volume 9 (link)
can access to individual volume pdf download page eg Volume 10 (link) only after corresponding zip volume has been requested (link)

png images with tiff content

opening http://library.bdrc.io/show/bdr:W12827 png images won't load
see http://iiif.bdrc.io/bdr:V12827_I2061::020610003.tiff/full/max/0/default.png , http://iiif.bdrc.io/bdr:V12827_I2065::020650001.tiff/full/max/0/default.png , http://iiif.bdrc.io/bdr:V12827_I2068::020680001.tiff/full/max/0/default.png etc.
actually these seem to be tiff images renamed as png

error 500 on pdf

I tried to download the first volume of W22084, the first time it took a very long time, but the second time it gave me a link to

http://iiif.bdrc.io/download/file/pdf/bdr:V22084_I0886:1-624

which is a 500 error

client cache headers

I don't know if it's relevant in all hymir use cases, but it definitely is in ours: we want to instruct web browsers to cache images and info.jsons. Currently there is no http cache instructions, there should be at least a configurable max-age. In our case it can be very large, the images more or less never change.

using maven version of webp-imageio?

The vagrant provisioning of buda-base cannot build the iiifserv package because it cannot find the webp-imageio dep (which should be installed... I don't really understand why it cannot find it). There seems to be a maven package here:

https://mvnrepository.com/artifact/org.sejda.webp-imageio/webp-imageio-sejda

maybe we could use it? That would make things more simple...

500 instead of 404

When reaching for a non-existing image like

http://iiif.bdrc.io/image/v2/bdr:V23703_I1521::152106isfjgoeirf.jpg/full/full/0/default.jpg

the server returns a 500 error instead of a 404. The spec is quite clear that it should be a 404

error 500 loading BDRC logo on iiif-dev

see http://iiif-dev.bdrc.io/static::logo.png/full/max/0/default.png

error 500 on pdf download page

http://iiif.bdrc.io/download/zip/wi:bdr:W22084::bdr:I22084
http://iiif.bdrc.io/download/zip/wi:bdr:W12827::bdr:I12827

cookie mechanism for images

if we want support inclusion of protected images through simple <img> tags we need to support passing the token in an alternate way, as in this case it cannot be passed in the header (see this question and this one). So we could probably use cookies for this case. We don't have to support that now, but we might in the future so it's important that we're cautious about the assumptions we make, including in the bdrc-auth-lib library (that should support this kind of case)

use mozjpeg

mozjpeg delivers jpg that are quite smaller that are quite smaller than regular jpgs (it optimizes encoding, especially for high contrasts). It should be a drop-in replacement for libturbo-jpeg (the lib has the same name), so it's theoretically just a matter of installing mozjpeg instead of libturbo-jpeg... There is a script to do that under Debian that we could use in buda-base:

https://gist.github.com/Kelfitas/f3fb99984698ccd79414c6a29e9f4edd

but I don't think we need to do that now, we're not reencoding jpeg a lot (we're serving them as is, mostly) so we won't see many benefits...

changing formatHints to preferredFormat

see buda-base/public-digital-library#64

buda-base / buda-iiif-server Goto Github PK

buda-iiif-server's Introduction

Vagrant scripts for BUDA platform instanciation

buda-iiif-server's People

Contributors

Watchers

Forkers

buda-iiif-server's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs