GithubHelp home page GithubHelp logo

buda-iiif-server's Introduction

Vagrant scripts for BUDA platform instanciation

The base platform is built using Vagrant and VirtualBox:

  1. Install Vagrant and VirtualBox.
  2. Download or git clone this repository.
  3. cd into the unzipped directory or git clone
  4. install VirtualBox guest additions with vagrant plugin install vagrant-vbguest
  5. run vagrant up to summon a local instance

Or for an AWS EC2 instance:

  1. install the vbguest plugin: vagrant plugin install vagrant-vbguest
  2. and run the command: vagrant up or rename Vagrantfile.aws to Vagrantfile and run vagrant up --provider=aws

This will grind awhile installing all the dependencies of the BUDA platform.

Once the initial install has completed the command: vagrant ssh will connect to the instance where development, customization of the environment and so on can be performed as for any headless server.

Similarly, the jena-fuseki server will be listening on:

http://localhost:13180/fuseki

Lds-pdi application is accessible at :

http://localhost:13280/

(see https://github.com/buda-base/lds-pdi/blob/master/README.md for details about using this rest services)

The command: vagrant halt will shut the instance down. After halting (or suspending the instance) a further: vagrant up will simply boot the instance without further downloads, and vagrant destroy will completely remove the instance.

If running an AWS instance, after provisioning access the instance via ssh -p 15345 and delete Port 22 from /etc/ssh/sshd_config and sudo systemctl restart sshd. This will further secure the instance from attacks on port 22.

buda-iiif-server's People

Contributors

eroux avatar marcagate avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Forkers

datazuul

buda-iiif-server's Issues

reducing size of output jpg

Context: on S3, the tif corresponding to this image is below 30KB, but the output jpg on the iiif server is 461KB.

On the current website (using JAI), the corresponding image (here) is a png of about 30KB.

The png version is only 53KB, much more reasonable but still significantly more than the current website.

It seems hymir just uses the basic javax.imageio functions (see here) as provided by twelvemonkeys. The parameters that we can use in JPEGImageWriteParam look very limited. There doesn't seem to be a much better option in Java though.

This is an important issue for various reasons:

  • bigger files take longer to load
  • they cost us more to transfer to the user (we're paying the bandwidth)
  • a factor 10 in size is just completely unreasonable and probably indicates some deep problems

Here are a few ideas to start dealing with the issue:

  • first, let's bring @TBRC-JimK in: Jim, you'll develop some expertise in image treatment in Java for the asset manager, maybe we should share our doc, techniques, libraries, code, etc.?
  • a first easy action would be to tweak the Java jpg encoding quality values with the method suggested here, 90% is probably sufficient
  • we should log the decoders/encoders used by javax.imageio to make sure the correct ones are used (and probably also make sure we understand what the correct ones are)
  • we can also make some experiments to understand why a png produced with JAI is half the size as a png produced with imageio and report bugs or tweak configuration if needed
  • (I'm not sure my diagnosis is right here) then we should understand why the output jpg is full color while the original tif is black and white. It will require some diving into the Java APIs and internal image representation in Java
  • then we should understand if in these cases we can indicate to the iiif viewer to prefer png to jpg. This will require also some diving, this time in the iiif APIs (that's probably a job for me)

logback configuration

There should be a system to configure logback so that when executing the jar we can override properties for the path to the log files, the log level, etc.

serving static images

In order to serve the error images, we should be able to serve an image in

src/main/resources/static/abc.def

when requesting the id

http://iiif.bdrc.io/static::abc.def/

caching images

I'm a bit surprised as I can't seem to find the code that handles the caching of S3 images... are the images cached at all in the normal context (not in the context of building an archive)? If not, doing it should make an important difference, especially since the same image is requested several times by the viewer in different sizes (thumbnail, big thumbnail and full)

public cache claim for open access images

For all the images that don't require authentication to be accessed (including the 20 first and final pages of a FairUse work), the cache control mechanism should advertise the image to be public (see rfc7234). I think images restricted in China could have that too...

Fair use

iiif server pdfs of Fair_use works always return 41 pages regardless the use profile. In other words, an admin user has no way to access the entire work.

remove the /image/v2 prefix

I have a perspective in mind, I'm not sure if it's reasonable or doable (or in contradiction with the iif spec), but basically it's the following: having permanent identifiers for images. This implies that an image is not dependent of the iiif version, and we need to have a iiif-server-independant way to refer to them, and thus we need to completely control the URL of the images in the image server. Hence the request. It's quite obvious for annotations: when someone annotates an image provided in iiif v2, we don't want their annotation not to work anymore when we'll move to iiifv3.

new identifier format?

In the same vein as #6, but in a less important way, maybe it would be nice to change the identifier format a little bit, so that it can be prefixed. For instance

<http://iiif.bdrc.io/image/v2/bdr:V00KG0545_I1KG20698::I1KG206980007.tif>

could be

@prefix bdi: http://iiif.bdrc.io/image/v2/

bdi:bdr:V00KG0545_I1KG20698::I1KG206980007.tif

but although having :s in the local names seems to be allowed by the spec, I'm afraid this could confuse both poorly written libraries and humans, so maybe we should replace : by something else in the identifiers... maybe ,?

add metadata in the PDF export

Some metadata should be added to the exported PDF, such as:

  • the prefLabel of the work in Unicode
  • a reference to BDRC with the URI
  • a license indication (should be available in the admin data of the item)
  • a URI indication

the reference to iText should be removed

implement country restriction

We have basically two options:

  • use Auth0 (see here)
  • implement it on the iiif server itself (not reading from the auth token), which is most costly but more secure (through this method for instance)

perf counters

The experience feedback I have for the iiif system is that it's very slow, and I tend to agree with that. There are some aspects that are not really due to the server, but I think some definitely are. It would be good to log some timings in the debug log level in order to have information about bottlenecks. Basically for each big operation there should be a log with some size and timing information. A non-exhaustive list of operations:

  • getting results from ldspdi
  • downloading an image from S3
  • checking if an image is in the cache
  • getting from the cache
  • writing in the cache
  • converting an image to the desired format
  • cutting or resizing the image to fit the request (might not be distinguishable from the previous one)

white image bug when treating some images

Some grayscale images are buggy when a modified image is treated. For instance

http://iiif.bdrc.io/bdr:V1PD127393_I1PD127464::I1PD1274640004.jpg/full/max/0/default.jpg

is fine while

http://iiif.bdrc.io/bdr:V1PD127393_I1PD127464::I1PD1274640004.jpg/full/max/90/default.jpg

is almost entirely white. Note that in the first case the original .jpg is accessed so it doesn't go through the image saving pipeline with turbojpeg. So there may be another bug with the way turbojpeg is handled. We should verify that and if the bug is indeed in the turbojpeg plugin, report a bug here

Expired download links

There should be a way to manage pdf/zip links expiration (typical use case: a user starts a pdf generation then goes out and comes back a few hours later)

exif metadata of images

It would be helpful if hymir was adding some exif metadata to the served images (when the format allows). Something rather simple like the source of download would be good enough. Does hymir allow that? If not we should open an issue about it.

use IIIF auth API

Once the auth is done, it should be indicated in the images which users can't access. It seems it's mostly a matter of adding the auth service to the info.json, see this example

error 500 fetching first image of volume

serving rdf for images?

In the perspective of having permanent identifiers for images, it seems reasonable that, if images are entities, an http request should provide some RDF in some serialization... This RDF would be different from the info.json (which is of type ImageService3 for iiif v3.0), but more similar to the information used in the presentation api, which in JSON-LD is (in the example):

{
    "id": "https://example.org/iiif/book1/res/page1.jpg",
    "type": "Image",
    "label": {"en": ["Page 1"], "es": ["Página 1"]},
    "format": "image/jpeg",
    "service": [
      {
        "id": "https://example.org/images/book1-page1",
        "type": "ImageService3",
        "profile": "level2"
      }
    ],
    "height": 2000,
    "width": 1500
  }

BDRCImageSErvice

Write a custom implementation of enrichInfo method (info.json building)

cleanup of geolocation

I've changed a few things in Geolocation in presentation (see this commit), I think a similar set of changes could be applied here, no emergency though

README update

The readme should be updated with the new instructions to run the server locally

use ebooks when present and relevant

In Some cases (full volume downloads), it way be relevant to use the ebook if it's present on S3. Not all are generated but when they are it seems they have the following URL convention:

s3://archive.tbrc.org/Works/{md5}/{workid}/eBooks/{workid}-{volumenumber}.pdf

with the volume number padded on 3 digits. Example:

s3://archive.tbrc.org/Works/60/W22084/eBooks/W22084-001.pdf

it's not really any kind of priority, but I thought I'd report this to provide some awareness on the existence of these. The main difference with normal PDF is that they contain bookmarks, table of contents, a cover and a copyright notice. The ones on S3 are quite old so it may be a bad idea to use them, maybe the new ones would make more sense... (but they're not on S3 yet?).

pdf file naming problem (double extension)

When on the download page (ex: here) and click on the link, the file that is saved by my browsers (both Firefox and Chrome) has a double .pdf extension (it's named bdr_V1PD96945_I1PD96947FAIR_USE_1-662.pdf.pdf), there is certainly a way to fix that... also, I'm not sure it has much impact on the experience of most users, but the <a> could have a type="application/pdf" attribute

pdf generation failures

  • cannot access to the volume listing page for bdr:W22084 (link)

  • can access to the volume listing page for bdr:W12827 (link) but then not to individual volume pdf download page eg Volume 9 (link)

  • can access to individual volume pdf download page eg Volume 10 (link) only after corresponding zip volume has been requested (link)

client cache headers

I don't know if it's relevant in all hymir use cases, but it definitely is in ours: we want to instruct web browsers to cache images and info.jsons. Currently there is no http cache instructions, there should be at least a configurable max-age. In our case it can be very large, the images more or less never change.

cookie mechanism for images

if we want support inclusion of protected images through simple <img> tags we need to support passing the token in an alternate way, as in this case it cannot be passed in the header (see this question and this one). So we could probably use cookies for this case. We don't have to support that now, but we might in the future so it's important that we're cautious about the assumptions we make, including in the bdrc-auth-lib library (that should support this kind of case)

use mozjpeg

mozjpeg delivers jpg that are quite smaller that are quite smaller than regular jpgs (it optimizes encoding, especially for high contrasts). It should be a drop-in replacement for libturbo-jpeg (the lib has the same name), so it's theoretically just a matter of installing mozjpeg instead of libturbo-jpeg... There is a script to do that under Debian that we could use in buda-base:

https://gist.github.com/Kelfitas/f3fb99984698ccd79414c6a29e9f4edd

but I don't think we need to do that now, we're not reencoding jpeg a lot (we're serving them as is, mostly) so we won't see many benefits...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.