buda-base / buda-base Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 1.0 414 KB

Vagrant file creating a BUDA server instance

License: Apache License 2.0

Shell 39.36% HTML 52.65% Ruby 5.78% Python 2.21%

buda-base's Introduction

Vagrant scripts for BUDA platform instanciation

The base platform is built using Vagrant and VirtualBox:

Install Vagrant and VirtualBox.
Download or git clone this repository.
cd into the unzipped directory or git clone
install VirtualBox guest additions with vagrant plugin install vagrant-vbguest
run vagrant up to summon a local instance

Or for an AWS EC2 instance:

install the vbguest plugin: vagrant plugin install vagrant-vbguest
and run the command: vagrant up or rename Vagrantfile.aws to Vagrantfile and run vagrant up --provider=aws

This will grind awhile installing all the dependencies of the BUDA platform.

Once the initial install has completed the command: vagrant ssh will connect to the instance where development, customization of the environment and so on can be performed as for any headless server.

Similarly, the jena-fuseki server will be listening on:

http://localhost:13180/fuseki

Lds-pdi application is accessible at :

http://localhost:13280/

(see https://github.com/buda-base/lds-pdi/blob/master/README.md for details about using this rest services)

The command: vagrant halt will shut the instance down. After halting (or suspending the instance) a further: vagrant up will simply boot the instance without further downloads, and vagrant destroy will completely remove the instance.

If running an AWS instance, after provisioning access the instance via ssh -p 15345 and delete Port 22 from /etc/ssh/sshd_config and sudo systemctl restart sshd. This will further secure the instance from attacks on port 22.

buda-base's People

Contributors

Stargazers

Watchers

Forkers

shn926

buda-base's Issues

nginx config fails

nginx config fails with:

Jan 04 15:35:34 buda-local nginx[11166]: nginx: [emerg] bind() to 0.0.0.0:13680 failed (98: Address already in use)

that's because in the local version of vagrant, port 13680 is taken by both github-email and the javascript code, see https://github.com/buda-base/buda-base/blob/master/conf/nginx/local.conf#L15

improve aws credential configuration

The aws credentials are currently in the various .aws directories of the home directories of the users corresponding to the services (ex: /usr/local/iiifserv/.aws/), there is a way to override this configuration with variable environments (see doc), this should be done in the systemd scripts and the new files should be in /etc/buda/iiifserv/ (to take the same example).

remove diacritics in mutlilingual analyzer

In the mutli-lingual analyzer, it would be useful to have a mechanism that removes diacritics so that more hits are returned for Sanskrit, Pinyin, etc.

other anciliary repositories

Two other repos are not configured yet, but they're a bit special:

https://github.com/buda-base/buda-thumbnail-generator which is probably the most normal one, we can integrate it like the others
https://github.com/buda-base/entity-scorer which shouldn't run every day as it's a bit CPU hungry... but we can install them
leyden-import
stabi-import
idp-import
wikidata-import
bn-import
fpl-import
person-century-cache

provisioning of blmp and pdl fail

Running vagrant up locally yields the following provisioning errors:

==> default: Running provisioner: blmp (shell)...
    default: Running: /var/folders/dy/1fhs6l1173jgmm6pc0gltmxm0000gn/T/vagrant-shell20181011-41639-sxgyi0.sh
    default: >>>> Installing blmp...
    default: >>>> DATA_DIR:  /mnt/data
    default: /mnt/data/blmp /home/vagrant
    default: >>>> downloading owl-schema
    default: Cloning into 'owl-schema'...
    default: >>>> downloading & installing blmp-prototype_flow
    default: Cloning into 'blmp-prototype-flow'...
    default: Parsing scenario file install
    default: ERROR: [Errno 2] No such file or directory: 'install'
    default: Parsing scenario file build
    default: ERROR: [Errno 2] No such file or directory: 'build'
==> default: Running provisioner: pdl (shell)...
    default: Running: /var/folders/dy/1fhs6l1173jgmm6pc0gltmxm0000gn/T/vagrant-shell20181011-41639-6720cl.sh
    default: >>>> Installing pdl...
    default: >>>> DATA_DIR:  /mnt/data
    default: /mnt/data/pdl /home/vagrant
    default: >>>> downloading & installing pdl
    default: Cloning into 'public-digital-library'...
    default: Parsing scenario file install
    default: ERROR: [Errno 2] No such file or directory: 'install'
    default: Parsing scenario file build
    default: ERROR: [Errno 2] No such file or directory: 'build'

configure timeout of Fuseki requests

I think with the recent problems it could be a reasonable to be a bit clearer about how we handle timeouts, and we could start with Fuseki. Ideally we should be able to specify:

a time after which a query must end in general, forcing a query to stop and returning an error if it's longer than this value
in some exceptional cases, having an endpoint (not used by ldspdi) that allows longer queries, because we might need them

The only thing I was able to find for this configuration is these line in this doc:

[] rdf:type fuseki:Server ;
   # Server-wide context parameters can be given here.
   # For example, to set query timeouts: on a server-wide basis:
   # Format 1: "1000" -- 1 second timeout
   # Format 2: "10000,60000" -- 10s timeout to first result, then 60s timeout to for rest of query.
   # See java doc for ARQ.queryTimeout
   # ja:context [ ja:cxtName "arq:queryTimeout" ;  ja:cxtValue "10000" ] ;

but it's not very clear to me how it works: does it actually interrupt the TDB query thread after some time, or does it just make http sockets timeout? Also, I don't know yet if we could have a bdrcrw-admin endpoint which would be basically the same as the bdrcrw but with a different timeout...

@xristy you looked at this part of the code quite a lot, what do you think?

"exact" search vs. "contains"

It's kind of random to report the issue there, but a user is wondering if we could have an exact search vs. contains. I suspect it would result in a different Lucene query, but I can't find anything about that in the Lucene query syntax. @xristy do you know if it's possible?

weird Sanskrit highlighting

It's probably a bug in lucene-sa but I don't have time to confirm that now so I'm just opening it here:

http://library.bdrc.io/search?q=%22n%C4%81masa%E1%B9%85g%C4%ABti%22&lg=sa-x-iast&t=Work

shows some odd highlighting such as

add zh-Hani to analyzers

It should be possible to index and search strings in zh-Hani (using the same configuration as zh-Hans)

update nginx mime types

The file /etc/nginx/mime.types should be updated with the following sort of patch:

- application/font-woff      woff; 
+ font/woff                 woff;
+ font/woff2                woff2;

to give the correct content-types for fonts. See nginx#1243

make sure that components of a local vagrant use other local components

we should make sure that when a user summons a vagrant instance, all the components use the local components (local pdl should use local lds-pdi, local lds-pdi should use local fuseki, etc.). In the case of course where the user didn't specify a configuration

change Chinese indexing

Instead of happening in simplified + pinyin with diacritics, the indexes of Chinese data should be in traditional (as most of our data will be) and pinyin without diacritics

buda-iiif-presentation bug with CORS handling

The way iiif-presentation is deployed in buda-base creates some 403 errors during CORS pre-flight requests with authenticated users. Example:

curl -v 'http://buda1.bdrc.io:13480/2.1.1/collection/wio:bdr:W22084_1088' -X OPTIONS -H 'Access-Control-Request-Method: GET' -H 'Access-Control-Request-Headers: Authorization' -H 'Origin: http://library.bdrc.io'

while it works when not requesting authorization header:

curl -v 'http://buda1.bdrc.io:13480/2.1.1/collection/wio:bdr:W22084_1088' -X OPTIONS -H 'Access-Control-Request-Method: GET' -H 'Origin: http://library.bdrc.io'

This bug is specific to the tomcat deployment, as the equivalent request on a local jetty running the iiif-presentation code works fine.

I've spent a few hours trying to debug this but am giving up because of lack of time. I've tried to configure a CorsFilter in Tomcat as an experiment, but it transforms everything into 404. I have no idea what's causing this nor any lead.

BUDA_PROPS shouldn't be required

with latest version, when I run vagrant destroy, I'm getting

script aborted: BUDA_PROPS env variable must be set !
script aborted: BUDA_PROPS env variable must be set !

this shouldn't happen. If the env variable is not set, then the default configuration (suitable for a local instance where no s3 or auth0 access will be possible) should be used

bizarre query results

If I run

select distinct ?mw ?pl where {
  (?mw ?sc ?pl) text:query ( :publisherLocation "rda ram sa la"@en ) .
}

I'm getting 9,965 results, which contain some whacky data such as places (G4252), persons (P2JM83), etc. I really don't understand what's going on... if I want the correct results (none) i need to run it with highlight:

select distinct ?mw ?pl where {
  (?mw ?sc ?pl) text:query ( :publisherLocation "rda ram sa la"@en :highlight ) .
}

now, one guess is that looking in :publisherLocation actually looks in other properties because of the configuration: https://github.com/buda-base/buda-base/blob/master/conf/fuseki/ttl.erb#L88 but I think this is a bug... @xristy any thought?

CORS for google translate

see buda-base/public-digital-library#409 (comment)

.json configuration in /etc/buda

We should find a way to have the .json configuration that currently lie in

/usr/local/pdl/public-digital-library/public/config.json
/usr/local/blmp/blmp-prototype-flow/public/config.json

be in /etc/buda. Note that it contains keys so it can't be in the buda-base repo.

alternative way to request properties in text queries

What I'm trying to solve is the problem of queries that we need to write like this:

select ?foo where {
  {
     (?s ?sc ?lit) text:query ( rdfs:label ?L_NAME "highlight:" ).
  }
  union
  {
    (?TT ?sc ?lit2) text:query ( skos:altLabel ?L_NAME "highlight:" ).
  }
  union
  { 
    (?TT ?sc ?lit1) text:query ( skos:prefLabel ?L_NAME "highlight:").
  }
}

and so on for every field. While it might be useful in some cases to discriminate between these, the only discrimination we make in our queries is between:

names / titles (using rdfs:label)
etext chunks (using :chunkContents)
entity labels (using skos:prefLabel and skos:altLabel)
the rest (which we actually never really query)

There is a mapping between properties and Lucene fields in the configuration. I'm wondering if we could leverage that to map multiple properties (like skos:prefLabel and skos:altLabel in a first time) to the same Lucene field, so that when we search on skos:prefLabel is actually finds both?

I'm thinking that a more accurate way would be to query on the Lucene field name actually, it makes less sense from a LOD perspective, but it's closer to reality... changing the configuration to something like:

text:map (
         [ text:field "label" ;
           text:fieldUri bdo:label ;
           text:predicate skos:prefLabel ]

and query something like:

(?s ?sc ?lit) bdo:label ( rdfs:label ?L_NAME "highlight:" ).

that way we fully understand what's going on...

fuseki marple needs log config

The fuseki marple works more or less but the systemd needs work and the start.sh needs to redirect or otherwise set up the logging so it doesn't grab out/err

searching in English strings when using transliteration

The problem is: on tbrc.org when you search ngag dbang blo bzang you get results where the match is an English note containing the string, for instance:

according to this often unreliable source there were two candidates, from khams nag shod and dpa' shod; the dpa' shod candidate selected in 1813 using the golden urn and given name ngag dbang blo bzang thogs med bstan 'dzin rgya mtsho

in P1504. This note has no lang tag in the xml but is (correctly?) tagged as English in the ttl.

Here's a proposal: for transliteration lang tags (bo-x-ewts, inc-x-ndia and sa-x-ndia primarily, but why not also zh-latn-pinyin and others too), we could configure the system so that it also queries the English as if the original string was English... That should give results similar to the results on tbrc.org at least for transliteration. I think we can even manage that without reindexing...

Now, this means results would be different for the transliteration and the unicode, which is not ideal... there are a few ways out of this:

double indexing English as English and as Tibetan (not ideal neither...)
converting bo to bo-x-ewts in the client behind the scenes
doing so on the server

Limit eXists database backups

eXists, and the legacy site create a full database backup every day. This backup consumes about 1GB. Try to reduce backups to 1 incremental/day, and possible one full / week.

couchdb crash/restart cycle

The following snippet from /var/log/daemon.log illustrates a repeated cycle of couchdb crashing and restarting. It looks like a config file issue:

gen_server config terminated with reason: no match of right hand value {error,eacces} at config_writer:save_to_file/2(line:38)

couchdb-crash-restart-cycle.txt

mysteriously empty highlighted match

Consider the following query:

CONSTRUCT {
  bdr:MW22737 tmp:labelMatchSa ?labelMatch1 ;
              tmp:labelMatchInc ?labelMatch2 ;
              tmp:labelMatchIncNoHi ?labelMatch3 ;
} where {
      (?title1 ?sc1 ?labelMatch1) text:query ( rdfs:label "\"therī\""@sa-x-ndia "highlight:" ) .
      bdr:MW22737 bdo:hasTitle ?title1 .
  
  	  (?title2 ?sc2 ?labelMatch2) text:query ( rdfs:label "\"therī\""@inc-x-ndia "highlight:" ) .
      bdr:MW22737 bdo:hasTitle ?title2 .

  	  (?title3 ?sc3 ?labelMatch3) text:query ( rdfs:label "\"therī\""@inc-x-ndia ) .
      bdr:MW22737 bdo:hasTitle ?title3 .
}

and the configuration which is currently on buda1, corresponding to this commit where the configurations of inc-x-ndia and sa-x-ndia are very much similar (same analyzer used). The bug is that the result is as follows:

bdr:MW22737  tmp:labelMatchInc  ""@sa-alalc97 ;
        tmp:labelMatchIncNoHi  "vajravali : a sanskrit manuscript from nepal containing the ritual and delineation of mandalas"@sa-alalc97 ;
        tmp:labelMatchSa       "vajravali : a sanskrit manuscript from nepal containing ↦the ri↤tual and delineation of mandalas"@sa-alalc97 .

where the highlight is empty in the inc-x-ndia case, while it's working in the sa-x-ndia case... This is quite strange as if you test with some pali matches (like bdr:MW1FPL8617) it works fine...

webp library installation

The IIIF installation script should install the webp library. In the current state the server can't be compiled because of that

errors in iiifserv.sh

there are several errors during provisioning iiifserv.sh. A PDF of the provisioning log is attached.

The first error appears on page 3:

default: cp:
default: cannot create regular file '/etc/buda/share/geolite/GeoLite2-Country.mmdb'
default: : No such file or directory

The second that I saw is on page 13:

default: [INFO] BUILD FAILURE
default: [INFO] ------------------------------------------------------------------------
default: [INFO] Total time: 22.827 s
default: [INFO] Finished at: 2018-11-21T19:17:07+00:00
default: [INFO] Final Memory: 53M/370M
default: [INFO] ------------------------------------------------------------------------
default: [ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.21.0:test (default-test) on project buda-hymir: There are test failures.
default: [ERROR]
default: [ERROR] Please refer to /mnt/data/downloads/buda-iiif-server/target/surefire-reports for the individual test results.
default: [ERROR] Please refer to dump files (if any exist) [date]-jvmRun[N].dump, [date].dumpstream and [date]-jvmRun[N].dumpstream.
default: [ERROR] -> [Help 1]
default: [ERROR]
default: [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
default: [ERROR] Re-run Maven using the -X switch to enable full debug logging.
default: [ERROR]
default: [ERROR] For more information about the errors and possible solutions, please read the following articles:
default: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
default: chown:
default: cannot access 'target/*.jar'
default: : No such file or directory
default: cp: cannot stat 'target/buda-hymir-1.0.0-SNAPSHOT-exec.jar': No such file or directory
default: cp: cannot create regular file '/etc/buda/iiifserv/': No such file or directory

iiifserv-provisioning-log.pdf

too many open files

lds-pdi crashed this morning with the error:

06-Aug-2018 09:25:27.069 SEVERE [http-nio-13280-Acceptor-0] org.apache.tomcat.util.net.NioEndpoint$Acceptor.run Socket accept failed
 java.io.IOException: Too many open files
	at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
	at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
	at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
	at org.apache.tomcat.util.net.NioEndpoint$Acceptor.run(NioEndpoint.java:692)
	at java.lang.Thread.run(Thread.java:748)

I'm not really sure why, but maybe adding

      proxy_http_version 1.1;
      proxy_set_header Connection "Keep-Alive";
      proxy_set_header Proxy-Connection "Keep-Alive";

in the nginx config could help

update Geolite fetching

Geolite database is now only accessible with an access key, see this blog entry, so there should either be a way to configure that in the scripts (maybe the key should be in a file that by default contains nothing and that users of buda-base would fill with their key. Note that under no circumstance the key should be on the git repo. If that happens, the key should be revoked.

tomcat 9

we should update to tomcat9

note about discoveries

there's no satisfying repo to record that kind of feature request in, so recording it here...

Jann pointed me to https://medievaleavings.hcommons.org/our-archival-darlings/ which I think is a very good system that we could implement. Not sure how, it could be annotations that we publish on the front page, articles, etc.

Fuseki not responding

We had today the issue of fuseki not responding using RdfConnectionFuseki.put(Model, graph), the connection being established on newcorerw/data ;

The following line appears 10s of thousands times in the logs (incident time: 2020-04-29 18:20:10):

[2020-04-29 18:20:10] ThriftConvert WARN visit: Unrecognized: <RDF_StreamRow >

followed by a full thread dump of the OpenJDK 64-Bit Server VM (25.242-b08 mixed mode)

Restarting Fuseki server solved the issue.

update to Debian 10

Debian 10 was released recently, we should use it instead of 9

create none.opt in fuseki database folder

When creating the fuseki database, the script should also add a none.opt (see doc), so that requests can work at full speed

cloning schema and shapes repo

The owl-schema repo should be cloned in $LDSPDI_HOME/owl-schema/

properties.erb should be updated for lds-pdi

https://github.com/buda-base/buda-base/blob/master/conf/lds-pdi/properties.erb is very outdated, a more reasonable version should be introduced

mirror website in Asia

The BDRC website is currently very slow in Asia (Taipei and India), much more so than in the US or France. My understanding is that this is mostly due to the geography. We should experiment with a mirror server in Asia

plug Fuseki into prometheus

Fuseki is using micrometer, I suspect it's feasible to plug it to prometheus...

searching on multiple languages for "en"

we could conflate several lang tags in the same analyzer:

en
bo-x-mixed
bo-x-phon-en-m-tbrc
bo-x-phon-en-m-thlib

metrics report

I think we could get some nice metrics through Prometheus and Mircrometer. Hymir (and anything using Spring Boot) already has the Micrometer metrics integrated, we could define a few interesting metrics (image dl from S3 in terms of size, image output, etc.), but first we would need to install all that in the system, hence this issue.

ldsearch issues in buda-base

It would be best if we have a release of ldsearch as a war file that can be downloaded from github or stashed in buda-base/conf/lds/ perhaps.

I don't see the benefit of installing mvn and git-core and cloning the ldsearch repo and then mvn which downloads all the dependencies every time the vagrant provisioning is run fresh.

Also unless ldsearch is supposed to be contacted on the same port as fuseki (13180 in the current buda-base) then it would be appropriate to setup a separate tomcat for ldsearch - and other app-server servlets.

add cron to clean yarn cache every (week/day?)

with the command yarn cache clean

installing Matomo

We should install Matomo for analytics. It requires a mysql database. The easiest way to set it up would be to take a small RDS instance I think. It provides backup and all that good stuff. It's GDPR compliant, doesn't require third party cookies (and the annoying message we have everywhere in Europe).

en search should find matches in en-x-mixed

For instance https://library.bdrc.io/search?q=%22Kajur%22&lg=en&t=Topic&s=closest%20matches%20forced doesn't find bdr:T2423

COOP header in NGINX

see https://web.dev/coop-coep/ , that might be useful

couchdb provisioning error on AWS

There's a couple of curl in the couchdb provisioner that reference port 13598. On the local installs, Vagrantfile includes:

config.vm.network :forwarded_port, guest: 13598, host: 5984 # couchdb

but this is not feasible on the AWS provisioning.

I haven't looked at what the curls are doing so I don't know if the AWS instance is properly provisioned or not.

It would be nice if there's a way to configure couchdb during installation to run on some other port than 5984 so that AWS and local instances are all on the same non-standard port. The AWS security group would need to be updated if the port can be changed.

I've attached a segment from the provisioning transcript that shows the error.

AWS-provisioning-error-couchdb.txt

add Pali and Indic

There should be a few additional analyzer configurations:

pi-x-iast
pi-x-iast-ndia (both mimicking their Sanskrit counterparts)
inc-x-iast
inc-x-iast-ndia (searching in both the Sanskrit and Pali, with probably more to come in the future)

buda-base / buda-base Goto Github PK

buda-base's Introduction

Vagrant scripts for BUDA platform instanciation

buda-base's People

Contributors

Stargazers

Watchers

Forkers

buda-base's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs